|
|
|
#1
|
|||
|
|||
|
SQL/XML database question--translation database
Hello--this is a fairly specific technical question and I don't know if anyone will be able to help me, but I know there are lots of smart computery folks here so I figured it's worth a shot.
I have a problem with a DejaVu X translation database that I'm trying to solve, either by using SQL commands to modify the database, or by exporting it to TMX (XML) and using a script to modify the XML, then importing it back into my translation memory program. I can also export the database to other formats such as Excel or Access. The problem is that rather than keeping a single record for each source segment, with multiple languages, for example: [Record #1] [Source] Saturday [French] samedi [Italian] sabato [Latin] aturdaysay my database has duplicate records, one for each language: [Record #1] [Source] Saturday [French] samedi [Record #2] [Source] Saturday [Italian] sabato [Record #3] [Source] Saturday [Latin] aturdaysay I would like to combine these into a single record, as in the first example. The language entries may not necessarily appear consecutive to each other in the database. The other problem is that multiple translations into a language may exist for a single term in the database and I would like to keep both--for example, if "mouse" has been translated into Italian as both "topo" (for the animal) and "mouse" (for the computer peripheral). It doesn't matter which one gets merged into the multilingual record. Does anybody know of a way to do this? Here is a sample of a TMX file showing two duplicate records, the first containing an English-German pair and the second containing an English-Danish pair--I would like them to both appear as part of the same record, with English as the main source segment and both German and Danish specified as <tuv> elements of the same <tu>, if that makes sense. <tu tuid="3" datatype="Text" srclang="en-us" > <prop type="x-Project">7301447</prop> <tuv xml:lang="en-us" creationdate="20051019T204809Z" creationid=" " > <prop type="IsSource">True</prop> <seg>INDICATION FOR USE</seg> </tuv> <tuv xml:lang="de" creationdate="20060302T013912Z" creationid=" " > <prop type="IsSource">False</prop> <seg>Einsatzbereich</seg> </tuv> </tu> <tu tuid="4" datatype="Text" srclang="en-us" > <prop type="x-Project">7301447</prop> <tuv xml:lang="en-us" creationdate="20051019T204809Z" creationid=" " > <prop type="IsSource">True</prop> <seg>INDICATION FOR USE</seg> </tuv> <tuv xml:lang="da" creationdate="20051019T204809Z" creationid=" " > <prop type="IsSource">False</prop> <seg>Indikation</seg> </tuv> </tu> |
| Advertisements | |
|
|
|
|
#2
|
|||
|
|||
|
That's data normalization. You want that.
|
|
#3
|
|||
|
|||
|
If you're trying to get this back into SQL, the best way to fit it into one table, in SQL, is to have the database look like:
(columns) English French Spanish German Italian Sasketchewanian, etc Where you have to define all the languages you want to keep as a separate column. You could then handle multiple instances by using a comma (although obviously you can only hold a few words max in each column to keep under the 8000 bytes/row limit) I can't think of an elegant way to do this in SQL, although its a relatively simple process in a programming language such as VB. If you didn't need duplicates, and just wanted to pull the first word it is a relatively simple SQL query, IE (assuming that the source table has a "language" field and a "translation" field, which would make sense given the format you suggested) INSERT INTO t_NewTable (English, French, Spanish, ...) SELECT ST.Source, (SELECT TOP 1 Translation FROM t_SourceTable WHERE Language = "French"), (SELECT TOP 1 Translation FROM t_SourceTable WHERE Language = "Spanish"), ... FROM t_SourceTable ST GROUP BY Source |
|
#4
|
|||
|
|||
|
Missing piece in my code, should be:
INSERT INTO t_NewTable (English, French, Spanish, ...) SELECT ST.Source, (SELECT TOP 1 Translation FROM t_SourceTable WHERE Language = 'French' AND Source = ST.Source), (SELECT TOP 1 Translation FROM t_SourceTable WHERE Language = 'Spanish' AND Source = ST.Source), ... FROM t_SourceTable ST GROUP BY Source |
|
#5
|
|||
|
|||
|
The problem is that I'm working with a pre-built tool, the translation memory software, which I can't customize. If the translations exist as part of the same record, it decreases the size of the database and also allows me to compare translations of a segment in different languages side-by-side. If I were building my own translation software, I'm sure I could specify the table architecture, and customize the translation memory viewer to show me all segments side by side, even from normalized data. However, the way it is now, it's hard for me to compare translations side by side--a very common task--and I'd like to fix my data so I can do this. There's a tool in the translation software that allows what I need, but you have to manually select each set of segments that you want to consolidate--not a very feasible task when you're dealing with millions of records.
|
|
#6
|
|||
|
|||
|
I'm not sure how you expect to view the resulting table and/or de-normalized data, if you are working with a prebuilt tool - surely the product will only work with the data, as is, no ? Are you trying to create a spreadsheet or XML file from the normalized data that is denormalized and then view in Excel/Internet Explorer, whatever ?
|
|
#7
|
|||
|
|||
|
Quote:
|
![]() |
| Bookmarks |
| Thread Tools | |
| Display Modes | |
|
|