SQL/XML database question--translation database

Hello–this is a fairly specific technical question and I don’t know if anyone will be able to help me, but I know there are lots of smart computery folks here so I figured it’s worth a shot.

I have a problem with a DejaVu X translation database that I’m trying to solve, either by using SQL commands to modify the database, or by exporting it to TMX (XML) and using a script to modify the XML, then importing it back into my translation memory program. I can also export the database to other formats such as Excel or Access.

The problem is that rather than keeping a single record for each source segment, with multiple languages, for example:

[Record #1]
[Source] Saturday
[French] samedi
[Italian] sabato
[Latin] aturdaysay

my database has duplicate records, one for each language:
[Record #1]
[Source] Saturday
[French] samedi

[Record #2]
[Source] Saturday
[Italian] sabato

[Record #3]
[Source] Saturday
[Latin] aturdaysay

I would like to combine these into a single record, as in the first example. The language entries may not necessarily appear consecutive to each other in the database.

The other problem is that multiple translations into a language may exist for a single term in the database and I would like to keep both–for example, if “mouse” has been translated into Italian as both “topo” (for the animal) and “mouse” (for the computer peripheral). It doesn’t matter which one gets merged into the multilingual record.

Does anybody know of a way to do this?

Here is a sample of a TMX file showing two duplicate records, the first containing an English-German pair and the second containing an English-Danish pair–I would like them to both appear as part of the same record, with English as the main source segment and both German and Danish specified as <tuv> elements of the same <tu>, if that makes sense.
<tu
tuid=“3”
datatype=“Text”
srclang=“en-us”
>
<prop type=“x-Project”>7301447</prop>
<tuv
xml:lang=“en-us”
creationdate=“20051019T204809Z”
creationid=" "
>
<prop type=“IsSource”>True</prop>
<seg>INDICATION FOR USE</seg>
</tuv>
<tuv
xml:lang=“de”
creationdate=“20060302T013912Z”
creationid=" "
>
<prop type=“IsSource”>False</prop>
<seg>Einsatzbereich</seg>
</tuv>
</tu>
<tu
tuid=“4”
datatype=“Text”
srclang=“en-us”
>
<prop type=“x-Project”>7301447</prop>
<tuv
xml:lang=“en-us”
creationdate=“20051019T204809Z”
creationid=" "
>
<prop type=“IsSource”>True</prop>
<seg>INDICATION FOR USE</seg>
</tuv>
<tuv
xml:lang=“da”
creationdate=“20051019T204809Z”
creationid=" "
>
<prop type=“IsSource”>False</prop>
<seg>Indikation</seg>
</tuv>
</tu>

That’s data normalization. You want that.

If you’re trying to get this back into SQL, the best way to fit it into one table, in SQL, is to have the database look like:
(columns)
English French Spanish German Italian Sasketchewanian, etc
Where you have to define all the languages you want to keep as a separate column.
You could then handle multiple instances by using a comma (although obviously you can only hold a few words max in each column to keep under the 8000 bytes/row limit)
I can’t think of an elegant way to do this in SQL, although its a relatively simple process in a programming language such as VB.

If you didn’t need duplicates, and just wanted to pull the first word it is a relatively simple SQL query, IE (assuming that the source table has a “language” field and a “translation” field, which would make sense given the format you suggested)

INSERT INTO t_NewTable (English, French, Spanish, …)
SELECT ST.Source, (SELECT TOP 1 Translation FROM t_SourceTable WHERE Language = “French”), (SELECT TOP 1 Translation FROM t_SourceTable WHERE Language = “Spanish”), …

FROM t_SourceTable ST GROUP BY Source

Missing piece in my code, should be:

INSERT INTO t_NewTable (English, French, Spanish, …)
SELECT ST.Source, (SELECT TOP 1 Translation FROM t_SourceTable WHERE Language = ‘French’ AND Source = ST.Source), (SELECT TOP 1 Translation FROM t_SourceTable WHERE Language = ‘Spanish’ AND Source = ST.Source), …

FROM t_SourceTable ST GROUP BY Source

The problem is that I’m working with a pre-built tool, the translation memory software, which I can’t customize. If the translations exist as part of the same record, it decreases the size of the database and also allows me to compare translations of a segment in different languages side-by-side. If I were building my own translation software, I’m sure I could specify the table architecture, and customize the translation memory viewer to show me all segments side by side, even from normalized data. However, the way it is now, it’s hard for me to compare translations side by side–a very common task–and I’d like to fix my data so I can do this. There’s a tool in the translation software that allows what I need, but you have to manually select each set of segments that you want to consolidate–not a very feasible task when you’re dealing with millions of records.

I’m not sure how you expect to view the resulting table and/or de-normalized data, if you are working with a prebuilt tool - surely the product will only work with the data, as is, no ? Are you trying to create a spreadsheet or XML file from the normalized data that is denormalized and then view in Excel/Internet Explorer, whatever ?

I do need the data to stay in the same format so it’s viewable by the software I’m using–but the software I’m using allows either one of the two formats listed above. So I need to transform the second example listed above into the first. I could also create a spreadsheet or XML–I was thinking I might be able to process the XML file more easily–but I would need to then import it back into the software (which allows imports from TMX, Access, Excel, Trados tagged text, etc.) My limitation is that I can’t alter the fundamental setup of the database, only change the values stored in it. Does that make sense? I feel like I’m babbling :frowning: