• Jourik Ciesielski

How to leverage multilingual terminology in DITA

If you’re in the localization business, you’ll be familiar with claims about terminology. Updating terminology after a claim involves a lot of unexpected extra work: implementing corrections, performing additional QA, updating translation memories and, if applicable, delivering new files to your client.

Terminology has always been and will always be a delicate topic in the language industry, especially in the field of technical documentation. The DITA standard, one of the most popular architectures for structured writing, allows to set up a special kind of collaboration between technical writers and localization teams to avoid cumbersome terminology extractions in PDF files and to tackle quality issues. All you need to do is think out of the box, take full advantage of the XML-based standards the language industry has to offer, and add two specific resources to your DITA project: a dedicated terminology topic and a spot-on XSLT transformation.

1. Terminology topic

There are some specialized glossary topics (glossentry, glossgroup) in DITA, but they fall short when it comes to multilingual support. Creating a custom terminology topic doesn’t have to be too difficult, however it’s the part where the out-of-the-box thinking kicks in. Let’s have a look at an example:

The foundation consists of a ditabase topic with a topic, a title and a body element. The topic’s body contains p elements with numeric identifiers starting at 1 and counting up. The p elements serve as containers for the actual terms whose language is determined via the xml:lang i18n attribute. Note that this XML structure is nothing more than an example, so it can be modified or extended in accordance with the needs and wishes of the content author or the localization team. If you need an extra language, no problem. If a term requires a definition, you could add it using a note. If you want to add more metadata to a term, consider organizing it within a div. You should only worry about keeping things simple, well-formed and valid.

Furthermore, terminology doesn’t necessarily have to be multilingual when it is picked up in the XML. The term elements that are supposed to contain the translated terms can be omitted or left empty if the terms still need to be localized. You’ll end up with a multilingual XML file for translation which might be a hard nut to crack, but there are several translation management systems (Memsource, memoQ) that offer good support for it.

2. XSLT transformation

XSLT is a styling language for transforming XML documents into other XML documents or even other formats such as HTML. Although regular expressions are considered to be the #1 lifesaver in localization, the power of XSLT shouldn’t be underestimated since basically all the language industry standards are XML-based (XLIFF, TMX, TBX, ITS, etc.).

The terminology topic we discussed earlier in this post can be converted easily to TBX (Term Base eXchange) using XSLT:

The TBX result of the conversion can subsequently be imported into the TMS of your choice (Memsource, memoQ, XTM, etc.) and used for leverage as well as QA in the localization process.

This way of managing terminology will undoubtedly raise some questions. Who is for example responsible for executing the XSLT transformations? The fact of the matter is it doesn’t really matter. Technical writers can do it if their authoring tool supports XSLT (e.g. Oxygen), but it probably makes more sense for the localization team to take care of it in e.g. Notepad++. How about version management? This workflow produces single files that need to be send back and forth via email, FTP or some other shared drive, so nobody can afford to miss out on a file. And what if there are no resources to write and integrate XSLT in the process?

Despite these downsides, there are many advantages to this way of working. The DITA terminology topic is relatively easy to create and maintain. Converting it to TBX is a matter of seconds and maintaining the corresponding term bases in the TMS isn’t a lot of work either. Furthermore, terminology can be tagged with the term element in the DITA documentation, which creates an additional safety check for the terms in the repository:

3. Taking it one step further

If you’re writing documentation for a software application, you need to refer correctly to texts that are displayed on the user interface of the software. If software strings are managed in a structured data format like JSON or XML, a similar transformation strategy can be applied to assure this. The purpose is to convert the software strings as well as their localized versions into reusable components on the one hand, and pull those components directly into the DITA documentation on the other (similar to how many technical writers treat notes for example). Let’s take XLIFF 1.2 as an example:

Both the English and the localized software strings from this XLIFF can be stored as reusable components in dedicated DITA libraries or warehouses using XSLT:

The software strings can subsequently be picked up in the documentation using e.g. conkeyrefs or keyrefs:

Just like with regular multilingual terminology, there are some questions that need to be considered. Is there for example a workaround if the software strings aren’t available in a structured data format? And since they end up being placeholders in the DITA documentation, won’t they cause loss of context in the localization process?

Nevertheless, there seem to be more advantages than disadvantages to this approach. First of all, software strings will be reproduced identically in the documentation in all circumstances, so the risk of mistakes as a result of copy-paste actions is basically reduced to zero. Second, updated software strings will be synchronized automatically in the documentation and since there’s no manual editing, there will be no translation memory leverage loss. Finally, there’s no need to spend time creating unnecessary term bases and doing cumbersome terminology checks.