Automation, integration and machine translation

Jourik Ciesielski
Jul 5, 2020
4 min read

Updated: Feb 22, 2021

This article describes a very specific machine translation project carried out by Yamagata Europe. The aim is to demonstrate how Yamagata manages to translate thousands of quality words in just a couple of minutes despite the relatively complex source file format.

Source data

The client provides tons of small *.tmx documents that are exported from an unspecified CMS. Note that the *.tmx documents are actually not valid since they do not contain an header element. As a consequence, it is impossible to process them with the default TMX filter in memoQ, the CAT tool for this project:

Pre-processing the source data

By default, translated text needs to end up between tuv elements with a lang attribute, value = target language code, e.g. lang="DE-DE". Target seg child elements contain the English source sentences in the source documents:

All the translation units (i.e. content between tu elements) have a prop child element with a type attribute, value =Txt::Quality. A translation unit does not require translation if the value of this prop element is equal to or higher than 80. A set of pre-processing scripts gives certain tu elements a translate="no" attribute based on the above mentioned quality value. It gives all source seg elements a translate="no" attribute as well:

The translation units that require translation (i.e. with quality value < 80) remain unchanged.

Processing the source data in memoQ

Since the upgrade from memoQ 2014 R2 to memoQ 2015 it is possible to automate pre-processing scripts upon project creation in memoQ by adding them as automated actions to the project template:

Thanks to this automated manner of pre-processing the source data, it does not only take considerably less time to prepare the projects, but it also enables people who are not familiar with running scripts to create them. In other words, manual pre-processing is completely excluded from the workflow.

Adding translate="no" attributes to elements in source documents is of course not enough to exclude them from translation in memoQ. Defining whether XML elements are translatable or not is something that needs to be configured in the file filter:

The configurations that are marked in the screenshots produce a situation in which all tu elements with a translate="no" attribute are excluded from translation. The remaining tu elements are imported for translation.

Note that for the source *.tmx documents we use a cascading filter that contains an XML filter and an HTML filter. The XML filter is the basic filter that (as explained earlier in this article) defines the conditions that exclude certain elements and/or attributes from translation. The HTML filter processes embedded HTML tags.

The combination of automated pre-processing via the project template and a powerful file filter makes it possible to exclusively extract the translatable translation units from the source TMX data. Content that does not require translation is not imported by memoQ. These assets also enable people without knowledge about the workflow to create projects in very little time.

Machine translation

Thanks to the Systran MT plugin, the machine translation engines that were created and trained intensively for this particular project can be used. These customized engines guarantee a high quality level of the MT output that is delivered to the client.

Post-processing the translated documents

Some post-processing steps are required before delivering translated *.tmx documents back to the client. All the post-processing actions are performed via a set of fully automated regex-based search and replace scripts that are integrated in the project template, 4 in total:

The first script removes all the translate="no" attributes from the target *.tmx documents.
The second one converts unwanted double-escaped HTML entities (e.g. &nbsp;) into actual characters.
The third one converts certain characters into HTML entities (e.g. " -> ").
The last one adjusts the encoding of the target *.tmx documents.

As a result, manual post-processing is entirely excluded from the workflow, just like manual pre-processing. Target documents exported by memoQ are immediately ready for delivery to the client.

Export path

Like in any other LSP or well-organized translation agency, target documents must end up in a fixed folder structure. Translated documents that are ready for delivery to the client need to be placed in a folder called 6_OUTPUT. Its corresponding export path rules in memoQ look like this:

Folder rule:

\\SERVER\Projects\<Project>\6_OUTPUT\<TrgLangIso2>\<RelativePath>\<OrigFileNameExt>

File rule:

\\SERVER\Projects\<Project>\6_OUTPUT\<TrgLangIso2>\<OrigFileNameExt>

Thanks to this set of rules, target documents that are exported via the Export (stored path) option in memoQ always end up in the appropriate target language folder in 6_OUTPUT. memoQ also maintains the source folder structure if the folder rule is applied. This is especially interesting to avoid confusion or mistakes when the project needs to be followed up by another project manager.

Conclusion

Both translation buyers and translation suppliers aim to process very big text volumes in very little time. Long-term costs need to be kept under control and a certain quality level has to be guaranteed. This goal can be achieved by an efficient combination of automation, integration and machine translation, which definitely applies to the *.tmx documents that have been described in this article. The pre-processing script that needs to be executed in order to filter the translatable content is fully automated via the project template. The powerful concept of cascading file filters finishes the work of the scripts by importing the filtered content only. The translation process is just a matter of seconds via the integration of heavily customized machine translation engines. The post-processing steps that are required to prepare the translated documents for delivery are integrated in the template as well. Thanks to the custom export path rules, target documents automatically end up in the folder that will be used for delivery to the client.

The actions described in this article lead to a workflow in which only one thing is still done manually: selecting the source documents to import for translation. All other manual interventions are excluded or significantly reduced. This allows saving precious time, and therefore also precious money, which is advantageous for all parties involved.