A localization challenge: Asciidoc
One of the nice things about localization engineering is that it continuously brings new challenges. There is one particular file format that I consider to be “challenging”: Asciidoc. Please note that this post doesn’t aim to suggest a general workflow for the translation of Asciidoc documents. The steps described in this article may differ from other possible workflows.
A plain-text writing format for authoring notes, articles, documentation, books, eBooks, web pages, slide decks, blog posts, etc.
A text processor and tool chain for converting Asciidoc documents into various formats including HTML, DocBook, PDF and ePub.
So yes, Asciidoc documents can be easily converted into e.g. HTML. And yes, you could easily translate the HTML result of the conversion. But what if your client wants you to translate the native Asciidoc (*.adoc) documents? That’s where the challenging engineering part kicks in.
Defining the translatable content
The markup, structure and syntax of Asciidoc documents are relatively complicated and it is therefore not so simple to find out which text parts are translatable. All the content in Asciidoc documents is placed between a sequence of certain delimiters (e.g. = - /). We know for sure that the “/” delimiter exclusively contains comments, and we can assume that the other delimiters may contain translatable content:
Setting up a file filter in a CAT tool
Since translatability depends on the presence of a certain delimiter, a very specific file filter is needed in order to import the translatable content properly into a CAT tool. The regex text filter in memoQ can make this happen as it allows writing a custom regular expression to define the start (and, if required, the end) of paragraphs:
Additionally, memoQ enables its users to write another regular expression to define the translatable portion of those paragraphs:
The regular expressions from the screenshots configure everything after sequences of the “=” or “-” delimiter as translatable paragraphs, while content after the “/” delimiter is not considered to be translatable text:
Regex, regex, and regex some more
Translatable paragraphs in Asciidoc documents still contain a lot of placeholders and other markup elements. These untranslatable things can be excluded from translation by regexing the daylights out of the Asciidoc documents after chaining a regex tagger to the regex text filter.
The configurations that are discussed in this post make it possible to properly import the translatable content, even though some segments will unavoidably look a bit messy: