Posted by: patenttranslator | May 3, 2015

Language Technology Tools – The Clash Between Bombastic Commercial Propaganda and Somewhat Depressing Reality

 
I have been reading, on Facebook and in other equally trustworthy venues, numerous press releases about the incredible progress that has been achieved with language automation tools. The press releases use a lot of cool terminology with impressive sounding terms celebrating new integration capabilities (with source code), mobile application for improved customer interface, integration capabilities enabling customers to promptly deliver product experiences to global users, highly extensible platforms completely automating the translation process, automatic content detection, etc.

All of this new, highly innovative technology is then in the final stage integrated with cloud technology, which to me means that invisible beings residing in clouds are in charge of the seemingly least important part of the process, which would be the translating bit.

These invisible beings are probably not translators, and maybe not even real people, since only angels can reside in clouds, at least based on the teachings of Catholic religion. How many translators can be instantly translating new content in the cloud is a question that should prove to be no less interesting to consider and analyze than the question of how many angels can dance on the tip of a pin.

Mad Patent Translator is also proudly using various language and technology tools greatly facilitating the translating process. But this technology has nothing to do with the impressive terms thrown around with abandon in the modern version of the “Translation Industry”. Although at this point, I wonder whether it would be more appropriate to call it something else and instead refer to a certain segment of highly propagandized and highly automated “Translation Industry” for instance as “Language Conversion Industry”. The word “translation” is not really mentioned much in the press releases, and translators are never mentioned in them either, at least not as real human beings.

The bombastic propaganda may be working – I received a few days ago a Price Quote Request through a link on my website in which a paralegal from a patent law firm wondered how much it would cost “to convert a Japanese patent application to English”.

So let’s consider how this translator is applying cool language tool and translation technology to his own work.

This week I am translating, among other things, 5 Japanese patent applications ranging in length from 2 to 19 pages. The term page, however, can be somewhat misleading, because older Japanese patent applications have 4 small pages on one page when printed on paper for a total of between about 800 to about 1,200 words in English translation.

Here is my first step for application of cutting-edge technology: as an intrepid, early pioneer of innovation in the field of translation technology, I have discovered more than 20 years ago that the entire translating process is greatly facilitated and its quality is enhanced when the tiny Japanese characters, which somehow must be squished into 4 miniscule pages of the Japanese A4 page format, are enlarged on a copy machine, preferably with the ratio of 1 : 1.5.

The enlarged source text then fits perfectly on a second translation technology tool that I have been using for a long time called document holder, or paper stand – an inexpensive, trusty tool that further improves may translating experience. I have been using this tool with great success already for more than three decades.

The third, more recent tool of advanced technology that I am frequently using now when I translate patents is machine translation. But one must use it with caution.

Some translators are now beginning to refer to machine translation as machine pseudo-translation, and with good reason. Although machine translation can be very useful for translation (or pseudo-translation) of patents in languages such as German, or Russian, or French, provided that the sentences are not too long and that the software can for example match correctly the right verb, hiding at the end of a long sentence in German, with the right object or subject, machine translation does not work nearly as well with Japanese.

I will now attempt to demonstrate my claim on something that I was translating today. Here is a very simple sentence from a Japanese patent application claim:

“以上の工程を 含む半 導 体装 置の 型 造 方法であり 、 酸化等の 然処理によるゲート 電 極 配 線の表質の問 題がなくなり安定し た半 導体装 置を 提 供 で き る”.

which says something like this:

“A method for manufacturing a semiconductor device including the stages mentioned above, which makes it possible to provide a stable semiconductor device, free of electrode gate wiring problems due to thermal processing with oxidizing, etc”.

was translated by GoogleTranslate as follows:

“Ri Oh semi-conductive KaradaSo location of the type production method, including the above steps, a deer semi-conductor equipment table quality problems of the gate electrode wiring that I have such clauses were Na Ri stability in processes such as oxidation ∎ You can in the provision”

GoogleTranslate and other machine translation programs will generally do a much better job than what we see above, especially with European languages. (Except when they don’t, of course.)

But one big problem with translation of patents is that many older patent applications exist only as a PDF file that must be first converted to a digital form. The conversion in itself is not a problem and there are many software packages that can be used for this purpose. But because some of the characters in Japanese or other languages will be invariably misread by the software if we are talking about older documents, erroneous characters are introduced into the converted digital file, which will then make it impossible for the machine translation software to interpret such a file so that it would make sense at least on some level.

Even when the conversion from PDF to a digital file is perfect, as is the case in the two lines of Japanese text above, if you take a closer look at these two lines in Japanese, you will see that the spacing between the characters is not perfectly uniform. This is not a problem for human eyes, but a huge, perhaps insolvable problem for a scanner. A small irregularity (lack of perfectly uniform spacing), combined with the fact that there are no spaces between Japanese words (Japanesetextiswrittenlikethis), will thus result in completely useless machine translation, such as the pseudo-translation above.

The translation agencies who describe in almost adulatory language the nifty language technology tools that they are trying to sell to new customers live in a universe that does not seem to have anything in common with the real world in which translators must translate real documents, namely in such a way so that the goal of the translation would be met – or at least so that these documents would make sense in another language.

They have created a special world for a new kind of “translation industry”, or language conversion industry, a world in which “enterprise-grade translation management platform is integrated with the version control systems developers use to manage their product strings, including Git, Mercurial, Subversion and CVS, optimizes product internationalization and accelerates product release cycles, allowing companies to increase user engagement and satisfaction by providing a localized web, desktop or mobile app experience”.

This fabulous new world has almost nothing to do with translation, or maybe a little bit, since in the end the cloud workers (also referred to as clown workers), whoever they are and wherever they may be hiding, must be ultimately unleashed to “translate the corpus” from one language to another, or probably to many other languages, (to the extent permitted by the new technology).

Personally, if I were running an innovative language conversion enterprise, I would make sure to specialize only in translation into languages that my customers do not understand.


Responses

  1. Great!

    “These invisible beings are probably not translators, and maybe not even real people, since only angels can reside in clouds, at least based on the teachings of Catholic religion. How many translators can be instantly translating new content in the cloud is a question that should prove to be no less interesting to consider and analyze than the question of how many angels can dance on the tip of a pin.”

    Perfect!

    Btw, Harry’s written a book on translators and agencies. He’s going to publish it soon, in a week or two, online, free (in BG). He mentions you in his book, too. Thank you, dear Steve, you’ve been a truly real inspiration.

    Liked by 1 person

  2. Can you send me a link? I will try to read some of it (with the help of machine pseudo-translation).

    Like

  3. “Personally, if I were running an innovative language conversion enterprise, I would make sure to specialize only in translation into languages that my customers do not understand.”

    Steve, that’s exactly the trick of selling innovative language conversion tools!

    However, the English version you got from GoogleTranslate of that simple sentence in Japanese is different than what I got from GoogleTranslate, which is as follows:

    “Is a type method for producing a semiconductor device including the above steps, it is possible to provide a semiconductor device in which the table quality problems in the gate electrode wiring eliminates stable natural by treatment such as oxidation”

    This version isn’t much improved. The reason why Machine Translation cannot produce a sentence that “flows” out of the Japanese sentence that flows in its original is because MT does not break down the string and processes the strings only as bits/bytes without analysing the composition of the string as we humans who has learned Japanese would do:

    “{以上(の)工程(を)[含(む)]半導体装置{の)型造方法(で)[あ(り)]} 、{ 酸化等(の)然処理(に)[よる]ゲート電極配線(の)表質(の)問題(が)[なくな(り)]}{安定した半導体装置(を)[提供(で)き(る)]}”

    While analysing, we know the differnt functions of the characters (sounds or strings of bits/bytes) in the parentheses and in the parentheses that are in the brackets. And we know that the different functions of the strings in braces. This is why we humans find this sentence simple, while machines can never figure out the way to break down sentences or even simple expression such as the ones in braces of the Japanese sentence above.

    I would wonder that a monolingual human being could ever be able to “post-edit” such a sentence to the accuracy as your “something like this.”

    “A method for manufacturing a semiconductor device including the stages mentioned above, which makes it possible to provide a stable semiconductor device, free of electrode gate wiring problems due to thermal processing with oxidizing, etc.”

    Either machine-assisted human translation or human-assisted machine translation is destined to fail, because it demands more energy to correct the faults caused by the MT pseudo-translation.

    BTW, I like the speech held by David Bellos very much. It reminds me a volume of short stories written by Lu Qiao (Nelson Ikon Wu), a demised Chinese writer who used to teach at Yale, San Francisco State College and Kyoto University in Japan and Washington University in St. Louis. The volume is titled “Son of Man” which I wish to be able to translate into German and did translate 2 short stories in it during my student days in Germany. There is a story about a man who, during his travelling around the globe, chanced an orangutan who was taught a human language by him. Through this orangutan the man was introduce to a society of orangutans with whom he held a close friendship and learned the language of orangutans. When the man returned to human society, he lost his ability to speak human. This is a beautiful sad story. I wish that someone would translate it into English for broader readers.

    Human languages are not just about communication. They are more that. Communication systems of other species may be also more than for communication. You’ve been with dogs for many years. You must know what I mean. As to the translatability of a dog’s language, we must ask Dr. Dolittle, not David Bellos who translates from French into English (La Vie mode d’emploi – perfect for a manual translator! And I hope also good enough for a patent translator).

    Like

  4. I like your analysis of the problem with MT relating to grammar rules. Even though Google Translate is using a different approach, while this is in some cases much better than the approach based on analyzing grammar, what if there is no similar, previous translation, or what if a similar translation says the exact opposite?

    Machine translation is an excellent tool, but it simply is not translation, never will be, and trying to edit it is mostly a fool’s errand, no matter what The Translation Industry is saying.

    Like

  5. Hello Steve, I always look forward to your posts. A little off topic, but I wondered what OCR software you used. Please share if you don’t mind. Every one that I’ve used has been really bad.

    Like

  6. Hi Kirk:

    Thank you for commenting. I am glad you are enjoying my silly blog.

    For conversion from Japanese, German and French or English, I use Adobe Export PDF utility. It is pretty good, the license costs only 20 dollars a year, and you can use it from as many computers as needed once you have an account with Adobe.

    The disadvantage is that the converted files are stored in cloud, so I use it mostly for documents that are in public domains, such as patent applications. The other disadvantage is that the utility only converts a limited number of languages. For example, it does not convert Russian, Polish, Czech, Chinese or Korean, which is something that I need every now and then.

    That is why I bought two inexpensive Samsung printers, SCX-3405FW and ProExpress M3870FW a couple of years ago, which come with a software for conversion from many more languages into English. I wrote a post describing the capabilities of one of the printers:

    An Inexpensive, Multifunctional Printer, Perfect for Translators

    I understand many people use ABBY FINE READER software for this purpose, but I did not buy it because I don’t know whether it has even more useful functions than what I already have.

    Like

  7. “Can you send me a link?”

    Yes. As soon as it’s ready, you’ll have a link. We’re editing it now. Most of the book is dedicated to the specific problems here in Bulgaria. These problems are irrelevant to you. But some universal problems are discussed as well, such as: misleading advertising, lack of human resources, anonymous translators, etc.

    Like

  8. Hi Steve. Thanks for the very detailed answer. I am going to get the Adobe utility. For what it’s worth, I tried ABBY but didn’t get acceptable results. Although I think it was version 7 at that time, and seeing that its on v.12 now it might not be a relevant comparison anymore.

    Like

  9. I agree with the comments about OCR: it’s all right as far as it goes, but it never goes far enough!

    For example, all it takes is someone to decide that, to make the layout of the patent claims prettier, they are going to type “characterised in that” with much broader spacing. Not only may they do this not the proper way – which in Word is to expand the character spacing (probably under Format > Font) – so they may literally type c[space]h[space]…, but no matter how they do it the OCR program is liable to render it as c[space]h[space]…, thus requiring time-consuming editing to get rid of all those spaces!

    Then there was the secretarial service local to me which, a while ago, was asked to produce a contract based on a PDF of an existing one, but changing the details of the parties, adding a few pages, etc. etc. They ran the text through the OCR and it all looked great – until they started trying to edit it and found that what the OCR program had produced wasn’t compatible with the way Word (or any other WP program, I suppose) worked. The page numbering was a total mess, the sub-paragraph numbers had been put in one column with the associated text in another column, rather than indenting the paragraphs, and they eventually ended up having to strip all the formatting out and starting again from scratch!

    Like


Leave a reply to patenttranslator Cancel reply

Categories