Machine translation is one of the most successful AI applications of natural language processing. High-quality machine translation systems such as Google's Google Translate or Microsoft's Bing Translator need large-scale bilingual data sets, up to millions of pairs of sentences for machine learning.
However, many languages in the world don’t have enough resources. That is why the building of effective machine translation for low-resource languages, including Southeast Asian languages, is an urgent and challenging work.
Most recently, the Information Technology Institute under the Vietnam Academy of Science and Technology has researched and mastered the most advanced machine translation technology. It has also successfully built a multilingual text translation system that translates from Vietnamese to regional languages including Lao, Khmer, Thai, Malaysian and Indonesian, and vice versa.
According to the developer, challenges arose when building the machine translation model. The difficulties not only were from the scarcity of bilingual data, but also from the diversity in morphology, lack of word and sentence separation, and polysemy.
The AI model developed by the IT Institute can learn how to adapt to all these characteristics of the languages. The software quickly supplements other languages when necessary with translation quality equal to that of foreign made products.
The special feature of the multi-lingual translation software is that it runs separately and stores data on the spot, with no need to use the API (application programming interface) of other service providers. This ensures information security and prevents information leakage.
The problem of well-known translation systems, such as Google Translate and Big Translator, is the domain specific adaptation. In other words, they can translate common words for the majority of people, but the translation quality is poor in translating specialized terms such as health, law, security and others.
To fix the problem, the research team of the IT Institute has developed a translation system with Vietnamese language put at the center, capable of conducting two-way translation to low-resource languages.
The translation software has a relatively high quality, the same, or even higher than Google Translate for the same documents. There is no limitation on the length of documents translated to other languages.
In 2022-2023, the system focused on deploying Large Language Models (LLMs) with the priority given to pairs of languages: Vietnamese-Khmer, Vietnamese-Laos, Vietnamese-Thai, Vietnamese-Malay and Vietnamese-Indonesian.
For English translation, the software developed at the IT Institute has the same high quality as Google Translate.
Le My