Performing natural language processing in non-English languages is a challenge. Today, it is possible to get great results with multilingual natural language processing. At last, anyone can perform natural language processing in French, Japanese, Spanish, Russian, Chinese, German... and much more.
Almost 7000 different languages are spoken in the world today! Each language has its own rules and some languages can work very differently. For example French, Spanish and Italian are very similar, but they have nothing to do with Asian languages based on ideographs or symbols like Chinese and Japanese.
The consequence is that different techniques have to be used to create language models that are able to deal with all these languages.
To make it short, different languages might require different vector spaces, even if some pre-trained language embeddings already exist. This is an active research field.
So what are the solutions?
A first approach is to train a model for a specific language. For example, several new versions of BERT have been trained in various languages. German BERT, from Deepset AI, is a good example of a new version of BERT trained on the German language from scratch: see German BERT here.
The problem is that this technique doesn't scale well. Training a new model takes time and costs a lot of money. Training several models is still affordable for small models like spaCy, and Explosion AI (the company behind spaCy) does a great job at maintaining several pre-trained models in many languages: see more here. But natural language processing models are getting bigger and bigger, and training these big models is very costly. For example, training the brand new GPT models (GPT-3, GPT-J and GPT-NeoX) took several weeks and cost million dollars. Training new versions of these models is not something everybody can do.
It also doesn't scale well from an inference perspective. If a company needs to use natural language processing in production in several languages, it will have to maintain several models and provision several servers and GPUs. It can prove to be extremely costly. This is one of the reasons why, at NLP Cloud, we're trying to avoid this strategy as much as possible.
A second approach is to leverage multilingual models.
These last years, new multilingual models have appeared and have proved to be very accurate. Sometimes even more accurate than specific non-English models. The most popular ones are mBERT, XLM, and XLM Roberta. XLM Roberta seems to be the most accurate multilingual model, and is performing very well on the XNLI evaluation dataset (a series of evaluations to assess the quality of multilingual models).
Some very good pre-trained models based on XLM Roberta are available. For example, for text classification in many languages, the best one is XLM Roberta Large XNLI: see this model here.
For the moment there is no good multilingual model for text generation. For example GPT is excellent in English and not so bad in several non-English languages, but far from impressive. Big Science is currently working on very large multilingual text generation models. It seems promising! See more here..
Big Science just announced a multilingual 176 billion parameters transformers model
The last strategy is to use translation. The idea is that you should translate your non-English content to English, send the English content to the model, and translate the result back to your original language.
This technique might sound like a hack, but it has advantages. Maintaining a translation workflow might be less expensive than training dedicated models, and all the languages in the world can be easily supported.
These last years, advanced translation models based on deep learning have been created. They are both fast and give very good results. For example, Helsinki NLP released a series of translation models based on deep learning. You can use the most popular ones on NLP Cloud: see more here.
Adding translation to your workflow will increase the overall response time though. So it might not be suited if you're looking for very fast results.
Multilingual natural language processing is not a solved problem, but a lot of progress have been made these last years. It is now possible to perform natural language processing in non-English languages with very good results, thanks to specific models, multilingual models, and translation.
At NLP Cloud, we believe that understanding and generating text in many languages is crucial, so we released a specific add-on called "multilingual add-on". Once enabled, all our AI models can work well in more than 20 languages, including GPT models like GPT-J and GPT-NeoX: see it here.. We also propose advanced multilingual models like spaCy and XLM Roberta. tr%}
François
Full-stack engineer at NLP Cloud