Language Detection

What is Language Detection?

Language detection is about automatically understanding in which language a text was written. It is also possible to detect several languages if a piece of text contains several languages.

Let's say you have the following block of text:

NLP Cloud is an easy way to leverage Natural Language Processing in production. The API has been released early January 2021. Cette API est à la fois peu onéreuse et très robuste.

As you can see, this text contains 2 languages: English and French. Around 2/3 of the text is in English, and 1/3 is in French.

If we perform language detection on this text, we will get 2 languages, and the proportion of the text in each language. Something like that: english: 0.66 and french: 0.33.

Why Use Language Detection?

Language detection is useful in many scenarios. Let's give you a couple of examples.

Multilingual Support

Companies who can afford it perform support in multiple languages. In order to triage the incoming messages to the right support agent, it is necessary to automatically detect the language of the message first.

Machine Translation

Language detection is often a first step in machine translation: in general you first need to detect the language, and then translate it with the right translation model.

First Step in a NLP Workflow

It is often interesting to perform a language detection as a first step, in order to know which model to use later. For example, let's say that you have entity extraction (NER) models in several languages. Before choosing one of them, you need to know what is the language of your text.