Tokenization and Lemmatization API

What is Tokenization?

Tokenization is about splitting a text into smaller entities called tokens. Tokens are different things depending on the type of tokenizer you're using. A token can either be a word, a character, or a sub-word (for example, in the English word "higher", there are 2 subwords: "high" and "er"). Punctuation like "!", ".", and ";", can be tokens too.

Tokenization is a fundamental step in every Natural Language Processing operation. Given the various existing language structures, tokenization is different in every language.

What is Lemmatization?

Lemmatization is about extracting the basic form of a word (typically the kind of work you could find in a dictionnary). For example, the lemma of "apple" would still be "apple" but the lemma of "is" would be "be".

Lemmatization, like tokenization, is a fundamental step in every Natural Language Processing operation. Given the various existing language structures, lemmatization is different in every language.

Why Use Tokenization and Lemmatization?

You usually don't use tokenization and lemmatization alone but as a first step in a natural language processing pipeline. Tokenization is often a costly operation that can significantly impact the performance of an Natural Language Processing model, so the choice of the tokenizer is important.

Tokenization and Lemmatization with spaCy and Ginza.

SpaCy is an excellent Natural Language Processing framework that performs fast and accurate tokenization in many languages (see more here). The Ginza model based on spaCy, released by Megagon Labs, is performing extremely well in Japanese (see the project here).

Tokenization and Lemmatization Inference API

Building an inference API for tokenization and lemmatization is an interesting step that can definitely make Natural Language Processing research easier. Thanks to an API, you can automate your tokenization lemmatization and do it in any language, not necessarily in Python.

NLP Cloud's Tokenization and Lemmatization API

NLP Cloud proposes a tokenization and lemmatization API that gives you the opportunity to perform this operation out of the box, based on spaCy, and Ginza, with excellent performances. Tokenization and lemmatization are not very resource intensive, so the response time (latency), when performing them from the NLP Cloud API, is very good. You can do it in 15 different languages.

For more details, see our documentation about tokenization and lemmatization here.