Tokenization and Lemmatization API, Based on spaCy

What is Tokenization?

Tokenization is about splitting a text into smaller entities called tokens. Tokens are different things depending on the type of tokenizer you're using. A token can either be a word, a character, or a sub-word (for example, in the English word "higher", there are 2 subwords: "high" and "er"). Punctuation like "!", ".", and ";", can be tokens too.

Tokenization is a fundamental step in every Natural Language Processing operation. Given the various existing language structures, tokenization is different in every language.

What is Lemmatization?

Lemmatization is about extracting the basic form of a word (typically the kind of work you could find in a dictionnary). For example, the lemma of "apple" would still be "apple" but the lemma of "is" would be "be".

Lemmatization, like tokenization, is a fundamental step in every Natural Language Processing operation. Given the various existing language structures, lemmatization is different in every language.


Why Use Tokenization and Lemmatization?

You usually don't use tokenization and lemmatization alone but as a first step in a natural language processing pipeline. Tokenization is often a costly operation that can significantly impact the performance of an Natural Language Processing model, so the choice of the tokenizer is important.

NLP Cloud's Tokenization and Lemmatization API

NLP Cloud proposes a tokenization and lemmatization API that allows you to perform tokenization and lemmatization out of the box, based on spaCy and GiNZA, with excellent performances. Tokenization and lemmatization are not very resource intensive, so the response time (latency), when performing them from the NLP Cloud API, is very low. You can do it in 15 different languages.

For more details, see our documentation about tokenization and lemmatization here.

Frequently Asked Questions

What is tokenization and why is it important in text analysis?

Tokenization is the process of breaking down text into smaller units, such as words, phrases, or symbols, known as tokens. It is crucial in text analysis for structuring data, enabling more accurate parsing, and facilitating tasks like sentiment analysis and topic modeling.

How does lemmatization differ from stemming, and why would I choose one over the other?

Lemmatization involves reducing a word to its base or dictionary form, taking into account its meaning and part of speech, whereas stemming simply removes prefixes and suffixes without considering context. You might choose lemmatization for tasks requiring high linguistic accuracy, like sentiment analysis, and stemming for faster processing in applications where perfect accuracy is less critical.

What is spaCy?

spaCy is an open-source software library for advanced natural language processing (NLP), designed specifically for production use. It offers pre-trained statistical models and word vectors, and supports tokenization, named entity recognition, part of speech tagging, and dependency parsing among other NLP capabilities.

What is GiNZA?

GiNZA is an open-source Natural Language Processing (NLP) library for Japanese, built on top of spaCy. It provides advanced NLP features such as tokenization, lemmatization, and named entity recognition tailored specifically for the Japanese language.

What are the supported languages or locales for this tokenization/lemmatization API?

Our tokenization/lemmatization API based on spaCy and GiNZA supports 15 languages

Can I try the tokenization/lemmatization API for free?

Yes, like all the API endpoints on NLP Cloud, the tokenization/lemmatization API can be tested for free.

How does your AI API handle data privacy and security during the tokenization/lemmatization process?

NLP Cloud is focused on data privacy by design: we do not log or store the content of the requests you make on our API. NLP Cloud is both HIPAA and GDPR compliant.