The natural language processing market is booming and many new tools have recently appeared in the ecosystem. Here are the libraries, frameworks, languages, services, and actors you should know about in order to integrate text understanding and text generation in your project in 2022.
Python has been the de facto standard language in data science for many years. If you are working on a natural language processing project, there will most likely be some Python code somewhere.
Python is a very expressive and simple high-level language that makes it perfectly suited for machine learning applications. But even more importantly, Python benefits from a comprehensive ecosystem of libraries and frameworks that make data scientists' lives easier.
Whether you're working on a research project or a production project, whether you're training new models or using them for inference, you will most likely have to use Python. If you absolutely need to use another language, you might find nice libraries in other languages too but only for basic use cases (for more advanced use cases, the solution will be to adopt a microservices strategy and use a REST API).
Hugging Face Hub is a central repository that stores most of the open-source natural language processing models.
On Hugging Face, it is easy to discover new AI models but also upload and share yours. It is also a great place to browse and find datasets for your next project. Models and datasets can be easily downloaded and used through their Transformers framework (see below).
Hugging Face's vision is to "democratize" natural language processing and become the "Github of machine learning".
OpenAI is the company behind GPT-3, the most advanced language AI model ever created.
The 2 first versions of this model (GPT and GPT-2) were open-source, but OpenAI decided that GPT-3 would not be open-source anymore. If you want to use GPT-3, you need to subscribe to the OpenAI API. Only Microsoft has access to the source code of GPT-3 as they purchased an exclusive license.
The GPT models are text generation AI models that are very good at writing text like a human. It is actually quite hard for a human to detect whether a piece of text was written by a real person or by GPT-3...
It cost OpenAI millions of dollars in order to design and train this new AI. If you want to use it, you will have to go through a demanding validation process as OpenAI doesn't allow all types of applications to use their model.
New open-source models are now being released in order to catch up with OpenAI like GPT-J and GPT-NeoX.
This is us!
NLP Cloud is an API that lets you easily use the most advanced natural language processing AI models in production.
For example you can generate text with GPT-J and GPT-NeoX, summarize content with Facebook's Bart Large CNN, classify a piece of text with Roberta, extract entities with spaCy, translate content with NLLB 200... and much more.
On NLP Cloud it is also possible to train and fine-tune your own AI, or deploy your own in-house models. For example, if you want to create your own medical chatbot based on GPT-J, you simply need to upload your dataset made up of your own examples coming from your industry, then start the training process, and use your final model in production through the API.
Deepspeed is an open-source framework by Microsoft that focuses on model parallelization.
What does it mean exactly?
AI models are getting bigger and bigger (see GPT-3, GPT-J, GPT-NeoX 20B, T0, Fairseq 13B...). These huge models open the door to tons of new applications, but they are also very hard to run.
Training these models, and reliably running them in production for inference, can either be done through vertical scalability (using huge GPUs like NVIDIA A100 or Google TPUs) or horizontal scalability (using several small GPUs in parallel).
The 2nd approach is more and more popular as it is cheaper and it scales better. Nevertheless, performing distributed training and inference is far from easy, which is why Deepspeed really helps.
Deepspeed was originally targeted to training tasks, but it is now more and more used for inference as it is easy to use and integrates with Hugging Face Transformers (see below).
Big Science is a collective or researchers and companies who work on big language models.
Their first workshop produced an AI model called T0 that performs very well at understanding human instructions.
They are now working on much bigger models: their goal is to create open-source multilingual AI models that are bigger and more advanced than GPT-3.
SpaCy is a Python natural language processing framework that is perfectly suited for production: it is both fast and easy to play with.
This is a framework maintained by a German AI company called Explosion AI.
SpaCy is very good at Named Entity Recognition (also known as entity extraction), and in around 50 different languages. They provide pre-trained models and you can easily create your own models through annotated examples.
The Transformers framework was released by Hugging Face a couple of years ago. Most of the advanced natural language processing models are now based on Transformers.
This is a Python module that is based on PyTorch, Tensorflow, and Jax, that can be used either for training or inference.
Hugging Face Transformers make it very easy to download and upload models to the Hugging Face Hub.
The tokenizers library from Hugging Face is a set of advanced natural language processing tokenizers, used by transformer-based models.
Tokenization is about splitting an input text into small words of subwords that can then be encoded and processed by the AI model.
Tokenization might sound like a detail, but it is not. It is actually a critical part of natural language processing, and using the right tokenizer makes a huge difference in terms of quality of the results and performances.
NLTK stands for Natural Language Toolkit. It is a Python framework that has been around for many years and that is great for research and education.
NLTK is not a production oriented framework, but it is perfect for data scientists trying to ramp up on natural language processing.
The natural language processing field has considerably evolved in 2021. Today, more and more companies want to use language AI models in production, and this is interesting to see that in 2022 the ecosystem has pretty much nothing to do with what it was 5 years ago.
Libraries and frameworks are getting more and more advanced, and the creation of large language models like GPT-3 raises new interesting challenges.
Can't wait to see what 2023 will be like!
Juliette