Advanced deep learning models for Natural Language Processing based on Transformers give impressive results, but getting high speed performances is hard. In this article we summarize the best options you have if you want to decrease the latency of your predictions in production.
When one wants to improve the speed of Transformer based Natural Language Processing models, the "naive" approach is to use more advanced hardware.
Most of the time this is an inevitable solution, as pure software-based solutions all have limits. Imagine you're performing inference on a CPU. You may spend weeks working on low level optimization for your favorite Transformer-based model, it's very likely that you will still get better speed improvement by simply moving your model to an NVIDIA A100 GPU.
These days the most widely used GPUs for inference in production are the NVIDIA Tesla T4 and V100.
But what if you already upgraded your hardware? Or what if your budget is limited and you cannot afford to leverage the last cutting-edge expensive GPUs? Read the following!
Batch inference is about sending several requests at the same time to your model, so it addresses your requests all at once.
Batch inference is very powerful because it will take almost the same time for your model to address several requests as it takes to address 1 request. Under the hood some operations will be factorized, so that instead of doing everything n times, the model only has to do it once.
Technically speaking, it doesn't decrease the latency of your requests, because your requests won't be addressed faster, but it will dramatically improve the throughput of your application (your application can handle more requests with the same hardware).
Batch inference is not perfect though.
First it is not always suited for online inference (i.e. for customer facing applications), because in order to build your batches you will have to buffer some user requests, so some users will have to wait longer than usual.
2nd challenge: batch inference works better for similar requests. For example, if you're deploying a text generation Natural Language Processing model, batching will be more efficient if you create batches of requests that have the same length.
Last of all, batch inference is not performed by your deep learning model itself, but by a higher level layer, like a dedicated inference server. It's not always easy to implement such a layer. For example, NVIDIA's Triton Inference Server (see below) is very good at performing batch inference, but you first need to find a way to make your model compliant with Triton.
Once you manage to export your model into Triton, batch inference is dead simple. For example, here is how to create batches made of up to 128 requests, and wait for 5 seconds maximum before processing the batch (this should be put in the "config.pbtxt" file).
Many people and companies are working hard on low level optimizations for some Transformer-based Natural Language Processing models. It can be a good idea for you to leverage these custom implementations of your model.
Most of the time these custom implementation are easy to use and require almost no additional work from you. Let me mention some of them here:
Here is an example about how you can perform inference for your GPT Neo 2.7B model thanks to DeepSpeed. Not too hard isn't it?
# Filename: gpt-neo-2.7b-generation.py
from transformers import pipeline
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
generator.model = deepspeed.init_inference(generator.model,
string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
When using one of these custom implementation, don't forget to rigorously test the quality of your new model, as you can never be 100% sure there is no bug in these new implementations.
Microsoft and NVIDIA both worked on advanced inference engines in order to improve inference performances.
ONNX Runtime ("ORT"), by Microsoft, is a cross-platform inference and training machine-learning accelerator (see here). TensorRT ("TRT"), by NVIDIA, is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications (see here).
Additionally, the NVIDIA Triton Inference Server is an inference serving software that makes AI inference easier by making it possible to deploy AI models from various frameworks. Thanks to Triton you can for example perform batch inference easily, run multiple deep learning models concurrently on the same GPU, deploy models on multiple GPUs, and more. See here.
The above seems exciting, but of course it's not that simple... In order for your model to leverage these dedicated inference engines, you first need to convert your existing model into a proper format.
You have several choices:
When playing with the above methods, you should be very careful with quality of your exported models. You might think that you successfully exported your model, but it's possible that you lost some accuracy in the process, so be very rigorous about how you're testing your new exported model.
Modern transformer-based Natural Language Processing models give impressive results, so that more and more companies want to use them in production. But very often it appears that performances are disappointing...
Working on improving the speed of your predictions is crucial but, as you could see above, there is no one size fits all solution.
If you have questions about how to speed up your inference, please don't hesitate to contact us! Or don't bother with infrastructure and simply subscribe to NLP Cloud!
Devops engineer at NLP Cloud