How To Speed Up Deep Learning Inference For Natural Language Processing Transformers

Use a better CPU or GPU

When one wants to improve the speed of Transformer based Natural Language Processing models, the "naive" approach is to use more advanced hardware.

Most of the time this is an inevitable solution, as pure software-based solutions all have limits. Imagine you're performing inference on a CPU. You may spend weeks working on low level optimization for your favorite Transformer-based model, it's very likely that you will still get better speed improvement by simply moving your model to an NVIDIA A100 GPU.

These days the most widely used GPUs for inference in production are the NVIDIA Tesla T4 and V100.

NVIDIA Tesla GPU

But what if you already upgraded your hardware? Or what if your budget is limited and you cannot afford to leverage the last cutting-edge expensive GPUs? Read the following!

Batch Inference

Batch inference is about sending several requests at the same time to your model, so it addresses your requests all at once.

Batch inference is very powerful because it will take almost the same time for your model to address several requests as it takes to address 1 request. Under the hood some operations will be factorized, so that instead of doing everything n times, the model only has to do it once.

Technically speaking, it doesn't decrease the latency of your requests, because your requests won't be addressed faster, but it will dramatically improve the throughput of your application (your application can handle more requests with the same hardware).

Batch inference is not perfect though.

First it is not always suited for online inference (i.e. for customer facing applications), because in order to build your batches you will have to buffer some user requests, so some users will have to wait longer than usual.

2nd challenge: batch inference works better for similar requests. For example, if you're deploying a text generation Natural Language Processing model, batching will be more efficient if you create batches of requests that have the same length.

Last of all, batch inference is not performed by your deep learning model itself, but by a higher level layer, like a dedicated inference server. It's not always easy to implement such a layer. For example, NVIDIA's Triton Inference Server (see below) is very good at performing batch inference, but you first need to find a way to make your model compliant with Triton.

Once you manage to export your model into Triton, batch inference is dead simple. For example, here is how to create batches made of up to 128 requests, and wait for 5 seconds maximum before processing the batch (this should be put in the "config.pbtxt" file).

max_batch_size: 128
dynamic_batching {
  max_queue_delay_microseconds: 5000000
}

Leverage Custom Implementations

Many people and companies are working hard on low level optimizations for some Transformer-based Natural Language Processing models. It can be a good idea for you to leverage these custom implementations of your model.

Most of the time these custom implementation are easy to use and require almost no additional work from you. Let me mention some of them here:

• DeepSpeed: this library by Microsoft was initially dedicated to improving performances of model training, especially on multiple GPUs. They recently released specific model kernels for inference, and it seems they are planning to gradually support more and more models.
• FasterTransformer: this framework was created by NVIDIA in order to make inference of Transformer-based models more efficient. You will have to build a new implementation of your model thanks to their library, if your model is supported. Then you can run your model as-is or load it into their Triton Inference Server by using their FasterTransformer Triton Backend.
• TurboTransformers: Tencent created this library for WeChat and then open-sourced it. Please refer to their documentation to see if your model is supported.
• Fast T5: a library dedicated to improving the speed of t5 models.
• CTranslate2: a library dedicated to improving the speed of models based on OpenNMT and FairSeq.

Here is an example about how you can perform inference for your GPT Neo 2.7B model thanks to DeepSpeed. Not too hard isn't it?

# Filename: gpt-neo-2.7b-generation.py
import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
                        device=local_rank)



generator.model = deepspeed.init_inference(generator.model,
                                            mp_size=world_size,
                                            dtype=torch.float,
                                            replace_method='auto')

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

When using one of these custom implementation, don't forget to rigorously test the quality of your new model, as you can never be 100% sure there is no bug in these new implementations.

Use a Dedicated Inference Engine

Microsoft and NVIDIA both worked on advanced inference engines in order to improve inference performances.

ONNX Runtime ("ORT"), by Microsoft, is a cross-platform inference and training machine-learning accelerator (see here). TensorRT ("TRT"), by NVIDIA, is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications (see here).

Additionally, the NVIDIA Triton Inference Server is an inference serving software that makes AI inference easier by making it possible to deploy AI models from various frameworks. Thanks to Triton you can for example perform batch inference easily, run multiple deep learning models concurrently on the same GPU, deploy models on multiple GPUs, and more. See here.

The above seems exciting, but of course it's not that simple... In order for your model to leverage these dedicated inference engines, you first need to convert your existing model into a proper format.

You have several choices:

• Exporting your Transformer-based model to ONNX. Hugging Face tries to make your job easier by providing your with a ready-to-use script for such an export: convert_graph_to_onnx.py. Unfortunately it's not compatible with all the models, so it's possible that you will have to get your hands dirty and write your own export script.
• Exporting your Transformer-based model to Torchscript, using either the tracing or scripting method. Here too, it won't work for all the models and you might have to dig deeper into low level code (see here).
• Converting your ONNX model to TensorRT for better performances and even further optimizations (like quantization). But it first requires your model to be already exported.

When playing with the above methods, you should be very careful with quality of your exported models. You might think that you successfully exported your model, but it's possible that you lost some accuracy in the process, so be very rigorous about how you're testing your new exported model.

Conclusion

Modern transformer-based Natural Language Processing models give impressive results, so that more and more companies want to use them in production. But very often it appears that performances are disappointing...

Working on improving the speed of your predictions is crucial but, as you could see above, there is no one size fits all solution.

If you have questions about how to speed up your inference, please don't hesitate to contact us! Or don't bother with infrastructure and simply subscribe to NLP Cloud!

Abhinav
Devops engineer at NLP Cloud