Deploy LLaMA 3, Mistral, and Mixtral, on AWS EC2 with vLLM

In 2023, many advanced open-source LLMs have been released, but deploying these AI models into production is still a technical challenge. In this article we will show how to deploy some of the best LLMs on AWS EC2: LLaMA 3 70B, Mistral 7B, and Mixtral 8x7B. We will use an advanced inference engine that supports batch inference in order to maximise the throughput: vLLM.

vLLM

LLaMA 3, Mistral, and Mixtral

Meta has created and launched the LLaMA 3 series of large language models (LLMs), which includes a variety of generative text models that have been pre-trained and fine-tuned. These models vary in size, with parameters ranging between 7 billion and 70 billion.

Mistral AI, a startup co-founded by individuals with experience at Google's DeepMind and Meta, made a significant entrance into the world of LLMs with Mistral 7B and then Mixtral 8x7B.

What makes Mistral 7B particularly impressive is its performance. In various tests, it has outperformed Llama2-13B, and even exceeded Llama1-34B in many metrics. This suggests that Mistral 7B provides similar or better capabilities with a significantly lower computational overhead. When it comes to coding tasks, Mistral 7B competes with CodeLlama 7B, and its compact size at 13.4 GB enables it to run on standard machines.

Mixtral 8x7B is an adaptable and swift model suitable for various applications. It operates at a speed that is six times faster and either meets or surpasses the performance of LLaMA 3 70B across all test metrics. This model supports multiple languages and possesses inherent coding capabilities. It can manage sequences up to 32k tokens in length.

All these models are open-source and you can deploy them on your own server if you manage to get access to the right hardware. Let's see how to deploy them on AWS EC2 with vLLM.

Batch Inference And Multi GPU with vLLM

vLLM is a swift and user-friendly library tailored for efficient LLM inference and deployment. The performance of vLLM comes from several advanced techniques like paged attention for effective attention key and value memory management, batch processing of incoming queries in real-time, and customized CUDA kernels.

Moreover, vLLM offers good flexibility through distributed inference (via tensor parallelism), output streaming, and accommodation for both NVIDIA and AMD GPU architectures.

In particular, vLLM will be very helpful to deploy LLaMA 3, Mistral, and Mixtral, because it will alow us to deploy our models on AWS EC2 instances embedding several smaller GPUs (like the NVIDIA A10), instead of one single big GPU (like the NVIDIA A100 or H100). Also, vLLM will allow us to dramatically increase the throughput of our model thanks to batch inference.

Provision the Right Hardware on AWS EC2

Deploying LLMs is a challenge for many reasons: VRAM (GPU memory) usage, inference speed, throughput, disk space usage... Here we need to make sure that we will provision a GPU instance on AWS EC2 that has enough VRAM to run our models.

G5 instances are a good choice because they give you access to modern NVIDIA A10 GPUs, and can scale up to 192 GB of VRAM (see the g5.48xlarge instance), while remaining quite cost effective.

Mistral 7B is the easiest model to deploy as it requires around 14GB of VRAM. Then comes Mixtral 8x7B with 110GB and LLaMA 3 70B with 140GB. Here we only consider fp16, not fp32, and we are not applying any sort of quantization.

Consequently Mistral 7B can run on g5.xlarge instance but Mixtral 8x7B and LLaMA 3 70B require a g5.48xlarge instance, so we are going to provision a g5.48xlarge instance in this tutorial.

In order to provision such instances, log into your AWS EC2 console, and launch a new instance: select the NVIDIA deep learning AMI, on a g5.48xlarge instance. You will need at least 300GB of disk space.

Deep Learning AMI on G5 instance on AWS

Install vLLM for Distributed Inference

vLLM installation is quite straightforward. Let's open an SSH connection to our newly created AWS instance, and install vLLM with pip:

pip install vllm

As we will use vLLM for distributed inference on 8 x A10 GPUs, we need to install Ray too:

pip install ray

In case of compatibility issues during the installation process, it might be easier for you to build vLLM from source or use their Docker image: see the installation documentation for more details.

Create the Inference Script

You can now create your first inference script. Create a Python file that contains the following:

from vllm import LLM

# Replace the model name with the one you want to use:
# Mixtral simple: mistralai/Mixtral-8x7B-v0.1
# Mixtral instruct: mistralai/Mixtral-8x7B-Instruct-v0.1
# Mistral 7B simple: mistralai/Mistral-7B-v0.1
# Mistral 7B instruct: mistralai/Mistral-7B-Instruct-v0.1
# LLaMA 3 70B: meta-llama/Llama-2-70b-hf
llm = LLM("mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=8)

print(llm.generate("What is batch inference?"))

Now run your script with Python, which returns something like this:

Batch inference is the process of applying machine learning models to a batch of data inputs all at once, rather than processing each input individually in real-time. In batch inference, a large set of inputs is collected and then processed together as a group, or "batch," by the machine learning model.

Batch inference is often used in scenarios where real-time predictions are not required, and where there is a large volume of data that needs to be processed. It can be more efficient and cost-effective than real-time inference, as it allows for the efficient use of computing resources and can reduce the latency associated with processing individual inputs.

Batch inference is commonly used in applications such as image recognition, natural language processing, and predictive analytics. For example, a company might use batch inference to analyze customer data and generate insights about purchasing patterns, or to analyze sensor data from industrial equipment to identify potential maintenance issues.

In summary, batch inference is the process of applying machine learning models to a batch of data inputs all at once, as an alternative to real-time inference. It is commonly used in scenarios where real-time predictions are not required and where there is a large volume of data that needs to be processed efficiently.

As you can see this is a piece of cake. You have to adapt tensor_parallel_size depending on the number of underlying GPUs you have.

The above was a one-shot scenario. Now we want to start a proper inference server that can handle multiple requests and perform batch inference on the fly. First, start the server:

python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mixtral-8x7B-Instruct-v0.1
--tensor-parallel-size 8

After some time, once the model is correctly loaded in VRAM, you can open a second shell window and make some requests:

curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
    "prompt": "What is batch inference?"
}'

This will return the same result as before, but this time you can perform several requests at the same time.

Conclusion

Leveraging an advanced inference server like vLLM is beneficial if you want to maximize the utilization of your GPU and easily deploy your model on several GPUs in parallel.

As you can see, it is quite easy to deploy the most advanced open-source AI models like LLaMA 3, Mistral, and Mixtral, on your own server thanks to this technique.

In this tutorial we used AWS EC2 but we could have used other vendors of course. The main challenge will be the cost of the GPUs and also their availability.

If you do not want to deploy such LLMs by yourself, we recommend that you use our NLP Cloud API instead. It will save you a lot of time and might even be cheaper than deploying your own LLMs. If not done yet, feel free to have a try!

Vincent
Developer Advocate at NLP Cloud