LLaMA 3.1 405b is a large language model developed by Meta AI, representing one of the largest openly available AI models in terms of parameter size, with 405 billion parameters. It's part of the Llama 3.1 family, which includes models of different sizes (8B, 70B, and 405B parameters).
The 405B version is particularly notable for its scale, aiming to match or even surpass the performance of some of the top closed-source models like GPT-4 in various benchmarks, indicating its state-of-the-art capabilities in language understanding, generation, and other tasks. Llama 3.1 models are designed with enhanced multilingual support, capable of understanding and generating text in multiple languages, which broadens its applicability across different regions and user bases.
In this article, we show how to install and deploy LLaMA 3.1 405B into production on Google Cloud Platform (GCP) Compute Engine. We first talk about hardware requirements, then instance provisioning on GCP, and deployment and quantization with vLLM.
The hardware requirements for running Llama 3.1 405B are quite extensive due to its size and complexity. As usual when deploying LLMs, the most complex part is the GPU. You will need a lot of VRAM (i.e. GPU memory) to deploy this model:
Given these needs, you'd typically look at setups like:
As usual, you have to be careful about quantization and make sure that the quality of the model does not suffer too much. In our test, the fp8 quantization does not seem to harm the quality of the model, so we are going to use it in this article.
Google Cloud Platform (GCP) is an interesting provider for deploying and scaling your AI workloads. They are relatively cheap and have a good GPU offer (NVIDIA H100 80GB, NVIDIA A100 80GB, NVIDIA V100, NVIDIA L4, NVIDIA T4...).
They are also quite flexible in terms of instance choices. For example you can provision instances with one or more H100 GPUs: 1xH100, 2xH100, 4xH100, or 8xH100.
You might not be allowed to provision GPU instances if your account is new though and if that's the case you will want to go through support to ask for a quota increase.
As a first step you will want to create a new project on GCP. Then you will want to enable the Compute Engine API for your project. You can do this by going to the API Library in the GCP Console and searching for "Compute Engine". Click on it and then click on "Enable" to activate the API.
Once you have enabled the API, you will be able to create a new instance. You can do this by going to the "VM instances" section in the GCP Console and clicking on "Create instance".
You will then be asked to choose a machine type. For LLaMA 3.1 405B in fp8 mode, you will want to choose a a3-highgpu-1g machine with 8xH100 GPUs.
GCP Instance for LLaMA 3.1 405B
You will then need be able to set many details for your VM like networking, storage, etc. We are not going to review all these settings in this article, but we will focus on the image type and storage.
In order to download the LLaMA 3.1 405B model weights in fp8 format, you will need at least 500GB of disk space. We also recommend that you use a Linux Deep Learning image with CUDA 12 already installed as it can save some work later. You can achieve that in the "OS and storage" section:
GCP Image And Disk Space for LLaMA 3.1 405B
vLLM, which stands for Virtual Large Language Model, represents a significant advancement in the field of AI, particularly in how large language models (LLMs) are served and utilized for inference.
vLLM is engineered for high-throughput and low-latency inferences, making it ideal for applications where quick and efficient language processing is crucial. It achieves this through innovative techniques like PagedAttention, which optimizes memory usage by managing attention key and value memory more efficiently, allowing for up to 24x higher throughput compared to traditional methods like HuggingFace Transformers.
The core of vLLM's efficiency lies in its memory management. By using PagedAttention, vLLM divides the key-value (KV) cache into blocks, which allows for better memory utilization and reduces fragmentation, a common bottleneck in GPU memory usage for LLMs. This approach not only speeds up processing but also allows for handling more requests simultaneously without significant performance drops.
As an inference and serving engine, vLLM focuses on not just running LLMs but doing so in a way that maximizes resource utilization. It employs techniques like continuous batching of incoming requests, which ensures that the GPU remains fully utilized, thereby reducing idle times and increasing overall efficiency.
vLLM supports various quantization techniques (like GPTQ, AWQ, INT4, INT8, FP8) which reduce the precision of model weights, thereby decreasing memory usage and potentially speeding up inference.
Installing vLLM is relatively easy. Let's connect to our GCP VM instance, and install vLLM using pip:
pip install vllm
We are going to perform distributed inference on 8 x H100 GPUs, so we need to install Ray as well:
pip install ray
If you experience compatibility problems while installing vLLM, it may be simpler for you to compile vLLM from the source or use their Docker image: have a look at the vLLM installation instructions.
Le'ts start with a basic Python example to test our model:
from vllm import LLM
# Load LLaMA 3.1 405B on 8 GPUs
llm = LLM("neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8-dynamic", tensor_parallel_size=8)
print(llm.generate("What is the difference between fp8 quantization and int8 quantization?"))
You can run the Python script. If this is the first time you run it, you will need to wait for the model to be downloaded and loaded on the GPU, then you will receive a response like this:
FP8 (Floating-Point 8) and INT8 (Integer 8) are both quantization techniques used to reduce the precision of numerical values in deep learning models, but they differ in their representation and behavior.
**INT8 Quantization**
INT8 quantization represents numbers as 8-bit integers, which can take on values between -128 and 127 (or 0 and 255 for unsigned integers). This means that the precision of the numbers is limited to 8 bits, and any values outside this range are clipped or saturated.
INT8 quantization is a simple and widely used technique, especially for integer-based architectures like ARM and x86. Most deep learning frameworks, including TensorFlow and PyTorch, support INT8 quantization.
**FP8 Quantization**
FP8 quantization, on the other hand, represents numbers as 8-bit floating-point numbers, with 1 sign bit, 2 exponent bits, and 5 mantissa bits. This allows for a much larger dynamic range than INT8, with values that can be as small as 2^-14 or as large as 2^15.
FP8 quantization is a more recent development, and its main advantage is that it can provide better accuracy than INT8 quantization, especially for models that require a large dynamic range, such as those with batch normalization or depthwise separable convolutions. FP8 is also more suitable for models that are sensitive to quantization noise, like those with recurrent neural networks (RNNs) or long short-term memory (LSTM) networks.
**Key differences**
Here are the key differences between FP8 and INT8 quantization:
1. **Dynamic range**: FP8 has a much larger dynamic range than INT8, which means it can represent a wider range of values.
2. **Precision**: FP8 has a lower precision than INT8, with 5 mantissa bits compared to 8 bits for INT8.
3. **Behavior**: FP8 is more suitable for models that require a large dynamic range, while INT8 is better suited for models with smaller weights and activations.
4. **Hardware support**: INT8 is widely supported by most hardware platforms, while FP8 is still an emerging standard, with limited hardware support.
In summary, FP8 quantization offers better accuracy and a larger dynamic range than INT8 quantization, but it requires more sophisticated hardware support and may not be suitable for all models or applications.
The LLaMA 3.1 405B model has already been quantized in fp8 for vLLM by Neural Magic, so we do not need to perform the quantization again. We simply load the quantized model from the HuggingFace Hub.
The tensor_parallel_size parameter is set according to the number of GPUs that we possess on our machine.
This simple Python script is not a proper production server though. We will now start the inference server in order to consume many requests in parallel and maximize throughput:
python -m vllm.entrypoints.openai.api_server \
--model neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8-dynamic \
--tensor-parallel-size 8
Once the model is loaded, you can start a second terminal and make some requests:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8-dynamic",
"prompt": "Who are you?"
}'
I'm an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."
LLaMA 3.1 405B is a cutting-edge generative AI model but deploying it into production is not easy.
The biggest challenge is to find the right hardware. GPUs are very costly and there is a global shortage. But once you manage to provision the right GPUs, deploying the model with an inference server like vLLM is quite easy.
For such large models, you might want to leverage quantization to reduce the VRAM usage and improve the latency, like we did. But be careful: quantization is not always a silver bullet as it can decrease the accuracy of the model.
If you cannot or do not want to deploy LLaMA 3.1 405B by yourself, you can easily use it on NLP Cloud and leverage this great model at scale in production. Try LLaMA 3.1 405B on NLP Cloud now!
If you have questions about LLaMA 3.1 405B and AI in general, please don't hesitate to ask us, it's always a pleasure to advise!
Julien
CTO at NLP Cloud