在 GCP 计算引擎上将 LLaMA 3.1 405B 安装和部署到生产中

LLaMA 3.1 405B：硬件要求

由于 Llama 3.1 405B 的大小和复杂性，运行它对硬件的要求相当高。与部署 LLM 时一样，最复杂的部分是 GPU。部署该模型需要大量的 VRAM（即 GPU 内存）：

在 fp16 模式下（这是该机型的默认模式），该机型需要约 972GB 的 VRAM
在 fp8 模式下，使用量化，需要约 486GB 的 VRAM
使用 4 位量化，VRAM 进一步减少到约 243GB

鉴于这些需求，您通常会考虑以下设置：

8 x AMD MI300 192GB GPU，用于 16 位模式。
8 x NVIDIA A100/H100 80GB GPU，用于 8 位模式。
4 x NVIDIA A100/H100 80GB GPU，用于 4 位模式。

与往常一样，您必须小心量化，确保模型的质量不会受到太大影响。在我们的测试中，fp8 量化似乎没有损害模型的质量，因此我们将在本文中使用它。

在 GCP 计算引擎上为 LLaMA 3.1 405B 配置实例

谷歌云平台（GCP）是部署和扩展人工智能工作负载的有趣供应商。它们的价格相对较低，并提供良好的 GPU（英伟达 H100 80GB、英伟达 A100 80GB、英伟达 V100、英伟达 L4、英伟达 T4......）。

在实例选择方面，它们也相当灵活。例如，您可以为实例配置一个或多个 H100 GPU：1xH100、2xH100、4xH100 或 8xH100。

不过，如果您的账户是新账户，您可能无法配置 GPU 实例，如果是这种情况，您需要通过技术支持请求增加配额。

第一步，您需要在 GCP 上创建一个新项目。然后为项目启用计算引擎 API。为此，您可以进入 GCP 控制台的 API 库，搜索 "Compute Engine"。点击它，然后点击 "启用 "来激活 API。

启用 API 后，您就可以创建新实例了。您可以进入 GCP 控制台的 "虚拟机实例 "部分，然后点击 "创建实例"。

然后会要求您选择机器类型。对于采用 fp8 模式的 LLaMA 3.1 405B，您需要选择配备 8xH100 GPU 的 a3-highgpu-1g 机器。

用于 LLaMA 3.1 的 GCP 实例 405B

然后，您将需要为虚拟机设置许多细节，如网络、存储等。我们不会在本文中回顾所有这些设置，但会重点介绍映像类型和存储。

要下载 fp8 格式的 LLaMA 3.1 405B 模型权重，您需要至少 500GB 的磁盘空间。我们还建议您使用已安装 CUDA 12 的 Linux 深度学习镜像，这样可以节省一些工作。您可以在 "操作系统和存储 "部分实现这一点：

LLaMA 3.1 的 GCP 映像和磁盘空间 405B

安装 vLLM

vLLM 是虚拟大型语言模型（Virtual Large Language Model）的缩写，它代表了人工智能领域的一大进步，尤其是在如何提供和利用大型语言模型（LLM）进行推理方面。

vLLM 专为高吞吐量和低延迟推断而设计，因此非常适合快速高效的语言处理应用。它通过 PagedAttention 等创新技术实现了这一目标，该技术通过更有效地管理注意力键和值内存来优化内存使用，与 HuggingFace Transformers 等传统方法相比，吞吐量最多可提高 24 倍。

vLLM 效率的核心在于其内存管理。通过使用 PagedAttention，vLLM 将键值（KV）缓存划分为多个区块，从而提高了内存利用率，并减少了碎片（LLM 在 GPU 内存使用方面的常见瓶颈）。这种方法不仅能加快处理速度，还能同时处理更多请求，而不会出现明显的性能下降。

作为一个推理和服务引擎，vLLM 不仅注重运行 LLM，还注重以最大限度利用资源的方式运行 LLM。它采用了持续批处理传入请求等技术，确保 GPU 得到充分利用，从而减少空闲时间，提高整体效率。

vLLM 支持各种量化技术（如 GPTQ、AWQ、INT4、INT8、FP8），这些技术可降低模型权重的精度，从而减少内存使用量并加快推理速度。

安装 vLLM 相对简单。让我们连接到 GCP 虚拟机实例，然后使用 pip 安装 vLLM：

pip install vllm

我们将在 8 x H100 GPU 上执行分布式推理，因此还需要安装 Ray：

pip install ray

如果在安装 vLLM 时遇到兼容性问题，从源代码编译 vLLM 或使用其 Docker 镜像可能会更简单：请查看 vLLM 安装说明。

用于 LLaMA 3.1 的推理服务器 405B

让我们从一个基本的 Python 示例开始，测试我们的模型：

from vllm import LLM

# Load LLaMA 3.1 405B on 8 GPUs
llm = LLM("neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8-dynamic", tensor_parallel_size=8)

print(llm.generate("What is the difference between fp8 quantization and int8 quantization?"))

您可以运行 Python 脚本。如果这是第一次运行，您需要等待模型下载并加载到 GPU 上，然后您会收到类似这样的响应：

FP8 (Floating-Point 8) and INT8 (Integer 8) are both quantization techniques used to reduce the precision of numerical values in deep learning models, but they differ in their representation and behavior.

**INT8 Quantization**

INT8 quantization represents numbers as 8-bit integers, which can take on values between -128 and 127 (or 0 and 255 for unsigned integers). This means that the precision of the numbers is limited to 8 bits, and any values outside this range are clipped or saturated.

INT8 quantization is a simple and widely used technique, especially for integer-based architectures like ARM and x86. Most deep learning frameworks, including TensorFlow and PyTorch, support INT8 quantization.

**FP8 Quantization**

FP8 quantization, on the other hand, represents numbers as 8-bit floating-point numbers, with 1 sign bit, 2 exponent bits, and 5 mantissa bits. This allows for a much larger dynamic range than INT8, with values that can be as small as 2^-14 or as large as 2^15.

FP8 quantization is a more recent development, and its main advantage is that it can provide better accuracy than INT8 quantization, especially for models that require a large dynamic range, such as those with batch normalization or depthwise separable convolutions. FP8 is also more suitable for models that are sensitive to quantization noise, like those with recurrent neural networks (RNNs) or long short-term memory (LSTM) networks.

**Key differences**

Here are the key differences between FP8 and INT8 quantization:

1. **Dynamic range**: FP8 has a much larger dynamic range than INT8, which means it can represent a wider range of values.
2. **Precision**: FP8 has a lower precision than INT8, with 5 mantissa bits compared to 8 bits for INT8.
3. **Behavior**: FP8 is more suitable for models that require a large dynamic range, while INT8 is better suited for models with smaller weights and activations.
4. **Hardware support**: INT8 is widely supported by most hardware platforms, while FP8 is still an emerging standard, with limited hardware support.

In summary, FP8 quantization offers better accuracy and a larger dynamic range than INT8 quantization, but it requires more sophisticated hardware support and may not be suitable for all models or applications.

LLaMA 3.1 405B 模型已经由 Neural Magic 在用于 vLLM 的 fp8 中进行了量化，因此我们无需再次进行量化。我们只需从 HuggingFace 中枢加载量化后的模型即可。

tensor_parallel_size 参数是根据我们机器上 GPU 的数量设置的。

不过，这个简单的 Python 脚本并不是一个合适的生产服务器。现在我们将启动推理服务器，以便并行处理大量请求，最大限度地提高吞吐量：

python -m vllm.entrypoints.openai.api_server \
--model neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8-dynamic \
--tensor-parallel-size 8

加载模型后，就可以启动第二个终端并提出一些请求：

curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8-dynamic",
    "prompt": "Who are you?"
}'

I'm an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."

结论

LLaMA 3.1 405B 是一种先进的生成式人工智能模型，但将其部署到生产中并非易事。

最大的挑战是找到合适的硬件。GPU 的成本很高，而且在全球范围内都存在短缺。但一旦找到合适的 GPU，使用 vLLM 等推理服务器部署模型就会变得非常容易。

对于这种大型模型，您可能希望像我们一样，利用量化来减少 VRAM 的使用并改善延迟。但要注意：量化并不总是灵丹妙药，因为它会降低模型的准确性。

如果您不能或不想自行部署 LLaMA 3.1 405B，您可以在 NLP Cloud 上轻松使用它，并在生产中大规模利用这一出色的模型。现在就在 NLP Cloud 上试用 LLaMA 3.1 405B！

如果您对 LLaMA 3.1 405B 和人工智能有任何疑问，请随时咨询我们，我们非常乐意为您提供建议！

Julien
NLP Cloud 首席技术官

在 GCP 计算引擎上将 LLaMA 3.1 405B 安装和部署到生产中

2024 年 9 月 17 日

LLaMA 3.1 405B：硬件要求

在 GCP 计算引擎上为 LLaMA 3.1 405B 配置实例

安装 vLLM

用于 LLaMA 3.1 的推理服务器 405B

结论