Struggling with AI or full-stack development? Our experts are here to guide you: tailored advice, technical integration, and more. Reach out at [email protected].

LLM Inference Optimization Techniques

Inference optimization is a critical part of generative AI applications deployed in production. Using LLMs efficiently at scale is a challenge and many techniques have been developed over the last years to make inference faster and cheaper. Let's review these techniques in this article.

LLM Inference Optimization Techniques

A Focus On LLMs Architecture

Large language models (LLMs) are all based on the transformer architecture invented in 2017 by Vaswani et al. The transformer architecture achieves superior accuracy, few-shot learning, and near-human abilities across diverse language tasks. However, these foundation models, often comprising tens to hundreds of billions of parameters, are costly to train and resource-intensive during inference. Inference costs escalate with long input contexts, which demand significant processing power due to large input data. This makes efficient inference a critical challenge, particularly in managing memory and compute resources.

The Transformer Architecture
The Transformer Architecture

More specifically, most well-known LLMs, are decoder-only LLMs, like GPT-3, GPT-4, LLaMA, Mistral, DeepSeek, etc. These models are pretrained on a causal modeling task, functioning as next-word predictors. They process a sequence of tokens as input and produce following tokens autoregressively until a stopping condition is reached.

LLM inference in decoder-only models involves two key phases: the prefill phase and the decode phase. In the prefill phase, the model processes input tokens to compute intermediate states (keys and values) for generating the first new token. This phase, resembling a matrix-matrix operation, is highly parallelized and efficiently utilizes GPU capabilities. Conversely, the decode phase generates tokens, one at a time, relying on previous tokens’ states. This matrix-vector operation is memory-bound, as data transfer to the GPU, rather than computation speed, primarily dictates latency, leading to underutilized GPU compute power.

Optimizing the decode phase is a focal point for addressing inference challenges. Solutions include developing efficient attention mechanisms and better management of keys and values to reduce memory bottlenecks. The post highlights practical approaches to enhance inference performance, assuming readers have a basic understanding of the transformer architecture and attention mechanisms. These optimizations are crucial for improving throughput and reducing latency in real-world LLM deployments.

A further complication arises from the use of different tokenizers across LLMs, which affects token comparability. Tokens, roughly equivalent to four English characters, vary in representation depending on the tokenizer, making direct comparisons of inference throughput (e.g., tokens per second) misleading. This variability underscores the need for standardized evaluation metrics to accurately assess and compare LLM performance during inference.

Batching

Batching is a key strategy for improving GPU utilization and throughput in large language models (LLMs). By processing multiple requests simultaneously using the same model, batching distributes the memory cost of model weights across requests, allowing larger batches to leverage more GPU compute power. However, there’s a limit to batch size, as excessively large batches can cause memory overflow due to the memory demands of LLMs, particularly related to key-value (KV) caching (more on this later).

Batching Techniques
Batching Techniques

Traditional or static batching has limitations because requests within a batch often generate different numbers of completion tokens, leading to varied execution times. This causes all requests to wait for the slowest one to complete, which can be problematic when generation lengths vary significantly. To address this, advanced techniques like in-flight batching have been developed to optimize performance.

In-flight batching, also known as continuous batching, tackles the challenges posed by the dynamic nature of LLM workloads, which can range from simple chatbot responses to complex document summarization or code generation. These tasks produce outputs of vastly different sizes, making it hard to batch and execute requests efficiently in parallel. Unlike static batching, in-flight batching allows the server to evict completed sequences from the batch immediately and start processing new requests while others are still in progress. This approach significantly boosts GPU utilization by adapting to the varying execution times of requests in real-world scenarios.

Multi-GPU Deployment With Model Parallelization

Model parallelization is a critical strategy for managing the memory and computational demands of large-scale machine learning models by distributing them across multiple GPUs. This approach allows for the handling of larger models or input batches that exceed the memory capacity of a single device, making it essential for both training and inference when memory constraints are tight. Various techniques exist for splitting model weights, including pipeline parallelism, tensor parallelism, and sequence parallelism, each addressing different aspects of model distribution. Unlike data parallelism, which focuses on replicating model weights across devices to process larger input batches during training, these methods are more relevant for reducing memory footprints during both training and inference.

Multiple NVIDIA GPUs
Multiple NVIDIA GPUs

Pipeline parallelism divides the model vertically into sequential chunks, with each chunk containing a subset of layers assigned to a separate device. For instance, in a four-way pipeline setup, each device handles a quarter of the model’s layers, passing outputs to the next device in sequence. While this significantly reduces per-device memory requirements, it introduces inefficiencies known as "pipeline bubbles," where devices may idle while waiting for outputs from previous layers. Microbatching, which splits input batches into smaller sub-batches for sequential processing, can reduce these bubbles but not eliminate them entirely, as idle times persist during forward and backward passes.

Tensor parallelism, in contrast, shards individual layers horizontally into smaller computational blocks that can be executed independently across devices. This is particularly effective for transformer components like attention blocks and multi-layer perceptrons (MLPs), where, for example, different attention heads can be assigned to separate devices for parallel computation. However, tensor parallelism is less effective for operations like LayerNorm and Dropout, which cannot be easily divided and must be replicated across devices, leading to redundant memory usage for storing activations. This limitation highlights the need for complementary approaches to optimize memory efficiency.

Sequence parallelism addresses the memory inefficiencies of operations like LayerNorm and Dropout by partitioning them along the input sequence dimension, leveraging their independence across sequence elements. This method reduces the memory footprint of redundant activations, making it a valuable complement to tensor parallelism. These parallelization techniques are not mutually exclusive and can be combined to optimize large language models (LLMs) further. Additionally, specific optimization strategies for the attention module can enhance scalability and reduce per-GPU memory demands, enabling more efficient training and inference for large models.

Attention Optimization

The 2017 paper *Attention Is All You Need* by Vaswani et al. introduced the Transformer model, with self-attention as its cornerstone. Self-attention enables the model to assess the relevance of different words in a sentence relative to each other, enhancing contextual understanding for tasks like natural language processing. The paper formalized self-attention, particularly through the scaled dot-product attention (SDPA) mechanism, which maps query and key-value pairs to an output, making it a pivotal component in modern neural networks. Here are some of the most important techniques to optimize attention computations:

The Attention Paper
The Attention Paper

Multi-head attention (MHA) builds on SDPA by running multiple attention operations in parallel, each with distinct projections of query, key, and value matrices. These parallel operations, or "heads," focus on different representational subspaces, enriching the model’s understanding of the input. The outputs from these heads are concatenated and linearly projected, maintaining computational efficiency comparable to single-head attention by reducing the dimensionality of each head (e.g., dividing the model dimension by the number of heads, such as 8).

Multi-query attention (MQA) optimizes MHA for inference by sharing key and value projections across multiple attention heads while keeping multiple query projections. This reduces memory bandwidth demands and the size of the key-value (KV) cache, enabling larger batch sizes and better compute utilization. However, MQA may slightly reduce accuracy, and models leveraging it require training or fine-tuning with MQA enabled to maintain performance.

Grouped-query attention (GQA) balances MHA and MQA by grouping query heads and sharing key-value projections within each group, achieving near-MHA quality with computational efficiency closer to MQA. Models like Llama 2 70B use GQA, and those trained with MHA can be adapted to GQA with minimal additional training. Both MQA and GQA reduce KV cache memory demands, though further optimizations in cache management remain necessary.

FlashAttention enhances attention mechanisms by reordering computations to leverage GPU memory hierarchies more effectively. Unlike traditional layer-by-layer processing, FlashAttention fuses operations and uses "tiling" to compute small portions of the output matrix at once, minimizing memory read/write operations. This I/O-aware, exact attention algorithm integrates seamlessly into existing models without modifications, offering significant speedups by optimizing data movement.

Key-Value Caching

KV caching is a critical optimization technique used during the decode phase of large language models (LLMs) to improve the efficiency of self-attention computations. In this phase, each generated token depends on the key (K) and value (V) tensors of all previous tokens, including those computed during the prefill stage and subsequent decode steps. Instead of recomputing these tensors for every token at each time step, KV caching stores them in GPU memory, appending new tensors to the cache as they are computed. Typically, a separate KV cache is maintained for each layer of the model, significantly reducing redundant computations and speeding up the decoding process.

Key-Value Caching
Key-Value Caching

The memory requirements for LLMs on GPUs are primarily driven by two components: model weights and the KV cache. Model weights, which consist of the model’s parameters, occupy substantial memory; for instance, a 7-billion-parameter model like Llama 2 7B in 16-bit precision requires approximately 14 GB. The KV cache, on the other hand, stores self-attention tensors to avoid recomputation, with its size determined by factors such as the number of layers, attention heads, head dimensions, and precision. For each token, the cache size is calculated as 2 * num_layers * (num_heads * dim_head) * precision_in_bytes, where the factor of 2 accounts for both K and V matrices. For a batch of inputs, the total KV cache size scales with batch size and sequence length, potentially reaching significant sizes, such as ~2 GB for a Llama 2 7B model with a sequence length of 4,096 and batch size of 1.

Managing the KV cache efficiently poses challenges due to its linear growth with batch size and sequence length, which can limit throughput and complicate handling long-context inputs. A common inefficiency arises from static over-provisioning, where memory is reserved for the maximum supported sequence length (e.g., 2,048 tokens), regardless of the actual input size. This leads to significant memory waste or fragmentation, as much of the reserved space often remains unused throughout the request’s lifetime, tying up valuable GPU memory resources.

To address these inefficiencies, the PagedAttention algorithm introduces a novel approach inspired by operating system paging. It divides the KV cache into fixed-size blocks, each representing a set number of tokens, which can be stored non-contiguously in memory. A block table tracks these blocks, fetching them as needed during attention computations. As new tokens are generated, additional blocks are allocated dynamically. This method minimizes memory wastage by eliminating the need for contiguous allocation and over-provisioning, enabling larger batch sizes and improving throughput, thus making it a significant advancement in managing KV cache memory for LLMs.

Model Optimization

In this section we discuss various techniques for optimizing large language models (LLMs) to reduce their memory consumption and enhance performance on GPUs. Key methods include quantization, sparsity, and distillation, each targeting different aspects of model efficiency. These techniques modify model weights, leverage GPU hardware acceleration, and transfer knowledge to smaller models, enabling larger models to run on limited hardware while maintaining performance. These methods can degrade the accuracy of the model, so they should be used with caution.

Quantization reduces the precision of a model’s weights and activations, typically from 32 or 16 bits to 8 or fewer bits, allowing models to occupy less memory and transfer data more efficiently. While quantizing weights is straightforward due to their fixed nature post-training, quantizing activations is more complex due to outliers that expand their dynamic range. Techniques like LLM.int8() address this by selectively applying higher precision to certain activations, or by reusing the dynamic range of quantized weights for activations, though GPUs may require converting weights back to higher precision for operations.

Sparsity involves pruning model values close to zero, creating sparse matrices that require less memory. GPUs support structured sparsity, such as representing two out of every four values as zeros, which accelerates computations. Combining sparsity with quantization can further enhance execution speed. Research continues to explore optimal sparse representations for LLMs, indicating a promising avenue for improving inference speeds.

Distillation transfers knowledge from a larger “teacher” model to a smaller “student” model, compressing size while preserving performance. For example, DistilBERT achieves a 40% size reduction and 60% speed increase compared to BERT, retaining 97% of its capabilities. Distillation can involve mimicking the teacher’s outputs or using teacher-generated data for training, with methods like "Distilling Step by Step!" incorporating rationales for efficient learning. However, restrictive licenses on many advanced LLMs limit the availability of suitable teacher models for distillation.

Speculative Inference

Speculative inference, also known as speculative sampling or assisted generation, is a method to parallelize the execution of autoregressive large language models (LLMs) like GPT-style models, which typically generate text token by token. In standard execution, each token depends on all prior tokens for context, making parallel generation impossible as the nth token must be generated before the (n+1)th. Speculative inference addresses this by using a "cheaper" draft model to predict multiple future tokens simultaneously, which are then verified or rejected in parallel by the main model, allowing faster text generation.

The process involves generating a draft continuation of several tokens using a less resource-intensive method, followed by parallel verification by the main model using the draft as speculative context. If the verification model matches the draft tokens, they are accepted; otherwise, non-matching tokens and subsequent ones are discarded, and the process repeats with a new draft. Draft tokens can be generated using various approaches, such as training multiple models, fine-tuning multiple heads on a pretrained model to predict future tokens, or employing a smaller draft model alongside a larger, more capable verification model, each with its own tradeoffs.

Disaggregated Inference

Disaggregated inference is a technique where the computational tasks are split across different hardware to optimize performance, cost, and resource use. Specifically, it separates the prefilling and decoding phases. By disaggregating these phases, each can be assigned to hardware best suited for its computational demands, improving efficiency and scalability.

Disaggregated Inference
Disaggregated Inference

Prefilling is compute-intensive, requiring significant matrix multiplications to process the entire input prompt and produce KV caches. This phase benefits from high-performance hardware like GPUs or TPUs, which excel at parallel computations. Since prefilling is a one-time task per inference request, it can be offloaded to a centralized, powerful compute node optimized for such workloads. This setup allows for faster processing of large prompts and reduces the burden on less capable devices, making it ideal for cloud-based or data-center environments where high-throughput hardware is available.

Decoding, in contrast, is memory-bound and involves iterative token generation, relying heavily on accessing the KV caches. It requires less computational power but needs fast memory access, making it suitable for less powerful, memory-optimized hardware like CPUs or edge devices. By moving decoding to separate hardware—potentially closer to the end user, such as on-premises servers or edge devices— disaggregated inference reduces latency and network bandwidth demands. This separation enables flexible deployment, where prefilling runs on high-end cloud servers and decoding occurs on local or edge devices, optimizing resource allocation and enabling efficient scaling for applications like real-time chatbots or interactive AI systems.

Conclusion

Many inference optimization techniques have been invented recently in order to improve the performance of LLMs.

Implementing these techniques requires a deep understanding of the LLM architecture and the hardware you are using, so it is generally easier to use an existing inference engine that has already implemented these techniques like vLLM, TensorRT-LLM, LMDeploy, etc. We have actually implemented these techniques in our own inference engine at NLP Cloud and we have written a blog post about inference engines if you want to deploy your own models: you can read it here.

If you cannot or do not want to deploy your own LLMs yourself, you can use NLP Cloud and leverage fast generative AI models at scale in production. Try fast inference on NLP Cloud now!

If you have questions about inference engines in general, please don't hesitate to ask us, it's always a pleasure to advise!

Julien
CTO at NLP Cloud