Running large language models (LLMs) locally requires careful estimation of VRAM requirements to ensure smooth inference and training. While my previous guide provided a simplified formula, this post moves more into the advanced calculations, including hidden dimensions, layer counts, quantization effects, and modality-specific overheads.
By the end of this guide, you'll understand how to estimate VRAM usage with greater accuracy and plan your hardware requirements accordingly.
VRAM Calculation Formula
The more precise formula for estimating VRAM usage considers model parameters, hidden dimensions, layers, modality overhead, and KV cache for inference.
$$ \text{Base VRAM} = \text{Model Weights} + \text{Activation Memory} + \text{KV Cache} + \text{Optimizer States} $$
$$ \text{Total VRAM} = \text{Base VRAM} \times \text{Modality Overhead} $$
Breaking this down:
Model Weights (GB)
The model weights refer to the memory required to store the neural network's parameters. The size of the model weights is determined by the total number of parameters and the precision (or quantization) level used.
$$ \text{Model Weights} = \frac{\text{Number of Parameters} \times \text{Bytes Per Parameter}}{10^9} $$
To calculate this:
Get the number of parameters – This is usually in the model's name:
- LLaMA 7B = 7 billion parameters
- LLaMA 13B = 13 billion parameters
- Mistral 7B = 7 billion parameters
For reference, you can calculate the exact number using: $$ \text{Number of Parameters} \approx \text{Hidden Dimension}^2 \times \text{Num Layers} \times 2.5 $$
However, using the number in the model name (like "7B") is accurate enough for VRAM calculations.
Choose the quantization level – The memory per parameter depends on the precision used, which is detailed in the next section.
Activation Memory (GB) (For Training and Some Inference Tasks)
Activation memory refers to the VRAM required to store intermediate activations during forward and backward passes. While inference uses minimal activation memory, training requires significantly more due to backpropagation.
The formula for activation memory is:
$$ \text{Activation Memory} = \text{Model Weights} \times \text{Activation Factor} $$
The activation factor varies based on the context:
- Inference: Typically 0.2x the model weights, since only the activations of the current sequence need to be stored.
- Training: Approximately 2x the model weights, as activations from all layers need to be retained for gradient calculation during backpropagation.
For large models, activation memory can be the bottleneck in VRAM usage, often exceeding the model weights in training scenarios. Techniques like activation checkpointing can reduce this memory requirement at the cost of additional compute.
KV Cache (GB) (For Inference Only)
The KV (Key-Value) cache is used during autoregressive decoding in transformers to store past token representations, reducing redundant computation.
The required KV cache size is given by:
$$ \text{KV Cache} = \text{Batch Size} \times \text{Sequence Length} \times \text{KV Cache Per Token} $$
Where:
Batch Size: The number of sequences processed in parallel.
Sequence Length: The number of tokens per sequence.
KV Cache Per Token: The memory needed per token, calculated as:
$$ \text{KV Cache Per Token} = \frac{\text{Hidden Dimension} \times 2 \times \text{Bytes Per Element} \times \text{Num Layers}}{10^9} $$
Note:
Bytes Per Element
depends on precision:- FP16/BF16: 2 bytes
- FP32: 4 bytes
- The factor
2
accounts for both Key and Value (K+V) tensors per token.
Since the KV cache grows linearly with sequence length, long sequences require more VRAM. For example, an LLaMA 7B model with a batch size of 4 and a sequence length of 2048 will use significantly more VRAM than a sequence length of 512. Techniques like flash attention and grouped-query attention (GQA) help optimize KV cache memory usage.
Optimizer States (GB) (For Training Only)
Optimizer states store additional information needed for parameter updates during training.
This typically requires 3× to 4× the model weights for Adam-based optimizers, depending on implementation.
The formula for optimizer state memory is:
$$ \text{Optimizer States} = 4 \times \text{Model Weights} $$
This accounts for:
- The primary weight updates.
- First-moment estimates (mean of past gradients).
- Second-moment estimates (uncentered variance of past gradients).
- Additional buffer storage.
For example, a 13B model in FP16 precision (≈26GB model weights) would require 4 × 26GB = 104GB just for optimizer states. Optimizer memory is one of the biggest constraints when training large models. Optimizers like Lion or Sophia aim to reduce state size while maintaining performance to mitigate this.
Modality Overhead Factor
Different model types require additional VRAM due to extra processing demands beyond the standard transformer architecture. This overhead factor adjusts the VRAM estimate based on modality:
Model Type | Overhead Factor |
---|---|
Text | 1.15x |
Audio | 1.30x |
Video | 1.40x |
Reasoning | 1.25x |
Multimodal | 1.50x+ |
- Text models primarily process tokens sequentially and have the lowest overhead.
- Audio models require additional FFT operations and often work with continuous signals.
- Video models process multi-frame sequences, significantly increasing memory requirements.
- Reasoning models (e.g., for complex logical inference) may require additional intermediate memory storage.
- Multimodal models (processing text, image, and audio together) have the highest overhead due to extra cross-modal attention layers.
These overhead factors help refine VRAM estimates for real-world deployments, especially for applications like generative video synthesis or large-scale multimodal AI assistants.
Bytes Per Parameter in Different Quantization Levels
Quantization reduces VRAM usage by representing model parameters with fewer bits. Here are common quantization levels and their bytes per parameter:
Precision | Bytes Per Parameter |
---|---|
FP32 | 4.0 |
FP16 | 2.0 |
BF16 | 2.0 |
INT8 | 1.0 |
INT4 | 0.5 |
For quantized inference, choosing INT4 over FP16 reduces model weight memory by 4x, which is useful for running large models on consumer GPUs.
Hidden Dimensions and Layers
Two architectural details influence VRAM usage: hidden dimension size and number of layers. These parameters affect both model weight size and KV cache size.
Finding Hidden Dimension and Layers for a Model
You can retrieve the hidden dimension and number of layers programmatically using Python:
from transformers import AutoConfig
model_name = "meta-llama/Llama-2-7b-hf"
config = AutoConfig.from_pretrained(model_name)
hidden_dim = config.hidden_size
num_layers = config.num_hidden_layers
print(f"Hidden Dimension: {hidden_dim}, Layers: {num_layers}")
Default Hidden Dimensions and Layers
If the model documentation does not provide the hidden dimension or layer count, the table below lists default values you can use based on typical architectures. You can often decipher exact numbers from the source code when they are unavailable. For example, Meta's Llama 3 model defines these values in its model implementation.
Model Parameters | Hidden Dimension | Number of Layers |
---|---|---|
1B | 2048 | 22 |
3B | 3072 | 26 |
7B | 4096 | 32 |
13B | 5120 | 40 |
30B | 7168 | 60 |
65B | 8192 | 80 |
120B | 12288 | 96 |
405B | 16384 | 120 |
671B | 20480 | 160 |
These values are a reference, but variations may exist across different architectures. If precise values are required, it is recommended that you check the model source code or official documentation.
Example Calculation: Running LLaMA-13B
Let's estimate the VRAM required for running LLaMA-13B in FP16 with a sequence length of 4096 and batch size 1.
Given:
- 13B parameters
- Hidden Dimension: 5120
- Layers: 40
- Precision: FP16 (2 bytes per parameter)
- Batch Size: 1
- Sequence Length: 4096
Calculations:
Model Weights: $$ 13B \times 2 = 26 \text{ GB} $$
Activation Memory (0.2x for inference): $$ 26 \times 0.2 = 5.2 \text{ GB} $$
KV Cache Per Token: $$ \frac{2 \times 5120 \times 40 \times 2}{10^9} = 0.82 \text{ MB per token} $$
KV Cache Total: $$ 4096 \times 0.82 = 3.36 \text{ GB} $$
Total VRAM Required:
$$ (26 + 5.2 + 3.36) \times 1.15 = 39.8 \text{ GB} $$
This means LLaMA-13B in FP16 would require about 40GB VRAM for good performance for a full inference. With quantization (Q4), the requirement drops to about 15GB.
Conclusion
This guide provides an accurate approach to estimating VRAM for LLMs by incorporating hidden dimensions, layer counts, modality overhead, and KV cache calculations. If you're working with limited VRAM, techniques like model offloading, low-rank adaptation (LoRA), and model sharding can help optimize performance on consumer-grade GPUs. Always benchmark VRAM usage to validate estimates, as real-world hardware variations may introduce small discrepancies.