How To Calculate GPU VRAM Requirements for Local LLMs (Advanced Guide)

Running large language models (LLMs) locally requires careful estimation of VRAM requirements to ensure smooth inference and training. While my previous guide provided a simplified formula, this post moves more into the advanced calculations, including hidden dimensions, layer counts, quantization effects, and modality-specific overheads.

By the end of this guide, you'll understand how to estimate VRAM usage with greater accuracy and plan your hardware requirements accordingly.

VRAM Calculation Formula

The more precise formula for estimating VRAM usage considers model parameters, hidden dimensions, layers, modality overhead, and KV cache for inference.

$$ \text{Base VRAM} = \text{Model Weights} + \text{Activation Memory} + \text{KV Cache} + \text{Optimizer States} $$

$$ \text{Total VRAM} = \text{Base VRAM} \times \text{Modality Overhead} $$

Breaking this down:

Model Weights (GB)

The model weights refer to the memory required to store the neural network's parameters. The size of the model weights is determined by the total number of parameters and the precision (or quantization) level used.

$$ \text{Model Weights} = \frac{\text{Number of Parameters} \times \text{Bytes Per Parameter}}{10^9} $$

To calculate this:

Get the number of parameters – This is usually in the model's name:
- LLaMA 7B = 7 billion parameters
- LLaMA 13B = 13 billion parameters
- Mistral 7B = 7 billion parameters
For reference, you can calculate the exact number using: $$ \text{Number of Parameters} \approx \text{Hidden Dimension}^2 \times \text{Num Layers} \times 2.5 $$

However, using the number in the model name (like "7B") is accurate enough for VRAM calculations.
Choose the quantization level – The memory per parameter depends on the precision used, which is detailed in the next section.

Activation Memory (GB) (For Training and Some Inference Tasks)

Activation memory refers to the VRAM required to store intermediate activations during forward and backward passes. While inference uses minimal activation memory, training requires significantly more due to backpropagation.

The formula for activation memory is:

$$ \text{Activation Memory} = \text{Model Weights} \times \text{Activation Factor} $$

The activation factor varies based on the context:

Inference: Typically 0.2x the model weights, since only the activations of the current sequence need to be stored.
Training: Approximately 2x the model weights, as activations from all layers need to be retained for gradient calculation during backpropagation.

For large models, activation memory can be the bottleneck in VRAM usage, often exceeding the model weights in training scenarios. Techniques like activation checkpointing can reduce this memory requirement at the cost of additional compute.

KV Cache (GB) (For Inference Only)

The KV (Key-Value) cache is used during autoregressive decoding in transformers to store past token representations, reducing redundant computation.

The required KV cache size is given by:

$$ \text{KV Cache} = \text{Batch Size} \times \text{Sequence Length} \times \text{KV Cache Per Token} $$

Where:

Batch Size: The number of sequences processed in parallel.
Sequence Length: The number of tokens per sequence.
KV Cache Per Token: The memory needed per token, calculated as:

$$ \text{KV Cache Per Token} = \frac{\text{Hidden Dimension} \times 2 \times \text{Bytes Per Element} \times \text{Num Layers}}{10^9} $$

Note:

Bytes Per Element depends on precision:

FP16/BF16: 2 bytes

FP32: 4 bytes

The factor 2 accounts for both Key and Value (K+V) tensors per token.

Since the KV cache grows linearly with sequence length, long sequences require more VRAM. For example, an LLaMA 7B model with a batch size of 4 and a sequence length of 2048 will use significantly more VRAM than a sequence length of 512. Techniques like flash attention and grouped-query attention (GQA) help optimize KV cache memory usage.

Optimizer States (GB) (For Training Only)

Optimizer states store additional information needed for parameter updates during training.
This typically requires 3× to 4× the model weights for Adam-based optimizers, depending on implementation.

The formula for optimizer state memory is:

$$ \text{Optimizer States} = 4 \times \text{Model Weights} $$

This accounts for:

The primary weight updates.
First-moment estimates (mean of past gradients).
Second-moment estimates (uncentered variance of past gradients).
Additional buffer storage.

For example, a 13B model in FP16 precision (≈26GB model weights) would require 4 × 26GB = 104GB just for optimizer states. Optimizer memory is one of the biggest constraints when training large models. Optimizers like Lion or Sophia aim to reduce state size while maintaining performance to mitigate this.

Modality Overhead Factor

Different model types require additional VRAM due to extra processing demands beyond the standard transformer architecture. This overhead factor adjusts the VRAM estimate based on modality:

Model Type	Overhead Factor
Text	1.15x
Audio	1.30x
Video	1.40x
Reasoning	1.25x
Multimodal	1.50x+

Text models primarily process tokens sequentially and have the lowest overhead.
Audio models require additional FFT operations and often work with continuous signals.
Video models process multi-frame sequences, significantly increasing memory requirements.
Reasoning models (e.g., for complex logical inference) may require additional intermediate memory storage.
Multimodal models (processing text, image, and audio together) have the highest overhead due to extra cross-modal attention layers.

These overhead factors help refine VRAM estimates for actual deployments, especially for applications like generative video synthesis or large-scale multimodal AI assistants.

Bytes Per Parameter in Different Quantization Levels

Quantization reduces VRAM usage by representing model parameters with fewer bits. Here are common quantization levels and their bytes per parameter:

Precision	Bytes Per Parameter
FP32	4.0
FP16	2.0
BF16	2.0
INT8	1.0
INT4	0.5

For quantized inference, choosing INT4 over FP16 reduces model weight memory by 4x, which is useful for running large models on consumer GPUs.

Hidden Dimensions and Layers

Two architectural details influence VRAM usage: hidden dimension size and number of layers. These parameters affect both model weight size and KV cache size.

Finding Hidden Dimension and Layers for a Model

You can retrieve the hidden dimension and number of layers programmatically using Python:

from transformers import AutoConfig

model_name = "meta-llama/Llama-2-7b-hf"
config = AutoConfig.from_pretrained(model_name)

hidden_dim = config.hidden_size
num_layers = config.num_hidden_layers

print(f"Hidden Dimension: {hidden_dim}, Layers: {num_layers}")

Default Hidden Dimensions and Layers

If the model documentation does not provide the hidden dimension or layer count, the table below lists default values you can use based on typical architectures. You can often decipher exact numbers from the source code when they are unavailable. For example, Meta's Llama 3 model defines these values in its model implementation.

Model Parameters	Hidden Dimension	Number of Layers
1B	2048	22
3B	3072	26
7B	4096	32
13B	5120	40
30B	7168	60
65B	8192	80
120B	12288	96
405B	16384	120
671B	20480	160

These values are a reference, but variations may exist across different architectures. If precise values are required, it is recommended that you check the model source code or official documentation.

Example Calculation: Running LLaMA-13B

Let's estimate the VRAM required for running LLaMA-13B in FP16 with a sequence length of 4096 and batch size 1.

Given:

13B parameters
Hidden Dimension: 5120
Layers: 40
Precision: FP16 (2 bytes per parameter)
Batch Size: 1
Sequence Length: 4096

Calculations:

Model Weights: $$ 13B \times 2 = 26 \text{ GB} $$
Activation Memory (0.2x for inference): $$ 26 \times 0.2 = 5.2 \text{ GB} $$
KV Cache Per Token: $$ \frac{2 \times 5120 \times 40 \times 2}{10^9} = 0.82 \text{ MB per token} $$
KV Cache Total: $$ 4096 \times 0.82 = 3.36 \text{ GB} $$

Total VRAM Required:

$$ (26 + 5.2 + 3.36) \times 1.15 = 39.8 \text{ GB} $$

This means LLaMA-13B in FP16 would require about 40GB VRAM for good performance for a full inference. With quantization (Q4), the requirement drops to about 15GB.

Conclusion

This guide provides an accurate approach to estimating VRAM for LLMs by incorporating hidden dimensions, layer counts, modality overhead, and KV cache calculations. If you're working with limited VRAM, techniques like model offloading, low-rank adaptation (LoRA), and model sharding can help optimize performance on consumer-grade GPUs. Always benchmark VRAM usage to validate estimates, as hardware variations may introduce small discrepancies.

How To Calculate GPU VRAM Requirements for Local LLMs (Advanced Guide)

VRAM Calculation Formula

Model Weights (GB)

Activation Memory (GB) (For Training and Some Inference Tasks)

KV Cache (GB) (For Inference Only)

Optimizer States (GB) (For Training Only)

Modality Overhead Factor

Bytes Per Parameter in Different Quantization Levels

Hidden Dimensions and Layers

Finding Hidden Dimension and Layers for a Model

Default Hidden Dimensions and Layers

Example Calculation: Running LLaMA-13B

Given:

Calculations:

Total VRAM Required:

Conclusion

Wei-Ming Thor

Background

Skills/Languages

Work

Location

Open Source

Support

Which Programming Language Should A Beginner Learn First In 2023

Getting Started with Llama 3.3 Using Ollama (MacOS)

Why Being a Great Hacker Doesn’t Make You a Great Software Engineer (and Vice Versa)