Specific calculators, planners, and explainers for jobs that usually get buried in forums.
Model engineering · Calculator · Analyzer
LLM VRAM and context calculator
Estimate how much GPU memory a transformer needs for weights and KV cache, then work backwards to the longest context or highest concurrency your hardware can sustain.
What this page models
- Weight memory from total parameter count and chosen quantization.
- KV-cache growth from layers, KV heads, head width, context, and concurrency.
- Partial GPU offload for llama.cpp style planning where not all weights live on GPU.
- Cluster fit against one or more GPUs after leaving some VRAM in reserve.
Fit verdict
Memory breakdown
Interpretation
Why KV heads matter
The cache grows with 2 × layers × KV heads × head width × tokens × sequences. Grouped-query attention can shrink that sharply versus a design where every attention head owns its own KV state.
How to use the offload field
If you are planning a local runtime that keeps only some layers on GPU, reduce Weights kept on GPU. The page then separates total model weight memory from the portion that must fit in VRAM.
Common questions
Why does context length increase VRAM usage so quickly?
Longer context grows KV-cache memory, and that growth can become the dominant cost even when the model weights themselves still fit comfortably.
What does offload change in a local LLM setup?
Offloading moves some model memory away from the GPU, which can make a larger model fit, but usually at a speed penalty compared with a cleaner all-GPU fit.
Is parameter count enough to estimate whether a model will fit?
No. Quantization, context length, concurrency, KV-cache precision, and GPU count all affect the real memory budget, which is why one rule-of-thumb number is often misleading.