Model engineering · Calculator · Analyzer

LLM VRAM and context calculator

Estimate how much GPU memory a transformer needs for weights and KV cache, then work backwards to the longest context or highest concurrency your hardware can sustain.

Model and runtime

Target search intent: LLM VRAM calculator, KV cache calculator, how much VRAM for 70B, and local LLM context planning.

Preset model

Total parameters (billions)

Layers

Hidden size

Attention heads

KV heads

Target context tokens

Concurrent sequences

Weights kept on GPU (%)

Weight precision

KV-cache precision

GPU count

VRAM per GPU (GB)

Per-GPU reserve (GB)

Cluster runtime overhead (GB)

What this page models

Weight memory from total parameter count and chosen quantization.
KV-cache growth from layers, KV heads, head width, context, and concurrency.
Partial GPU offload for llama.cpp style planning where not all weights live on GPU.
Cluster fit against one or more GPUs after leaving some VRAM in reserve.

Assumptions: decoder-only transformer, KV cache replicated for active sequences, and no extra activation spikes beyond the explicit overhead field.

Read the launch note for this model-engineering tool.

Fit verdict

Memory breakdown

Interpretation

Why KV heads matter

The cache grows with 2 × layers × KV heads × head width × tokens × sequences. Grouped-query attention can shrink that sharply versus a design where every attention head owns its own KV state.

How to use the offload field

If you are planning a local runtime that keeps only some layers on GPU, reduce Weights kept on GPU. The page then separates total model weight memory from the portion that must fit in VRAM.

Common questions

Short answers for adjacent search queries and first-use questions.

Why does context length increase VRAM usage so quickly?

Longer context grows KV-cache memory, and that growth can become the dominant cost even when the model weights themselves still fit comfortably.

What does offload change in a local LLM setup?

Offloading moves some model memory away from the GPU, which can make a larger model fit, but usually at a speed penalty compared with a cleaner all-GPU fit.

Is parameter count enough to estimate whether a model will fit?

No. Quantization, context length, concurrency, KV-cache precision, and GPU count all affect the real memory budget, which is why one rule-of-thumb number is often misleading.