Will it run on your machine?
Pick your hardware. See every local model that fits, the quant to use, and the one actually worth running.
For each model and quant, the estimate is weights + KV cache + overhead, compared against the memory your hardware can actually give the model.
- Weights = parameters × bytes-per-param for the quant (Q4_K_M ≈ 0.58, Q8_0 ≈ 1.06, FP16 = 2.0). Mixture-of-experts models keep all experts in memory, so weights use the total parameter count, not the active count.
- KV cache = 2 × layers × KV-heads × head-dim × tokens × 2 bytes, using each model's real attention shape. Most modern models use grouped-query attention, which means far fewer KV-heads and a much smaller cache than the parameter count would suggest. This assumes a 16-bit cache; quantizing the cache to 8-bit roughly halves it.
- Usable memory = VRAM minus a small driver reserve on a dedicated card, or roughly 67-75% of unified memory on Apple Silicon (the share grows on bigger machines; macOS holds the rest, and you can raise the limit manually).
- Runs well means it sits under 90% of usable memory. Runs tight means it fits with little headroom. CPU offload means it only fits by spilling layers into system RAM, which is much slower.
- Speed (tokens/sec) is a rough, bandwidth-based estimate: generation reads the model once per token, so tok/s ≈ memory bandwidth ÷ active model size. Two machines with the same memory can differ five to eight times in speed, so "fits" doesn't mean "fast." Watch the tok/s, not just the verdict.
- Best for you is the largest, most capable model that still fits comfortably at Q4_K_M or better. Bigger usually means smarter, but a model that only just fits, or fits only at a low quant, isn't always better than a smaller one with room to spare. Treat it as a strong default, not gospel.
These are close estimates, not exact VRAM accounting. We model the common K-quants; the IQ-quant family packs models even smaller at very low bit-rates. Architecture values are best-effort, sliding-window models like Gemma are flagged on their own cards since their long-context cache is smaller than shown, and real usage shifts with your runtime and settings. The model list is current as of June 2026, and this space moves fast. When it says tight, leave yourself a little headroom.
You know those sites you use once and never open again? Yeah, this is one of those. I built it for fun. If it saved you a little time, you can buy me a coffee.