The best local LLM for your GPU in 2026 — a VRAM-by-VRAM guide
Every week someone asks r/LocalLLaMA “what can I run on my 3060?” and gets fifteen conflicting answers. This guide is the consolidated, honest version — the same logic our hardware scanner uses, written out so you can sanity-check it.
The one rule that matters
A model’s GGUF file size ≈ parameters × bits-per-weight ÷ 8, plus ~5% overhead. At Q4_K_M (≈4.85 effective bits), an 8B model is ~5 GB, a 14B is ~9 GB, a 32B is ~20 GB, a 70B is ~45 GB. Add 0.5–2 GB for context cache, and leave ~1 GB for your display. If weights + cache exceed VRAM, layers spill to system RAM and dense-model speed falls off a cliff.
4 GB (GTX 1650, RX 6500)
Run Gemma 3 4B or Qwen3 4B at Q3_K_M–Q4_K_M, fully on GPU. These punch far above their size in 2026 — genuinely useful for chat, summarization and light coding. Skip the temptation to offload a 12B; interactive speed dies.
6–8 GB (GTX 1660, RTX 3050/4060)
The 8B class is yours: Qwen3 8B, Llama 3.1 8B, or DeepSeek-R1 Distill 8B if you want visible reasoning chains — all at Q4_K_M or Q5_K_M with room for 16–32k context. On 8 GB, Gemma 3 12B at Q4_K_M just barely fits and is worth trying for chat quality.
10–12 GB (RTX 3080, 3060 12GB, 4070)
The sweet spot for price/performance. Gemma 3 12B at Q5_K_M or Qwen3 14B at Q4_K_M run fully on GPU at 30+ tokens/sec with 32k context. If you have 32 GB+ of system RAM, Qwen3 30B-A3B (MoE) with partial offload is the dark-horse pick: 30B-class quality that stays usable because only 3B parameters are active per token.
16 GB (RTX 4080, 4060 Ti 16GB)
Qwen3 14B at Q8_0 (near-lossless) or Mistral Small 3.2 24B at Q4_K_M. The Mistral is the workhorse choice for agents and function calling; 16 GB fits it with 16k context.
24 GB (RTX 3090/4090)
The community-consensus answer is Qwen3 32B at Q4_K_M — frontier-adjacent reasoning, fully on GPU, ~32k context. Alternatives: Gemma 3 27B at Q5_K_M for chat + vision, or Qwen3 14B at Q8_0 when you want maximum fidelity and 64k+ context.
Apple Silicon
Unified memory changes the math: macOS lets the GPU use ~70% of system RAM. A 36 GB MacBook Pro runs Qwen3 32B at Q4_K_M; a 64 GB machine runs 70B-class models at Q4; a 128–192 GB Mac Studio runs Llama 3.3 70B at Q8_0 — near-lossless — comfortably. Prompt processing is slower than NVIDIA, decode speed is fine.
CPU-only
With 16 GB of RAM: an 8B at Q4_K_M gives ~5–8 tokens/sec on a modern laptop — usable for short tasks. Prefer Qwen3 8B or Llama 3.1 8B. Anything 12B+ on CPU is a batch-job tool, not a chat partner. With 8 GB of RAM, use a 4B and keep context short.
FAQ
What does Q4_K_M mean in GGUF model names?
It's a quantization level: weights compressed to roughly 4.85 bits each. Q4_K_M is the community-standard sweet spot — about 70% smaller than FP16 with minimal quality loss. Q8_0 is near-lossless but twice the size; Q2_K is a last resort that visibly degrades output.
How much VRAM do I need to run a 14B model locally?
A 14B model at Q4_K_M needs about 9 GB for weights plus context cache, so a 12 GB card (RTX 3060 12GB, RTX 4070) runs it fully on GPU. At Q8_0 you need about 16 GB.
Can I run a 70B model at home?
Yes, with 48 GB+ of VRAM (2×24 GB cards) or a Mac with 64 GB+ unified memory. A Mac Studio with 128–192 GB runs Llama 3.3 70B at Q8_0 comfortably. On a single 24 GB card, a 70B only runs with heavy CPU offload — usable for batch work, painful for chat.
Is partial GPU offload worth it?
For dense models, usually not for interactive chat — splitting a 24B+ dense model between GPU and RAM drops you to a few tokens/second. The exception is MoE models like Qwen3 30B-A3B: only ~3B parameters are active per token, so they stay usable even with experts held in system RAM.
Skip the manual math — scan your machine
One command reads your GPU, VRAM, RAM and CPU and prints the exact model + quant + Ollama config that fits:
curl -fsSL https://butler.aiskillhub.info/scan.sh | sh