Question 1

What does Q4_K_M mean in GGUF model names?

Accepted Answer

It's a quantization level: weights compressed to roughly 4.85 bits each. Q4_K_M is the community-standard sweet spot — about 70% smaller than FP16 with minimal quality loss. Q8_0 is near-lossless but twice the size; Q2_K is a last resort that visibly degrades output.

Question 2

How much VRAM do I need to run a 14B model locally?

Accepted Answer

A 14B model at Q4_K_M needs about 9 GB for weights plus context cache, so a 12 GB card (RTX 3060 12GB, RTX 4070) runs it fully on GPU. At Q8_0 you need about 16 GB.

Question 3

Can I run a 70B model at home?

Accepted Answer

Yes, with 48 GB+ of VRAM (2×24 GB cards) or a Mac with 64 GB+ unified memory. A Mac Studio with 128–192 GB runs Llama 3.3 70B at Q8_0 comfortably. On a single 24 GB card, a 70B only runs with heavy CPU offload — usable for batch work, painful for chat.

Question 4

Is partial GPU offload worth it?

Accepted Answer

For dense models, usually not for interactive chat — splitting a 24B+ dense model between GPU and RAM drops you to a few tokens/second. The exception is MoE models like Qwen3 30B-A3B: only ~3B parameters are active per token, so they stay usable even with experts held in system RAM.

The best local LLM for your GPU in 2026 — a VRAM-by-VRAM guide

The one rule that matters

4 GB (GTX 1650, RX 6500)

6–8 GB (GTX 1660, RTX 3050/4060)

10–12 GB (RTX 3080, 3060 12GB, 4070)

16 GB (RTX 4080, 4060 Ti 16GB)

24 GB (RTX 3090/4090)

Apple Silicon

CPU-only

FAQ

What does Q4_K_M mean in GGUF model names?

How much VRAM do I need to run a 14B model locally?

Can I run a 70B model at home?

Is partial GPU offload worth it?