Qwodel
Concepts

Quantization is always a trade-off. This page helps you understand the three dimensions of that trade-off — quality, size, and speed — so you can pick the right format for your use case.


What is Perplexity?

Perplexity (PPL) is the standard metric for measuring language model quality. Lower is better. A model with perplexity 5.2 is more accurate than one with perplexity 5.9.

When you quantize a model, you slightly reduce its quality (raise its perplexity). The question is: how much quality are you willing to trade for a smaller file and faster inference?


The GGUF Format Ladder

For GGUF, the format name encodes the bit depth. Here's a practical ladder from smallest/fastest to largest/best:

FormatBits per weightSize (7B model)Perplexity lossRecommended for
Q2_K~2.6~2.7 GBHighExperimentation only
Q3_K_M~3.35~3.3 GBMedium-highTight RAM constraints
Q4_04.0~3.8 GBMediumLegacy compatibility
Q4_K_S4.4~4.0 GBMediumSlightly better than Q4_0
Q4_K_M4.8~4.1 GBLowYes Default recommendation
Q5_K_S5.5~4.6 GBVery lowMore RAM, better quality
Q5_K_M5.7~4.8 GBVery lowQuality-focused deployments
Q6_K6.6~5.5 GBMinimalNear-lossless at 6-bit
Q8_08.0~7.2 GBNear-zeroTesting / highest quality

Rule of thumb: Q4_K_M is the right choice for 90% of users. Only go lower if you're constrained by RAM; only go higher if quality is more important than file size.


The "K" Suffix

Formats with _K (e.g., Q4_K_M, Q5_K_S) use k-quants — a smarter quantization scheme that uses mixed precision (some layers at higher bits) for better quality at the same average bit width. Always prefer _K variants over plain ones (e.g., Q4_K_M over Q4_0).


AWQ: Always 4-bit

AWQ only supports one format: int4 (W4A16 — 4-bit weights, 16-bit activations). The quality trade-off is managed by the calibration step, not by choosing a format. AWQ at int4 typically outperforms GGUF Q4_K_M on perplexity because of its activation-aware approach.


CoreML: Float16 vs Int

FormatBitsNotes
float1616Minimal quality loss. Best starting point.
int8_linear8~2x smaller than float16. Good quality.
int8_symmetric8Faster ops on ANE. Similar quality.
int66Balance between int4 and int8.
int44Maximum compression. iOS 18+ only.

Summary: Choosing a Format

Prioritize compatibility?      → Q4_K_M (GGUF)
Prioritize quality?            → Q6_K or Q8_0 (GGUF), or AWQ int4 on GPU
Prioritize smallest size?      → Q2_K or Q3_K_M (GGUF), int4 (CoreML)
Targeting iPhone?              → float16 or int8_linear (CoreML)