TurboQuant — Vector Quantization¶
Part of Statistical Methods — MORIE’s statistical-methods reference.
MORIE implements the TurboQuant algorithm as a standalone vector compression library (morie.quant). It is validated against the paper’s theoretical bounds and achieves near-optimal distortion rates.
Important
morie.quant compresses arbitrary vectors — it is not integrated into Ollama’s inference pipeline. For runtime KV-cache compression during LLM inference, use Ollama’s built-in support:
OLLAMA_KV_CACHE_TYPE="q8_0" ollama serve
TurboQuant integration into llama.cpp is pending (estimated Q2 2026).
Weight Quantization vs KV-Cache Quantization¶
These are different techniques that solve different problems:
Compresses — weights compress model parameters (permanent, on disk); KV-cache compresses the runtime attention scratchpad (ephemeral, in RAM).
Reduces — weights reduce model file size + VRAM to load; KV-cache reduces context memory during generation.
Best for — weights help fit larger models on GPU (8B → 5GB file); KV-cache helps fit longer contexts on the same hardware.
Calibration — weights need per-model tuning; KV-cache is data-oblivious (no calibration data required).
Status in Ollama — weights handled automatically via GGUF; KV-cache supports
q8_0/q4_0today, TurboQuant integration pending.
Both can be combined: a Q4_K_M model with q8_0 KV-cache is the practical sweet spot for consumer hardware.
References¶
Zandieh, A., Daliri, M., Hadian, M., & Mirrokni, V. (2026). TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. ICLR 2026. arXiv:2504.19874
Zandieh, A., Daliri, M., & Han, I. (2025). QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead. AAAI 2025, 39(24), 25805-25813. arXiv:2406.03482
Han, I., Kacham, P., Karbasi, A., Mirrokni, V., & Zandieh, A. (2026). PolarQuant: Quantization of KV Cache via Polar Coordinate Transforms. AISTATS 2026. arXiv:2502.02617
Algorithm¶
Stage 1 — TurboQuant_MSE (PolarQuant):
Generate random rotation matrix \(\Pi\) via QR decomposition
Rotate: \(\mathbf{y} = \Pi \cdot \mathbf{x}\)
Normalize: store \(\|\mathbf{y}\|_2\), compute \(\hat{\mathbf{y}} = \mathbf{y} / \|\mathbf{y}\|_2\)
Scalar-quantize each coordinate via Lloyd-Max codebook optimized for the Beta distribution:
\[f_X(x) = \frac{\Gamma(d/2)}{\sqrt{\pi}\,\Gamma((d{-}1)/2)}\,(1 - x^2)^{(d-3)/2}\]Store:
(norm, indices)per block
Stage 2 — QJL Error Correction:
Compute residual: \(\mathbf{r} = \mathbf{x} - \hat{\mathbf{x}}_{\text{MSE}}\)
Project and sign-quantize: \(Q(\mathbf{r}) = \text{sign}(S \cdot \mathbf{r})\)
Dequantize: \(Q^{-1}(\mathbf{z}) = \frac{\sqrt{\pi/2}}{d} \, S^T \mathbf{z}\)
Guarantee: \(\mathbb{E}[\langle \mathbf{y}, Q^{-1}(Q(\mathbf{r})) \rangle] = \langle \mathbf{y}, \mathbf{r} \rangle\) (unbiased)
Theoretical Bounds¶
MSE distortion upper bound:
Inner-product distortion upper bound:
Information-theoretic lower bound:
The gap between TurboQuant’s achievable rate and the lower bound is approximately \(2.72\times\).
Experimental Results¶
Synthetic Validation¶
Validated on random Gaussian vectors (d=128 and d=256):
Python path (morie.quant):
2-bit, d=128 — MSE 0.096 (bound 0.170), cosine 0.946, 7.1× compression.
3-bit, d=128 — MSE 0.030 (bound 0.043), cosine 0.984, 4.9× compression.
4-bit, d=128 — MSE 0.007 (bound 0.011), cosine 0.996, 3.8× compression.
5-bit, d=256 — MSE 0.002 (bound 0.003), cosine 0.999, 3.2× compression.
2-bit, d=256 — MSE 0.093 (bound 0.170), cosine 0.955, 7.5× compression.
3-bit, d=256 — MSE 0.028 (bound 0.043), cosine 0.987, 5.1× compression.
4-bit, d=256 — MSE 0.009 (bound 0.011), cosine 0.996, 3.9× compression.
C path (quant_ggml.dylib):
2-bit — MSE 0.110, cosine 0.935, 7.5× compression.
3-bit — MSE 0.028, cosine 0.984, 5.1× compression.
4-bit — MSE 0.008, cosine 0.995, 3.9× compression.
All 3-bit and 4-bit results are within the paper’s theoretical MSE bounds. The C path uses Walsh-Hadamard Transform (WHT) with \(1/\sqrt{d}\) normalization instead of full QR decomposition.
End-to-End Model Validation¶
TurboQuant was validated end-to-end on a 50.3M parameter autoresearch GPT model (Karpathy’s autoresearch, depth=4, d_model=256, 4 heads, vocab=8192) using post-training quantization of all 2D weight matrices. The model was trained on Apple M2 (MPS) and quantized/evaluated on a 16GB ARM64 system.
Post-training quantization on autoresearch 50.3M GPT:
fp32 baseline — val_bpb 1.614, cosine 1.000, 1.0× (M2).
2-bit — val_bpb 2.019 (Δ +0.405, +25.1%), cosine 0.940, 14.6× (M2).
3-bit — val_bpb 2.001 (Δ +0.387, +24.0%), cosine 0.983, 10.0× (M2).
4-bit — val_bpb 2.002 (Δ +0.388, +24.0%), cosine 0.995, 7.6× (M2).
5-bit — val_bpb 2.028 (Δ +0.0004, +0.02%), cosine 0.999, 6.2× (ARM64).
Note
The 5-bit result uses a different baseline (Pi-trained model, val_bpb=2.028) than the 2-4 bit results (M2-trained, val_bpb=1.614). The key metric is percentage bpb degradation from quantization, not absolute val_bpb:
4-bit: +24% bpb loss (significant quality degradation)
5-bit: +0.02% bpb loss (essentially lossless)
This 1000x improvement in preservation is due to two changes: (1) 50,000-point Lloyd-Max codebook grid (up from 10,000) for finer centroid placement, and (2) the extra bit providing 32 quantization levels per block instead of 16.
Per-Parameter Analysis (5-bit)¶
All 28 quantized weight tensors achieved cosine > 0.998:
transformer.wte.weight[8192, 256] — MSE 0.1799, cosine 0.9986.transformer.h.*.attn.c_{q,k,v,proj}[256, 256] — MSE 0.0001, cosine 0.9985.transformer.h.*.mlp.c_fc[1024, 256] — MSE 0.0001, cosine 0.9986.transformer.h.*.mlp.c_proj[256, 1024] — MSE 0.0000, cosine 0.9985.lm_head.weight[8192, 256] — MSE 0.0000, cosine 0.9986.value_embeds.{1,3}.weight[8192, 256] — MSE 0.1672, cosine 0.9986.
The embedding tables (wte, value_embeds) have the highest absolute MSE due to their large magnitudes, but still achieve > 0.998 cosine because the error is proportional to signal strength.
Why MSE-only Outperforms Prod at ≥ 4 Bits¶
TurboQuant has two modes:
TurboQuant_MSE (Stage 1 only): all b bits for scalar quantization
TurboQuant_Prod (Stage 1 + Stage 2): (b-1) bits for MSE + 1 bit for QJL error correction
At ≥ 4 bits, Prod is worse because it sacrifices a full bit (halving resolution) for a QJL correction that provides diminishing returns. At 2-3 bits, the QJL correction is more valuable because the Stage 1 quantization is coarse.
At b = 5: MSE uses 32 levels directly, while Prod uses 16 + 1-bit QJL. The factor-of-2 resolution loss outweighs the error correction benefit.
Significance for Causal Inference¶
TurboQuant’s unbiased property (\(\mathbb{E}[\hat{x}] = x\)) is critical for epidemiological applications. Biased quantization would systematically shift treatment effect estimates (ATE, CATE), potentially invalidating causal conclusions. The data-oblivious nature (no calibration data required) ensures the method is safe for any dataset without risk of overfitting to calibration distributions.
Implementation¶
Python (morie.quant):
from morie.quant import turboquant_mse, turboquant_mse_decode
import numpy as np
x = np.random.default_rng(42).standard_normal(256)
block = turboquant_mse(x, bits=3) # 5.1x compression
x_hat = turboquant_mse_decode(block) # reconstruct
C (quant_ggml.c):
# Compile (macOS)
cc -O2 -march=native -shared -o quant_ggml.dylib quant_ggml.c -lm
from morie.quant_bridge import GGMLTurboQuant
tq = GGMLTurboQuant() # auto-loads .dylib
block = tq.quantize(x.astype(np.float32), bits=3)
x_hat = tq.dequantize(block, bits=3)
Codebook Resolution¶
Lloyd-Max codebook quality is controlled by the grid resolution parameter in
morie.quant. The codebook is built by discretizing the Beta distribution PDF
on a uniform grid and running the Lloyd-Max iteration.
10,000-point grid — centroid precision ~1e-4, cosine 0.995 at 4-bit.
50,000-point grid (default) — centroid precision ~2e-5, cosine 0.999 at 5-bit.
The 5x increase in grid resolution (np.linspace(lo, hi, 50000)) provides
finer centroid placement, reducing quantization MSE by approximately 15% at
4+ bits. This is the default as of 2026-04-14.
Reproduce¶
5-bit model compression (ARM64 reference):
cd dev/autoresearch-macos
python3 quantize_eval.py --bits 5 --eval-tokens 16384
Sweep all bit widths:
python3 quantize_eval.py --all --eval-tokens 16384
Full pipeline (train → quantize → benchmark → GGUF):
./run_full_experiment.sh
Critical Notes¶
WHT normalization must use \(1/\sqrt{d}\), never \(1/d\)
Temperature for LLM-assisted code generation: T=0.1 (never 0.7 for math)
Lloyd-Max codebooks use 50,000-point grids (default) from the exact Beta distribution
QJL provides unbiased inner-product estimation — essential for causal inference where biased attention drifts ATE/CATE estimates
At ≥ 4 bits, use
turboquant_mse(notturboquant_prod) for best results5-bit is the recommended default: 6.2x compression with cosine > 0.998