TurboQuant: Google's Game-Changing AI Compression Algorithm

Date Published

A deep dive into how Google reduced LLM memory requirements by 6x with zero accuracy loss — and why it matters.



TL;DR

Google Research dropped TurboQuant in March 2026, and it's kind of a big deal. It compresses the key-value (KV) cache in large language models down to 3 bits without any accuracy loss. Think of it as zipping your AI's working memory — but way smarter, because the model doesn't even notice the difference.

The numbers:

6x reduction in KV cache memory

8x speed boost on attention logit computation (tested on Nvidia H100 GPUs)

Zero accuracy loss across standard benchmarks

No training or fine-tuning required — just plug it in


The Problem: Why AI Models Are So Hungry for Memory

When you're running a large language model — whether it's Gemini, GPT, Mistral, whatever — the model doesn't just load its weights and go. It maintains something called a key-value cache.

Here's the simple version: every time the model processes a token, it computes a "key" and a "value" pair. It stores these so it doesn't have to recompute them every time. It's like a cheat sheet the model keeps while writing an essay.

The problem? As context windows grow — 100K tokens, 1M tokens — that cheat sheet gets massive. We're talking tens of gigabytes just for the cache. This is why running these models costs a fortune. You need beefy GPUs with huge amounts of memory.

People have tried to compress this cache before. But every method comes with a trade-off: you compress the data, but you need extra bits to store "decompression keys" (quantization constants). It's like shrinking a zip file but having to attach a giant decoder ring. You save space, but not nearly as much as you'd hope.

TurboQuant solves this.


Architecture Overview

Here's how the whole thing fits together at a high level.

The pipeline is dead simple: PolarQuant does the heavy lifting, QJL cleans up the tiny errors, and the result goes into a compressed cache that's 6x smaller than the original.


How It Works: Two-Stage Compression

Stage 1: PolarQuant — The Heavy Lifter

PolarQuant is where the magic starts. Here's what it does:

Random rotation: Take the data vectors and randomly rotate them. This simplifies the geometry — makes the data easier to compress.

Polar coordinate conversion: Instead of using standard X, Y, Z coordinates, convert vectors to polar coordinates. Think of it like replacing "go 3 blocks east, 4 blocks north" with "go 5 blocks at 37 degrees." Same destination, but now the data has a predictable, circular pattern.

Scalar quantization: Apply standard quantization to each dimension independently. Because the data is now in polar form, you know exactly where the boundaries are — no normalization overhead.


The key insight: by rotating and converting to polar coordinates, TurboQuant eliminates the memory overhead that traditional quantization methods can't avoid. No extra bits needed for quantization constants because the pattern is already predictable.

Stage 2: QJL — The 1-Bit Error Killer

After PolarQuant does its thing, there's a tiny bit of error left. That's where QJL (Quantized Johnson-Lindenstrauss) comes in.

QJL uses a mathematical technique called the Johnson-Lindenstrauss Transform to preserve distances and relationships between data points while compressing each remaining error value to a single sign bit (+1 or -1).

One bit. That's it.


The estimator is the clever part — it balances a high-precision query with the low-precision simplified data, giving you accurate attention scores without the overhead.


Code: Implementing TurboQuant

Here's a simplified Python implementation showing the core concepts. This isn't production-ready (Google's actual implementation is more optimized), but it shows how the pieces fit together.

Core PolarQuant Implementation

1import numpy as np
2from scipy.spatial.distance import cdist
3
4class PolarQuant:
5 """
6 PolarQuant: Converts vectors to polar coordinates for overhead-free quantization.
7
8 The key idea: by rotating vectors randomly and converting to polar form,
9 we eliminate the need for per-block quantization constants.
10 """
11
12 def __init__(self, n_dims: int, n_bits: int = 3):
13 self.n_dims = n_dims
14 self.n_bits = n_bits
15 self.n_levels = 2 ** n_bits
16
17 # Generate a random rotation matrix (done once)
18 # Uses Householder reflections for numerical stability
19 self.rotation_matrix = self._generate_rotation_matrix(n_dims)
20
21 def _generate_rotation_matrix(self, dim: int) -> np.ndarray:
22 """Generate a random orthogonal rotation matrix."""
23 # Random matrix
24 A = np.random.randn(dim, dim)
25 # QR decomposition gives us an orthogonal matrix
26 Q, _ = np.linalg.qr(A)
27 return Q
28
29 def _to_polar(self, vector: np.ndarray) -> tuple:
30 """Convert a vector to polar coordinates."""
31 # For high-dim vectors, we use a recursive approach
32 # This is simplified — real implementation handles dimension groups
33 radius = np.linalg.norm(vector)
34 if radius == 0:
35 return 0.0, np.zeros(len(vector) - 1)
36
37 angles = []
38 remaining = vector.copy()
39
40 for i in range(len(vector) - 1):
41 r = np.linalg.norm(remaining)
42 if r < 1e-10:
43 angles.append(0.0)
44 break
45 # Compute angle with remaining dimensions
46 cos_angle = remaining[0] / r
47 cos_angle = np.clip(cos_angle, -1, 1)
48 angles.append(np.arccos(cos_angle))
49 # Project to remaining dimensions
50 remaining = remaining[1:] * np.sin(angles[-1])
51
52 return radius, np.array(angles)
53
54 def _from_polar(self, radius: float, angles: np.ndarray) -> np.ndarray:
55 """Convert polar coordinates back to a vector."""
56 vec = np.zeros(len(angles) + 1)
57 vec[0] = radius * np.cos(angles[0])
58
59 running_sin = np.sin(angles[0])
60 for i in range(1, len(angles)):
61 vec[i] = radius * running_sin * np.cos(angles[i])
62 running_sin *= np.sin(angles[i])
63
64 vec[-1] = radius * running_sin
65 return vec
66
67 def _scalar_quantize(self, values: np.ndarray,
68 min_val: float, max_val: float) -> tuple:
69 """Uniform scalar quantization."""
70 step = (max_val - min_val) / (self.n_levels - 1)
71 if step < 1e-10:
72 indices = np.zeros(len(values), dtype=np.int32)
73 return indices, min_val, step
74
75 indices = np.round((values - min_val) / step).astype(np.int32)
76 indices = np.clip(indices, 0, self.n_levels - 1)
77 return indices, min_val, step
78
79 def compress(self, kv_vectors: np.ndarray) -> dict:
80 """
81 Compress KV vectors using PolarQuant.
82
83 Args:
84 kv_vectors: Shape (seq_len, head_dim) — the KV cache vectors
85
86 Returns:
87 Compressed representation
88 """
89 # Step 1: Apply random rotation
90 rotated = kv_vectors @ self.rotation_matrix
91
92 # Step 2: Convert to polar coordinates
93 compressed = []
94 metadata = []
95
96 for vec in rotated:
97 radius, angles = self._to_polar(vec)
98
99 # Quantize radius (needs high precision)
100 rad_idx, rad_min, rad_step = self._scalar_quantize(
101 np.array([radius]), 0, np.max(np.abs(rotated))
102 )
103
104 # Quantize angles (known distribution → no overhead!)
105 ang_idx, ang_min, ang_step = self._scalar_quantize(
106 angles, 0, np.pi
107 )
108
109 compressed.append({
110 'radius_idx': rad_idx[0],
111 'angles_idx': ang_idx
112 })
113 metadata.append({
114 'rad_min': rad_min, 'rad_step': rad_step,
115 'ang_min': ang_min, 'ang_step': ang_step
116 })
117
118 return {
119 'compressed': compressed,
120 'metadata': metadata,
121 'rotation_matrix': self.rotation_matrix,
122 'n_bits': self.n_bits
123 }
124
125 def decompress(self, data: dict) -> np.ndarray:
126 """Decompress back to original vector space."""
127 vectors = []
128
129 for comp, meta in zip(data['compressed'], data['metadata']):
130 # Reconstruct radius
131 radius = meta['rad_min'] + comp['radius_idx'] * meta['rad_step']
132
133 # Reconstruct angles
134 angles = meta['ang_min'] + comp['angles_idx'] * meta['ang_step']
135
136 # Convert from polar
137 vec = self._from_polar(radius, angles)
138 vectors.append(vec)
139
140 # Reverse rotation
141 vectors = np.array(vectors)
142 return vectors @ data['rotation_matrix'].T

QJL Error Correction


1class QJLErrorCorrector:
2 """
3 QJL: Quantized Johnson-Lindenstrauss for 1-bit error correction.
4
5 Takes residual errors from PolarQuant and eliminates bias
6 using a single sign bit per value.
7 """
8
9 def __init__(self, input_dim: int, target_dim: int):
10 self.input_dim = input_dim
11 self.target_dim = target_dim
12
13 # Random JL projection matrix
14 # Scale by 1/sqrt(target_dim) for distance preservation
15 self.projection = np.random.randn(input_dim, target_dim)
16 self.projection /= np.sqrt(target_dim)
17
18 def encode_residual(self, residual: np.ndarray) -> np.ndarray:
19 """
20 Encode residual error using 1-bit sign quantization.
21
22 Args:
23 residual: The small errors left after PolarQuant
24
25 Returns:
26 1-bit encoded representation (sign bits)
27 """
28 # Project to lower dimension
29 projected = residual @ self.projection
30
31 # 1-bit quantization: just the sign
32 sign_bits = np.sign(projected)
33 # Replace zeros randomly with +1 or -1
34 sign_bits[sign_bits == 0] = np.random.choice([-1, 1],
35 size=np.sum(sign_bits == 0))
36
37 return sign_bits
38
39 def compute_attention_score(self, query: np.ndarray,
40 compressed_key: np.ndarray,
41 sign_bits: np.ndarray) -> float:
42 """
43 Compute attention score using the special QJL estimator.
44
45 The estimator combines high-precision query with low-precision
46 key to produce an unbiased attention score.
47 """
48 # High-precision query projection
49 query_proj = query @ self.projection
50
51 # The QJL estimator:
52 # E[<q, k>] ≈ mean(q_proj * sign_bits) * ||k|| * sqrt(2/π)
53 # This eliminates the bias introduced by sign quantization
54
55 raw_estimate = np.mean(query_proj * sign_bits)
56
57 # Correction factor for sign quantization bias
58 # From the property: E[sign(x)] = erf(x/sqrt(2))
59 # Linearized: ≈ sqrt(2/π) * x for small x
60 bias_correction = np.sqrt(2 / np.pi)
61
62 return raw_estimate / bias_correction

Putting It Together: Full TurboQuant Pipeline


1class TurboQuant:
2 """
3 Full TurboQuant pipeline combining PolarQuant + QJL.
4
5 This is the complete compression/decompression pipeline
6 for KV cache quantization.
7 """
8
9 def __init__(self, head_dim: int, polar_bits: int = 2, qjl_target_dim: int = 64):
10 self.head_dim = head_dim
11 self.polar_quant = PolarQuant(n_dims=head_dim, n_bits=polar_bits)
12 self.qjl = QJLErrorCorrector(input_dim=head_dim, target_dim=qjl_target_dim)
13 self.polar_bits = polar_bits
14
15 # Total effective bits: ~2-3 bits for PolarQuant + 1 bit for QJL
16 # vs original 32 bits → 6-10x compression
17 self.effective_bits = polar_bits + 1 + (32 / head_dim) # overhead amortized
18
19 def compress_kv_cache(self, keys: np.ndarray, values: np.ndarray) -> dict:
20 """
21 Compress the full KV cache.
22
23 Args:
24 keys: Shape (seq_len, n_heads, head_dim)
25 values: Shape (seq_len, n_heads, head_dim)
26
27 Returns:
28 Compressed KV cache representation
29 """
30 compressed_keys = []
31 compressed_values = []
32 qjl_signs_k = []
33 qjl_signs_v = []
34
35 for head_idx in range(keys.shape[1]):
36 # Compress keys with PolarQuant
37 k_data = self.polar_quant.compress(keys[:, head_idx, :])
38
39 # Compute residual errors for QJL
40 k_decompressed = self.polar_quant.decompress(k_data)
41 k_residual = keys[:, head_idx, :] - k_decompressed
42
43 # Compress residual with QJL (1 bit)
44 k_signs = self.qjl.encode_residual(k_residual)
45
46 compressed_keys.append(k_data)
47 qjl_signs_k.append(k_signs)
48
49 # Same for values
50 v_data = self.polar_quant.compress(values[:, head_idx, :])
51 v_decompressed = self.polar_quant.decompress(v_data)
52 v_residual = values[:, head_idx, :] - v_decompressed
53 v_signs = self.qjl.encode_residual(v_residual)
54
55 compressed_values.append(v_data)
56 qjl_signs_v.append(v_signs)
57
58 return {
59 'keys': compressed_keys,
60 'values': compressed_values,
61 'qjl_signs_keys': qjl_signs_k,
62 'qjl_signs_values': qjl_signs_v,
63 'original_shape': keys.shape,
64 'compression_ratio': 32 / self.effective_bits
65 }
66
67 def compute_attention(self, query: np.ndarray,
68 cache: dict) -> np.ndarray:
69 """
70 Compute attention scores using compressed KV cache.
71
72 This is where the magic happens — attention computation
73 directly on compressed data with QJL-corrected scores.
74 """
75 seq_len, n_heads, head_dim = cache['original_shape']
76 scores = np.zeros((n_heads, seq_len))
77
78 for head_idx in range(n_heads):
79 # Decompress keys (PolarQuant only — QJL correction in score)
80 k_decompressed = self.polar_quant.decompress(
81 cache['keys'][head_idx]
82 )
83
84 # Standard attention dot product on decompressed keys
85 base_scores = query[head_idx] @ k_decompressed.T
86
87 # QJL correction: add bias-free residual correction
88 qjl_signs = cache['qjl_signs_keys'][head_idx]
89 qjl_correction = self.qjl.compute_attention_score(
90 query[head_idx], k_decompressed, qjl_signs
91 )
92
93 # Combine: base scores from PolarQuant + QJL correction
94 scores[head_idx] = base_scores + qjl_correction
95
96 return scores
97
98
99# --- Usage Example ---
100
101# Simulate a KV cache for a small model
102seq_len = 2048 # context length
103n_heads = 32 # attention heads
104head_dim = 128 # dimension per head
105
106# Generate random KV cache (in practice, this comes from the model)
107keys = np.random.randn(seq_len, n_heads, head_dim).astype(np.float32)
108values = np.random.randn(seq_len, n_heads, head_dim).astype(np.float32)
109
110original_size = keys.nbytes + values.nbytes
111print(f"Original KV cache size: {original_size / 1024 / 1024:.1f} MB")
112
113# Compress
114turbo = TurboQuant(head_dim=head_dim, polar_bits=2, qjl_target_dim=64)
115compressed = turbo.compress_kv_cache(keys, values)
116
117# Approximate compressed size
118compressed_size = sum(
119 sum(arr.nbytes for arr in head_data['compressed']
120 for head_data in [head_data])
121 for head_data in [compressed['keys']]
122) + sum(signs.nbytes for signs in compressed['qjl_signs_keys'])
123
124print(f"Compressed size: ~{compressed_size / 1024 / 1024:.1f} MB")
125print(f"Compression ratio: {compressed['compression_ratio']:.1f}x")
126
127# Compute attention on compressed cache
128query = np.random.randn(n_heads, head_dim).astype(np.float32)
129scores = turbo.compute_attention(query, compressed)
130print(f"Attention scores shape: {scores.shape}")
131print("✓ TurboQuant pipeline complete")

Data Flow Architecture

Here's how data flows through the entire system during inference:


Performance Comparison

Let's put the numbers in perspective:



Where TurboQuant Fits in the LLM Stack

TurboQuant lives in the Memory Layer, specifically managing the KV cache. It doesn't touch model weights or activations — just the cache that grows with sequence length.


Market Impact: The Ripple Effect

The day Google announced TurboQuant, memory chip stocks took a hit:

Samsung: Down ~3%

Micron: Down ~5%

SK Hynix: Down ~4%

The market's logic: if models need 6x less memory, who's buying all these chips?

But analysts see it differently. Morgan Stanley's take: more intense computing, not less demand. Lower inference costs → more AI adoption → more total compute needed. The same thing happened when compression algorithms improved internet bandwidth — people didn't use less internet, they used more.

Forbes theorized that reducing hardware barriers could actually accelerate localized AI projects, paradoxically driving up total long-term chip consumption.

The community reaction was fast:

Ported to llama.cpp within 24 hours

Ported to MLX (Apple Silicon) within 48 hours

Reddit calling it "the democratization of local AI"


What This Means for You

If you're running LLMs at scale: Expect inference costs to drop 30-50%. This isn't a maybe — it's a when. Once TurboQuant is integrated into popular inference frameworks, it's just a flag you flip.

If you're building on-device AI: TurboQuant makes larger models viable on constrained hardware. Running a 13B model on a phone? Getting closer every day.

If you're in the chip business: The narrative shifts from "more memory" to "smarter memory." Companies that adapt win. Companies that don't, well...

If you're a researcher: The paper is at arxiv.org/abs/2504.19874. Being presented at ICLR 2026. The code approach is elegant — worth studying even if you never use TurboQuant itself.


Sources

Google Research Blog: turboquant-redefining-ai-efficiency-with-extreme-compression

Paper: arxiv.org/abs/2504.19874

Tom's Hardware: turboquant-compresses-llm-kv-caches-to-3-bits

TechCrunch: google-turboquant-ai-memory-compression

VentureBeat: turboquant-algorithm-speeds-up-ai-memory-8x


Written for people who want to understand the how, not just the what.