TurboQuant: Google's Game-Changing AI Compression Algorithm | Matrix Maven

A deep dive into how Google reduced LLM memory requirements by 6x with zero accuracy loss — and why it matters.

TL;DR

Google Research dropped TurboQuant in March 2026, and it's kind of a big deal. It compresses the key-value (KV) cache in large language models down to 3 bits without any accuracy loss. Think of it as zipping your AI's working memory — but way smarter, because the model doesn't even notice the difference.

The numbers:

6x reduction in KV cache memory

8x speed boost on attention logit computation (tested on Nvidia H100 GPUs)

Zero accuracy loss across standard benchmarks

No training or fine-tuning required — just plug it in

The Problem: Why AI Models Are So Hungry for Memory

When you're running a large language model — whether it's Gemini, GPT, Mistral, whatever — the model doesn't just load its weights and go. It maintains something called a key-value cache.

Here's the simple version: every time the model processes a token, it computes a "key" and a "value" pair. It stores these so it doesn't have to recompute them every time. It's like a cheat sheet the model keeps while writing an essay.

The problem? As context windows grow — 100K tokens, 1M tokens — that cheat sheet gets massive. We're talking tens of gigabytes just for the cache. This is why running these models costs a fortune. You need beefy GPUs with huge amounts of memory.

People have tried to compress this cache before. But every method comes with a trade-off: you compress the data, but you need extra bits to store "decompression keys" (quantization constants). It's like shrinking a zip file but having to attach a giant decoder ring. You save space, but not nearly as much as you'd hope.

TurboQuant solves this.

Architecture Overview

Here's how the whole thing fits together at a high level.

The pipeline is dead simple: PolarQuant does the heavy lifting, QJL cleans up the tiny errors, and the result goes into a compressed cache that's 6x smaller than the original.

How It Works: Two-Stage Compression

Stage 1: PolarQuant — The Heavy Lifter

PolarQuant is where the magic starts. Here's what it does:

Random rotation: Take the data vectors and randomly rotate them. This simplifies the geometry — makes the data easier to compress.

Polar coordinate conversion: Instead of using standard X, Y, Z coordinates, convert vectors to polar coordinates. Think of it like replacing "go 3 blocks east, 4 blocks north" with "go 5 blocks at 37 degrees." Same destination, but now the data has a predictable, circular pattern.

Scalar quantization: Apply standard quantization to each dimension independently. Because the data is now in polar form, you know exactly where the boundaries are — no normalization overhead.

The key insight: by rotating and converting to polar coordinates, TurboQuant eliminates the memory overhead that traditional quantization methods can't avoid. No extra bits needed for quantization constants because the pattern is already predictable.

Stage 2: QJL — The 1-Bit Error Killer

After PolarQuant does its thing, there's a tiny bit of error left. That's where QJL (Quantized Johnson-Lindenstrauss) comes in.

QJL uses a mathematical technique called the Johnson-Lindenstrauss Transform to preserve distances and relationships between data points while compressing each remaining error value to a single sign bit (+1 or -1).

One bit. That's it.

The estimator is the clever part — it balances a high-precision query with the low-precision simplified data, giving you accurate attention scores without the overhead.

Code: Implementing TurboQuant

Here's a simplified Python implementation showing the core concepts. This isn't production-ready (Google's actual implementation is more optimized), but it shows how the pieces fit together.

Core PolarQuant Implementation

1import numpy as np
2from scipy.spatial.distance import cdist
3
4class PolarQuant:
5    """
6    PolarQuant: Converts vectors to polar coordinates for overhead-free quantization.
7    
8    The key idea: by rotating vectors randomly and converting to polar form,
9    we eliminate the need for per-block quantization constants.
10    """
11    
12    def __init__(self, n_dims: int, n_bits: int = 3):
13        self.n_dims = n_dims
14        self.n_bits = n_bits
15        self.n_levels = 2 ** n_bits
16        
17        # Generate a random rotation matrix (done once)
18        # Uses Householder reflections for numerical stability
19        self.rotation_matrix = self._generate_rotation_matrix(n_dims)
20    
21    def _generate_rotation_matrix(self, dim: int) -> np.ndarray:
22        """Generate a random orthogonal rotation matrix."""
23        # Random matrix
24        A = np.random.randn(dim, dim)
25        # QR decomposition gives us an orthogonal matrix
26        Q, _ = np.linalg.qr(A)
27        return Q
28    
29    def _to_polar(self, vector: np.ndarray) -> tuple:
30        """Convert a vector to polar coordinates."""
31        # For high-dim vectors, we use a recursive approach
32        # This is simplified — real implementation handles dimension groups
33        radius = np.linalg.norm(vector)
34        if radius == 0:
35            return 0.0, np.zeros(len(vector) - 1)
36        
37        angles = []
38        remaining = vector.copy()
39        
40        for i in range(len(vector) - 1):
41            r = np.linalg.norm(remaining)
42            if r < 1e-10:
43                angles.append(0.0)
44                break
45            # Compute angle with remaining dimensions
46            cos_angle = remaining[0] / r
47            cos_angle = np.clip(cos_angle, -1, 1)
48            angles.append(np.arccos(cos_angle))
49            # Project to remaining dimensions
50            remaining = remaining[1:] * np.sin(angles[-1])
51        
52        return radius, np.array(angles)
53    
54    def _from_polar(self, radius: float, angles: np.ndarray) -> np.ndarray:
55        """Convert polar coordinates back to a vector."""
56        vec = np.zeros(len(angles) + 1)
57        vec[0] = radius * np.cos(angles[0])
58        
59        running_sin = np.sin(angles[0])
60        for i in range(1, len(angles)):
61            vec[i] = radius * running_sin * np.cos(angles[i])
62            running_sin *= np.sin(angles[i])
63        
64        vec[-1] = radius * running_sin
65        return vec
66    
67    def _scalar_quantize(self, values: np.ndarray, 
68                         min_val: float, max_val: float) -> tuple:
69        """Uniform scalar quantization."""
70        step = (max_val - min_val) / (self.n_levels - 1)
71        if step < 1e-10:
72            indices = np.zeros(len(values), dtype=np.int32)
73            return indices, min_val, step
74        
75        indices = np.round((values - min_val) / step).astype(np.int32)
76        indices = np.clip(indices, 0, self.n_levels - 1)
77        return indices, min_val, step
78    
79    def compress(self, kv_vectors: np.ndarray) -> dict:
80        """
81        Compress KV vectors using PolarQuant.
82        
83        Args:
84            kv_vectors: Shape (seq_len, head_dim) — the KV cache vectors
85            
86        Returns:
87            Compressed representation
88        """
89        # Step 1: Apply random rotation
90        rotated = kv_vectors @ self.rotation_matrix
91        
92        # Step 2: Convert to polar coordinates
93        compressed = []
94        metadata = []
95        
96        for vec in rotated:
97            radius, angles = self._to_polar(vec)
98            
99            # Quantize radius (needs high precision)
100            rad_idx, rad_min, rad_step = self._scalar_quantize(
101                np.array([radius]), 0, np.max(np.abs(rotated))
102            )
103            
104            # Quantize angles (known distribution → no overhead!)
105            ang_idx, ang_min, ang_step = self._scalar_quantize(
106                angles, 0, np.pi
107            )
108            
109            compressed.append({
110                'radius_idx': rad_idx[0],
111                'angles_idx': ang_idx
112            })
113            metadata.append({
114                'rad_min': rad_min, 'rad_step': rad_step,
115                'ang_min': ang_min, 'ang_step': ang_step
116            })
117        
118        return {
119            'compressed': compressed,
120            'metadata': metadata,
121            'rotation_matrix': self.rotation_matrix,
122            'n_bits': self.n_bits
123        }
124    
125    def decompress(self, data: dict) -> np.ndarray:
126        """Decompress back to original vector space."""
127        vectors = []
128        
129        for comp, meta in zip(data['compressed'], data['metadata']):
130            # Reconstruct radius
131            radius = meta['rad_min'] + comp['radius_idx'] * meta['rad_step']
132            
133            # Reconstruct angles
134            angles = meta['ang_min'] + comp['angles_idx'] * meta['ang_step']
135            
136            # Convert from polar
137            vec = self._from_polar(radius, angles)
138            vectors.append(vec)
139        
140        # Reverse rotation
141        vectors = np.array(vectors)
142        return vectors @ data['rotation_matrix'].T

QJL Error Correction

1class QJLErrorCorrector:
2    """
3    QJL: Quantized Johnson-Lindenstrauss for 1-bit error correction.
4    
5    Takes residual errors from PolarQuant and eliminates bias
6    using a single sign bit per value.
7    """
8    
9    def __init__(self, input_dim: int, target_dim: int):
10        self.input_dim = input_dim
11        self.target_dim = target_dim
12        
13        # Random JL projection matrix
14        # Scale by 1/sqrt(target_dim) for distance preservation
15        self.projection = np.random.randn(input_dim, target_dim)
16        self.projection /= np.sqrt(target_dim)
17    
18    def encode_residual(self, residual: np.ndarray) -> np.ndarray:
19        """
20        Encode residual error using 1-bit sign quantization.
21        
22        Args:
23            residual: The small errors left after PolarQuant
24            
25        Returns:
26            1-bit encoded representation (sign bits)
27        """
28        # Project to lower dimension
29        projected = residual @ self.projection
30        
31        # 1-bit quantization: just the sign
32        sign_bits = np.sign(projected)
33        # Replace zeros randomly with +1 or -1
34        sign_bits[sign_bits == 0] = np.random.choice([-1, 1], 
35            size=np.sum(sign_bits == 0))
36        
37        return sign_bits
38    
39    def compute_attention_score(self, query: np.ndarray, 
40                                 compressed_key: np.ndarray,
41                                 sign_bits: np.ndarray) -> float:
42        """
43        Compute attention score using the special QJL estimator.
44        
45        The estimator combines high-precision query with low-precision
46        key to produce an unbiased attention score.
47        """
48        # High-precision query projection
49        query_proj = query @ self.projection
50        
51        # The QJL estimator: 
52        # E[<q, k>] ≈ mean(q_proj * sign_bits) * ||k|| * sqrt(2/π)
53        # This eliminates the bias introduced by sign quantization
54        
55        raw_estimate = np.mean(query_proj * sign_bits)
56        
57        # Correction factor for sign quantization bias
58        # From the property: E[sign(x)] = erf(x/sqrt(2))
59        # Linearized: ≈ sqrt(2/π) * x for small x
60        bias_correction = np.sqrt(2 / np.pi)
61        
62        return raw_estimate / bias_correction

Putting It Together: Full TurboQuant Pipeline

1class TurboQuant:
2    """
3    Full TurboQuant pipeline combining PolarQuant + QJL.
4    
5    This is the complete compression/decompression pipeline
6    for KV cache quantization.
7    """
8    
9    def __init__(self, head_dim: int, polar_bits: int = 2, qjl_target_dim: int = 64):
10        self.head_dim = head_dim
11        self.polar_quant = PolarQuant(n_dims=head_dim, n_bits=polar_bits)
12        self.qjl = QJLErrorCorrector(input_dim=head_dim, target_dim=qjl_target_dim)
13        self.polar_bits = polar_bits
14        
15        # Total effective bits: ~2-3 bits for PolarQuant + 1 bit for QJL
16        # vs original 32 bits → 6-10x compression
17        self.effective_bits = polar_bits + 1 + (32 / head_dim)  # overhead amortized
18    
19    def compress_kv_cache(self, keys: np.ndarray, values: np.ndarray) -> dict:
20        """
21        Compress the full KV cache.
22        
23        Args:
24            keys: Shape (seq_len, n_heads, head_dim)
25            values: Shape (seq_len, n_heads, head_dim)
26            
27        Returns:
28            Compressed KV cache representation
29        """
30        compressed_keys = []
31        compressed_values = []
32        qjl_signs_k = []
33        qjl_signs_v = []
34        
35        for head_idx in range(keys.shape[1]):
36            # Compress keys with PolarQuant
37            k_data = self.polar_quant.compress(keys[:, head_idx, :])
38            
39            # Compute residual errors for QJL
40            k_decompressed = self.polar_quant.decompress(k_data)
41            k_residual = keys[:, head_idx, :] - k_decompressed
42            
43            # Compress residual with QJL (1 bit)
44            k_signs = self.qjl.encode_residual(k_residual)
45            
46            compressed_keys.append(k_data)
47            qjl_signs_k.append(k_signs)
48            
49            # Same for values
50            v_data = self.polar_quant.compress(values[:, head_idx, :])
51            v_decompressed = self.polar_quant.decompress(v_data)
52            v_residual = values[:, head_idx, :] - v_decompressed
53            v_signs = self.qjl.encode_residual(v_residual)
54            
55            compressed_values.append(v_data)
56            qjl_signs_v.append(v_signs)
57        
58        return {
59            'keys': compressed_keys,
60            'values': compressed_values,
61            'qjl_signs_keys': qjl_signs_k,
62            'qjl_signs_values': qjl_signs_v,
63            'original_shape': keys.shape,
64            'compression_ratio': 32 / self.effective_bits
65        }
66    
67    def compute_attention(self, query: np.ndarray, 
68                          cache: dict) -> np.ndarray:
69        """
70        Compute attention scores using compressed KV cache.
71        
72        This is where the magic happens — attention computation
73        directly on compressed data with QJL-corrected scores.
74        """
75        seq_len, n_heads, head_dim = cache['original_shape']
76        scores = np.zeros((n_heads, seq_len))
77        
78        for head_idx in range(n_heads):
79            # Decompress keys (PolarQuant only — QJL correction in score)
80            k_decompressed = self.polar_quant.decompress(
81                cache['keys'][head_idx]
82            )
83            
84            # Standard attention dot product on decompressed keys
85            base_scores = query[head_idx] @ k_decompressed.T
86            
87            # QJL correction: add bias-free residual correction
88            qjl_signs = cache['qjl_signs_keys'][head_idx]
89            qjl_correction = self.qjl.compute_attention_score(
90                query[head_idx], k_decompressed, qjl_signs
91            )
92            
93            # Combine: base scores from PolarQuant + QJL correction
94            scores[head_idx] = base_scores + qjl_correction
95        
96        return scores
97
98
99# --- Usage Example ---
100
101# Simulate a KV cache for a small model
102seq_len = 2048       # context length
103n_heads = 32         # attention heads
104head_dim = 128       # dimension per head
105
106# Generate random KV cache (in practice, this comes from the model)
107keys = np.random.randn(seq_len, n_heads, head_dim).astype(np.float32)
108values = np.random.randn(seq_len, n_heads, head_dim).astype(np.float32)
109
110original_size = keys.nbytes + values.nbytes
111print(f"Original KV cache size: {original_size / 1024 / 1024:.1f} MB")
112
113# Compress
114turbo = TurboQuant(head_dim=head_dim, polar_bits=2, qjl_target_dim=64)
115compressed = turbo.compress_kv_cache(keys, values)
116
117# Approximate compressed size
118compressed_size = sum(
119    sum(arr.nbytes for arr in head_data['compressed'] 
120        for head_data in [head_data])
121    for head_data in [compressed['keys']]
122) + sum(signs.nbytes for signs in compressed['qjl_signs_keys'])
123
124print(f"Compressed size: ~{compressed_size / 1024 / 1024:.1f} MB")
125print(f"Compression ratio: {compressed['compression_ratio']:.1f}x")
126
127# Compute attention on compressed cache
128query = np.random.randn(n_heads, head_dim).astype(np.float32)
129scores = turbo.compute_attention(query, compressed)
130print(f"Attention scores shape: {scores.shape}")
131print("✓ TurboQuant pipeline complete")

Data Flow Architecture

Here's how data flows through the entire system during inference:

Performance Comparison

Let's put the numbers in perspective:

Where TurboQuant Fits in the LLM Stack

TurboQuant lives in the Memory Layer, specifically managing the KV cache. It doesn't touch model weights or activations — just the cache that grows with sequence length.

Market Impact: The Ripple Effect

The day Google announced TurboQuant, memory chip stocks took a hit:

Samsung: Down ~3%

Micron: Down ~5%

SK Hynix: Down ~4%

The market's logic: if models need 6x less memory, who's buying all these chips?

But analysts see it differently. Morgan Stanley's take: more intense computing, not less demand. Lower inference costs → more AI adoption → more total compute needed. The same thing happened when compression algorithms improved internet bandwidth — people didn't use less internet, they used more.

Forbes theorized that reducing hardware barriers could actually accelerate localized AI projects, paradoxically driving up total long-term chip consumption.

The community reaction was fast:

Ported to llama.cpp within 24 hours

Ported to MLX (Apple Silicon) within 48 hours

Reddit calling it "the democratization of local AI"

What This Means for You

If you're running LLMs at scale: Expect inference costs to drop 30-50%. This isn't a maybe — it's a when. Once TurboQuant is integrated into popular inference frameworks, it's just a flag you flip.

If you're building on-device AI: TurboQuant makes larger models viable on constrained hardware. Running a 13B model on a phone? Getting closer every day.

If you're in the chip business: The narrative shifts from "more memory" to "smarter memory." Companies that adapt win. Companies that don't, well...

If you're a researcher: The paper is at arxiv.org/abs/2504.19874. Being presented at ICLR 2026. The code approach is elegant — worth studying even if you never use TurboQuant itself.

Sources

Google Research Blog: turboquant-redefining-ai-efficiency-with-extreme-compression

Paper: arxiv.org/abs/2504.19874

Tom's Hardware: turboquant-compresses-llm-kv-caches-to-3-bits

TechCrunch: google-turboquant-ai-memory-compression

VentureBeat: turboquant-algorithm-speeds-up-ai-memory-8x

Written for people who want to understand the how, not just the what.