TurboQuant: 6x more space efficient, 8x faster inference

Is cheap RAM coming back? Google's new TurboQuant makes models:

6x more space efficient (= less memory)
8x faster inference (= faster response)

It does this by "compressing" the model's vectors/data into smaller numbers. This is quantisation, which normally takes high-precision decimals into 16-bit, 8-bit or even smaller integers. The problem is that too much quantisation creates problems (quality, hallucination, etc).

TurboQuant uses a new approach (the PolarQuant method): rather than using standard Cartesian coordinates (X, Y, Z), it converts vectors into polar coordinates consisting of a radius and a set of angles. Plus some more tricks. And the results are rather impressive!

Science takes time and smart people. The timeline shows it:

Jun 2024: QJL: 1-Bit Quantized JL Transform for KV Cache Quantization
Feb 2025: PolarQuant: Quantizing KV Caches with Polar Transformation
Apr 2025: TurboQuant: Online Vector Quantization
Mar 2026: Google publishes their post

This will require requantising models, but I have no doubt that many labs will prioritise this, seeing the potential savings.