Skip to content

TurboQuant: 6x more space efficient, 8x faster inference

Is cheap RAM coming back? Google's TurboQuant makes models 6x more space efficient and 8x faster inference using polar coordinate quantization.

Is cheap RAM coming back? Google's new TurboQuant makes models:

  • 6x more space efficient (= less memory)
  • 8x faster inference (= faster response)

It does this by "compressing" the model's vectors/data into smaller numbers. This is quantisation, which normally takes high-precision decimals into 16-bit, 8-bit or even smaller integers. The problem is that too much quantisation creates problems (quality, hallucination, etc).

TurboQuant uses a new approach (the PolarQuant method): rather than using standard Cartesian coordinates (X, Y, Z), it converts vectors into polar coordinates consisting of a radius and a set of angles. Plus some more tricks. And the results are rather impressive!

Science takes time and smart people. The timeline shows it:

  • Jun 2024: QJL: 1-Bit Quantized JL Transform for KV Cache Quantization
  • Feb 2025: PolarQuant: Quantizing KV Caches with Polar Transformation
  • Apr 2025: TurboQuant: Online Vector Quantization
  • Mar 2026: Google publishes their post

This will require requantising models, but I have no doubt that many labs will prioritise this, seeing the potential savings.

Olivier Reuland