Is cheap RAM coming back? Google's new TurboQuant makes models:
- 6x more space efficient (= less memory)
- 8x faster inference (= faster response)
It does this by "compressing" the model's vectors/data into smaller numbers. This is quantisation, which normally takes high-precision decimals into 16-bit, 8-bit or even smaller integers. The problem is that too much quantisation creates problems (quality, hallucination, etc).
TurboQuant uses a new approach (the PolarQuant method): rather than using standard Cartesian coordinates (X, Y, Z), it converts vectors into polar coordinates consisting of a radius and a set of angles. Plus some more tricks. And the results are rather impressive!
Science takes time and smart people. The timeline shows it:
- Jun 2024: QJL: 1-Bit Quantized JL Transform for KV Cache Quantization
- Feb 2025: PolarQuant: Quantizing KV Caches with Polar Transformation
- Apr 2025: TurboQuant: Online Vector Quantization
- Mar 2026: Google publishes their post
This will require requantising models, but I have no doubt that many labs will prioritise this, seeing the potential savings.