Model Compression and Quantization
This page tracks compression, quantization, and low-bit inference.
Sources in this batch
- “Everything looks fine at 4-bit” is a video source on aggressive quantization.
- A Substack guide covers quantized neural networks.
- Google Research’s TurboQuant post frames extreme compression as a route to AI efficiency.
Research interest
The surprising question is how far compression can go before qualitative behavior fails. If many workloads remain usable at 4-bit or under newer compression schemes, the deployment frontier shifts toward local and edge inference. But the hard research problem is not just perplexity retention; it is whether reasoning, tool use, calibration, and long-context behavior survive quantization.
Open questions:
- Which capabilities degrade first under extreme compression?
- Can training-aware quantization preserve agent/tool-use behavior better than post-hoc quantization?
- How should evals distinguish cosmetic output quality from reasoning robustness?