Kamil Józwik

LLM quantization

Quantization is a model compression technique that reduces the size and computational requirements of LLMs.

llm

Large language models are revolutionizing our daily life, powering everything from intelligent code completion to sophisticated chatbots. However, their power comes with a significant drawback: size. State-of-the-art LLMs often contain billions of parameters, making them incredibly demanding in terms of memory, computational power, and consequently, operational cost. A model like Llama-3-70B, for instance, which is considered not so big, can require upwards of 140GB of high-speed memory just to load, restricting its use to high-end hardware.

This is where quantization enters the picture. It's a powerful model compression technique designed to make these massive models significantly leaner and faster. Think of it like compressing a high-resolution image by reducing the number of colors – quantization reduces the precision of the numbers (weights and sometimes activations) within the model, typically converting them from 32-bit or 16-bit floating-point numbers (FP32/FP16) to lower-precision formats like 8-bit or even 4-bit integers (INT8/INT4).

The result? A dramatic reduction in model size and a significant boost in inference speed. While this compression involves a small trade-off in model accuracy (often negligible with modern techniques), the gains in efficiency are substantial. For example, a 6 billion parameter model might shrink from 12 GB (FP32) to around 3.7 GB (INT4).

For us, software developers building AI-powered apps, quantization is the key to:

  1. Running powerful LLMs on accessible hardware: Deploy models on consumer-grade GPUs, standard CPUs, or even mobile and edge devices.
  2. Building faster, more responsive AI applications: Lower inference latency leads to a better user experience.
  3. Reducing operational costs: Lower memory and compute requirements translate to cheaper hosting and energy savings.
  4. Democratizing AI: Making advanced LLM capabilities available in a wider range of applications and environments.

In essence, quantization bridges the gap between the cutting-edge capabilities of large models and the practical constraints of real-world deployment. This guide will demystify quantization, equipping you with the knowledge to confidently leverage this technique in your AI-powered projects.

How does it work?

At its heart, quantization is about representing the numerical values within an LLM using fewer bits. Models are primarily composed of weights (parameters learned during training) and activations (intermediate outputs calculated during inference). These are typically stored as high-precision floating-point numbers (like FP32 or FP16). Quantization converts these numbers into lower-precision formats, most commonly 8-bit (INT8) or 4-bit (INT4) integers.

The core mechanism involves mapping a wide range of high-precision values onto a much smaller, discrete set of low-precision values. Imagine rounding the number 3.14159265 down to 3.14. The rounded version is slightly less precise but takes up less space. Similarly, in an LLM, a 32-bit weight might be approximated by one of just 256 possible values (for INT8) or even 16 possible values (for INT4).

To manage this mapping effectively and minimize the loss of information, quantization techniques often employ scaling factors and zero-points. These parameters help define how the original range of floating-point values corresponds to the target integer range, ensuring that important values (like zero) are represented accurately.

The impact on model size is significant:

  • INT8 Quantization: Uses 8 bits per parameter, roughly 4 times smaller than FP32.
  • INT4 Quantization: Uses 4 bits per parameter, roughly 8 times smaller than FP32.

This compression isn't free; reducing precision inevitably introduces a small amount of quantization error, as multiple distinct high-precision values might map to the same low-precision value. The goal of sophisticated quantization techniques is to perform this mapping intelligently, minimizing the impact of this error on the model's final output. Modern methods employ clever strategies like non-uniform value distributions (allocating more precision to more common value ranges), per-group scaling (using different scaling factors for different parts of the model), or being "activation-aware" (protecting weights that have a larger impact on the model's calculations).

During inference, the model performs computations using these low-bit representations. Sometimes, the low-bit weights are temporarily "dequantized" back to a higher precision (like FP16) just for the mathematical operation, but the memory saving comes from storing and transferring the weights in their compressed low-bit format.

I recommend checking out Optimize Your AI - Quantization Explained video to make it even clearer.

Pros and cons

Quantization offers compelling benefits, but we also need to understand the trade-offs involved.

Advantages of quantized models

  1. Drastically reduced memory footprint: This is the most immediate benefit. Quantized models require significantly less RAM or VRAM.

    Example 1: A Llama2-70B model shrinks from ~138 GB (FP16) to ~40 GB (INT4).

    Example 2: Google's Gemma 3 27B model drops from ~54 GB (BF16) to ~14 GB (INT4). This allows larger, more capable models to fit on hardware that couldn't previously handle them.

  2. Faster inference speed: Processing fewer bits per number and reducing the amount of data transferred from memory (memory bandwidth) leads to quicker computations and lower latency. Useful for real-time applications.

  3. Run on cheaper & more accessible hardware: Quantization enables powerful LLMs to run on consumer GPUs, standard CPUs, laptops, and even mobile or edge devices, broadening the reach of AI applications. INT8 and INT4 models can often run effectively on CPUs where FP16/FP32 versions would be impractical.

  4. Lower power consumption and cost savings: Reduced computational load means less energy consumed, crucial for battery-powered devices and reducing operational costs in cloud deployments (potentially using fewer or smaller instances).

  5. Scalability and deployment flexibility: Smaller models are easier to deploy and scale across diverse environments and hardware platforms, including older systems that might even lack full floating-point support.

  6. Minimal quality loss (with good techniques): While there's a potential for accuracy reduction, modern quantization methods (we see them later) often preserve model performance remarkably well, frequently achieving over 95% of the original model's accuracy on benchmark tasks.

Disadvantages and trade-offs

  1. Potential accuracy loss: This is the primary trade-off. Reducing precision introduces approximation errors, which can lead to a slight degradation in performance. The impact depends heavily on the quantization aggressiveness (INT4 is riskier than INT8), the technique used, the model architecture, and the specific task.
  2. Sensitivity to model and task: Some models, particularly those with extreme weight values or outliers, might suffer more from quantization. Tasks requiring high numerical precision (like complex math or logical reasoning) can also be more sensitive.
  3. Limited fine-tuning capability: Once weights are quantized to very low precision (especially INT4), directly fine-tuning them becomes difficult because the weights can only change in large, discrete steps. Quantized models are thus typically used for inference only. If fine-tuning is needed, you usually revert to a higher-precision version or use techniques like LoRA adapters on top of the frozen quantized base model (e.g., QLoRA). I cover fine tuning in a dedicated article.
  4. Hardware and software dependencies: Achieving the speed benefits of quantization often requires specialized software libraries (like bitsandbytes, llama.cpp, AutoGPTQ) or hardware with optimized low-precision compute kernels (common in newer GPUs). Without these, low-bit operations might be emulated inefficiently, potentially slowing things down. Framework support also varies.
  5. Compatibility and complexity: Quantized models sometimes come in specific file formats (like GGUF for CPU use) that might require different loading mechanisms or tools compared to standard PyTorch or TensorFlow models, adding a layer of complexity to the development workflow.

Despite these considerations, the cost-benefit analysis for quantization is overwhelmingly positive in most scenarios. The efficiency gains are often massive, while the accuracy impact can be minimized with careful technique selection.

Two paths to quantization: PTQ vs. QAT

There are two primary strategies for quantizing an LLM:

Post-training quantization (PTQ)

  • What it is: Quantization is applied after the model has been fully trained. You take a pre-trained model (usually FP16 or FP32) and convert its weights (and sometimes activations) to a lower precision format like INT8 or INT4.
  • Process: It's typically a one-shot conversion process. Some PTQ methods use a small "calibration" dataset to analyze the typical range of activations and determine optimal scaling factors, while others might not require any data.
  • Pros:
    • Simple and fast: Doesn't require retraining the model.
    • Cost-effective: No need for extensive training data or compute resources for retraining.
  • Cons:
    • Potential for higher accuracy loss: Since the model wasn't trained with quantization in mind, the direct precision reduction can sometimes impact performance more significantly, especially with very low bit-widths (like INT4).

Quantization-aware training (QAT)

  • What it is: Quantization is incorporated during the model training or fine-tuning process. The training loop simulates the effects of quantization (often called "fake quantization").
  • Process: This allows the model to learn and adapt its weights to compensate for the noise and limitations introduced by low-precision representation. It doesn't usually mean training from scratch; often, a pre-trained model is fine-tuned for a few epochs with QAT enabled to "repair" quantization damage.
  • Pros:
    • Higher accuracy: Generally yields better performance than PTQ, especially for very low bit quantization (INT4 or below), as the model learns to be robust to precision loss. Studies show QAT can recover a large portion (e.g., ~96%) of the accuracy lost by naive PTQ. Google reported QAT halved the performance drop for Gemma 3 INT4 compared to PTQ.
  • Cons:
    • More complex and costly: Requires a full training/fine-tuning loop, demanding significant computational resources, time, and representative training data.

Which to choose?

  • For developers primarily using pre-trained models and seeking efficiency gains quickly, PTQ (or downloading a model already quantized via PTQ) is often sufficient and much easier.
  • For developers training or fine-tuning models who need the absolute best possible accuracy in a quantized format, QAT is the preferred, albeit more involved, approach.

Quantization techniques and formats

The world of LLM quantization is filled with acronyms and specific techniques. Here's a breakdown of the most common ones you'll encounter when working with quantized models on platforms like Hugging Face.

Common precision levels and formats:

  • FP16 / BF16: 16-bit floating-point formats. Often the "standard" precision for efficient LLM inference before further quantization. BF16 (BFloat16) offers a wider dynamic range than standard FP16, making it more stable for training, and is often used as the high-quality baseline for comparison.
  • INT8 (8-bit Integer): Quantizes weights to 8-bit integers. Offers a good balance, providing significant memory reduction (~4x vs FP32) with typically negligible impact on accuracy. Widely supported.
  • INT4 (4-bit Integer): Quantizes weights to 4-bit integers. Provides maximum compression (~8x vs FP32) but carries a higher risk of accuracy loss if not done carefully. Requires specialized techniques and libraries for good results.

Popular quantization algorithms and techniques:

  • GPTQ (Generalized Post-Training Quantization): An advanced PTQ method, typically used for 4-bit quantization. It quantizes layer by layer, adjusting remaining weights to compensate for errors. Known for achieving good accuracy with PTQ, especially on GPUs. Often uses "grouped quantization" (sharing scaling factors for blocks of weights).
  • AWQ (Activation-Aware Weight Quantization): Another sophisticated 4-bit PTQ method. It identifies and "protects" the most important weights (those corresponding to large activation values) during quantization, often yielding great accuracy, particularly for instruction-following models. Optimized for GPU inference.
  • QLoRA (Quantized Low-Rank Adaptation): An efficient fine-tuning technique, not just a quantization format. It freezes the base LLM weights in 4-bit precision (using the special NF4 data type) and trains only small, low-rank adapter (LoRA) matrices. This allows fine-tuning massive models on modest hardware (like a single GPU) while maintaining high performance.
  • GGML / GGUF (GPT-Generated Unified Format): These are file formats and associated quantization methods primarily designed for running LLMs efficiently on CPUs (including Apple Silicon), though GGUF supports GPU offloading. GGUF is the successor to GGML, offering more flexibility and metadata storage. Files in this format contain pre-quantized weights (often INT4, INT8 variants) ready for use with libraries like llama.cpp.
  • NF4 (NormalFloat 4-bit): A specialized 4-bit data type introduced with QLoRA. It uses a non-uniform distribution of values optimized for the typical bell-curve (normal) distribution of weights in neural networks, aiming for better information preservation than standard 4-bit floats or integers.

Other related terms:

  • QAT (Quantization-Aware Training): As described earlier, indicates the model was trained/fine-tuned with quantization simulation, leading to highly accurate quantized models.
  • PRILoRA (Pruned and Rank-Increasing LoRA): An optimization technique for LoRA fine-tuning (often used alongside quantization like QLoRA). It uses varying adapter ranks and prunes weights for better parameter efficiency. Not a quantization method itself.

Finding and using quantized models

The good news is that you often don't need to perform quantization yourself. The AI community, including model creators and dedicated contributors, actively quantizes popular open-source LLMs and shares them online.

Where to find pre-quantized models:

  1. HuggingFace Hub: This is the primary repository. You can find thousands of pre-trained models, including numerous quantized versions.
  2. GitHub: Many repositories host quantization code, tools, and sometimes links to quantized model weights (e.g., the llm-awq repo).
  3. Community Hubs: Forums and Discord channels related to specific models or tools (like llama.cpp) often share quantized model files.

How to find them on HuggingFace:

  • Search: Use keywords like the model name plus the desired precision or method (e.g., "Llama-3 8B GGUF", "Mistral 7B GPTQ", "Gemma 4bit").
  • Use collections and tags: Model creators (like Google for Gemma 3 QAT) or the community might curate collections of quantized models. Look for tags like quantized, int8, gguf.
  • Check model cards: The documentation (README or model card) for a base model often links to official or community-provided quantized variants.

Interpreting model names

As seen before, suffixes are key to understand model better. For example, already mentioned before google/gemma-3-12b-it-qat-q4_0-gguf HF tells you:

  • google: The creator/organization.
  • gemma-3-12b: The base model (Gemma 3, 12 Billion parameters).
  • it: Instruction-tuned version.
  • qat: Quantization-aware trained.
  • q4_0: A specific 4-bit quantization recipe within GGUF.
  • gguf: The file format, suitable for llama.cpp.

Virtually every popular open-source LLM has readily available quantized versions:

  • Llama (Meta): Llama 2 and Llama 3 models (7B, 8B, 13B, 70B) are widely available in GGUF, GPTQ, and AWQ formats.
  • Gemma (Google): Google released official QAT 4-bit versions of Gemma 3, alongside community GGUF/GPTQ versions.
  • Mistral / Mixtral (Mistral AI): High-performance models like Mistral 7B are heavily quantized (GGUF, GPTQ, AWQ) allowing them to run even on low-resource devices.
  • Qwen (Alibaba): These models, including multi-modal variants, have official and community quantized versions (e.g., AWQ, GGUF).
  • DeepSeek (DeepSeek AI): Strong coding/reasoning models available in GGUF, GPTQ, AWQ formats.

Illustrative size reductions (typical INT4)

Model nameOriginal size (FP16 approx.)Quantized size (INT4 approx.)Common quantization formats
Llama 3 8B~16 GB~4-5 GBGGUF, GPTQ, AWQ
Gemma 3 7B~14 GB~3-4 GBGGUF, GPTQ, QAT (GGUF)
DeepSeek 7B~14 GB~3-4 GBGGUF, GPTQ, AWQ
Qwen 7B~14 GB~3-4 GBGGUF, GPTQ
Mistral 7B~14 GB~3-4 GBGGUF, GPTQ, AWQ
Llama 3 70B~140 GB~35-45 GBGGUF, GPTQ, AWQ
Qwen 72B~144 GB~36-46 GBGGUF, GPTQ

Note: Actual sizes vary based on the specific quantization method and metadata.

Practical guidance for developers

We've talk a lot about theory, but choosing and potentially creating quantized models involves a lot of practical considerations.

Choosing the right quantized model:

  1. Hardware constraints: Start here. How much VRAM (GPU) or RAM (CPU) do you have? This dictates the maximum quantized model size you can run. A 13B model might need ~26GB in FP16, but only ~6.5GB in INT4.
  2. Capability vs. precision trade-off: Often, a larger model quantized to a lower precision (e.g., 30B INT4) will outperform a smaller model at higher precision (e.g., 7B FP16) while potentially using similar resources. Aim for the most capable model your hardware can handle after quantization.
  3. Task sensitivity: Is your task highly sensitive to numerical precision (complex math, subtle reasoning)? If so, consider INT8 over INT4, or look for models specifically quantized with high-accuracy methods like AWQ or QAT. For general chat or text generation, INT4 is often sufficient.
  4. Prefer QAT models: If available (like Google's Gemma 3 QAT), these models are specifically trained for low precision and likely offer the best accuracy post-quantization.
  5. Check community feedback: Look at discussions on the model's Hugging Face page or forums (like Reddit's r/LocalLLaMA). See if others report issues or successes with specific quantized versions.
  6. Experiment: If unsure, download and test INT8 and INT4 versions. The smaller file sizes make experimentation feasible.

Quantizing models yourself (if necessary)

While using pre-quantized models is easiest, you might need to quantize a custom fine-tuned model or a model not yet available in your desired format. Fortunately, user-friendly tools exist:

  1. Hugging Face Transformers + bitsandbytes: The simplest way for quick PTQ inference. Just add a parameter when loading. This performs on-the-fly quantization (using methods like NF4 for 4-bit) without needing separate quantization steps. Ideal for quick tests and GPU inference.
  2. GPTQModel: For creating persistent GPTQ-quantized model files (usually 4-bit). Requires a small script but provides high-quality PTQ results suitable for GPU inference.
  3. AutoAWQ / Transformers AWQ Integration: Tools and library support for applying AWQ quantization (usually 4-bit). May require a small calibration dataset but offers good accuracy, especially for instruction-tuned models on GPUs.
  4. llama.cpp: Includes command-line tools to convert models (usually from PyTorch FP16 format) into the GGUF format with various quantization levels (Q4, Q5, Q8, etc.). The primary tool for creating CPU-optimized quantized models.
  5. Other Libraries:
    • Hugging Face Optimum: Extends Transformers with optimized pipelines, supporting various backends and quantization methods (PTQ, QAT).
    • vLLM: A high-throughput inference server that supports various quantization backends, including bitsandbytes.
    • Quanto: A newer Hugging Face library aiming to simplify the quantization workflow within the ecosystem.

Summary of common quantization tools

Tool/LibrarySupported Methods (Examples)Primary Use CaseKey Features
HF Transformers + bitsandbytesINT8, INT4 (NF4/FP4) PTQEasy on-the-fly quantization for inferenceSimple load_in_Xbit flag, good integration
AutoGPTQGPTQ (INT4, INT3, etc.) PTQCreating persistent GPTQ models for GPUHigh-accuracy PTQ, requires separate quantization step
AutoAWQ / HF AWQAWQ (INT4) PTQCreating persistent AWQ models for GPUActivation-aware, high accuracy, calibration needed
llama.cppGGML/GGUF (various Q levels) PTQCreating GGUF models for CPU/Apple SiliconEfficient CPU inference, versatile quant options
HF OptimumVarious PTQ, QAT, AWQ, GPTQOptimized inference across backendsPerformance enhancements, hardware acceleration
vLLMbitsandbytes, AWQ, GPTQ etc.High-throughput LLM servingFast inference engine, supports multiple quant types
QuantoBasic PTQStreamlined quantization in HF ecosystemSimplified API for common quantization tasks

How hard is it for a developer?

  • Using pre-quantized models: Generally very easy. It often involves just pointing your code to the quantized model ID and perhaps installing an extra library (like optimum or auto-gptq or llama-cpp-python) and using the correct loading function/flags shown in the model card examples.
  • Performing basic PTQ (e.g., load_in_4bit): Extremely easy. It's often just one extra parameter during model loading.
  • Using tools like AutoGPTQ/AutoAWQ/llama.cpp: Moderately easy. Requires running a script or command-line tool, following tutorials. You don't need deep ML knowledge but need to follow instructions carefully.
  • Performing QAT: More difficult. This requires understanding model training pipelines, managing datasets, and significant compute resources. It's closer to ML engineering than typical software development.

Leveraging quantization, especially using pre-quantized models or simple PTQ tools, is well within the reach of software developers without deep ML expertise. The ecosystem has matured enough to make efficiency accessible.

Running locally

One of the most exciting aspects of quantization is that it allows you to run powerful LLMs locally on your own hardware. One of the most popular tool for this is Ollama, which provides a simple command-line interface for downloading and running LLM models.

Luckily for us, we can find them a lot of quantized versions of models like llama3.3 and phi3.

You can check out this Comparing Quantizations of the Same Model videos to see it in action.

Conclusion

Quantization is no longer an obscure research topic; it's a technique for practical, real-world deployment of Large Language Models. It directly addresses the critical challenge of LLM size and resource consumption, acting as a key enabler for bringing powerful AI capabilities to a wider range of applications and hardware.

The ability to run capable LLMs on standard hardware opens up new possibilities for innovation. Quantization empowers you to build more responsive, cost-effective, and widely deployable AI-powered features and applications. The field continues to evolve rapidly, but the principles and tools discussed here provide a solid foundation.

Don't let the size of LLMs be a barrier. Embrace quantization, experiment with these efficient models, and unlock the potential to build smarter, leaner, and more accessible AI solutions. Happy quantizing!