Advanced LLM Compression: A Hands-on Implementation Guide for FP8, GPTQ, and SmoothQuant using llmcompressor

May 19, 2026 • guides

Deploying state-of-the-art Large Language Models (LLMs) into production poses a critical infrastructure bottleneck: memory bandwidth. While high-parameter models deliver impressive reasoning, their raw weight size (typically represented in FP16 or BF16 format) demands massive GPU VRAM allocations. This leads to slow time-to-first-token latency, restricted batch sizes, and soaring host costs.

The most effective tool to combat this footprint inflation is Post-Training Quantization (PTQ). By reducing the numerical precision of model weights and activations, you can shrink disk sizes, accelerate compute pipelines, and dramatically boost throughput.

In this systems guide, we build a production-grade quantization and benchmarking lab from scratch. Using the specialized llmcompressor framework, we will compress an instruction-tuned model under three elite quantization recipes: FP8 Dynamic Quantization, GPTQ W4A16, and SmoothQuant + GPTQ W8A8. We will then benchmark all variants for disk size, latency, perplexity, generation speed (tokens/sec), and semantic quality.

The Quantization Landscape: Strategic Trade-offs

Choosing a quantization recipe is a balancing act between compute efficiency, dataset calibration costs, and numerical precision recovery.

Quantization Method	Target Weight / Activation	Calibration Data Required?	Key Advantage	Inference Bottleneck Fix
FP16 Baseline	16-bit / 16-bit	No	Zero precision loss	None (Standard Baseline)
FP8 Dynamic	8-bit / 8-bit (Dynamic scaling)	No (Zero-Shot)	Instant compilation, no data	Reduces memory and activation latency
GPTQ W4A16	4-bit / 16-bit	Yes (UltraChat SFT)	Extreme size reduction (~75%)	Resolves memory bandwidth limits
SmoothQuant W8A8	8-bit / 8-bit (Co-scaled)	Yes (UltraChat SFT)	Handles activation outliers smoothly	Accelerates compute & bandwidth concurrently

Understanding the Architecture: Quantization Mechanics

Before executing the code, we must analyze the structural mechanics of our three target quantization strategies:

1. FP8 Dynamic Quantization

Floating Point 8 (FP8) dynamic quantization represents weights and activations in 8-bit formats (E4M3 or E5M2). Instead of calculating static scaling factors offline, dynamic quantization computes scales dynamically during the forward pass. This method is completely data-free, requiring no calibration datasets, while offering a fast, plug-and-play compression pathway that yields a 50% footprint reduction.

2. GPTQ (Generalized Post-Training Quantization)

GPTQ is an approximate second-order optimization method. It quantizes weights layer by layer, correcting the resulting quantization error in remaining unquantized weights using the inverse Hessian matrix.

W4A16 Scheme: Compresses weights to 4-bit integers while preserving activations in 16-bit. When layers are loaded, weights are dequantized to FP16 in register memory for execution. This is highly effective for memory-bound tasks (low batch sizes, single-user inference).

3. SmoothQuant

Standard 8-bit quantization (W8A8) frequently degrades model performance due to activation outliers—specific channels in LLM activation layers that exhibit values up to 100x larger than others. SmoothQuant addresses this by applying a mathematical smoothing multiplier, $s$:

$$Y = (W \cdot diag(s)^{-1}) \cdot (diag(s) \cdot X)$$

This co-scaling formula scales the activations down while absorbing the inverse scaling factor directly into the weights, smoothing the activation distribution and preventing representation collapse during 8-bit operations.

Step-by-Step Benchmarking & Compilation Code

Here is the complete, high-fidelity implementation pipeline. You can run this directly inside a Colab Notebook equipped with a single T4 or A10G GPU. The script installs required frameworks, compiles calibration datasets, processes the three quantization passes, and prints a benchmark summary matrix.

import subprocess, sys
def pip(*pkgs):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *pkgs])
pip("llmcompressor", "compressed-tensors", "transformers>=4.45", "accelerate", "datasets")

import os, gc, time, json, math
from pathlib import Path
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

assert torch.cuda.is_available(), "Enable a GPU: Runtime > Change runtime type > T4 GPU"
print("GPU:", torch.cuda.get_device_name(0), "| CUDA:", torch.version.cuda, "| torch:", torch.__version__)

MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
WORKDIR = Path("/content/quant_lab")
WORKDIR.mkdir(exist_ok=True)
os.chdir(WORKDIR)

def free_mem():
    gc.collect()
    torch.cuda.empty_cache()

def dir_size_gb(path):
    total = 0
    for root, _, files in os.walk(path):
        for f in files:
            total += os.path.getsize(os.path.join(root, f))
    return total / 1e9

def time_generation(model, tok, prompt, max_new_tokens=64):
    inputs = tok(prompt, return_tensors="pt").to(model.device)
    # Warmup
    _ = model.generate(**inputs, max_new_tokens=4, do_sample=False)
    torch.cuda.synchronize()
    t0 = time.time()
    out = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        pad_token_id=tok.eos_token_id
    )
    torch.cuda.synchronize()
    dt = time.time() - t0
    new_ids = out[0][inputs["input_ids"].shape[1]:]
    return tok.decode(new_ids, skip_special_tokens=True), dt, max_new_tokens / dt

@torch.no_grad()
def wikitext_ppl(model, tok, seq_len=512, max_chunks=20, stride=512):
    """Light WikiText-2 perplexity probe (fast, indicative)."""
    ds = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
    text = "\n\n".join(t for t in ds["text"][:400] if t.strip())
    enc = tok(text, return_tensors="pt").input_ids.to(model.device)
    nll_sum, tok_count = 0.0, 0
    for begin in range(0, enc.size(1) - seq_len, stride):
        chunk = enc[:, begin:begin+seq_len]
        out = model(chunk, labels=chunk)
        nll_sum += out.loss.float().item() * seq_len
        tok_count += seq_len
        if tok_count // seq_len >= max_chunks: break
    return math.exp(nll_sum / tok_count)

results = {}
PROMPT = (
    "<|im_start|>user\nIn two sentences, explain why post-training "
    "quantization works for large language models.<|im_end|>\n"
    "<|im_start|>assistant\n"
)

def benchmark(label, model_path_or_id):
    free_mem()
    print(f"\n──── benchmarking: {label} ────")
    tok = AutoTokenizer.from_pretrained(model_path_or_id)
    m = AutoModelForCausalLM.from_pretrained(
        model_path_or_id, torch_dtype="auto", device_map="cuda"
    ).eval()
    sample, dt, tps = time_generation(m, tok, PROMPT)
    ppl = wikitext_ppl(m, tok)
    size = dir_size_gb(model_path_or_id) if os.path.isdir(str(model_path_or_id)) else None
    results[label] = {
        "size_gb": size,
        "ppl": round(ppl, 3),
        "latency_s": round(dt, 3),
        "tok_per_s": round(tps, 1),
        "sample": sample.strip().replace("\n", " ")[:180]
    }
    print(json.dumps(results[label], indent=2))
    del m; free_mem()

print("\n════════════ Baseline (FP16) ════════════")
benchmark("00_fp16_baseline", MODEL_ID)

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

print("\n════════════ Recipe 1: FP8_DYNAMIC ════════════")
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tok = AutoTokenizer.from_pretrained(MODEL_ID)

recipe_fp8 = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head"],
)
oneshot(model=model, recipe=recipe_fp8)
FP8_DIR = "Qwen2.5-0.5B-FP8-Dynamic"
model.save_pretrained(FP8_DIR, save_compressed=True)
tok.save_pretrained(FP8_DIR)
del model; free_mem()
benchmark("01_fp8_dynamic", FP8_DIR)

# Calibration dataset preparation
NUM_CALIB_SAMPLES = 256
MAX_SEQ_LEN = 1024
tok = AutoTokenizer.from_pretrained(MODEL_ID)
raw = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_CALIB_SAMPLES}]")

def to_text(ex):
    return {"text": tok.apply_chat_template(ex["messages"], tokenize=False)}

def tokenize(ex):
    return tok(ex["text"], padding=False, truncation=True, max_length=MAX_SEQ_LEN, add_special_tokens=False)

calib_ds = (raw.shuffle(seed=42)
            .map(to_text)
            .map(tokenize, remove_columns=raw.column_names))
print("Calibration set:", len(calib_ds), "samples, max_seq_len =", MAX_SEQ_LEN)

from llmcompressor.modifiers.quantization import GPTQModifier
print("\n════════════ Recipe 2: GPTQ W4A16 ════════════")
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
recipe_w4a16 = GPTQModifier(
    targets="Linear",
    scheme="W4A16",
    ignore=["lm_head"],
    dampening_frac=0.01,
)
oneshot(
    model=model,
    dataset=calib_ds,
    recipe=recipe_w4a16,
    max_seq_length=MAX_SEQ_LEN,
    num_calibration_samples=NUM_CALIB_SAMPLES,
)
W4A16_DIR = "Qwen2.5-0.5B-W4A16-G128"
model.save_pretrained(W4A16_DIR, save_compressed=True)
tok.save_pretrained(W4A16_DIR)
del model; free_mem()
benchmark("02_gptq_w4a16", W4A16_DIR)

from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
print("\n════════════ Recipe 3: SmoothQuant + GPTQ W8A8 ════════════")
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
recipe_w8a8 = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
]
oneshot(
    model=model,
    dataset=calib_ds,
    recipe=recipe_w8a8,
    max_seq_length=MAX_SEQ_LEN,
    num_calibration_samples=NUM_CALIB_SAMPLES,
)
W8A8_DIR = "Qwen2.5-0.5B-W8A8-SmoothQuant"
model.save_pretrained(W8A8_DIR, save_compressed=True)
tok.save_pretrained(W8A8_DIR)
del model; free_mem()
benchmark("03_smoothquant_w8a8", W8A8_DIR)

print("\n══════════════════════ FINAL SUMMARY ══════════════════════")
print(f"{'Variant':<26}{'Size GB':>9}{'PPL':>10}{'tok/s':>9}{'Latency':>11}")
print("-" * 65)
for k, v in results.items():
    size = f"{v['size_gb']:.3f}" if v['size_gb'] else "  (hub) "
    print(f"{k:<26}{size:>9}{v['ppl']:>10.2f}{v['tok_per_s']:>9.1f}"
          f"{v['latency_s']:>10.2f}s")
print("\nSample completions (greedy, 64 new tokens):")
for k, v in results.items():
    print(f"\n[{k}]\n  → {v['sample']}")

Step-by-Step Deep-Dive of the Recipes

Let's dissect each phase of our compilation code to understand how llmcompressor optimizes execution structures:

Setting Up the Laboratory Environment

The code starts by running a dynamic dependency check and importing key frameworks. We verify GPU resources to ensure CUDA is initialized, and target Qwen/Qwen2.5-0.5B-Instruct as our baseline instruction-tuned architecture.

MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
WORKDIR = Path("/content/quant_lab")
WORKDIR.mkdir(exist_ok=True)
os.chdir(WORKDIR)

The Evaluation Harness

Evaluating compression performance requires checking both system footprint and mathematical sanity. Our script uses two primary benchmark functions:

time_generation(): Computes latency and raw token-per-second generation speeds. We warm up the GPU compilation cache with 4 initial tokens to exclude compilation overhead from our measurements.
wikitext_ppl(): Probe test that evaluates standard Wikitext-2 test splits. A lower perplexity (PPL) score indicates that the compressed model retains baseline accuracy and context representation.

Applying FP8_DYNAMIC Quantization

We use llmcompressor's QuantizationModifier to target all linear layers except the terminal projection layer (lm_head). This allows the model to leverage dynamic scaling without data-heavy requirements:

recipe_fp8 = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head"],
)
oneshot(model=model, recipe=recipe_fp8)

Calibration Dataset Pipeline

For advanced quantization recipes (GPTQ and SmoothQuant), calibration data is essential. We pull HuggingFaceH4/ultrachat_200k, structure it with Qwen's specific chat templates, map it to our tokenizer, and restrict context bounds to 1024 tokens:

calib_ds = (raw.shuffle(seed=42)
            .map(to_text)
            .map(tokenize, remove_columns=raw.column_names))

[!IMPORTANT] Use chat templates during calibration: Always apply the exact chat format (e.g. ChatML markers like <|im_start|>) used during the target model's training. Applying unformatted raw text can bias weight scales, leading to high perplexity degradation.

GPTQ W4A16 Compression

We run the 4-bit weight recipe using GPTQModifier. Setting dampening_frac=0.01 regulates optimization, preventing numerical instabilities when processing complex multi-layer layers:

recipe_w4a16 = GPTQModifier(
    targets="Linear",
    scheme="W4A16",
    ignore=["lm_head"],
    dampening_frac=0.01,
)

SmoothQuant W8A8 Integration

To tackle activation outliers, we stack modifiers. First, SmoothQuantModifier scales down activation spikes by a factor of $0.8$ and updates weight arrays. Second, the GPTQModifier compresses the smoothed layers to 8-bit formats:

recipe_w8a8 = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
]

Key Technical Takeaways

[!TIP] Production Deployment Best Practices

Memory Bandwidth-Bound Tasks: Use GPTQ W4A16 when single-user generation speed (Tokens/Sec) is the primary constraint. It shrinks models to 25% of their original size, making it perfect for smaller devices.

Compute-Bound Environments: Use SmoothQuant W8A8 in high-concurrency systems (heavy batches, multi-tenant servers). Keeping weights and activations in 8-bit maximizes hardware acceleration.

Dynamic Fast-Track: Use FP8 Dynamic if you need instant deployment with zero data access. It delivers excellent precision recovery with zero calibration time.

New Systems Playbook

The Production AI Engineer

Go beyond simple prototypes. Master enterprise-grade RAG, multi-tenant databases, autonomous multi-agent networks, strict guardrails, and GPU cost optimization in our complete 122-page systems guide.

Get the 122-Page Book →

Share this guide:

𝕏 in r/

Related Guides

guides

Shan • 2026-07-03

llmself-hostedollamahardwareprivacy

Self-Hosted LLM Guide 2026: Run AI Locally for Privacy & Savings

Complete 2026 guide to running LLMs locally for privacy and cost savings. Set up Ollama, llama.cpp, and vLLM on your hardware.

guides

Shan • 2026-06-07

Zero-Shot ClassificationLocal LLMOllamaNLPProduction AI

Build a Local LLM Zero-Shot Classifier You Can Actually Deploy

Learn how to run zero-shot text classification on a local model with Ollama, enforce strict JSON outputs, and add confidence-aware routing for production triage.

guides

architect • 2026-05-25T09:00:00Z

Local LLMsOllamallama.cppRAGDockerGGUFLLM Engineering

The Complete Developer Guide to Running LLMs Locally: From Ollama to Production

Everything you need to run LLMs on your own hardware in 2026: VRAM sizing, model formats, an 8-tool comparison table, a full local RAG pipeline, and Docker production deployment with GPU passthrough and Nginx auth.

Advanced LLM Compression: A Hands-on Implementation Guide for FP8, GPTQ, and SmoothQuant using llmcompressor

In this article

The Quantization Landscape: Strategic Trade-offs

Understanding the Architecture: Quantization Mechanics

1. FP8 Dynamic Quantization

2. GPTQ (Generalized Post-Training Quantization)

3. SmoothQuant

Step-by-Step Benchmarking & Compilation Code

Step-by-Step Deep-Dive of the Recipes

Setting Up the Laboratory Environment

The Evaluation Harness

Applying FP8_DYNAMIC Quantization

Calibration Dataset Pipeline

GPTQ W4A16 Compression

SmoothQuant W8A8 Integration

Key Technical Takeaways

The Production AI Engineer

Related Guides

Self-Hosted LLM Guide 2026: Run AI Locally for Privacy & Savings

Build a Local LLM Zero-Shot Classifier You Can Actually Deploy

The Complete Developer Guide to Running LLMs Locally: From Ollama to Production