Qwodel
Backends

GPU-based INT4 quantization using llm-compressor. Requires an NVIDIA GPU with CUDA.

Install:

pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 \
    --index-url https://download.pytorch.org/whl/cu121
pip install qwodel[awq]

Supported Formats

FormatDescription
int44-bit weight quantization (W4A16). Best for GPU inference.

Parameters

Quantizer(...) — Initialization

ParameterTypeDefaultDescription
calibration_datasetstr"wikitext:wikitext-2-raw-v1"Calibration dataset. Supports HF IDs, repo:subset syntax, and local .json/.jsonl/.txt files.
calibration_splitstr"train"Dataset split to use.
tokenstrNoneHuggingFace API token for gated/private models.
batch_sizeintAutoCalibration batch size. Auto-selected based on available VRAM.
seq_lengthintAutoMax sequence length for calibration. Auto-selected based on VRAM.
num_samplesintAutoNumber of calibration samples. Auto-selected based on VRAM.

quantize(...) — Runtime overrides

ParameterTypeDescription
batch_sizeintOverride batch size.
seq_lenintOverride sequence length.
num_samplesintOverride number of calibration samples.
ignoreList[str]Modules to skip. Supports exact names and re: regex patterns (e.g. ["lm_head", "re:.*vision_tower.*"]).

VRAM Auto-Config

When batch_size, seq_length, and num_samples are not set, they are automatically chosen based on available GPU VRAM headroom:

VRAM Headroombatch_sizeseq_lennum_samples
< 4 GB1204832
4–8 GB2409664
8–16 GB44096128
16–24 GB88192128
> 24 GB168192256

Example

from qwodel import Quantizer

quantizer = Quantizer(
    backend="awq",
    model_path="./gemma-2",
    output_dir="./output",
    calibration_dataset="wikitext:wikitext-2-raw-v1",
    token="hf_..."           # only needed for gated models
)
output = quantizer.quantize(format="int4")
print(f"Output: {output}")

CLI:

qwodel quantize ./gemma-2 --backend awq --format int4 --output ./output

After Quantization

Your output is a directory of safetensors files. Load it with vLLM →