Qwodel

Complete reference for the qwodel Python API.


Install Full Qwodel (Note This causes full dependency installation)

Install: pip install qwodel[all]

Quantizer Class

The main entry point for all quantization operations.

from qwodel import Quantizer

Constructor

Quantizer(
    backend,
    model_path,
    output_dir="./quantized_models",
    progress_callback=None,
    **backend_kwargs
)
ParameterTypeRequiredDescription
backendstrBackend to use: "awq", "gguf", or "coreml".
model_pathstrPath to the source model. Can be a local directory, .gguf file, or HuggingFace model ID.
output_dirstrDirectory to save the output. Defaults to ./quantized_models.
progress_callbackCallableOptional callback (percent: int, stage: str, message: str) for progress updates.
**backend_kwargsAnyBackend-specific arguments — see each backend section below.

Methods

quantize(format, **kwargs) → Path

Runs the quantization process.

ParameterTypeRequiredDescription
formatstrQuantization format string (e.g., "int4", "Q4_K_M", "float16").
**kwargsAnyRuntime overrides for backend-specific parameters.

Returns: Path to the quantized model file or directory.


get_model_info() → Dict

Returns metadata about the quantized model.

KeyTypeDescription
source_modelstrPath to the input model.
quantized_modelstrPath to the output model.
backendstrBackend used.
file_sizeintOutput file size in bytes.
input_formatstrDetected format of the source model.

list_formats(backend=None) → Dict (static)

Lists available quantization formats.

ParameterTypeDescription
backendstr | NoneSpecific backend name, or None to list all.

list_backends() → List[str] (static)

Returns a list of all registered backend names.


Convenience Function

from qwodel import quantize

quantize(
    model_path="./my-model",
    output_dir="./output",
    backend="gguf",
    format="Q4_K_M"
)
ParameterTypeRequiredDescription
model_pathstrPath to source model.
output_dirstrOutput directory.
backendstrBackend name.
formatstrQuantization format.
progress_callbackCallableOptional progress callback.
**kwargsAnyAdditional backend/format arguments.

Backends

AWQ Backend (backend="awq")

GPU-based INT4 quantization using llm-compressor. Requires an NVIDIA GPU with CUDA.

Install: pip install qwodel[awq]

Supported Formats

FormatDescription
int44-bit weight quantization (W4A16). Best for GPU inference.

Initialization Parameters (Quantizer(...))

ParameterTypeDefaultDescription
calibration_datasetstr"wikitext:wikitext-2-raw-v1"Dataset for calibration. Supports HF IDs, repo:subset syntax, and local .json/.jsonl/.txt files.
calibration_splitstr"train"Dataset split to use.
tokenstrNoneHuggingFace API token for gated/private models.
batch_sizeintAutoCalibration batch size. Auto-selected based on available VRAM.
seq_lengthintAutoMax sequence length for calibration. Auto-selected based on VRAM.
num_samplesintAutoNumber of calibration samples. Auto-selected based on VRAM.

Runtime Overrides (quantize(...))

These can be passed to quantize() to override the values set at init time.

ParameterTypeDescription
batch_sizeintOverride batch size.
seq_lenintOverride sequence length.
num_samplesintOverride number of calibration samples.
ignoreList[str]Modules to skip quantization. Supports exact names and re: regex patterns (e.g., ["lm_head", "re:.*vision_tower.*"]).

VRAM Auto-Config

When batch_size, seq_length, and num_samples are not set, they are automatically chosen based on available GPU VRAM headroom:

VRAM Headroombatch_sizeseq_lennum_samples
< 4 GB1204832
4–8 GB2409664
8–16 GB44096128
16–24 GB88192128
> 24 GB168192256

GGUF Backend (backend="gguf")

CPU-friendly quantization for llama.cpp-compatible runtimes.

Install: pip install qwodel[gguf]

Supported Formats

FormatDescription
Q4_K_MBest balance of speed and quality. Recommended for most users.
Q8_0Near-lossless quality. Requires more RAM.
Q2_KMaximum compression. Reduced quality.
Q3_K_M3-bit medium quality.
Q4_0Compact 4-bit, slightly smaller than Q4_K_M.
Q4_K_SSmall 4-bit K-quant.
Q5_K_MBetter quality than Q4_K_M, slightly larger.
Q5_K_SSmall 5-bit K-quant.
Q6_KHigh quality between Q8_0 and Q4_K_M.
IQ4_NL4.5 bpw importance-based quantization.
IQ3_M3.66 bpw compact importance quantization.

Initialization Parameters

ParameterTypeDefaultDescription
model_pathstrPath to HuggingFace model directory or existing .gguf file.
output_dirstr./quantized_modelsOutput directory.

Note: GGUF has no additional backend-specific init parameters.

Example

from qwodel import Quantizer

quantizer = Quantizer(
    backend="gguf",
    model_path="./llama-3",
    output_dir="./output"
)
quantizer.quantize(format="Q4_K_M")

CoreML Backend (backend="coreml")

Quantizes models for Apple devices (iOS, macOS, iPadOS) using coremltools.

Install: pip install qwodel[coreml]

Supported Formats

FormatCompressionNotes
float16~2xHalf-precision. Minimal accuracy loss. Universal compatibility.
int8_linear~4x8-bit linear quantization. Good accuracy.
int8_symmetric~4x8-bit symmetric quantization. Faster ops.
int4~8x4-bit palettization. iOS 18+ only.
int6~5x6-bit palettization. Balance between int4 and int8.

Initialization Parameters

ParameterTypeDefaultDescription
input_shapetuple(1, 512)(batch_size, seq_length) for model tracing.
compute_unitsstr"ALL"CoreML compute units: "ALL", "CPU_ONLY", "CPU_AND_GPU".
seq_lengthint512Maximum sequence length for dynamic shape range.

Example

from qwodel import Quantizer

quantizer = Quantizer(
    backend="coreml",
    model_path="./my-model",
    output_dir="./output",
    compute_units="ALL",
    seq_length=512
)
quantizer.quantize(format="int8_linear")

Logging

qwodel uses Python's standard logging module. Configure it in your application:

import logging
logging.basicConfig(level=logging.INFO)

The logger name for each backend is its class name (e.g., AWQQuantizer, GGUFQuantizer, CoreMLQuantizer).


Exceptions

ExceptionDescription
QuantizationErrorGeneral quantization failure.
ValidationErrorInvalid input path, format, or architecture.
DependencyErrorMissing required library or binary.
FormatNotSupportedErrorRequested format is not supported by the backend.

On this page