Integrations
Ollama is the easiest way to serve a GGUF model locally. This guide shows how to create a custom Modelfile from a Qwodel-quantized GGUF and run it.
Prerequisites
- Install Ollama for your platform.
- A
.gguffile produced by Qwodel.
Step 1: Quantize your model
from qwodel import Quantizer
quantizer = Quantizer(
backend="gguf",
model_path="./llama-3",
output_dir="./output"
)
output = quantizer.quantize(format="Q4_K_M")
# output → ./output/llama-3-q4_k_m.ggufStep 2: Create a Modelfile
Create a file named Modelfile (no extension) in the same directory:
FROM ./output/llama-3-q4_k_m.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
SYSTEM "You are a helpful assistant."Common Modelfile parameters
| Parameter | Description |
|---|---|
temperature | Sampling temperature (0.0 = deterministic, 1.0 = creative) |
top_p | Nucleus sampling threshold |
num_ctx | Context window size in tokens |
SYSTEM | System prompt prepended to every conversation |
Step 3: Create and run the model
# Register the model with Ollama
ollama create my-llama3 --file ./Modelfile
# Run it interactively
ollama run my-llama3Step 4: Use via API
Ollama exposes a local REST API compatible with the OpenAI client:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required but unused
)
response = client.chat.completions.create(
model="my-llama3",
messages=[{"role": "user", "content": "Explain quantization in one sentence."}]
)
print(response.choices[0].message.content)Useful Ollama CLI commands
ollama list # List all registered models
ollama show my-llama3 # Show model info
ollama rm my-llama3 # Remove a model
ollama ps # Show running models