Integrations
llama-cpp-python provides Python bindings for llama.cpp, letting you load and run a GGUF model directly in Python without a separate server.
Install
pip install llama-cpp-pythonFor GPU acceleration (CUDA):
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-pythonStep 1: Quantize your model
from qwodel import Quantizer
quantizer = Quantizer(
backend="gguf",
model_path="./llama-3",
output_dir="./output"
)
output = quantizer.quantize(format="Q4_K_M")
# output → ./output/llama-3-q4_k_m.ggufStep 2: Load and run
from llama_cpp import Llama
llm = Llama(
model_path="./output/llama-3-q4_k_m.gguf",
n_ctx=4096, # Context window
n_gpu_layers=-1, # -1 = offload all layers to GPU (omit for CPU-only)
verbose=False
)
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is quantization?"}
]
)
print(response["choices"][0]["message"]["content"])Text completion (non-chat)
output = llm(
"The capital of France is",
max_tokens=64,
temperature=0.7,
top_p=0.9,
stop=["\n"]
)
print(output["choices"][0]["text"])OpenAI-compatible server
llama-cpp-python includes a built-in HTTP server:
python -m llama_cpp.server \
--model ./output/llama-3-q4_k_m.gguf \
--n_ctx 4096 \
--n_gpu_layers -1Then use it like any OpenAI endpoint:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)Key parameters
| Parameter | Description |
|---|---|
n_ctx | Context window in tokens (default 512) |
n_gpu_layers | Number of layers to offload to GPU (-1 = all) |
n_threads | CPU threads to use for inference |
verbose | Print llama.cpp debug output |
