> ## Documentation Index > Fetch the complete documentation index at: https://opentracy.com/docs/llms.txt > Use this file to discover all available pages before exploring further. # Distillation > Train a cheap student model from teacher labels — the wedge that separates OpenTracy from a generic LLM gateway OpenTracy ghost with dollar-sign eyes — distillation is the cost-reduction wedge

OpenTracy ghost with dollar-sign eyes — distillation is the cost-reduction wedge

Distillation is how you go from "paying $0.02 per call to GPT-4o" to "paying $0.0005 per call to a small model that I fine-tuned **on my own traffic** to match GPT-4o's output". It's the wedge — the compounding value that a plain gateway can't offer. ## The core idea A **teacher** is a large, expensive model that already does the task well (GPT-4o, Claude Sonnet, etc.). A **student** is a small, cheap open model (llama-3.2-1b, qwen3-0.6b, etc.). Distillation trains the student to **imitate the teacher's behavior on a specific dataset** — usually the one you built from your own traces (see [Datasets](/concepts/datasets)). The student won't be as smart as the teacher in general. It will be roughly as good as the teacher on the narrow slice of prompts you distilled — and 10–100× cheaper to run. ## What the pipeline does For each prompt in the dataset, the teacher is called N times (default 4\) with temperature > 0. This produces N candidate responses per prompt. A judge model scores each candidate. The top-k (default: best 2) survive — bad candidates are dropped. This is the **"best-of-N"** part of BOND (Best-Of-N Distillation). The student is fine-tuned on (prompt → curated\_response) pairs using the BOND loss, a blend of supervised fine-tuning, preference optimization, and KL regularization. Runs on GPU via Unsloth + TRL. The trained LoRA adapter is saved and optionally converted to GGUF (quantized) for serving on CPU or edge. Output: a directory you can load into any inference engine that speaks GGUF/llama.cpp. Register the distilled model in OpenTracy's model registry. Point a routing alias at it. Your app keeps calling `model="smart"` and the requests now flow through your custom student. ## Running a distillation job The simplest path — `ot.distill()` runs the pipeline in-process and returns a callable [`Student`](/api-reference/distill#student-class). No REST service, no job polling, no ClickHouse. ```python theme={null} import opentracy as ot student = ot.distill( dataset="tickets.jsonl", # path, list of dicts, or a callable teacher="openai/gpt-4o", student="llama-3.2-1b", steps=100, n_samples=4, # BOND candidates per prompt quantize="q4_k_m", # or None to skip GGUF export ) print(student("Classify: refund please")) # local inference, $0 ``` Pass `on_progress=callback` for a tidy phase-by-phase timeline. See [`ot.distill` reference](/api-reference/distill) for every parameter. For the queued, multi-tenant, REST-backed flow (jobs persisted in ClickHouse, UI observability, resumable on restart), use the [`Distiller`](/api-reference/distiller) HTTP client instead — same engine under the hood, different deployment shape. ## Choosing a teacher and a student **Teacher**: pick the model you'd use in production if cost weren't an issue. GPT-4o, Claude Sonnet, or Gemini 1.5 Pro are good defaults. The student will learn to match this model's output style and accuracy — on the distilled task only. **Student**: the smallest model that can plausibly handle your task's output. Rule of thumb: | Task | Student floor | | ---------------------------------- | ------------------ | | Classification (few labels) | 0.6B (qwen3-0.6b) | | Structured extraction (JSON) | 1B (llama-3.2-1b) | | Short-form generation (\< 200 tok) | 1–3B | | Long-form + reasoning | 8B+ (llama-3.1-8b) | Smaller is cheaper to run but harder to train. If training fails to converge, move up a tier. Discover the full current list: ```python theme={null} # Via the REST client (self-hosted stack): from opentracy import Distiller d = Distiller(base_url="http://localhost:8000") for t in d.teacher_models(): print(t["id"], t["provider"]) for s in d.student_models(): print(s["id"], s.get("params")) # Via the in-process path (no server needed): from opentracy.distillation.schemas import STUDENT_MODEL_MAP, TEACHER_MODEL_MAP print(list(STUDENT_MODEL_MAP.keys())) print(list(TEACHER_MODEL_MAP.keys())) ``` ## The BOND hyperparameters The BOND loss has two knobs worth knowing: * **`bond_beta`** (default `0.5`) — how hard to push the student toward preferred responses vs. dispreferred. Higher = more aggressive preference shift; lower = gentler, more SFT-like. * **`bond_gamma`** (default `0.1`) — KL regularization strength. Keeps the student close to its initial weights so you don't destroy general capability. Raise if your student overfits or starts babbling. You rarely need to tune these — defaults are good for most tasks. If you're getting bad results, first look at dataset quality before touching BOND parameters. ## Hardware requirements Training runs on GPU. The Docker image (`opentracy-api`) is built on the `nvidia/cuda:12.6` base and supports `--gpus all`. Minimum specs: | Student size | Min VRAM | Typical training time (500 prompts, 100 steps) | | ------------ | ------------- | ---------------------------------------------- | | 0.6B–1B | 8 GB | 10–20 minutes | | 3B | 16 GB | 30–60 minutes | | 8B | 24 GB (4-bit) | 2–4 hours | Without a GPU, training will fail. Use the `estimate` endpoint first to validate before kicking off a job. ## After training: the alias swap The closing move. `student.deploy(alias)` writes the mapping into `~/.opentracy/aliases.json` — from that point on, any `ot.completion(model=alias, ...)` call from any Python process owned by the same user dispatches to this student locally. ```python theme={null} student = ot.distill(dataset=..., teacher=..., student=...) student.save("./ticket-classifier-v1") # durable artifact path student.deploy("ticket-classifier") # register alias # Now any caller — your app, a FastAPI server, a batch job — # transparently hits the local student with this single string: resp = ot.completion( model="ticket-classifier", messages=[{"role": "user", "content": "Classify: refund please"}], ) print(resp.choices[0].message.content, resp._cost) # "billing" 0.0 ``` Re-pointing, listing, or removing aliases is a one-liner: ```python theme={null} ot.set_alias("smart", backend="peft", model_path=..., base_model=...) ot.list_aliases() ot.unset_alias("smart") ``` This is the closing move of the pipeline — the moment cost savings actually land in your invoice. **Your app code didn't change**; only the thing on the other side of the alias got 10× cheaper. ## Common pitfalls **Distilling a single cluster, not a mixed bag.** One dataset should be one coherent task. If you mix "JSON extraction" and "creative writing" into the same dataset, the student gets confused. Distill each task separately; swap separate aliases. **Training before the teacher is right.** If your teacher is giving 70% accurate answers, your student will cap out below that. Fix prompting and model choice first; then distill. **Evaluating the student only on training examples.** Always evaluate on held-out traces. OpenTracy's evaluation framework handles this — pass a dataset with a test split and it will report accuracy on rows the student never saw. ## Next Every method of the `Distiller` client with parameters and return types. Distillation requires the engine + GPU — this guide sets up Docker Compose.