> ## Documentation Index
> Fetch the complete documentation index at: https://opentracy.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Distillation

> Train a cheap student model from teacher labels — the wedge that separates OpenTracy from a generic LLM gateway

<img src="https://mintcdn.com/opentracy/GPH9CFICBELzB50g/images/money.jpeg?fit=max&auto=format&n=GPH9CFICBELzB50g&q=85&s=eff74340f9ade39ecc6dc53ef93bb8fa" alt="OpenTracy ghost with dollar-sign eyes — distillation is the cost-reduction wedge" width="1280" height="698" data-path="images/money.jpeg" />

Distillation is how you go from "paying $0.02 per call to GPT-4o" to
"paying $0.0005 per call to a small model that I fine-tuned **on my own
traffic** to match GPT-4o's output". It's the wedge — the compounding
value that a plain gateway can't offer.

## The core idea

A **teacher** is a large, expensive model that already does the task well
(GPT-4o, Claude Sonnet, etc.). A **student** is a small, cheap open model
(llama-3.2-1b, qwen3-0.6b, etc.). Distillation trains the student to
**imitate the teacher's behavior on a specific dataset** — usually the
one you built from your own traces (see [Datasets](/concepts/datasets)).

The student won't be as smart as the teacher in general. It will be
roughly as good as the teacher on the narrow slice of prompts you
distilled — and 10–100× cheaper to run.

## What the pipeline does

<Steps>
  <Step title="Data generation">
    For each prompt in the dataset, the teacher is called N times (default
    4\) with temperature > 0. This produces N candidate responses per prompt.
  </Step>

  <Step title="Curation">
    A judge model scores each candidate. The top-k (default: best 2)
    survive — bad candidates are dropped. This is the **"best-of-N"** part
    of BOND (Best-Of-N Distillation).
  </Step>

  <Step title="Training">
    The student is fine-tuned on (prompt → curated\_response) pairs using
    the BOND loss, a blend of supervised fine-tuning, preference optimization,
    and KL regularization. Runs on GPU via Unsloth + TRL.
  </Step>

  <Step title="Export">
    The trained LoRA adapter is saved and optionally converted to GGUF
    (quantized) for serving on CPU or edge. Output: a directory you can
    load into any inference engine that speaks GGUF/llama.cpp.
  </Step>

  <Step title="Serve">
    Register the distilled model in OpenTracy's model registry. Point a
    routing alias at it. Your app keeps calling `model="smart"` and the
    requests now flow through your custom student.
  </Step>
</Steps>

## Running a distillation job

The simplest path — `ot.distill()` runs the pipeline in-process and
returns a callable [`Student`](/api-reference/distill#student-class).
No REST service, no job polling, no ClickHouse.

```python theme={null}
import opentracy as ot

student = ot.distill(
    dataset="tickets.jsonl",              # path, list of dicts, or a callable
    teacher="openai/gpt-4o",
    student="llama-3.2-1b",
    steps=100,
    n_samples=4,                          # BOND candidates per prompt
    quantize="q4_k_m",                    # or None to skip GGUF export
)

print(student("Classify: refund please"))  # local inference, $0
```

Pass `on_progress=callback` for a tidy phase-by-phase timeline. See
[`ot.distill` reference](/api-reference/distill) for every parameter.

For the queued, multi-tenant, REST-backed flow (jobs persisted in
ClickHouse, UI observability, resumable on restart), use the
[`Distiller`](/api-reference/distiller) HTTP client instead — same
engine under the hood, different deployment shape.

## Choosing a teacher and a student

**Teacher**: pick the model you'd use in production if cost weren't an
issue. GPT-4o, Claude Sonnet, or Gemini 1.5 Pro are good defaults. The
student will learn to match this model's output style and accuracy — on
the distilled task only.

**Student**: the smallest model that can plausibly handle your task's
output. Rule of thumb:

| Task                               | Student floor      |
| ---------------------------------- | ------------------ |
| Classification (few labels)        | 0.6B (qwen3-0.6b)  |
| Structured extraction (JSON)       | 1B (llama-3.2-1b)  |
| Short-form generation (\< 200 tok) | 1–3B               |
| Long-form + reasoning              | 8B+ (llama-3.1-8b) |

Smaller is cheaper to run but harder to train. If training fails to
converge, move up a tier.

Discover the full current list:

```python theme={null}
# Via the REST client (self-hosted stack):
from opentracy import Distiller
d = Distiller(base_url="http://localhost:8000")
for t in d.teacher_models(): print(t["id"], t["provider"])
for s in d.student_models(): print(s["id"], s.get("params"))

# Via the in-process path (no server needed):
from opentracy.distillation.schemas import STUDENT_MODEL_MAP, TEACHER_MODEL_MAP
print(list(STUDENT_MODEL_MAP.keys()))
print(list(TEACHER_MODEL_MAP.keys()))
```

## The BOND hyperparameters

The BOND loss has two knobs worth knowing:

* **`bond_beta`** (default `0.5`) — how hard to push the student toward
  preferred responses vs. dispreferred. Higher = more aggressive
  preference shift; lower = gentler, more SFT-like.
* **`bond_gamma`** (default `0.1`) — KL regularization strength. Keeps
  the student close to its initial weights so you don't destroy general
  capability. Raise if your student overfits or starts babbling.

You rarely need to tune these — defaults are good for most tasks. If
you're getting bad results, first look at dataset quality before touching
BOND parameters.

## Hardware requirements

Training runs on GPU. The Docker image (`opentracy-api`) is built on the
`nvidia/cuda:12.6` base and supports `--gpus all`. Minimum specs:

| Student size | Min VRAM      | Typical training time (500 prompts, 100 steps) |
| ------------ | ------------- | ---------------------------------------------- |
| 0.6B–1B      | 8 GB          | 10–20 minutes                                  |
| 3B           | 16 GB         | 30–60 minutes                                  |
| 8B           | 24 GB (4-bit) | 2–4 hours                                      |

Without a GPU, training will fail. Use the `estimate` endpoint first
to validate before kicking off a job.

## After training: the alias swap

The closing move. `student.deploy(alias)` writes the mapping into
`~/.opentracy/aliases.json` — from that point on, any
`ot.completion(model=alias, ...)` call from any Python process owned by
the same user dispatches to this student locally.

```python theme={null}
student = ot.distill(dataset=..., teacher=..., student=...)
student.save("./ticket-classifier-v1")       # durable artifact path
student.deploy("ticket-classifier")          # register alias

# Now any caller — your app, a FastAPI server, a batch job —
# transparently hits the local student with this single string:
resp = ot.completion(
    model="ticket-classifier",
    messages=[{"role": "user", "content": "Classify: refund please"}],
)
print(resp.choices[0].message.content, resp._cost)   # "billing" 0.0
```

Re-pointing, listing, or removing aliases is a one-liner:

```python theme={null}
ot.set_alias("smart", backend="peft", model_path=..., base_model=...)
ot.list_aliases()
ot.unset_alias("smart")
```

This is the closing move of the pipeline — the moment cost savings
actually land in your invoice. **Your app code didn't change**; only the
thing on the other side of the alias got 10× cheaper.

## Common pitfalls

<Warning>
  **Distilling a single cluster, not a mixed bag.** One dataset should be
  one coherent task. If you mix "JSON extraction" and "creative writing"
  into the same dataset, the student gets confused. Distill each task
  separately; swap separate aliases.
</Warning>

<Warning>
  **Training before the teacher is right.** If your teacher is giving 70%
  accurate answers, your student will cap out below that. Fix prompting
  and model choice first; then distill.
</Warning>

<Warning>
  **Evaluating the student only on training examples.** Always evaluate
  on held-out traces. OpenTracy's evaluation framework handles this —
  pass a dataset with a test split and it will report accuracy on rows
  the student never saw.
</Warning>

## Next

<CardGroup cols={2}>
  <Card title="Distiller reference" icon="code" href="/api-reference/distiller">
    Every method of the `Distiller` client with parameters and return types.
  </Card>

  <Card title="Self-host the full stack" icon="server" href="/guides/self-host">
    Distillation requires the engine + GPU — this guide sets up Docker Compose.
  </Card>
</CardGroup>
