> ## Documentation Index
> Fetch the complete documentation index at: https://opentracy.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Auto-routing

> How the router picks a model per prompt — semantic clusters, per-model error profiles, cost-quality tradeoff

<img src="https://mintcdn.com/opentracy/GPH9CFICBELzB50g/images/routing_with_text.jpeg?fit=max&auto=format&n=GPH9CFICBELzB50g&q=85&s=bf8759ba9e90bf4d85ca4d48121f52ca" alt="Multiple OpenTracy ghosts fanning out to different providers — one router, many models" width="1280" height="698" data-path="images/routing_with_text.jpeg" />

The auto-router is the piece that turns **"I have thirteen models I could
call"** into **"call the right one for this specific prompt"** — without
you writing any rules.

It's based on the idea that prompts with similar *meaning* have similar
*difficulty*, and different models have different strengths on different
kinds of prompts. So if you can (a) group prompts by meaning and (b) know
each model's error rate on each group, you can route by minimizing
`expected_error + λ·cost`.

## The three moving parts

### 1. Embedder

Every prompt is run through a sentence embedder (MiniLM-L6-v2, bundled in
the wheel) to produce a 384-dimensional vector. This is a pure function:
the same prompt always produces the same vector.

### 2. Cluster assigner

The embedder output is assigned to one of **100 pre-trained semantic
clusters**. Cluster centroids live in the weights package you downloaded
on first run. Examples (cluster names from the default weights):

* cluster 47 → "mathematical proofs and formal reasoning"
* cluster 84 → "short factual lookup"
* cluster 88 → "data-structure code generation"
* cluster 29 → "creative short-form writing"

Clusters are assigned by nearest centroid (cosine distance). You can opt
into **soft assignment** — a full probability distribution over the 100
clusters — via `use_soft_assignment=True` when loading the router.

### 3. Per-model error profiles

For every model the router knows about, there's a vector Ψ of length 100:
`Ψ[i]` is the model's empirical error rate on cluster `i`. Error is
measured as "fraction of validation examples where this model got it
wrong" during profile fitting.

A routing decision is then:

```
score(model) = Ψ[cluster] + λ · cost_per_1k(model)
selected    = argmin(score)
```

`λ` is the `cost_weight` argument you pass to `load_router()`.

## Using it

The whole thing collapses to two lines:

```python theme={null}
import opentracy as ot

router = ot.load_router(cost_weight=0.5)
decision = router.route("Write a Python function that reverses a linked list.")
```

The returned `RoutingDecision`:

```python theme={null}
decision.selected_model           # "gpt-4o"
decision.cluster_id               # 88
decision.expected_error           # 0.000
decision.cost_adjusted_score      # 0.0031
decision.all_scores               # {model_id: score, ...} — every candidate
decision.cluster_probabilities    # np.ndarray(100,) — soft distribution
decision.reasoning                # human-readable explanation
```

## Tuning the cost-quality dial

`cost_weight` (λ) is the one knob you'll actually touch:

| λ      | Behavior                                                                   |
| ------ | -------------------------------------------------------------------------- |
| `0.0`  | Pick whichever model has lowest predicted error, ignore cost.              |
| `0.5`  | Balanced — common default. A tiny error delta won't justify a 10× cost.    |
| `1.0`  | Strongly prefer cheaper models; only escalate if they're demonstrably bad. |
| `2.0+` | Aggressively cheap; escalate only on the worst prompts.                    |

Try a few values on your traffic. The right number depends on how much
quality degradation you can tolerate.

## Restricting the candidate pool

By default the router considers every model in the loaded registry. You
can restrict it:

```python theme={null}
# Only route among these three
router = ot.load_router(
    allowed_models=["gpt-4o-mini", "ministral-3b-latest", "gpt-4o"],
    cost_weight=0.5,
)
```

Or override per-call:

```python theme={null}
decision = router.route(prompt, available_models=["gpt-4o-mini", "gpt-4o"])
```

Useful when you want to A/B test a model subset, or when certain models
aren't available in a tenant.

## The two backends

`load_router` has a single parameter you'll barely ever touch: `engine`.

* `engine="go"` (**default**) — spawns the bundled Go engine as a
  subprocess. Fast (\~sub-millisecond routing), production path.
* `engine="python"` — pure Python implementation, no subprocess. Slower,
  but useful in environments where process-spawn is forbidden or where
  you want to introspect every internal (e.g. swap the cluster assigner,
  monkey-patch profiles). The Go binary is bundled per-platform; if it
  isn't present you'll see a clear error.
* `engine="auto"` — prefer Go, fall back to Python if the binary is
  missing. **Not recommended** as a default because the fallback is
  silent — if something's wrong with the binary, you want to know, not
  route 10× slower without noticing.

## How routing changes over time

The profiles you loaded are from a benchmark the weights were trained on.
Your production traffic will be different — maybe your users ask more code
questions than the benchmark assumed. Two mechanisms adapt the router:

1. **`blend_with_profiles`** — periodically combine the benchmark's
   per-model error profile with the one observed in production:
   `Ψ_new = α · Ψ_prod + (1 - α) · Ψ_benchmark`. The `feedback` module
   has utilities for this. See the "self-learning" section of the
   [basic\_router\_to\_self\_learning notebook](https://github.com/OpenTracy/opentracy/blob/main/notebooks/basic_router_to_self_learning.ipynb).

2. **Alias swapping** — when a distilled student is ready for a cluster
   you've worked on, you add it to the registry, point the alias at it,
   and from that moment the router can select it for prompts in that
   cluster. See [Distillation](/concepts/distillation).

## When auto-routing isn't enough

For two shapes of problem, auto-routing alone won't cut it:

* **You have hard policy constraints.** "Never route X to Anthropic." In
  that case combine with a `Router` (explicit, rule-based) — the logical
  alias can still be semantic, but the candidates are constrained.
* **Your prompts don't cluster well.** If everything you do is one narrow
  domain that doesn't match any of the pre-trained clusters, you'll get
  mediocre routing decisions. Solution: retrain the weights on your
  traffic (`opentracy.training.full_training_pipeline`), or fall
  back to `Router` with hand-picked deployments.

## Next

<CardGroup cols={2}>
  <Card title="Distillation" icon="wand-magic-sparkles" href="/concepts/distillation">
    The counterpart — how the student models that auto-routing swaps in get built.
  </Card>

  <Card title="Router reference" icon="code" href="/api-reference/load-router">
    `load_router` parameters, `.route()` / `.route_batch()` signatures, full `RoutingDecision` schema.
  </Card>
</CardGroup>
