> ## Documentation Index
> Fetch the complete documentation index at: https://opentracy.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Datasets

> How curated traces become the training-ready data that feeds distillation and evaluation

A **dataset** is a named, versioned bundle of rows shaped for training or
evaluation. Each row is a `(prompt, response, metadata)` tuple. Datasets
are how you go from "a million raw traces" to "500 high-quality examples
of ticket classification I want to distill a model on".

Datasets are the artifact that makes everything downstream work. You
can't distill without one. You can't run a rigorous evaluation without one.

## Three ways a dataset is born

### 1. Auto-clustered from traces (the fast path)

The engine clusters your traces by prompt embedding similarity, names each
cluster with an LLM, and offers each named cluster as a candidate dataset:

```
cluster 12 → "SQL generation"           (2,341 traces)
cluster 34 → "Support ticket triage"    (8,902 traces)
cluster 51 → "Code review feedback"     (410 traces)
```

In the UI you click **Promote to dataset**, give it a name, and pick the
rows to include. Under the hood, this is the shortest path from running
traffic to training data.

### 2. Uploaded from a file

If you already have labeled data — a JSONL, a CSV, a Hugging Face dataset —
you can upload it directly:

```python theme={null}
from opentracy import Distiller

d = Distiller(base_url="http://localhost:8000")
dataset = d.upload_dataset(
    name="invoice-extraction-v1",
    path="./data/invoices.jsonl",
    # jsonl rows: {"prompt": "...", "response": "...", "metadata": {...}}
)
```

### 3. Generated from prompts (synthetic)

Start with a list of prompts you care about. The engine asks a teacher
model to generate N responses per prompt, judges them, and keeps the top
ones. Useful when you have prompt ideas but no labeled responses yet.

```python theme={null}
d.generate_dataset(
    name="python-docstrings-v1",
    prompts_path="./data/prompt_seeds.txt",
    teacher="openai/gpt-4o",
    n_samples=4,
    judge="openai/gpt-4o-mini",
)
```

## What a dataset row looks like

```json theme={null}
{
  "row_id": "r_0000123",
  "prompt": "Classify this ticket into one of: billing, technical, feature_request. Ticket: Where can I download my invoice for March?",
  "response": "billing",
  "metadata": {
    "source_trace_id": "t_af91",
    "teacher_model": "openai/gpt-4o",
    "judge_score": 0.92,
    "cluster_id": 34,
    "tags": ["ticket_classifier", "reviewed"]
  }
}
```

Rows that came from traces retain a `source_trace_id` — you can always
follow back to the original request.

## Curation: filtering the bad rows

Raw traces have noise. A good dataset is **curated**: you keep the useful
rows and drop the rest. The engine ships a curation pipeline with three
stages:

<Steps>
  <Step title="Judge">
    An LLM judge (configurable — defaults to a cheap model like
    `openai/gpt-4o-mini`) scores each row on helpfulness, relevance, and
    format. Rows below a threshold are flagged.
  </Step>

  <Step title="Filter">
    Apply rules: drop rows with errors, drop rows outside the target
    cluster, drop rows above a length limit, drop rows with flagged PII.
  </Step>

  <Step title="Review">
    Human review in the UI for the top slice — usually 50–100 borderline
    rows. Not required but cheap insurance for your first distillation.
  </Step>
</Steps>

Each stage is implemented in `opentracy.distillation.curation` and you
can run them standalone if you're building a pipeline manually. See the
[API reference](/api-reference/distiller#curate) for details.

## Versioning

Datasets are immutable once frozen. If you curate more rows or change the
judge, you get a new version:

```
invoice-extraction-v1  (frozen, 847 rows)
invoice-extraction-v2  (frozen, 1240 rows, stricter judge threshold)
invoice-extraction-v3  (active, 1240 rows + 312 newly reviewed)
```

Every distillation job records the dataset version it trained on, so you
can always reproduce a result or compare student models trained on
different data.

## Two things you do with a dataset

<CardGroup cols={2}>
  <Card title="Distill" icon="wand-magic-sparkles" href="/concepts/distillation">
    Hand the dataset to the distillation pipeline. A student model gets
    fine-tuned on the teacher's labels and comes out as a LoRA adapter
    you can serve.
  </Card>

  <Card title="Evaluate" icon="magnifying-glass-chart">
    Pick a dataset as the benchmark. Run any model against it and compare
    accuracy, cost per row, and latency — including models you're
    considering swapping in via an alias.
  </Card>
</CardGroup>

## Common mistakes

<Warning>
  **Don't distill with \< 200 rows.** Below that threshold the student tends
  to overfit and doesn't generalize beyond the training prompts. 500–2000
  is the sweet spot for most tasks.
</Warning>

<Warning>
  **Don't mix unrelated clusters in one dataset.** If you put
  "SQL generation" and "customer emails" in the same dataset, the student
  learns neither well. One dataset = one coherent task.
</Warning>

<Warning>
  **Don't skip curation on the first run.** Raw traces include failures,
  refusals, truncated outputs. Let the judge drop those before training
  — otherwise the student learns the noise.
</Warning>
