> ## Documentation Index
> Fetch the complete documentation index at: https://docs.honeyhive.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# CI Regression Detection

> Automatically catch quality regressions on every pull request using HoneyHive experiment comparison

## Introduction

Running evaluations in CI means every pull request gets a quality gate: if a code change degrades a metric beyond your threshold, the build fails before the change ships.

This guide covers:

* **Setting a baseline run** - tie an evaluation run to your pipeline ID so you can compare against it
* **Detecting regressions** - call `compare_runs()` on every PR and exit non-zero on degradation
* **GitHub Actions YAML** - a copy-paste workflow that automates the full loop

**Prerequisites:** You can already run evaluations with `evaluate()`. If not, start with the [Experiments introduction](/v2/evaluation/introduction) first.

***

## Worked example: evaluating a sentiment classifier

This section walks through evaluating a minimal sentiment classifier end to end: dataset, evaluators, baseline run, and regression gate. Swap in your own function and dataset - the evaluation flow only cares about the shape of the output.

### The function under test

A small classifier that tags a product review as `"positive"` or `"negative"`:

```python theme={null}
# classifier.py
import os
from anthropic import Anthropic

_client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def classify_sentiment(review: str) -> str:
    response = _client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=10,
        temperature=0,
        messages=[{
            "role": "user",
            "content": (
                "Classify this product review as 'positive' or 'negative'. "
                "Return one word only.\n\n"
                f"Review: {review}"
            ),
        }],
    )
    return response.content[0].text.strip().lower()
```

### The dataset

Three reviews with expected labels. A local list is enough for this example; for larger datasets see [Upload Datasets](/v2/datasets/import).

```python theme={null}
# dataset.py
test_cases = [
    {
        "inputs": {"review": "Fantastic product - works exactly as advertised."},
        "ground_truth": {"expected": "positive"},
    },
    {
        "inputs": {"review": "Broke after two days. Total waste of money."},
        "ground_truth": {"expected": "negative"},
    },
    {
        "inputs": {"review": "Shipping was fast and the quality is excellent."},
        "ground_truth": {"expected": "positive"},
    },
]
```

### Evaluator 1: `exact_match`

Programmatic evaluator. Returns `1.0` when the prediction matches the expected label, `0.0` otherwise. Follows the [evaluator signature](/v2/evaluators/client_side) `(outputs, inputs, ground_truth)`:

```python theme={null}
# evaluators.py
def exact_match(outputs, inputs, ground_truth):
    return 1.0 if outputs.strip().lower() == ground_truth["expected"] else 0.0
```

### Evaluator 2: `llm_judge`

LLM-as-judge evaluator for a softer correctness signal - useful when the model returns a valid label that still disagrees with the expected one. Temperature 0 keeps scores reproducible across runs:

```python theme={null}
# evaluators.py (continued)
import json
import os
from anthropic import Anthropic

_judge = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def llm_judge(outputs, inputs, ground_truth):
    prompt = (
        "Score whether the predicted sentiment matches the review's true sentiment. "
        'Return JSON only: {"score": 0 or 1}.\n\n'
        f"Review: {inputs['review']}\n"
        f"Predicted: {outputs}\n"
        f"Expected: {ground_truth['expected']}"
    )
    response = _judge.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=50,
        temperature=0,
        messages=[{"role": "user", "content": prompt}],
    )
    text = response.content[0].text
    start, end = text.find("{"), text.rfind("}")
    return float(json.loads(text[start : end + 1])["score"])
```

### Wiring it into `evaluate()`

Feed the function, dataset, and evaluators into `evaluate()`. The `run_id` is derived directly from the commit SHA (see [section 1](#1-setting-a-baseline-run)), so the same commit always maps to the same `run_id`:

```python theme={null}
# baseline.py
import os
from honeyhive import evaluate
from classifier import classify_sentiment
from dataset import test_cases
from evaluators import exact_match, llm_judge
from run_id_utils import run_id_from_sha

sha = os.environ["GITHUB_SHA"]
run_id = run_id_from_sha(sha)

def classifier_function(datapoint):
    return classify_sentiment(datapoint["inputs"]["review"])

evaluate(
    function=classifier_function,
    dataset=test_cases,
    evaluators=[exact_match, llm_judge],
    project=os.environ["HH_PROJECT"],
    name=f"sentiment-run-{sha[:7]}",
    run_id=run_id,
    api_key=os.environ["HH_API_KEY"],
)
```

Once this script runs inside the workflow from [section 3](#3-github-actions-workflow), every PR baseline-compares classifier quality. A regression in `exact_match` or `llm_judge` means the classifier started mislabeling reviews it used to handle correctly.

**References**

* [Evaluator templates](/v2/evaluators/evaluator-templates) - patterns for programmatic and LLM-judge evaluators.
* [honeyhive python-sdk](https://github.com/honeyhiveai/python-sdk) - install and auth reference.

***

## 1. Setting a Baseline Run

A baseline run is an ordinary `evaluate()` call tagged with a `run_id` that your PR jobs can later reference. HoneyHive validates `run_id` as a strict UUIDv4, so plain strings like `"ci-abc123"` are rejected. Derive the `run_id` deterministically from the commit SHA, forcing the UUIDv4 version and variant bits so the result stays valid:

```python theme={null}
# run_id_utils.py
import uuid

def run_id_from_sha(git_sha: str) -> str:
    """Deterministically derive a valid UUIDv4 string from a git SHA.
    Same SHA -> same run_id, everywhere, no state needed."""
    b = bytearray(bytes.fromhex(git_sha)[:16])
    b[6] = (b[6] & 0x0f) | 0x40   # version = 4
    b[8] = (b[8] & 0x3f) | 0x80   # RFC 4122 variant
    return str(uuid.UUID(bytes=bytes(b)))
```

Because `run_id` is a pure function of the SHA, the baseline for any commit is the same across retries, and the PR job can reconstruct its base commit's `run_id` without any state-passing - no cache, no artifact, no metadata lookup.

```python theme={null}
# baseline.py - runs once per commit (push to main AND every PR)
import os
from honeyhive import evaluate
from run_id_utils import run_id_from_sha

sha = os.environ["GITHUB_SHA"]
run_id = run_id_from_sha(sha)

result = evaluate(
    function=your_pipeline_function,      # the function under test
    dataset=test_cases,                   # local list, or use dataset_id= for managed datasets
    evaluators=[accuracy_evaluator, latency_evaluator],
    project=os.environ["HH_PROJECT"],
    name=f"run-{sha[:7]}",
    run_id=run_id,
    api_key=os.environ["HH_API_KEY"],
)
```

For full `evaluate()` options, see the [Experiments introduction](/v2/evaluation/introduction).

***

## 2. Detecting Regressions

Once the PR's evaluation has run (via the same `baseline.py` script above), call `compare_runs()` against the baseline to detect regressions. The PR workflow passes the PR head SHA and the base SHA in as `PR_SHA` and `BASELINE_SHA`; the script derives both `run_id`s directly from them.

```python theme={null}
# regression_check.py - run this on every pull request, after baseline.py
import os
import sys
from honeyhive import HoneyHive
from honeyhive.experiments import compare_runs
from run_id_utils import run_id_from_sha

pr_run_id = run_id_from_sha(os.environ["PR_SHA"])
baseline_run_id = run_id_from_sha(os.environ["BASELINE_SHA"])

client = HoneyHive(api_key=os.environ["HH_API_KEY"])

comparison = compare_runs(
    client=client,
    new_run_id=pr_run_id,
    old_run_id=baseline_run_id,
    project_id=os.environ["HH_PROJECT"],
)

degraded = comparison.list_degraded_metrics()

if degraded:
    print("Regression detected:")
    for metric_name in degraded:
        delta = comparison.get_metric_delta(metric_name)
        if delta:
            old = delta.get("old_aggregate") or 0
            new = delta.get("new_aggregate") or 0
            print(f"  {metric_name}: {old:.4f} -> {new:.4f} ({new - old:+.4f})")
    sys.exit(1)

improved = comparison.list_improved_metrics()
if improved:
    print("Improvements detected:")
    for metric_name in improved:
        delta = comparison.get_metric_delta(metric_name)
        if delta:
            old = delta.get("old_aggregate") or 0
            new = delta.get("new_aggregate") or 0
            print(f"  {metric_name}: {old:.4f} -> {new:.4f} ({new - old:+.4f})")

if not improved:
    print("All metrics stable.")
print("No regression detected.")
sys.exit(0)
```

### Key methods on `RunComparisonResult`

| Method                    | Returns     | Description                                                                                                              |
| ------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------ |
| `list_degraded_metrics()` | `list[str]` | Metric names where at least one datapoint degraded                                                                       |
| `list_improved_metrics()` | `list[str]` | Metric names where at least one datapoint improved                                                                       |
| `get_metric_delta(name)`  | `dict`      | Delta dict with `old_aggregate`, `new_aggregate`, `improved_count`, `degraded_count`, `improved` (IDs), `degraded` (IDs) |

`sys.exit(1)` on any degraded metric is the minimal threshold. For per-datapoint breakdowns and a richer comparison workflow, see [Comparing Experiments](/v2/evaluation/comparing_evals).

<Tip>
  The snippet above gates on **every** metric. In production, teams typically gate only on their most critical or sensitive metrics (e.g. `accuracy`, `hallucination_rate`) and treat the rest as informational. Filter `list_degraded_metrics()` to a critical subset:

  ```python theme={null}
  CRITICAL_METRICS = {"accuracy", "hallucination_rate"}

  degraded = comparison.list_degraded_metrics()
  critical_degraded = [m for m in degraded if m in CRITICAL_METRICS]

  if critical_degraded:
      print(f"Critical regression in: {critical_degraded}")
      sys.exit(1)
  ```
</Tip>

***

## 3. GitHub Actions Workflow

The workflow below has two jobs:

1. **`run-evaluation`** - runs on push to `main` (sets the baseline) and on every PR (sets the PR run)
2. **`detect-regression`** - runs only on PRs, compares the two runs and posts a comment

```yaml theme={null}
# .github/workflows/eval-regression.yml
name: Evaluation Regression Check

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  HH_API_KEY: ${{ secrets.HH_API_KEY }}
  HH_PROJECT: ${{ vars.HH_PROJECT }}

jobs:
  run-evaluation:
    name: Run Evaluation
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          pip install honeyhive
          pip install -r requirements.txt

      - name: Run evaluation
        env:
          # On push events this is the pushed commit; on pull_request events
          # github.sha is the temporary merge commit, so use the PR head SHA.
          GITHUB_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
        run: python scripts/baseline.py

  detect-regression:
    name: Detect Regression
    runs-on: ubuntu-latest
    needs: run-evaluation
    if: github.event_name == 'pull_request'

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          pip install honeyhive
          pip install -r requirements.txt

      - name: Run regression check
        id: regression
        env:
          PR_SHA: ${{ github.event.pull_request.head.sha }}
          BASELINE_SHA: ${{ github.event.pull_request.base.sha }}
        run: |
          set +e
          python scripts/regression_check.py 2>&1 | tee regression_output.txt
          echo "exit_code=${PIPESTATUS[0]}" >> "$GITHUB_OUTPUT"

      - name: Post PR comment
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const output = fs.readFileSync('regression_output.txt', 'utf8');
            const exitCode = '${{ steps.regression.outputs.exit_code }}';
            const status = exitCode === '0' ? '✅ No regression' : '❌ Regression detected';
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `**HoneyHive Evaluation - ${status}**\n\n\`\`\`\n${output}\n\`\`\``
            });

      - name: Fail on regression
        if: steps.regression.outputs.exit_code != '0'
        run: exit 1
```

**How it works:**

* Both jobs derive `run_id` from the commit SHA via the same `run_id_from_sha()` helper, so the same commit always produces the same `run_id`. No cache, no artifact, no metadata passing.
* On `push` to `main`: `baseline.py` runs on the pushed commit. That commit's `run_id` becomes the baseline any future PR branching off it will look up.
* On `pull_request`: `run-evaluation` produces the PR head's run, and `detect-regression` derives the baseline `run_id` from `github.event.pull_request.base.sha` and compares.
* `github.event.pull_request.head.sha` is used instead of `github.sha` on PR events because `github.sha` is GitHub's temporary merge commit, not the PR head.
* The PR comment posts the full output whether the check passes or fails, so reviewers see exactly which metrics changed.
* `baseline.py` is the script that runs the evaluation - for a concrete `function`, `dataset`, and `evaluators` list, see the [worked example above](#worked-example-evaluating-a-sentiment-classifier).

***

## REST API (Non-Python CI)

If your CI doesn't use Python, you can drive the same comparison via the REST API. Start a run, wait for it to complete, then call the comparison endpoint.

For the full REST flow covering run creation and event logging, see [Experiments via API](/v2/evaluation/via-api).

The run comparison endpoint:

```bash theme={null}
curl -X GET "https://api.dp1.us.honeyhive.ai/v1/runs/${NEW_RUN_ID}/compare-with/${BASELINE_RUN_ID}" \
  -H "Authorization: Bearer $HH_API_KEY"
```

The response contains a `metrics` array where each entry includes `metric_name`, `old_aggregate`, `new_aggregate`, `improved_count`, and `degraded_count` - the same data surface as `RunComparisonResult`.

***

## Summary

| Step                    | What happens                                                                                                             |
| ----------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| Push to `main`          | `evaluate()` runs with `run_id` derived from `github.sha`; this is the baseline for any PR that branches off this commit |
| Pull request opens      | `evaluate()` runs on the PR head with `run_id` derived from `pr.head.sha`; baseline `run_id` derived from `pr.base.sha`  |
| `compare_runs()` called | Returns `RunComparisonResult` with degraded/improved metrics                                                             |
| Degraded metric found   | `sys.exit(1)` - CI fails, PR blocked                                                                                     |
| All metrics stable      | `sys.exit(0)` - CI passes, PR unblocked                                                                                  |
| PR comment posted       | Reviewers see exact metric deltas inline                                                                                 |
