HoneyHive Docs

Introduction

Running evaluations in CI means every pull request gets a quality gate: if a code change degrades a metric beyond your threshold, the build fails before the change ships. This guide covers:

Setting a baseline run - tie an evaluation run to your pipeline ID so you can compare against it
Detecting regressions - call compare_runs() on every PR and exit non-zero on degradation
GitHub Actions YAML - a copy-paste workflow that automates the full loop

Prerequisites: You can already run evaluations with evaluate(). If not, start with the Experiments introduction first.

Worked example: evaluating a sentiment classifier

This section walks through evaluating a minimal sentiment classifier end to end: dataset, evaluators, baseline run, and regression gate. Swap in your own function and dataset - the evaluation flow only cares about the shape of the output.

The function under test

A small classifier that tags a product review as "positive" or "negative":

# classifier.py
import os
from anthropic import Anthropic

_client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def classify_sentiment(review: str) -> str:
    response = _client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=10,
        temperature=0,
        messages=[{
            "role": "user",
            "content": (
                "Classify this product review as 'positive' or 'negative'. "
                "Return one word only.\n\n"
                f"Review: {review}"
            ),
        }],
    )
    return response.content[0].text.strip().lower()

The dataset

Three reviews with expected labels. A local list is enough for this example; for larger datasets see Upload Datasets.

# dataset.py
test_cases = [
    {
        "inputs": {"review": "Fantastic product - works exactly as advertised."},
        "ground_truth": {"expected": "positive"},
    },
    {
        "inputs": {"review": "Broke after two days. Total waste of money."},
        "ground_truth": {"expected": "negative"},
    },
    {
        "inputs": {"review": "Shipping was fast and the quality is excellent."},
        "ground_truth": {"expected": "positive"},
    },
]

Evaluator 1: `exact_match`

Programmatic evaluator. Returns 1.0 when the prediction matches the expected label, 0.0 otherwise. Follows the evaluator signature (outputs, inputs, ground_truth):

# evaluators.py
def exact_match(outputs, inputs, ground_truth):
    return 1.0 if outputs.strip().lower() == ground_truth["expected"] else 0.0

Evaluator 2: `llm_judge`

LLM-as-judge evaluator for a softer correctness signal - useful when the model returns a valid label that still disagrees with the expected one. Temperature 0 keeps scores reproducible across runs:

# evaluators.py (continued)
import json
import os
from anthropic import Anthropic

_judge = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def llm_judge(outputs, inputs, ground_truth):
    prompt = (
        "Score whether the predicted sentiment matches the review's true sentiment. "
        'Return JSON only: {"score": 0 or 1}.\n\n'
        f"Review: {inputs['review']}\n"
        f"Predicted: {outputs}\n"
        f"Expected: {ground_truth['expected']}"
    )
    response = _judge.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=50,
        temperature=0,
        messages=[{"role": "user", "content": prompt}],
    )
    text = response.content[0].text
    start, end = text.find("{"), text.rfind("}")
    return float(json.loads(text[start : end + 1])["score"])

Wiring it into `evaluate()`

Feed the function, dataset, and evaluators into evaluate(). The run_id is derived directly from the commit SHA (see section 1), so the same commit always maps to the same run_id:

# baseline.py
import os
from honeyhive import evaluate
from classifier import classify_sentiment
from dataset import test_cases
from evaluators import exact_match, llm_judge
from run_id_utils import run_id_from_sha

sha = os.environ["GITHUB_SHA"]
run_id = run_id_from_sha(sha)

def classifier_function(datapoint):
    return classify_sentiment(datapoint["inputs"]["review"])

evaluate(
    function=classifier_function,
    dataset=test_cases,
    evaluators=[exact_match, llm_judge],
    name=f"sentiment-run-{sha[:7]}",
    run_id=run_id,
    api_key=os.environ["HH_API_KEY"],
)

Once this script runs inside the workflow from section 3, every PR baseline-compares classifier quality. A regression in exact_match or llm_judge means the classifier started mislabeling reviews it used to handle correctly. References

Evaluator templates - patterns for programmatic and LLM-judge evaluators.
honeyhive python-sdk - install and auth reference.

1. Setting a Baseline Run

A baseline run is an ordinary evaluate() call tagged with a run_id that your PR jobs can later reference. HoneyHive validates run_id as a strict UUIDv4, so plain strings like "ci-abc123" are rejected. Derive the run_id deterministically from the commit SHA, forcing the UUIDv4 version and variant bits so the result stays valid:

# run_id_utils.py
import uuid

def run_id_from_sha(git_sha: str) -> str:
    """Deterministically derive a valid UUIDv4 string from a git SHA.
    Same SHA -> same run_id, everywhere, no state needed."""
    b = bytearray(bytes.fromhex(git_sha)[:16])
    b[6] = (b[6] & 0x0f) | 0x40   # version = 4
    b[8] = (b[8] & 0x3f) | 0x80   # RFC 4122 variant
    return str(uuid.UUID(bytes=bytes(b)))

Because run_id is a pure function of the SHA, the baseline for any commit is the same across retries, and the PR job can reconstruct its base commit’s run_id without any state-passing - no cache, no artifact, no metadata lookup.

# baseline.py - runs once per commit (push to main AND every PR)
import os
from honeyhive import evaluate
from run_id_utils import run_id_from_sha

sha = os.environ["GITHUB_SHA"]
run_id = run_id_from_sha(sha)

result = evaluate(
    function=your_pipeline_function,      # the function under test
    dataset=test_cases,                   # local list, or use dataset_id= for HoneyHive datasets
    evaluators=[accuracy_evaluator, latency_evaluator],
    name=f"run-{sha[:7]}",
    run_id=run_id,
    api_key=os.environ["HH_API_KEY"],
)

For full evaluate() options, see the Experiments introduction.

2. Detecting Regressions

Once the PR’s evaluation has run (via the same baseline.py script above), call compare_runs() against the baseline to detect regressions. The PR workflow passes the PR head SHA and the base SHA in as PR_SHA and BASELINE_SHA; the script derives both run_ids directly from them.

# regression_check.py - run this on every pull request, after baseline.py
import os
import sys
from honeyhive import HoneyHive
from honeyhive.experiments import compare_runs
from run_id_utils import run_id_from_sha

pr_run_id = run_id_from_sha(os.environ["PR_SHA"])
baseline_run_id = run_id_from_sha(os.environ["BASELINE_SHA"])

client = HoneyHive(api_key=os.environ["HH_API_KEY"])

comparison = compare_runs(
    client=client,
    new_run_id=pr_run_id,
    old_run_id=baseline_run_id,
)

degraded = comparison.list_degraded_metrics()

if degraded:
    print("Regression detected:")
    for metric_name in degraded:
        delta = comparison.get_metric_delta(metric_name)
        if delta:
            old = delta.get("old_aggregate") or 0
            new = delta.get("new_aggregate") or 0
            print(f"  {metric_name}: {old:.4f} -> {new:.4f} ({new - old:+.4f})")
    sys.exit(1)

improved = comparison.list_improved_metrics()
if improved:
    print("Improvements detected:")
    for metric_name in improved:
        delta = comparison.get_metric_delta(metric_name)
        if delta:
            old = delta.get("old_aggregate") or 0
            new = delta.get("new_aggregate") or 0
            print(f"  {metric_name}: {old:.4f} -> {new:.4f} ({new - old:+.4f})")

if not improved:
    print("All metrics stable.")
print("No regression detected.")
sys.exit(0)

Key methods on `RunComparisonResult`

Method	Returns	Description
`list_degraded_metrics()`	`list[str]`	Metric names where at least one datapoint degraded
`list_improved_metrics()`	`list[str]`	Metric names where at least one datapoint improved
`get_metric_delta(name)`	`dict`	Delta dict with `old_aggregate`, `new_aggregate`, `improved_count`, `degraded_count`, `improved` (IDs), `degraded` (IDs)

sys.exit(1) on any degraded metric is the minimal threshold. For per-datapoint breakdowns and a richer comparison workflow, see Comparing Experiments.

The snippet above gates on every metric. In production, teams typically gate only on their most critical or sensitive metrics (e.g. accuracy, hallucination_rate) and treat the rest as informational. Filter list_degraded_metrics() to a critical subset:

CRITICAL_METRICS = {"accuracy", "hallucination_rate"}

degraded = comparison.list_degraded_metrics()
critical_degraded = [m for m in degraded if m in CRITICAL_METRICS]

if critical_degraded:
    print(f"Critical regression in: {critical_degraded}")
    sys.exit(1)

3. GitHub Actions Workflow

The workflow below has two jobs:

run-evaluation - runs on push to main (sets the baseline) and on every PR (sets the PR run)
detect-regression - runs only on PRs, compares the two runs and posts a comment

# .github/workflows/eval-regression.yml
name: Evaluation Regression Check

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  HH_API_KEY: ${{ secrets.HH_API_KEY }}

jobs:
  run-evaluation:
    name: Run Evaluation
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          pip install honeyhive
          pip install -r requirements.txt

      - name: Run evaluation
        env:
          # On push events this is the pushed commit; on pull_request events
          # github.sha is the temporary merge commit, so use the PR head SHA.
          GITHUB_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
        run: python scripts/baseline.py

  detect-regression:
    name: Detect Regression
    runs-on: ubuntu-latest
    needs: run-evaluation
    if: github.event_name == 'pull_request'

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          pip install honeyhive
          pip install -r requirements.txt

      - name: Run regression check
        id: regression
        env:
          PR_SHA: ${{ github.event.pull_request.head.sha }}
          BASELINE_SHA: ${{ github.event.pull_request.base.sha }}
        run: |
          set +e
          python scripts/regression_check.py 2>&1 | tee regression_output.txt
          echo "exit_code=${PIPESTATUS[0]}" >> "$GITHUB_OUTPUT"

      - name: Post PR comment
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const output = fs.readFileSync('regression_output.txt', 'utf8');
            const exitCode = '${{ steps.regression.outputs.exit_code }}';
            const status = exitCode === '0' ? '✅ No regression' : '❌ Regression detected';
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `**HoneyHive Evaluation - ${status}**\n\n\`\`\`\n${output}\n\`\`\``
            });

      - name: Fail on regression
        if: steps.regression.outputs.exit_code != '0'
        run: exit 1

How it works:

Both jobs derive run_id from the commit SHA via the same run_id_from_sha() helper, so the same commit always produces the same run_id. No cache, no artifact, no metadata passing.
On push to main: baseline.py runs on the pushed commit. That commit’s run_id becomes the baseline any future PR branching off it will look up.
On pull_request: run-evaluation produces the PR head’s run, and detect-regression derives the baseline run_id from github.event.pull_request.base.sha and compares.
github.event.pull_request.head.sha is used instead of github.sha on PR events because github.sha is GitHub’s temporary merge commit, not the PR head.
The PR comment posts the full output whether the check passes or fails, so reviewers see exactly which metrics changed.
baseline.py is the script that runs the evaluation - for a concrete function, dataset, and evaluators list, see the worked example above.

REST API (Non-Python CI)

If your CI doesn’t use Python, you can drive the same comparison via the REST API. Start a run, wait for it to complete, then call the comparison endpoint. For the full REST flow covering run creation and event logging, see Experiments via API. The run comparison endpoint:

curl -X GET "https://api.dp1.us.honeyhive.ai/v1/runs/${NEW_RUN_ID}/compare-with/${BASELINE_RUN_ID}" \
  -H "Authorization: Bearer $HH_API_KEY"

The response contains a metrics array where each entry includes metric_name, old_aggregate, new_aggregate, improved_count, and degraded_count - the same data surface as RunComparisonResult.

Summary

Step	What happens
Push to `main`	`evaluate()` runs with `run_id` derived from `github.sha`; this is the baseline for any PR that branches off this commit
Pull request opens	`evaluate()` runs on the PR head with `run_id` derived from `pr.head.sha`; baseline `run_id` derived from `pr.base.sha`
`compare_runs()` called	Returns `RunComparisonResult` with degraded/improved metrics
Degraded metric found	`sys.exit(1)` - CI fails, PR blocked
All metrics stable	`sys.exit(0)` - CI passes, PR unblocked
PR comment posted	Reviewers see exact metric deltas inline

​Introduction

​Worked example: evaluating a sentiment classifier

​The function under test

​The dataset

​Evaluator 1: exact_match

​Evaluator 2: llm_judge

​Wiring it into evaluate()

​1. Setting a Baseline Run

​2. Detecting Regressions

​Key methods on RunComparisonResult

​3. GitHub Actions Workflow

​REST API (Non-Python CI)

​Summary