Skip to main content

Introduction

Running evaluations in CI means every pull request gets a quality gate: if a code change degrades a metric beyond your threshold, the build fails before the change ships. This guide covers:
  • Setting a baseline run - tie an evaluation run to your pipeline ID so you can compare against it
  • Detecting regressions - call compare_runs() on every PR and exit non-zero on degradation
  • GitHub Actions YAML - a copy-paste workflow that automates the full loop
Prerequisites: You can already run evaluations with evaluate(). If not, start with the Experiments introduction first.
compare_runs() is available in honeyhive >= 1.0.0rc21 (pre-release). Pin your dependency accordingly: honeyhive>=1.0.0rc21.

Worked example: evaluating a sentiment classifier

This section walks through evaluating a minimal sentiment classifier end to end: dataset, evaluators, baseline run, and regression gate. Swap in your own function and dataset - the evaluation flow only cares about the shape of the output.

The function under test

A small classifier that tags a product review as "positive" or "negative":
# classifier.py
import os
from anthropic import Anthropic

_client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def classify_sentiment(review: str) -> str:
    response = _client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=10,
        temperature=0,
        messages=[{
            "role": "user",
            "content": (
                "Classify this product review as 'positive' or 'negative'. "
                "Return one word only.\n\n"
                f"Review: {review}"
            ),
        }],
    )
    return response.content[0].text.strip().lower()

The dataset

Three reviews with expected labels. A local list is enough for this example; for larger datasets see Upload Datasets.
# dataset.py
test_cases = [
    {
        "inputs": {"review": "Fantastic product - works exactly as advertised."},
        "ground_truth": {"expected": "positive"},
    },
    {
        "inputs": {"review": "Broke after two days. Total waste of money."},
        "ground_truth": {"expected": "negative"},
    },
    {
        "inputs": {"review": "Shipping was fast and the quality is excellent."},
        "ground_truth": {"expected": "positive"},
    },
]

Evaluator 1: exact_match

Programmatic evaluator. Returns 1.0 when the prediction matches the expected label, 0.0 otherwise. Follows the evaluator signature (outputs, inputs, ground_truth):
# evaluators.py
def exact_match(outputs, inputs, ground_truth):
    return 1.0 if outputs.strip().lower() == ground_truth["expected"] else 0.0

Evaluator 2: llm_judge

LLM-as-judge evaluator for a softer correctness signal - useful when the model returns a valid label that still disagrees with the expected one. Temperature 0 keeps scores reproducible across runs:
# evaluators.py (continued)
import json
import os
from anthropic import Anthropic

_judge = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def llm_judge(outputs, inputs, ground_truth):
    prompt = (
        "Score whether the predicted sentiment matches the review's true sentiment. "
        'Return JSON only: {"score": 0 or 1}.\n\n'
        f"Review: {inputs['review']}\n"
        f"Predicted: {outputs}\n"
        f"Expected: {ground_truth['expected']}"
    )
    response = _judge.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=50,
        temperature=0,
        messages=[{"role": "user", "content": prompt}],
    )
    text = response.content[0].text
    start, end = text.find("{"), text.rfind("}")
    return float(json.loads(text[start : end + 1])["score"])

Wiring it into evaluate()

Feed the function, dataset, and evaluators into evaluate(). The run_id is derived directly from the commit SHA (see section 1), so the same commit always maps to the same run_id:
# baseline.py
import os
from honeyhive import evaluate
from classifier import classify_sentiment
from dataset import test_cases
from evaluators import exact_match, llm_judge
from run_id_utils import run_id_from_sha

sha = os.environ["GITHUB_SHA"]
run_id = run_id_from_sha(sha)

def classifier_function(datapoint):
    return classify_sentiment(datapoint["inputs"]["review"])

evaluate(
    function=classifier_function,
    dataset=test_cases,
    evaluators=[exact_match, llm_judge],
    project=os.environ["HH_PROJECT"],
    name=f"sentiment-run-{sha[:7]}",
    run_id=run_id,
    api_key=os.environ["HH_API_KEY"],
)
Once this script runs inside the workflow from section 3, every PR baseline-compares classifier quality. A regression in exact_match or llm_judge means the classifier started mislabeling reviews it used to handle correctly. References

1. Setting a Baseline Run

A baseline run is an ordinary evaluate() call tagged with a run_id that your PR jobs can later reference. HoneyHive validates run_id as a strict UUIDv4, so plain strings like "ci-abc123" are rejected. Derive the run_id deterministically from the commit SHA, forcing the UUIDv4 version and variant bits so the result stays valid:
# run_id_utils.py
import uuid

def run_id_from_sha(git_sha: str) -> str:
    """Deterministically derive a valid UUIDv4 string from a git SHA.
    Same SHA -> same run_id, everywhere, no state needed."""
    b = bytearray(bytes.fromhex(git_sha)[:16])
    b[6] = (b[6] & 0x0f) | 0x40   # version = 4
    b[8] = (b[8] & 0x3f) | 0x80   # RFC 4122 variant
    return str(uuid.UUID(bytes=bytes(b)))
Because run_id is a pure function of the SHA, the baseline for any commit is the same across retries, and the PR job can reconstruct its base commit’s run_id without any state-passing - no cache, no artifact, no metadata lookup.
# baseline.py - runs once per commit (push to main AND every PR)
import os
from honeyhive import evaluate
from run_id_utils import run_id_from_sha

sha = os.environ["GITHUB_SHA"]
run_id = run_id_from_sha(sha)

result = evaluate(
    function=your_pipeline_function,      # the function under test
    dataset=test_cases,                   # local list, or use dataset_id= for managed datasets
    evaluators=[accuracy_evaluator, latency_evaluator],
    project=os.environ["HH_PROJECT"],
    name=f"run-{sha[:7]}",
    run_id=run_id,
    api_key=os.environ["HH_API_KEY"],
)
For full evaluate() options, see the Experiments introduction.

2. Detecting Regressions

Once the PR’s evaluation has run (via the same baseline.py script above), call compare_runs() against the baseline to detect regressions. The PR workflow passes the PR head SHA and the base SHA in as PR_SHA and BASELINE_SHA; the script derives both run_ids directly from them.
# regression_check.py - run this on every pull request, after baseline.py
import os
import sys
from honeyhive import HoneyHive
from honeyhive.experiments import compare_runs
from run_id_utils import run_id_from_sha

pr_run_id = run_id_from_sha(os.environ["PR_SHA"])
baseline_run_id = run_id_from_sha(os.environ["BASELINE_SHA"])

client = HoneyHive(api_key=os.environ["HH_API_KEY"])

comparison = compare_runs(
    client=client,
    new_run_id=pr_run_id,
    old_run_id=baseline_run_id,
    project_id=os.environ["HH_PROJECT"],
)

degraded = comparison.list_degraded_metrics()

if degraded:
    print("Regression detected:")
    for metric_name in degraded:
        delta = comparison.get_metric_delta(metric_name)
        if delta:
            old = delta.get("old_aggregate") or 0
            new = delta.get("new_aggregate") or 0
            print(f"  {metric_name}: {old:.4f} -> {new:.4f} ({new - old:+.4f})")
    sys.exit(1)

improved = comparison.list_improved_metrics()
if improved:
    print("Improvements detected:")
    for metric_name in improved:
        delta = comparison.get_metric_delta(metric_name)
        if delta:
            old = delta.get("old_aggregate") or 0
            new = delta.get("new_aggregate") or 0
            print(f"  {metric_name}: {old:.4f} -> {new:.4f} ({new - old:+.4f})")

if not improved:
    print("All metrics stable.")
print("No regression detected.")
sys.exit(0)

Key methods on RunComparisonResult

MethodReturnsDescription
list_degraded_metrics()list[str]Metric names where at least one datapoint degraded
list_improved_metrics()list[str]Metric names where at least one datapoint improved
get_metric_delta(name)dictDelta dict with old_aggregate, new_aggregate, improved_count, degraded_count, improved (IDs), degraded (IDs)
sys.exit(1) on any degraded metric is the minimal threshold. For per-datapoint breakdowns and a richer comparison workflow, see Comparing Experiments.
The snippet above gates on every metric. In production, teams typically gate only on their most critical or sensitive metrics (e.g. accuracy, hallucination_rate) and treat the rest as informational. Filter list_degraded_metrics() to a critical subset:
CRITICAL_METRICS = {"accuracy", "hallucination_rate"}

degraded = comparison.list_degraded_metrics()
critical_degraded = [m for m in degraded if m in CRITICAL_METRICS]

if critical_degraded:
    print(f"Critical regression in: {critical_degraded}")
    sys.exit(1)

3. GitHub Actions Workflow

The workflow below has two jobs:
  1. run-evaluation - runs on push to main (sets the baseline) and on every PR (sets the PR run)
  2. detect-regression - runs only on PRs, compares the two runs and posts a comment
# .github/workflows/eval-regression.yml
name: Evaluation Regression Check

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  HH_API_KEY: ${{ secrets.HH_API_KEY }}
  HH_PROJECT: ${{ vars.HH_PROJECT }}

jobs:
  run-evaluation:
    name: Run Evaluation
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          pip install "honeyhive>=1.0.0rc21"
          pip install -r requirements.txt

      - name: Run evaluation
        env:
          # On push events this is the pushed commit; on pull_request events
          # github.sha is the temporary merge commit, so use the PR head SHA.
          GITHUB_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
        run: python scripts/baseline.py

  detect-regression:
    name: Detect Regression
    runs-on: ubuntu-latest
    needs: run-evaluation
    if: github.event_name == 'pull_request'

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          pip install "honeyhive>=1.0.0rc21"
          pip install -r requirements.txt

      - name: Run regression check
        id: regression
        env:
          PR_SHA: ${{ github.event.pull_request.head.sha }}
          BASELINE_SHA: ${{ github.event.pull_request.base.sha }}
        run: |
          set +e
          python scripts/regression_check.py 2>&1 | tee regression_output.txt
          echo "exit_code=${PIPESTATUS[0]}" >> "$GITHUB_OUTPUT"

      - name: Post PR comment
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const output = fs.readFileSync('regression_output.txt', 'utf8');
            const exitCode = '${{ steps.regression.outputs.exit_code }}';
            const status = exitCode === '0' ? '✅ No regression' : '❌ Regression detected';
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `**HoneyHive Evaluation - ${status}**\n\n\`\`\`\n${output}\n\`\`\``
            });

      - name: Fail on regression
        if: steps.regression.outputs.exit_code != '0'
        run: exit 1
How it works:
  • Both jobs derive run_id from the commit SHA via the same run_id_from_sha() helper, so the same commit always produces the same run_id. No cache, no artifact, no metadata passing.
  • On push to main: baseline.py runs on the pushed commit. That commit’s run_id becomes the baseline any future PR branching off it will look up.
  • On pull_request: run-evaluation produces the PR head’s run, and detect-regression derives the baseline run_id from github.event.pull_request.base.sha and compares.
  • github.event.pull_request.head.sha is used instead of github.sha on PR events because github.sha is GitHub’s temporary merge commit, not the PR head.
  • The PR comment posts the full output whether the check passes or fails, so reviewers see exactly which metrics changed.
  • baseline.py is the script that runs the evaluation - for a concrete function, dataset, and evaluators list, see the worked example above.

REST API (Non-Python CI)

If your CI doesn’t use Python, you can drive the same comparison via the REST API. Start a run, wait for it to complete, then call the comparison endpoint. For the full REST flow covering run creation and event logging, see Experiments via API. The run comparison endpoint:
curl -X GET "https://api.honeyhive.ai/v1/runs/${NEW_RUN_ID}/compare-with/${BASELINE_RUN_ID}" \
  -H "Authorization: Bearer $HH_API_KEY"
The response contains a metrics array where each entry includes metric_name, old_aggregate, new_aggregate, improved_count, and degraded_count - the same data surface as RunComparisonResult.

Summary

StepWhat happens
Push to mainevaluate() runs with run_id derived from github.sha; this is the baseline for any PR that branches off this commit
Pull request opensevaluate() runs on the PR head with run_id derived from pr.head.sha; baseline run_id derived from pr.base.sha
compare_runs() calledReturns RunComparisonResult with degraded/improved metrics
Degraded metric foundsys.exit(1) - CI fails, PR blocked
All metrics stablesys.exit(0) - CI passes, PR unblocked
PR comment postedReviewers see exact metric deltas inline