Introduction
Running evaluations in CI means every pull request gets a quality gate: if a code change degrades a metric beyond your threshold, the build fails before the change ships.
This guide covers:
- Setting a baseline run - tie an evaluation run to your pipeline ID so you can compare against it
- Detecting regressions - call
compare_runs() on every PR and exit non-zero on degradation
- GitHub Actions YAML - a copy-paste workflow that automates the full loop
Prerequisites: You can already run evaluations with evaluate(). If not, start with the Experiments introduction first.
compare_runs() is available in honeyhive >= 1.0.0rc21 (pre-release). Pin your dependency accordingly: honeyhive>=1.0.0rc21.
Worked example: evaluating a sentiment classifier
This section walks through evaluating a minimal sentiment classifier end to end: dataset, evaluators, baseline run, and regression gate. Swap in your own function and dataset - the evaluation flow only cares about the shape of the output.
The function under test
A small classifier that tags a product review as "positive" or "negative":
# classifier.py
import os
from anthropic import Anthropic
_client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
def classify_sentiment(review: str) -> str:
response = _client.messages.create(
model="claude-sonnet-4-6",
max_tokens=10,
temperature=0,
messages=[{
"role": "user",
"content": (
"Classify this product review as 'positive' or 'negative'. "
"Return one word only.\n\n"
f"Review: {review}"
),
}],
)
return response.content[0].text.strip().lower()
The dataset
Three reviews with expected labels. A local list is enough for this example; for larger datasets see Upload Datasets.
# dataset.py
test_cases = [
{
"inputs": {"review": "Fantastic product - works exactly as advertised."},
"ground_truth": {"expected": "positive"},
},
{
"inputs": {"review": "Broke after two days. Total waste of money."},
"ground_truth": {"expected": "negative"},
},
{
"inputs": {"review": "Shipping was fast and the quality is excellent."},
"ground_truth": {"expected": "positive"},
},
]
Evaluator 1: exact_match
Programmatic evaluator. Returns 1.0 when the prediction matches the expected label, 0.0 otherwise. Follows the evaluator signature (outputs, inputs, ground_truth):
# evaluators.py
def exact_match(outputs, inputs, ground_truth):
return 1.0 if outputs.strip().lower() == ground_truth["expected"] else 0.0
Evaluator 2: llm_judge
LLM-as-judge evaluator for a softer correctness signal - useful when the model returns a valid label that still disagrees with the expected one. Temperature 0 keeps scores reproducible across runs:
# evaluators.py (continued)
import json
import os
from anthropic import Anthropic
_judge = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
def llm_judge(outputs, inputs, ground_truth):
prompt = (
"Score whether the predicted sentiment matches the review's true sentiment. "
'Return JSON only: {"score": 0 or 1}.\n\n'
f"Review: {inputs['review']}\n"
f"Predicted: {outputs}\n"
f"Expected: {ground_truth['expected']}"
)
response = _judge.messages.create(
model="claude-sonnet-4-6",
max_tokens=50,
temperature=0,
messages=[{"role": "user", "content": prompt}],
)
text = response.content[0].text
start, end = text.find("{"), text.rfind("}")
return float(json.loads(text[start : end + 1])["score"])
Wiring it into evaluate()
Feed the function, dataset, and evaluators into evaluate(). The run_id is derived directly from the commit SHA (see section 1), so the same commit always maps to the same run_id:
# baseline.py
import os
from honeyhive import evaluate
from classifier import classify_sentiment
from dataset import test_cases
from evaluators import exact_match, llm_judge
from run_id_utils import run_id_from_sha
sha = os.environ["GITHUB_SHA"]
run_id = run_id_from_sha(sha)
def classifier_function(datapoint):
return classify_sentiment(datapoint["inputs"]["review"])
evaluate(
function=classifier_function,
dataset=test_cases,
evaluators=[exact_match, llm_judge],
project=os.environ["HH_PROJECT"],
name=f"sentiment-run-{sha[:7]}",
run_id=run_id,
api_key=os.environ["HH_API_KEY"],
)
Once this script runs inside the workflow from section 3, every PR baseline-compares classifier quality. A regression in exact_match or llm_judge means the classifier started mislabeling reviews it used to handle correctly.
References
1. Setting a Baseline Run
A baseline run is an ordinary evaluate() call tagged with a run_id that your PR jobs can later reference. HoneyHive validates run_id as a strict UUIDv4, so plain strings like "ci-abc123" are rejected. Derive the run_id deterministically from the commit SHA, forcing the UUIDv4 version and variant bits so the result stays valid:
# run_id_utils.py
import uuid
def run_id_from_sha(git_sha: str) -> str:
"""Deterministically derive a valid UUIDv4 string from a git SHA.
Same SHA -> same run_id, everywhere, no state needed."""
b = bytearray(bytes.fromhex(git_sha)[:16])
b[6] = (b[6] & 0x0f) | 0x40 # version = 4
b[8] = (b[8] & 0x3f) | 0x80 # RFC 4122 variant
return str(uuid.UUID(bytes=bytes(b)))
Because run_id is a pure function of the SHA, the baseline for any commit is the same across retries, and the PR job can reconstruct its base commit’s run_id without any state-passing - no cache, no artifact, no metadata lookup.
# baseline.py - runs once per commit (push to main AND every PR)
import os
from honeyhive import evaluate
from run_id_utils import run_id_from_sha
sha = os.environ["GITHUB_SHA"]
run_id = run_id_from_sha(sha)
result = evaluate(
function=your_pipeline_function, # the function under test
dataset=test_cases, # local list, or use dataset_id= for managed datasets
evaluators=[accuracy_evaluator, latency_evaluator],
project=os.environ["HH_PROJECT"],
name=f"run-{sha[:7]}",
run_id=run_id,
api_key=os.environ["HH_API_KEY"],
)
For full evaluate() options, see the Experiments introduction.
2. Detecting Regressions
Once the PR’s evaluation has run (via the same baseline.py script above), call compare_runs() against the baseline to detect regressions. The PR workflow passes the PR head SHA and the base SHA in as PR_SHA and BASELINE_SHA; the script derives both run_ids directly from them.
# regression_check.py - run this on every pull request, after baseline.py
import os
import sys
from honeyhive import HoneyHive
from honeyhive.experiments import compare_runs
from run_id_utils import run_id_from_sha
pr_run_id = run_id_from_sha(os.environ["PR_SHA"])
baseline_run_id = run_id_from_sha(os.environ["BASELINE_SHA"])
client = HoneyHive(api_key=os.environ["HH_API_KEY"])
comparison = compare_runs(
client=client,
new_run_id=pr_run_id,
old_run_id=baseline_run_id,
project_id=os.environ["HH_PROJECT"],
)
degraded = comparison.list_degraded_metrics()
if degraded:
print("Regression detected:")
for metric_name in degraded:
delta = comparison.get_metric_delta(metric_name)
if delta:
old = delta.get("old_aggregate") or 0
new = delta.get("new_aggregate") or 0
print(f" {metric_name}: {old:.4f} -> {new:.4f} ({new - old:+.4f})")
sys.exit(1)
improved = comparison.list_improved_metrics()
if improved:
print("Improvements detected:")
for metric_name in improved:
delta = comparison.get_metric_delta(metric_name)
if delta:
old = delta.get("old_aggregate") or 0
new = delta.get("new_aggregate") or 0
print(f" {metric_name}: {old:.4f} -> {new:.4f} ({new - old:+.4f})")
if not improved:
print("All metrics stable.")
print("No regression detected.")
sys.exit(0)
Key methods on RunComparisonResult
| Method | Returns | Description |
|---|
list_degraded_metrics() | list[str] | Metric names where at least one datapoint degraded |
list_improved_metrics() | list[str] | Metric names where at least one datapoint improved |
get_metric_delta(name) | dict | Delta dict with old_aggregate, new_aggregate, improved_count, degraded_count, improved (IDs), degraded (IDs) |
sys.exit(1) on any degraded metric is the minimal threshold. For per-datapoint breakdowns and a richer comparison workflow, see Comparing Experiments.
The snippet above gates on every metric. In production, teams typically gate only on their most critical or sensitive metrics (e.g. accuracy, hallucination_rate) and treat the rest as informational. Filter list_degraded_metrics() to a critical subset:CRITICAL_METRICS = {"accuracy", "hallucination_rate"}
degraded = comparison.list_degraded_metrics()
critical_degraded = [m for m in degraded if m in CRITICAL_METRICS]
if critical_degraded:
print(f"Critical regression in: {critical_degraded}")
sys.exit(1)
3. GitHub Actions Workflow
The workflow below has two jobs:
run-evaluation - runs on push to main (sets the baseline) and on every PR (sets the PR run)
detect-regression - runs only on PRs, compares the two runs and posts a comment
# .github/workflows/eval-regression.yml
name: Evaluation Regression Check
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
HH_API_KEY: ${{ secrets.HH_API_KEY }}
HH_PROJECT: ${{ vars.HH_PROJECT }}
jobs:
run-evaluation:
name: Run Evaluation
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: |
pip install "honeyhive>=1.0.0rc21"
pip install -r requirements.txt
- name: Run evaluation
env:
# On push events this is the pushed commit; on pull_request events
# github.sha is the temporary merge commit, so use the PR head SHA.
GITHUB_SHA: ${{ github.event.pull_request.head.sha || github.sha }}
run: python scripts/baseline.py
detect-regression:
name: Detect Regression
runs-on: ubuntu-latest
needs: run-evaluation
if: github.event_name == 'pull_request'
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: |
pip install "honeyhive>=1.0.0rc21"
pip install -r requirements.txt
- name: Run regression check
id: regression
env:
PR_SHA: ${{ github.event.pull_request.head.sha }}
BASELINE_SHA: ${{ github.event.pull_request.base.sha }}
run: |
set +e
python scripts/regression_check.py 2>&1 | tee regression_output.txt
echo "exit_code=${PIPESTATUS[0]}" >> "$GITHUB_OUTPUT"
- name: Post PR comment
if: always()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const output = fs.readFileSync('regression_output.txt', 'utf8');
const exitCode = '${{ steps.regression.outputs.exit_code }}';
const status = exitCode === '0' ? '✅ No regression' : '❌ Regression detected';
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `**HoneyHive Evaluation - ${status}**\n\n\`\`\`\n${output}\n\`\`\``
});
- name: Fail on regression
if: steps.regression.outputs.exit_code != '0'
run: exit 1
How it works:
- Both jobs derive
run_id from the commit SHA via the same run_id_from_sha() helper, so the same commit always produces the same run_id. No cache, no artifact, no metadata passing.
- On
push to main: baseline.py runs on the pushed commit. That commit’s run_id becomes the baseline any future PR branching off it will look up.
- On
pull_request: run-evaluation produces the PR head’s run, and detect-regression derives the baseline run_id from github.event.pull_request.base.sha and compares.
github.event.pull_request.head.sha is used instead of github.sha on PR events because github.sha is GitHub’s temporary merge commit, not the PR head.
- The PR comment posts the full output whether the check passes or fails, so reviewers see exactly which metrics changed.
baseline.py is the script that runs the evaluation - for a concrete function, dataset, and evaluators list, see the worked example above.
REST API (Non-Python CI)
If your CI doesn’t use Python, you can drive the same comparison via the REST API. Start a run, wait for it to complete, then call the comparison endpoint.
For the full REST flow covering run creation and event logging, see Experiments via API.
The run comparison endpoint:
curl -X GET "https://api.honeyhive.ai/v1/runs/${NEW_RUN_ID}/compare-with/${BASELINE_RUN_ID}" \
-H "Authorization: Bearer $HH_API_KEY"
The response contains a metrics array where each entry includes metric_name, old_aggregate, new_aggregate, improved_count, and degraded_count - the same data surface as RunComparisonResult.
Summary
| Step | What happens |
|---|
Push to main | evaluate() runs with run_id derived from github.sha; this is the baseline for any PR that branches off this commit |
| Pull request opens | evaluate() runs on the PR head with run_id derived from pr.head.sha; baseline run_id derived from pr.base.sha |
compare_runs() called | Returns RunComparisonResult with degraded/improved metrics |
| Degraded metric found | sys.exit(1) - CI fails, PR blocked |
| All metrics stable | sys.exit(0) - CI passes, PR unblocked |
| PR comment posted | Reviewers see exact metric deltas inline |