Skip to main content
Experiments in HoneyHive help you systematically test and improve your AI applications. Whether you’re iterating on prompts, comparing models, or optimizing your RAG pipeline, experiments provide a structured way to measure improvements and ensure reliability.
HoneyHive Experiments dashboard showing run results and metrics

Run Your First Experiment

New to experiments? Follow our hands-on tutorial to run your first experiment in 10 minutes.

Core Concepts

An experiment consists of three parts:
ComponentWhat it isExample
FunctionThe code you want to evaluateA prompt, RAG pipeline, or agent
DatasetTest cases with inputs and expected outputsCustomer queries with correct intents
EvaluatorsFunctions that score outputsAccuracy check, LLM-as-judge
from honeyhive import evaluate

result = evaluate(
    function=my_classifier,      # Your function
    dataset=test_cases,          # Your test data
    evaluators=[accuracy_check], # Your scoring functions
    name="intent-classifier-v2"
)

Custom Run IDs

By default, evaluate() generates a UUID for each run. You can pass a custom run_id to correlate results with specific CI pipeline runs or to enable deterministic identifiers:
result = evaluate(
    function=my_pipeline,
    dataset=test_cases,
    evaluators=[accuracy],
    name="nightly-regression",
    run_id=f"ci-{os.environ['CI_PIPELINE_ID']}"
)

Why Use Experiments?

  • Iterate with confidence - Test prompt variations, model configurations, and architectural changes against consistent metrics
  • Track improvements - Monitor how changes affect key metrics over time
  • Automate quality checks - Run experiments in CI/CD pipelines to catch issues before deployment
  • Compare approaches - Evaluate different models, retrieval methods, or chunking strategies side-by-side
  • Ensure reliability - Catch regressions by testing across diverse scenarios before deploying

How It Works

When you call evaluate():
  1. Run - Your function executes on each datapoint (with automatic tracing)
  2. Score - Evaluators measure each output against ground truth
  3. Aggregate - HoneyHive computes metrics (average, min, max)
  4. View - Results appear in the dashboard for analysis

Trace Linking

Every execution creates a traced session with metadata that links it to:
  • run_id - Groups all traces from a single experiment run together
  • datapoint_id - Identifies which test case produced each trace
This linking enables powerful comparisons:
  • Same datapoint, different runs - Compare how prompt v1 vs v2 handled the same input
  • Aggregate metrics - See average accuracy across all test cases in a run
  • Regression detection - Identify which specific inputs degraded between versions

Git Context

When you run evaluate() from a Git repository, the SDK automatically captures Git metadata on each experiment run:
  • Commit hash and branch name
  • Author and remote URL
  • Dirty status (whether there are uncommitted changes)
This makes it easy to trace any experiment result back to the exact code that produced it, which is especially useful in CI/CD pipelines.
For deeper understanding of the framework design and evaluation philosophy, see Evaluation Framework.

Next Steps