HoneyHive Docs

Experiments in HoneyHive help you systematically test and improve your AI applications. Whether you’re iterating on prompts, comparing models, or optimizing your RAG pipeline, experiments provide a structured way to measure improvements and ensure reliability.

HoneyHive Experiments dashboard showing run results and metrics

Run Your First Experiment

New to experiments? Follow our hands-on tutorial to run your first experiment in 10 minutes.

Core Concepts

An experiment consists of three parts:

Component	What it is	Example
Function	The code you want to evaluate	A prompt, RAG pipeline, or agent
Dataset	Test cases with inputs and expected outputs	Customer queries with correct intents
Evaluators	Functions that score outputs	Accuracy check, LLM-as-judge

from honeyhive import evaluate

result = evaluate(
    function=my_classifier,      # Your function
    dataset=test_cases,          # Your test data
    evaluators=[accuracy_check], # Your scoring functions
    name="intent-classifier-v2"
)

Custom Run IDs

By default, evaluate() generates a UUID for each run. You can pass a custom run_id to correlate results with specific CI pipeline runs or to enable deterministic identifiers:

result = evaluate(
    function=my_pipeline,
    dataset=test_cases,
    evaluators=[accuracy],
    name="nightly-regression",
    run_id=f"ci-{os.environ['CI_PIPELINE_ID']}"
)

Why Use Experiments?

Iterate with confidence - Test prompt variations, model configurations, and architectural changes against consistent metrics
Track improvements - Monitor how changes affect key metrics over time
Automate quality checks - Run experiments in CI/CD pipelines to catch issues before deployment
Compare approaches - Evaluate different models, retrieval methods, or chunking strategies side-by-side
Ensure reliability - Catch regressions by testing across diverse scenarios before deploying

How It Works

When you call evaluate():

Run - Your function executes on each datapoint (with automatic tracing)
Score - Evaluators measure each output against ground truth
Aggregate - HoneyHive computes metrics (average, min, max)
View - Results appear in the dashboard for analysis

Trace Linking

Every execution creates a traced session with metadata that links it to:

run_id - Groups all traces from a single experiment run together
datapoint_id - Identifies which test case produced each trace

This linking enables powerful comparisons:

Same datapoint, different runs - Compare how prompt v1 vs v2 handled the same input
Aggregate metrics - See average accuracy across all test cases in a run
Regression detection - Identify which specific inputs degraded between versions

Git Context

When you run evaluate() from a Git repository, the SDK automatically captures Git metadata on each experiment run:

Commit hash and branch name
Author and remote URL
Dirty status (whether there are uncommitted changes)

This makes it easy to trace any experiment result back to the exact code that produced it, which is especially useful in CI/CD pipelines.

For deeper understanding of the framework design and evaluation philosophy, see Evaluation Framework.

Next Steps

Run Your First Experiment

Hands-on tutorial to get started in 10 minutes

Compare Runs

Identify improvements and regressions across versions

Create Evaluators

Build code, LLM-as-judge, or human evaluators

Manage Datasets

Create and version test datasets in HoneyHive

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

Introduction

Run Your First Experiment

Core Concepts

Custom Run IDs

Why Use Experiments?

How It Works

Trace Linking

Git Context

Next Steps

Run Your First Experiment

Compare Runs

Create Evaluators

Manage Datasets

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

Run Your First Experiment

​Core Concepts

​Custom Run IDs

​Why Use Experiments?

​How It Works

​Trace Linking

​Git Context

​Next Steps

Run Your First Experiment

Compare Runs

Create Evaluators

Manage Datasets

Core Concepts

Custom Run IDs

Why Use Experiments?

How It Works

Trace Linking

Git Context

Next Steps