HoneyHive Docs

After running multiple experiments, you can compare their results side-by-side to understand what changed and why. This is essential when iterating on prompts, testing different models, or tuning parameters.

Experiment comparison view showing metrics summary, distribution charts, and side-by-side outputs

How to Compare Runs

Go to Experiments in the sidebar
Select an experiment run to view its details
Click compared with and select another run from the dropdown
The view updates to show side-by-side comparison

Runs are comparable when they share common datapoints (matched by datapoint_id). For best results, run experiments against the same dataset.

Comparison Features

Aggregated Metrics

View aggregate scores for each metric across both runs. The summary highlights improved and regressed counts, so you can see at a glance which run performed better on each metric.

Metric Distribution

Analyze how scores are distributed across datapoints. This helps identify whether improvements are consistent or driven by outliers.

Distribution chart comparing metric scores between two experiment runs

Improved/Regressed Filtering

Filter the datapoint table to show only cases where performance improved or regressed on a specific metric.

Filter dropdown for selecting improved or regressed datapoints

Output Diff Viewer

Toggle Diff mode to see side-by-side outputs for each datapoint, with differences highlighted. This helps you understand exactly how outputs changed between runs.

Side-by-side output comparison with diff highlighting

Step-Level Comparisons

For multi-step traces, compare metrics at each individual step using the Viewing Event dropdown. This shows how changes affect specific stages of your pipeline.

Programmatic Comparison

Use compare_runs() to analyze differences in code:

from honeyhive import HoneyHive
from honeyhive.experiments import compare_runs

client = HoneyHive()
comparison = compare_runs(
    client=client,
    new_run_id="run-abc123",
    old_run_id="run-xyz789"
)

print(f"Common datapoints: {comparison.common_datapoints}")

for metric, delta in comparison.metric_deltas.items():
    old = delta.get("old_aggregate", 0) or 0
    new = delta.get("new_aggregate", 0) or 0
    print(f"{metric}: {old:.2f} → {new:.2f} ({new - old:+.2f})")

The compare_runs() function returns:

common_datapoints - Number of shared datapoints between runs
metric_deltas - Per-metric comparison with old/new aggregates and improved/degraded counts

Best Practices

Practice	Why It Matters
Same dataset	Ensures you’re comparing apples to apples
One change at a time	Isolates the impact of each change
Sufficient sample size	Avoids conclusions based on outliers
Name runs descriptively	Makes it easy to identify what changed (e.g., `gpt-4o-temp-0.3` vs `gpt-4o-temp-0.7`)

Run Your First Experiment

Tutorial with step-by-step comparison example

Evaluation Framework

Understand the underlying evaluation architecture

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

Comparing Experiments

How to Compare Runs

Comparison Features

Aggregated Metrics

Metric Distribution

Improved/Regressed Filtering

Output Diff Viewer

Step-Level Comparisons

Programmatic Comparison

Best Practices

Run Your First Experiment

Evaluation Framework

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

​How to Compare Runs

​Comparison Features

​Aggregated Metrics

​Metric Distribution

​Improved/Regressed Filtering

​Output Diff Viewer

​Step-Level Comparisons

​Programmatic Comparison

​Best Practices

​Related

Run Your First Experiment

Evaluation Framework

How to Compare Runs

Comparison Features

Aggregated Metrics

Metric Distribution

Improved/Regressed Filtering

Output Diff Viewer

Step-Level Comparisons

Programmatic Comparison

Best Practices

Related