Skip to main content
After running multiple experiments, you can compare their results side-by-side to understand what changed and why. This is essential when iterating on prompts, testing different models, or tuning parameters.
Experiment comparison view showing metrics summary, distribution charts, and side-by-side outputs

How to Compare Runs

  1. Go to Experiments in the sidebar
  2. Select an experiment run to view its details
  3. Click compared with and select another run from the dropdown
  4. The view updates to show side-by-side comparison
Runs are comparable when they share common datapoints (matched by datapoint_id). For best results, run experiments against the same dataset.

Comparison Features

Aggregated Metrics

View aggregate scores for each metric across both runs. The summary highlights improved and regressed counts, so you can see at a glance which run performed better on each metric.
Aggregated metrics comparison showing average scores for each evaluator

Metric Distribution

Analyze how scores are distributed across datapoints. This helps identify whether improvements are consistent or driven by outliers.
Distribution chart comparing metric scores between two experiment runs

Improved/Regressed Filtering

Filter the datapoint table to show only cases where performance improved or regressed on a specific metric.
Filter dropdown for selecting improved or regressed datapoints

Output Diff Viewer

Toggle Diff mode to see side-by-side outputs for each datapoint, with differences highlighted. This helps you understand exactly how outputs changed between runs.
Side-by-side output comparison with diff highlighting

Step-Level Comparisons

For multi-step traces, compare metrics at each individual step using the Viewing Event dropdown. This shows how changes affect specific stages of your pipeline.

Programmatic Comparison

Use compare_runs() to analyze differences in code:
from honeyhive import HoneyHive
from honeyhive.experiments import compare_runs

client = HoneyHive()
comparison = compare_runs(
    client=client,
    new_run_id="run-abc123",
    old_run_id="run-xyz789"
)

print(f"Common datapoints: {comparison.common_datapoints}")

for metric, delta in comparison.metric_deltas.items():
    old = delta.get("old_aggregate", 0) or 0
    new = delta.get("new_aggregate", 0) or 0
    print(f"{metric}: {old:.2f}{new:.2f} ({new - old:+.2f})")
The compare_runs() function returns:
  • common_datapoints - Number of shared datapoints between runs
  • metric_deltas - Per-metric comparison with old/new aggregates and improved/degraded counts

Best Practices

PracticeWhy It Matters
Same datasetEnsures you’re comparing apples to apples
One change at a timeIsolates the impact of each change
Sufficient sample sizeAvoids conclusions based on outliers
Name runs descriptivelyMakes it easy to identify what changed (e.g., gpt-4o-temp-0.3 vs gpt-4o-temp-0.7)