> ## Documentation Index
> Fetch the complete documentation index at: https://docs.honeyhive.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Comparing Experiments

> Compare experiment runs to identify improvements and regressions across prompts, models, or configurations.

After running multiple experiments, you can compare their results side-by-side to understand what changed and why. This is essential when iterating on prompts, testing different models, or tuning parameters.

<Frame>
  <img src="https://mintcdn.com/honeyhiveai/qmpHooEVX6j-ieIE/images/comparitive_evals_screenshot.png?fit=max&auto=format&n=qmpHooEVX6j-ieIE&q=85&s=5d670083096f94d4380834c60c01fd60" alt="Experiment comparison view showing metrics summary, distribution charts, and side-by-side outputs" width="2622" height="1516" data-path="images/comparitive_evals_screenshot.png" />
</Frame>

## How to Compare Runs

1. Go to **Experiments** in the sidebar
2. Select an experiment run to view its details
3. Click **compared with** and select another run from the dropdown
4. The view updates to show side-by-side comparison

<Note>
  Runs are comparable when they share common datapoints (matched by `datapoint_id`). For best results, run experiments against the same dataset.
</Note>

## Comparison Features

### Aggregated Metrics

View aggregate scores for each metric across both runs. The summary highlights improved and regressed counts, so you can see at a glance which run performed better on each metric.

<Frame>
  <img src="https://mintcdn.com/honeyhiveai/qmpHooEVX6j-ieIE/images/comparitive_evals_aggregates.png?fit=max&auto=format&n=qmpHooEVX6j-ieIE&q=85&s=6d2fc96c08a300fa34c261c467c226b1" alt="Aggregated metrics comparison showing average scores for each evaluator" width="858" height="714" data-path="images/comparitive_evals_aggregates.png" />
</Frame>

### Metric Distribution

Analyze how scores are distributed across datapoints. This helps identify whether improvements are consistent or driven by outliers.

<Frame>
  <img src="https://mintcdn.com/honeyhiveai/qmpHooEVX6j-ieIE/images/comparitive_evals_distribution.png?fit=max&auto=format&n=qmpHooEVX6j-ieIE&q=85&s=68c343008fa51fb0c1ca6de800858caf" alt="Distribution chart comparing metric scores between two experiment runs" width="1716" height="686" data-path="images/comparitive_evals_distribution.png" />
</Frame>

### Improved/Regressed Filtering

Filter the datapoint table to show only cases where performance improved or regressed on a specific metric.

<Frame>
  <img src="https://mintcdn.com/honeyhiveai/qmpHooEVX6j-ieIE/images/comparitive_evals_filter_select.png?fit=max&auto=format&n=qmpHooEVX6j-ieIE&q=85&s=c87709b795311a851a717318a5f130e3" alt="Filter dropdown for selecting improved or regressed datapoints" width="850" height="690" data-path="images/comparitive_evals_filter_select.png" />
</Frame>

### Output Diff Viewer

Toggle **Diff** mode to see side-by-side outputs for each datapoint, with differences highlighted. This helps you understand exactly how outputs changed between runs.

<Frame>
  <img src="https://mintcdn.com/honeyhiveai/qmpHooEVX6j-ieIE/images/comparitive_evals_diff.png?fit=max&auto=format&n=qmpHooEVX6j-ieIE&q=85&s=e01194418e075d994b078228fd7dbd20" alt="Side-by-side output comparison with diff highlighting" width="2554" height="1116" data-path="images/comparitive_evals_diff.png" />
</Frame>

### Step-Level Comparisons

For multi-step traces, compare metrics at each individual step using the **Viewing Event** dropdown. This shows how changes affect specific stages of your pipeline.

## Programmatic Comparison

Use `compare_runs()` to analyze differences in code:

```python theme={null}
from honeyhive import HoneyHive
from honeyhive.experiments import compare_runs

client = HoneyHive()
comparison = compare_runs(
    client=client,
    new_run_id="run-abc123",
    old_run_id="run-xyz789",
    project_id="my-project",
    aggregate_function="average",  # "average", "sum", "min", or "max"
)

print(f"Common datapoints: {comparison.common_datapoints}")

for metric, delta in comparison.metric_deltas.items():
    old = delta.get("old_aggregate", 0) or 0
    new = delta.get("new_aggregate", 0) or 0
    print(f"{metric}: {old:.2f} → {new:.2f} ({new - old:+.2f})")
```

### Working with Results

The `compare_runs()` function returns a `RunComparisonResult` object. Key properties include `common_datapoints`, `new_only_datapoints`, `old_only_datapoints`, and `metric_deltas`. Use the helper methods to quickly identify changes:

* `list_improved_metrics()` - metric names where at least one datapoint improved
* `list_degraded_metrics()` - metric names where at least one datapoint degraded
* `get_metric_delta(name)` - detailed delta for a specific metric, including `old_aggregate`, `new_aggregate`, `improved_count`, `degraded_count`, and lists of affected datapoint IDs

**Example: identify regressions**

```python theme={null}
comparison = compare_runs(client, new_run_id="run-new", old_run_id="run-old", project_id="my-project")

# List metrics that got better or worse
print("Improved:", comparison.list_improved_metrics())
print("Degraded:", comparison.list_degraded_metrics())

# Drill into a specific metric
delta = comparison.get_metric_delta("accuracy")
if delta:
    print(f"Accuracy: {delta.get('old_aggregate', 0) or 0:.2f} → {delta.get('new_aggregate', 0) or 0:.2f}")
    print(f"  Improved on {delta['improved_count']} datapoints")
    print(f"  Degraded on {delta['degraded_count']} datapoints")
    if delta["degraded"]:
        print(f"  Degraded datapoint IDs: {delta['degraded']}")
```

## Best Practices

| Practice                    | Why It Matters                                                                        |
| --------------------------- | ------------------------------------------------------------------------------------- |
| **Same dataset**            | Ensures you're comparing apples to apples                                             |
| **One change at a time**    | Isolates the impact of each change                                                    |
| **Sufficient sample size**  | Avoids conclusions based on outliers                                                  |
| **Name runs descriptively** | Makes it easy to identify what changed (e.g., `gpt-4o-temp-0.3` vs `gpt-4o-temp-0.7`) |

## Related

<CardGroup cols={2}>
  <Card title="Run Your First Experiment" icon="flask" href="/v2/introduction/experiments-quickstart">
    Tutorial with step-by-step comparison example
  </Card>

  <Card title="CI Regression Detection" icon="github" href="/v2/evaluation/ci-regression-detection">
    Use `compare_runs()` in CI to block PRs on metric regressions
  </Card>

  <Card title="Evaluation Framework" icon="book" href="/v2/evaluation/concepts">
    Understand the underlying evaluation architecture
  </Card>
</CardGroup>
