Comparative Evaluations allow you to run multiple evaluations using the same dataset (linked by dataset_id) and compare their results side-by-side. This is particularly useful when you want to benchmark different models, prompts, or configurations against each other.

The comparison view allows you to:

Setting Up Comparative Evaluations

The below steps are the same as the dataset flow in Evaluations tab. Follow these steps to set up and run comparative evaluations.

Steps

1

Setup Evaluators

Evaluators can be setup either client-side before logging the events or processed server-side post ingestion. Setup server side evaluators that evaluate the sessions and events from the evaluation.

2

Setup input data

Create a dataset by uploading dataset or from events logged at honeyhive.

Each datapoint is a different scenario you want to evaluate.

Datapoint in sample dataset with mapped inputs.

3

Create the flow you want to evaluate

The function to evaluate should accept a parameter input. This parameter refers to an object that maps to each datapoint or json passed to the evaluation.

The value returned by the function would map to the outputs field of each session in the evaluation.

4

Run evaluation

Viewing Comparative Results

To view and compare evaluation results:

  1. Navigate to the ‘Evaluations’ section in your HoneyHive dashboard.
  2. Select an evaluation to compare.
  3. Click the ‘Compare with’ button to select other eligible evaluations.

Advanced Comparison Features

1. Aggregated Metrics

HoneyHive automatically calculates and compares aggregates from:

  • Server-side metrics
  • Client-side metrics
  • Composite metrics at the session level

2. Improved/regressed events

Filter for events that have improved or regressed in specific metrics.

Select the metric and operation you want.

View the corresponding events in the events table.

3. Output Diff Viewer

Compare outputs and metrics of corresponding events with the same event name.

4. Metric Distribution

Analyze the distribution of various metrics for deeper insights.

Best Practices

  1. Use a consistent dataset for all compared evaluations.
  2. Isolate one change at a time (e.g., model, prompt, temperature) to understand its specific impact.
  3. Ensure a sufficient sample size for statistically significant conclusions.
  4. Document configurations used in each evaluation for future reference.

Conclusion

Comparative Evaluations in HoneyHive provide a powerful tool for benchmarking different LLM configurations. Leverage this feature to make data-driven decisions about optimal models, prompts, or parameters for your specific use case.