Comparative Evaluations
Learn how to perform and compare multiple evaluations in HoneyHive
Comparative Evaluations allow you to run multiple evaluations using the same dataset (linked by dataset_id
) and compare their results side-by-side. This is particularly useful when you want to benchmark different models, prompts, or configurations against each other.
The comparison view allows you to:
Setting Up Comparative Evaluations
The below steps are the same as the dataset flow in Evaluations tab. Follow these steps to set up and run comparative evaluations.
Steps
Setup Evaluators
Evaluators can be setup either client-side before logging the events or processed server-side post ingestion. Setup server side evaluators that evaluate the sessions and events from the evaluation.
Setup input data
Create a dataset by uploading dataset or from events logged at honeyhive.
Each datapoint is a different scenario you want to evaluate.
Datapoint in sample dataset with mapped inputs.
Create the flow you want to evaluate
The function to evaluate should accept a parameter input
. This parameter refers to an object that maps to each datapoint or json passed to the evaluation.
The value returned by the function would map to the outputs
field of each session in the evaluation.
Run evaluation
Viewing Comparative Results
To view and compare evaluation results:
- Navigate to the ‘Evaluations’ section in your HoneyHive dashboard.
- Select an evaluation to compare.
- Click the ‘Compare with’ button to select other eligible evaluations.
Advanced Comparison Features
1. Aggregated Metrics
HoneyHive automatically calculates and compares aggregates from:
- Server-side metrics
- Client-side metrics
- Composite metrics at the session level
2. Improved/regressed events
Filter for events that have improved or regressed in specific metrics.
Select the metric and operation you want.
View the corresponding events in the events table.
3. Output Diff Viewer
Compare outputs and metrics of corresponding events with the same event name.
4. Metric Distribution
Analyze the distribution of various metrics for deeper insights.
Best Practices
- Use a consistent dataset for all compared evaluations.
- Isolate one change at a time (e.g., model, prompt, temperature) to understand its specific impact.
- Ensure a sufficient sample size for statistically significant conclusions.
- Document configurations used in each evaluation for future reference.
Conclusion
Comparative Evaluations in HoneyHive provide a powerful tool for benchmarking different LLM configurations. Leverage this feature to make data-driven decisions about optimal models, prompts, or parameters for your specific use case.