Introduction
Get started with running experiments with HoneyHive
Experiments in HoneyHive make it easy to systematically test and compare different configurations of your application.
What is an Experiment Run?
Experiments in Honeyhive are a set of grouped sessions run with the same configuration iterated over a dataset or input list. An experiment run in HoneyHive consists of three main components:
- Configuration/Variant: This is what you are evaluating. It could be different models, prompts, or any other configuration you want to test. This is measured through traced sessions that trace the code execution.
- Dataset: This is what you are evaluating against. The dataset provides the input data for the experiment. Hence we are able to compare the performance of different configurations across consistent inputs.
- Evaluators: These are the metrics or criteria you are measuring. Evaluators help you assess the performance of the configuration against the dataset.
Why run experiments with HoneyHive?
Our approach of measuring performance against consistent datasets using standardized metrics helps:
- Compare different models: Test multiple LLM models (like GPT-4, Claude, PaLM) against the same inputs to evaluate their relative performance, response quality, and cost-effectiveness
- Fine-tune prompts: Systematically test different prompt variations and templates to identify which ones produce the most accurate and relevant outputs for your use case
- Optimize parameters: Experiment with various configuration parameters like temperature, max tokens, and top_p to find the optimal balance between creativity and precision
- Evaluate different approaches: Compare different architectural choices like retrieval methods, chunking strategies, or vector databases using consistent metrics. We support distributed tracing to enable any client architecture.
- Track improvements: Monitor how changes to your application affect key metrics over time to ensure continuous improvement. Our github integration allows you to track improvements in your codebase.
- Ensure reliability: Test your application’s performance across diverse scenarios to identify potential edge cases and failure modes. Our github integration allows you set failure modes to your experiments.
How are experiment runs tracked?
In HoneyHive, we link related sessions using the run_id
field on the session’s metadata
.
The corresponding experiment run with that run_id
is linked to a dataset. This dataset-run linking allows different experiment runs to be compared against each other on an aggregate level.
To compare sessions run on the same input, we link them using the datapoint_id
field on the metadata
that tells us whether or not they can be compared on a datapoint level.
The flexibility here helps you compare the performance of your application across any dimension of configuration you want to test - models, chunking strategies, vector databases, prompts, so on.