Introduction
Get started with running experiments with HoneyHive
Experiments in HoneyHive help you systematically test and improve your AI applications. Whether you’re iterating on prompts, comparing models, or optimizing your RAG pipeline, experiments provide a structured way to measure improvements and ensure reliability.
What is an experiment?
An experiment in HoneyHive consists of three core components:
- Application Logic: The components you want to evaluate - this could be different models, prompts, retrieval strategies, or any configuration you want to test.
- Dataset: The inputs you’re evaluating against. Using consistent test data ensures you can reliably compare different versions of your application.
- Evaluators: The metrics and criteria you’re measuring. Evaluators help quantify improvements and catch regressions across different configurations.
Why run experiments with HoneyHive?
Experiments provide a systematic approach to improving your AI applications:
- Iterate with confidence: Test prompt variations, model configurations, and architectural changes against consistent metrics
- Track improvements: Monitor how changes affect key metrics over time and ensure continuous improvement
- Automate quality checks: With GitHub integration, automatically run experiments on code changes and set performance thresholds
- Compare approaches: Evaluate different models, retrieval methods, or chunking strategies using standardized metrics
- Ensure reliability: Catch potential issues by testing across diverse scenarios before deploying to production
How do experiments work?
HoneyHive uses metadata linking to track and organize experiment traces:
Trace Metadata and Linking
Every trace in HoneyHive contains metadata that links it to specific experiments and inputs. The run_id
in the metadata links related test traces together, while the datapoint_id
connects traces that were run on the same input.
Experiment Structure
-
Experiment-Dataset Relationship
- Each experiment run (identified by
run_id
) is linked to a specific dataset - This dataset-run linking enables aggregate comparison across different configurations
- Multiple runs can use the same dataset, allowing you to test different approaches against consistent inputs
- Each experiment run (identified by
-
Trace Comparison
- Traces with the same
datapoint_id
represent different configurations tested on identical inputs - This enables direct comparison of performance for specific inputs
- Example: Compare how different LLM models handle the same prompt, or how different RAG configurations retrieve for the same query
- Traces with the same
-
Performance Tracking
- Evaluators measure performance metrics for each trace
- Results can be analyzed at both individual trace and aggregate run levels
- Metrics are tracked over time to identify improvements or regressions
Integration with Development Workflow
The experiment framework integrates with GitHub to:
- Trigger automated experiment runs on code changes
- Set performance thresholds that must be met
- Track metric improvements across commits
- Alert on performance regressions
This metadata-driven approach to testing and evaluation lets you compare performance across any configuration dimension - whether you’re testing different prompts, models, or entire pipeline architectures.