Get started with running experiments with HoneyHive
Experiments in HoneyHive help you systematically test and improve your AI applications. Whether you’re iterating on prompts, comparing models, or optimizing your RAG pipeline, experiments provide a structured way to measure improvements and ensure reliability.
An experiment in HoneyHive consists of three core components:
Application Logic: The core function you want to evaluate - this could be different models, prompts, retrieval strategies, or an end-to-end agent you want to evaluate.
Dataset: A dataset of inputs (and optionally target outputs) you’re evaluating against. Using consistent test cases ensures you can reliably compare different versions of your application as you iterate.
Evaluators: The metrics and criteria you’re measuring. Evaluators help quantify improvements and catch regressions across different versions as you iterate. These can be either automated (i.e. code or LLM evaluators) or performed by a human.
Every trace in HoneyHive contains metadata that links it to specific experiments and datapoint you’re testing against (i.e. inputs and ground_truths pairs). The run_id in the metadata links related test traces together, while the datapoint_id connects traces that were run on the same test cases / datapoints.
The experiment framework integrates with GitHub to:
Trigger automated experiment runs on code changes
Set performance thresholds that must be met
Track metric improvements across commits
Alert on performance regressions
This metadata-driven approach to testing and evaluation lets you compare performance across any configuration dimension - whether you’re testing different prompts, models, or entire pipeline architectures.