dataset_id to evaluate() so HoneyHive loads your datapoints, records the run, and aggregates results in the Experiments dashboard. Your function and client-side evaluators run in your environment; experiment runs and comparisons live in HoneyHive.
HoneyHive datasets let your team upload, curate, version, and reuse the same test cases across local experiments, CI jobs, and dashboard comparisons.
What you’ll run: a support Q&A agent that answers questions from a HoneyHive dataset with an LLM, scores each response with an LLM-as-judge evaluator, and sends traced results to the Experiments dashboard.
Time: ~10 minutes
Prerequisites
- Python 3.11+
- A HoneyHive dataset with input fields and ground truth fields
- The dataset ID from the HoneyHive Datasets page
- A HoneyHive API key
- An OpenAI API key for the example support agent
Upload a file
Import JSON, JSONL, or CSV through the UI
Upload with the SDK
Create a dataset from Python
Curate from traces
Build a dataset from production traces
Step 1: Install dependencies and set credentials
Project scope comes from the API key. You do not need to set
HH_PROJECT or pass a project argument. evaluate() wraps your function in a chain trace automatically. Pass instrumentors to also capture OpenAI model spans inside that function.Step 2: Define the function to evaluate
evaluate() passes the full datapoint dictionary to your function. Read inputs from datapoint["inputs"], call your application logic, and return a JSON-serializable output.
This example evaluates a small support agent: a system prompt plus policy context, then an LLM call that answers the user question.
Step 3: Write a client-side LLM-as-judge evaluator
Client-side evaluators run in your environment during the experiment. They receive the function output, datapoint inputs, and datapoint ground truth. LLM answers are paraphrased, so checking whether the expected string appears in the response is brittle. Use an LLM-as-judge instead to score semantic correctness. This follows the same client-side pattern asllm_judge in CI Regression Detection. For evaluators that run on HoneyHive after ingestion, see LLM Evaluators.
(outputs, inputs, ground_truth). Return a scalar score, or return a dictionary with a scalar score field and optional explanation (shown in HoneyHive per datapoint).
Step 4: Run the experiment with dataset_id
Pass your dataset ID so HoneyHive loads the datapoints for this run:evaluate() fetches the dataset from HoneyHive, runs your function on each datapoint, runs your evaluator, and uploads the traced results.
Step 5: View and compare results
Open Experiments in HoneyHive. The run shows aggregate evaluator scores, per-datapoint outputs, and traces for each execution. Run the same script again after changing your function, prompt, model, or retrieval strategy. Because both runs use the same HoneyHive dataset, HoneyHive can compare results by datapoint. See Compare experiment runs for side-by-side diffs and regressions.Troubleshooting
| Issue | Fix |
|---|---|
HH_DATASET_ID is missing | Copy the dataset ID from the HoneyHive Datasets page and export it before running the script |
OPENAI_API_KEY is missing | Export your OpenAI API key before running the support agent example |
| Function argument mismatch | Define your function as def fn(datapoint), not def fn(inputs, ground_truth) |
| Evaluator receives empty ground truth | Check the dataset field mapping and make sure expected-answer fields are mapped to ground truth |
| Judge returns invalid JSON | Keep temperature=0 and require JSON-only output in the judge system prompt |
| Dashboard results appear after a short delay | Trace ingestion is asynchronous, so the run can arrive before every trace is visible |
| Server-side evaluator scores missing from the run | Configure evaluators in HoneyHive. They run on ingested traces and do not belong in the evaluators argument to evaluate() |
Next steps
Compare Experiments
Compare runs against the same HoneyHive dataset
CI Regression Detection
Run HoneyHive dataset experiments in pull request checks
Dataset Curation
Grow your eval set from production traces
Client-Side Evaluators
Go deeper on evaluator functions for experiments and CI
Server-Side LLM Evaluators
Run LLM judges on HoneyHive instead of in your script

