Skip to main content
Run an experiment against a dataset stored in HoneyHive. Pass dataset_id to evaluate() so HoneyHive loads your datapoints, records the run, and aggregates results in the Experiments dashboard. Your function and client-side evaluators run in your environment; experiment runs and comparisons live in HoneyHive. HoneyHive datasets let your team upload, curate, version, and reuse the same test cases across local experiments, CI jobs, and dashboard comparisons. What you’ll run: a support Q&A agent that answers questions from a HoneyHive dataset with an LLM, scores each response with an LLM-as-judge evaluator, and sends traced results to the Experiments dashboard. Time: ~10 minutes

Prerequisites

  • Python 3.11+
  • A HoneyHive dataset with input fields and ground truth fields
  • The dataset ID from the HoneyHive Datasets page
  • A HoneyHive API key
  • An OpenAI API key for the example support agent
Go to Settings > Project > API Keys and click Create API Key. Copy the key from the modal - it will only be shown once. If you do not have a HoneyHive dataset yet, create one first:

Upload a file

Import JSON, JSONL, or CSV through the UI

Upload with the SDK

Create a dataset from Python

Curate from traces

Build a dataset from production traces
This guide assumes each datapoint has:
{
  "inputs": {
    "question": "How do I reset my password?"
  },
  "ground_truth": {
    "answer": "Use the password reset link on the login page."
  }
}
Add more datapoints that match your support policy, for example refund or billing questions. Adapt the field names in the examples to match your dataset.

Step 1: Install dependencies and set credentials

pip install "honeyhive[openinference-openai]" openai
export HH_API_KEY="your-honeyhive-api-key"
export OPENAI_API_KEY="your-openai-api-key"
export HH_DATASET_ID="your-dataset-id"
from openai import OpenAI

client = OpenAI()
Project scope comes from the API key. You do not need to set HH_PROJECT or pass a project argument. evaluate() wraps your function in a chain trace automatically. Pass instrumentors to also capture OpenAI model spans inside that function.

Step 2: Define the function to evaluate

evaluate() passes the full datapoint dictionary to your function. Read inputs from datapoint["inputs"], call your application logic, and return a JSON-serializable output. This example evaluates a small support agent: a system prompt plus policy context, then an LLM call that answers the user question.
SUPPORT_POLICY = """
- Password reset: use the password reset link on the login page.
- Refunds: refunds are available within 30 days of purchase.
- Billing questions: email support@example.com.
"""

def answer_question(datapoint):
    question = datapoint["inputs"]["question"]

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a helpful customer support agent. "
                    "Answer using only the policy below. Keep answers concise.\n"
                    f"{SUPPORT_POLICY}"
                ),
            },
            {"role": "user", "content": question},
        ],
        temperature=0,
    )

    return {"answer": response.choices[0].message.content.strip()}
Swap the model, prompt, or policy for your own agent, RAG pipeline, or tool-using workflow. The experiment only needs a function that returns JSON-serializable outputs.

Step 3: Write a client-side LLM-as-judge evaluator

Client-side evaluators run in your environment during the experiment. They receive the function output, datapoint inputs, and datapoint ground truth. LLM answers are paraphrased, so checking whether the expected string appears in the response is brittle. Use an LLM-as-judge instead to score semantic correctness. This follows the same client-side pattern as llm_judge in CI Regression Detection. For evaluators that run on HoneyHive after ingestion, see LLM Evaluators.
import json

def answer_quality_judge(outputs, inputs, ground_truth=None):
    expected = (ground_truth or {}).get("answer", "")
    actual = outputs.get("answer", "")
    if not expected:
        return {"score": 0.0, "explanation": "No ground-truth answer to score against"}

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        max_tokens=100,
        messages=[
            {
                "role": "system",
                "content": (
                    'Return JSON only: {"score": 0 or 1, "explanation": "brief reason"}. '
                    "Score 1 when the actual answer conveys the same guidance as the expected answer, "
                    "even if wording differs. Score 0 if it contradicts or misses the key guidance."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Question: {inputs.get('question', '')}\n"
                    f"Expected guidance: {expected}\n"
                    f"Actual answer: {actual}"
                ),
            },
        ],
    )

    text = response.choices[0].message.content.strip()
    start, end = text.find("{"), text.rfind("}")
    if start == -1 or end == -1:
        return {"score": 0.0, "explanation": "Judge returned invalid JSON"}

    parsed = json.loads(text[start : end + 1])
    return {
        "score": float(parsed.get("score", 0)),
        "explanation": parsed.get("explanation", ""),
    }
Evaluator signature: (outputs, inputs, ground_truth). Return a scalar score, or return a dictionary with a scalar score field and optional explanation (shown in HoneyHive per datapoint).

Step 4: Run the experiment with dataset_id

Pass your dataset ID so HoneyHive loads the datapoints for this run:
import os

from honeyhive import evaluate
from openinference.instrumentation.openai import OpenAIInstrumentor

result = evaluate(
    function=answer_question,
    dataset_id=os.environ["HH_DATASET_ID"],
    evaluators=[answer_quality_judge],
    name="honeyhive-dataset-qa-v1",
    instrumentors=[lambda: OpenAIInstrumentor()],
    # server_url="https://your-data-plane-host",  # self-hosted or dedicated deployments only
)

print(f"Run {result.run_id}: success={result.success}")
result.print_table()
evaluate() fetches the dataset from HoneyHive, runs your function on each datapoint, runs your evaluator, and uploads the traced results.

Step 5: View and compare results

Open Experiments in HoneyHive. The run shows aggregate evaluator scores, per-datapoint outputs, and traces for each execution. Run the same script again after changing your function, prompt, model, or retrieval strategy. Because both runs use the same HoneyHive dataset, HoneyHive can compare results by datapoint. See Compare experiment runs for side-by-side diffs and regressions.

Troubleshooting

IssueFix
HH_DATASET_ID is missingCopy the dataset ID from the HoneyHive Datasets page and export it before running the script
OPENAI_API_KEY is missingExport your OpenAI API key before running the support agent example
Function argument mismatchDefine your function as def fn(datapoint), not def fn(inputs, ground_truth)
Evaluator receives empty ground truthCheck the dataset field mapping and make sure expected-answer fields are mapped to ground truth
Judge returns invalid JSONKeep temperature=0 and require JSON-only output in the judge system prompt
Dashboard results appear after a short delayTrace ingestion is asynchronous, so the run can arrive before every trace is visible
Server-side evaluator scores missing from the runConfigure evaluators in HoneyHive. They run on ingested traces and do not belong in the evaluators argument to evaluate()

Next steps

Compare Experiments

Compare runs against the same HoneyHive dataset

CI Regression Detection

Run HoneyHive dataset experiments in pull request checks

Dataset Curation

Grow your eval set from production traces

Client-Side Evaluators

Go deeper on evaluator functions for experiments and CI

Server-Side LLM Evaluators

Run LLM judges on HoneyHive instead of in your script