> ## Documentation Index
> Fetch the complete documentation index at: https://docs.honeyhive.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Run Experiments with HoneyHive Datasets

> Run experiments against HoneyHive datasets with evaluate(dataset_id=...). HoneyHive loads datapoints and aggregates run results server-side while your function and evaluators execute locally.

Run an experiment against a dataset stored in HoneyHive. Pass `dataset_id` to `evaluate()` so HoneyHive loads your datapoints, records the run, and aggregates results in the Experiments dashboard. Your function and client-side evaluators run in your environment; experiment runs and comparisons live in HoneyHive.

HoneyHive datasets let your team upload, curate, version, and reuse the same test cases across local experiments, CI jobs, and dashboard comparisons.

**What you'll run:** a support Q\&A agent that answers questions from a HoneyHive dataset with an LLM, scores each response with an LLM-as-judge evaluator, and sends traced results to the Experiments dashboard.

**Time:** \~10 minutes

## Prerequisites

* Python 3.11+
* A HoneyHive dataset with input fields and ground truth fields
* The dataset ID from the HoneyHive Datasets page
* A HoneyHive API key
* An OpenAI API key for the example support agent

Go to [**Settings > Project > API Keys**](https://app.us.honeyhive.ai/settings/project/keys) and click **Create API Key**. Copy the key from the modal - it will only be shown once.

If you do not have a HoneyHive dataset yet, create one first:

<CardGroup cols={3}>
  <Card title="Upload a file" icon="upload" href="/v2/datasets/import#upload-via-ui">
    Import JSON, JSONL, or CSV through the UI
  </Card>

  <Card title="Upload with the SDK" icon="code" href="/v2/datasets/import#upload-via-sdk">
    Create a dataset from Python
  </Card>

  <Card title="Curate from traces" icon="filter" href="/v2/datasets/dataset-curation">
    Build a dataset from production traces
  </Card>
</CardGroup>

This guide assumes each datapoint has:

```json theme={null}
{
  "inputs": {
    "question": "How do I reset my password?"
  },
  "ground_truth": {
    "answer": "Use the password reset link on the login page."
  }
}
```

Add more datapoints that match your support policy, for example refund or billing questions.

Adapt the field names in the examples to match your dataset.

## Step 1: Install dependencies and set credentials

```bash theme={null}
pip install "honeyhive[openinference-openai]" openai
export HH_API_KEY="your-honeyhive-api-key"
export OPENAI_API_KEY="your-openai-api-key"
export HH_DATASET_ID="your-dataset-id"
```

```python theme={null}
from openai import OpenAI

client = OpenAI()
```

<Note>
  Project scope comes from the API key. You do not need to set `HH_PROJECT` or pass a `project` argument. `evaluate()` wraps your function in a chain trace automatically. Pass `instrumentors` to also capture OpenAI model spans inside that function.
</Note>

## Step 2: Define the function to evaluate

`evaluate()` passes the full datapoint dictionary to your function. Read inputs from `datapoint["inputs"]`, call your application logic, and return a JSON-serializable output.

This example evaluates a small support agent: a system prompt plus policy context, then an LLM call that answers the user question.

```python theme={null}
SUPPORT_POLICY = """
- Password reset: use the password reset link on the login page.
- Refunds: refunds are available within 30 days of purchase.
- Billing questions: email support@example.com.
"""

def answer_question(datapoint):
    question = datapoint["inputs"]["question"]

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a helpful customer support agent. "
                    "Answer using only the policy below. Keep answers concise.\n"
                    f"{SUPPORT_POLICY}"
                ),
            },
            {"role": "user", "content": question},
        ],
        temperature=0,
    )

    return {"answer": response.choices[0].message.content.strip()}
```

Swap the model, prompt, or policy for your own agent, RAG pipeline, or tool-using workflow. The experiment only needs a function that returns JSON-serializable outputs.

## Step 3: Write a client-side LLM-as-judge evaluator

Client-side evaluators run in your environment during the experiment. They receive the function output, datapoint inputs, and datapoint ground truth.

LLM answers are paraphrased, so checking whether the expected string appears in the response is brittle. Use an LLM-as-judge instead to score semantic correctness. This follows the same client-side pattern as [`llm_judge` in CI Regression Detection](/v2/evaluation/ci-regression-detection). For evaluators that run on HoneyHive after ingestion, see [LLM Evaluators](/v2/evaluators/llm).

```python theme={null}
import json

def answer_quality_judge(outputs, inputs, ground_truth=None):
    expected = (ground_truth or {}).get("answer", "")
    actual = outputs.get("answer", "")
    if not expected:
        return {"score": 0.0, "explanation": "No ground-truth answer to score against"}

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        max_tokens=100,
        messages=[
            {
                "role": "system",
                "content": (
                    'Return JSON only: {"score": 0 or 1, "explanation": "brief reason"}. '
                    "Score 1 when the actual answer conveys the same guidance as the expected answer, "
                    "even if wording differs. Score 0 if it contradicts or misses the key guidance."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Question: {inputs.get('question', '')}\n"
                    f"Expected guidance: {expected}\n"
                    f"Actual answer: {actual}"
                ),
            },
        ],
    )

    text = response.choices[0].message.content.strip()
    start, end = text.find("{"), text.rfind("}")
    if start == -1 or end == -1:
        return {"score": 0.0, "explanation": "Judge returned invalid JSON"}

    parsed = json.loads(text[start : end + 1])
    return {
        "score": float(parsed.get("score", 0)),
        "explanation": parsed.get("explanation", ""),
    }
```

**Evaluator signature:** `(outputs, inputs, ground_truth)`. Return a scalar score, or return a dictionary with a scalar `score` field and optional `explanation` (shown in HoneyHive per datapoint).

## Step 4: Run the experiment with dataset\_id

Pass your dataset ID so HoneyHive loads the datapoints for this run:

```python theme={null}
import os

from honeyhive import evaluate
from openinference.instrumentation.openai import OpenAIInstrumentor

result = evaluate(
    function=answer_question,
    dataset_id=os.environ["HH_DATASET_ID"],
    evaluators=[answer_quality_judge],
    name="honeyhive-dataset-qa-v1",
    instrumentors=[lambda: OpenAIInstrumentor()],
    # server_url="https://your-data-plane-host",  # self-hosted or dedicated deployments only
)

print(f"Run {result.run_id}: success={result.success}")
result.print_table()
```

`evaluate()` fetches the dataset from HoneyHive, runs your function on each datapoint, runs your evaluator, and uploads the traced results.

## Step 5: View and compare results

Open **Experiments** in HoneyHive. The run shows aggregate evaluator scores, per-datapoint outputs, and traces for each execution.

Run the same script again after changing your function, prompt, model, or retrieval strategy. Because both runs use the same HoneyHive dataset, HoneyHive can compare results by datapoint. See [Compare experiment runs](/v2/evaluation/comparing_evals) for side-by-side diffs and regressions.

## Troubleshooting

| Issue                                             | Fix                                                                                                                           |
| ------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| `HH_DATASET_ID` is missing                        | Copy the dataset ID from the HoneyHive Datasets page and export it before running the script                                  |
| `OPENAI_API_KEY` is missing                       | Export your OpenAI API key before running the support agent example                                                           |
| Function argument mismatch                        | Define your function as `def fn(datapoint)`, not `def fn(inputs, ground_truth)`                                               |
| Evaluator receives empty ground truth             | Check the dataset field mapping and make sure expected-answer fields are mapped to ground truth                               |
| Judge returns invalid JSON                        | Keep `temperature=0` and require JSON-only output in the judge system prompt                                                  |
| Dashboard results appear after a short delay      | Trace ingestion is asynchronous, so the run can arrive before every trace is visible                                          |
| Server-side evaluator scores missing from the run | Configure evaluators in HoneyHive. They run on ingested traces and do not belong in the `evaluators` argument to `evaluate()` |

## Next steps

<CardGroup cols={2}>
  <Card title="Compare Experiments" icon="code-compare" href="/v2/evaluation/comparing_evals">
    Compare runs against the same HoneyHive dataset
  </Card>

  <Card title="CI Regression Detection" icon="github" href="/v2/evaluation/ci-regression-detection">
    Run HoneyHive dataset experiments in pull request checks
  </Card>

  <Card title="Dataset Curation" icon="filter" href="/v2/datasets/dataset-curation">
    Grow your eval set from production traces
  </Card>

  <Card title="Client-Side Evaluators" icon="user-check" href="/v2/evaluators/client_side">
    Go deeper on evaluator functions for experiments and CI
  </Card>

  <Card title="Server-Side LLM Evaluators" icon="robot" href="/v2/evaluators/llm">
    Run LLM judges on HoneyHive instead of in your script
  </Card>
</CardGroup>
