> ## Documentation Index
> Fetch the complete documentation index at: https://docs.honeyhive.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Concepts

> How experiments work in HoneyHive

AI quality is multidimensional. A response can be accurate but unhelpful, or fluent but hallucinated. Some dimensions (latency, format compliance, toxicity) are measurable by code. Others (helpfulness, brand voice, domain correctness) require human judgment or LLM-based assessment. Useful evaluation needs both, flowing through the same pipeline and producing comparable, trackable metrics.

HoneyHive's experiment framework is built on this idea. An experiment runs your function against a dataset and scores the outputs with evaluators - automated, LLM-based, or human. You define what to test, how to run it, and how to score it, each independently. Critically, `evaluate()` produces fully traced sessions using the same OpenTelemetry infrastructure as production, so evaluation and observability aren't separate workflows - they're the same system.

For a hands-on walkthrough, see the [Experiments Quickstart](/v2/introduction/experiments-quickstart).

***

## Experiment Structure

Every experiment combines three independent parts:

```mermaid theme={null}
graph LR
    A[Dataset] --> B[Your Function]
    B --> C[Evaluators]
    C --> D[Results]
```

| Component      | What it is                                  | Interface                                                                               |
| -------------- | ------------------------------------------- | --------------------------------------------------------------------------------------- |
| **Dataset**    | Test cases with inputs and expected outputs | List of `{inputs, ground_truth}` dicts, or a `dataset_id` referencing a managed dataset |
| **Function**   | Your application logic                      | `def fn(datapoint)` → output dict                                                       |
| **Evaluators** | Scoring functions that assess outputs       | `def eval(outputs, inputs, ground_truth)` → score                                       |

The **function** is whatever you're trying to evaluate - a single LLM call, a RAG pipeline, a multi-agent system, or an API wrapper around an external service. It receives a datapoint and returns an output dict. There are no constraints on what happens inside: call models, query databases, invoke tools, orchestrate sub-agents. If your code can run it, `evaluate()` can test it.

These three components are deliberately decoupled. You can reuse a dataset across multiple functions, run the same function against different datasets, and swap evaluators without changing anything else.

Here's a complete example:

```python theme={null}
from honeyhive import evaluate

dataset = [
    {"inputs": {"text": "I was charged twice"}, "ground_truth": {"intent": "billing"}},
    {"inputs": {"text": "App crashes on login"}, "ground_truth": {"intent": "technical"}},
]

def classify(datapoint):
    text = datapoint["inputs"]["text"]
    response = call_llm(f"Classify intent: {text}. Reply: billing, technical, account, or general.")
    return {"intent": response.strip().lower()}

def intent_match(outputs, inputs, ground_truth):
    return 1.0 if outputs.get("intent") == ground_truth.get("intent") else 0.0

result = evaluate(
    function=classify,
    dataset=dataset,
    evaluators=[intent_match],
    name="classifier-v1"
)
```

***

## Built-in Tracing

When you call `evaluate()`, your function is automatically traced using HoneyHive's OpenTelemetry-based tracing. Every datapoint execution produces a full traced session, identical in structure to production traces, with no additional setup.

This means all tracing primitives work inside your function:

* **Auto-instrumentation** - LLM calls via OpenAI, Anthropic, etc. are captured automatically if you've configured [instrumentors](/v2/integrations/google-adk)
* **Custom spans** - Use `@trace` to create spans for any step in your pipeline
* **Enrichment** - Call `enrich_span()` to attach metrics, metadata, or feedback to any span
* **Nested traces** - Multi-agent orchestration, sub-agent calls, and tool chains are traced with full parent-child relationships

```python theme={null}
from honeyhive import evaluate, trace, enrich_span

@trace
def create_plan(query):
    plan = call_llm(f"Create a plan for: {query}")
    enrich_span(metrics={"plan_steps": plan.count("\n") + 1})
    return plan

@trace
def execute(plan):
    result = call_llm(f"Execute this plan:\n{plan}")
    return result

def agent(datapoint):
    query = datapoint["inputs"]["query"]
    plan = create_plan(query)
    return execute(plan)

result = evaluate(
    function=agent,
    dataset=my_dataset,
    evaluators=[quality_check],
    name="agent-v2"
)
```

After this runs, every datapoint has a fully traced session in the dashboard. You can inspect each LLM call, see latency and token usage per step, and drill into the exact execution path, the same way you would for production traffic.

***

## Evaluator Types

HoneyHive supports four evaluator types, differentiated by what runs the evaluation logic.

| Type             | What runs the logic  | Can run                             | Best for                                         |
| ---------------- | -------------------- | ----------------------------------- | ------------------------------------------------ |
| **Code**         | Deterministic Python | Client-side or server-side          | Format checks, metrics, validation               |
| **LLM-as-judge** | An LLM model         | Server-side (or custom client-side) | Subjective quality, relevance, tone              |
| **Human**        | Domain experts       | Server-side only                    | Edge cases, compliance, ground truth curation    |
| **Composite**    | Aggregation formula  | Server-side only                    | Weighted quality indexes, tiered pass/fail gates |

For implementation details: [Code](/v2/evaluators/python) | [LLM](/v2/evaluators/llm) | [Human](/v2/evaluators/human) | [Composite](/v2/evaluators/composites)

***

## Client-Side vs Server-Side

Evaluators can run in two places, each with a different interface and different tradeoffs. This is the most important architectural distinction in the evaluation system.

### Client-side evaluators

Run in your environment during `evaluate()`. You define them as Python functions and pass them directly:

```python theme={null}
def length_check(outputs, inputs, ground_truth):
    return 1.0 if len(outputs.get("answer", "")) > 50 else 0.0

result = evaluate(
    function=my_function,
    dataset=my_dataset,
    evaluators=[length_check],
    name="my-experiment"
)
```

**Interface:** `(outputs, inputs, ground_truth)` → score

| Argument       | Contains                                   |
| -------------- | ------------------------------------------ |
| `outputs`      | Return value of your function              |
| `inputs`       | The `inputs` dict from the datapoint       |
| `ground_truth` | The `ground_truth` dict from the datapoint |

**Use when:** You need custom libraries, proprietary models, access to local resources, or are working with sensitive data that shouldn't leave your environment.

### Server-side evaluators

Configured in the HoneyHive UI and run on HoneyHive's infrastructure. When enabled, they execute automatically on every trace that matches your event filters — both production and experiment traces — without any code changes.

```python theme={null}
# Server-side evaluators run automatically on matching traces.
# You don't pass them to evaluate().
result = evaluate(
    function=my_function,
    dataset=my_dataset,
    name="my-experiment"
)
```

**Interface:** `event` dict (for Python evaluators) or `{{ }}` template syntax (for LLM evaluators)

| Property     | Python access                       | LLM template access           |
| ------------ | ----------------------------------- | ----------------------------- |
| Outputs      | `event["outputs"]["result"]`        | `{{ outputs.result }}`        |
| Inputs       | `event["inputs"]["query"]`          | `{{ inputs.query }}`          |
| Ground truth | `event["feedback"]["ground_truth"]` | `{{ feedback.ground_truth }}` |
| Event type   | `event["event_type"]`               | `{{ event_type }}`            |
| Event name   | `event["event_name"]`               | `{{ event_name }}`            |

<Note>
  The property keys (like `result`, `query`) depend on how your functions are traced. Click **Show Schema** in the evaluator console to see available fields for your events.
</Note>

**Use when:** You want consistent evaluation across all traces, zero-code-change monitoring, centralized management, or built-in version control.

### Choosing between them

|                    | Client-side                          | Server-side                                     |
| ------------------ | ------------------------------------ | ----------------------------------------------- |
| **Where it runs**  | Your environment                     | HoneyHive infrastructure                        |
| **When it runs**   | During `evaluate()` only             | Every matching trace (production + experiments) |
| **Setup**          | Define in code, pass to `evaluate()` | Configure once in HoneyHive UI                  |
| **Data interface** | `(outputs, inputs, ground_truth)`    | `event` dict or `{{ }}` templates               |
| **Versioning**     | Your source control                  | Built-in version history with rollback          |
| **Latency**        | Synchronous                          | Asynchronous (post-ingestion)                   |

You can use both together. A common pattern: client-side evaluators for experiment-specific scoring, server-side evaluators for baseline checks (toxicity, format, PII) that run on all traces automatically.

***

## Evaluation Scope

Evaluators can target different levels of your application.

| Scope             | What it evaluates                              | How                                                                                                       |
| ----------------- | ---------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
| **Session-level** | End-to-end pipeline output                     | Pass evaluators to `evaluate()`, or set server-side evaluator filter to `event_type: session`             |
| **Span-level**    | Individual steps (LLM calls, retrieval, tools) | Call `enrich_span(metrics={...})` inside traced functions, or filter server-side evaluators by event name |

For multi-step pipelines like RAG, combine both:

```python theme={null}
from honeyhive import evaluate, trace, enrich_span

@trace
def retrieve(query):
    docs = search(query)
    enrich_span(metrics={"num_docs": len(docs)})
    return docs

@trace
def generate(docs, query):
    answer = call_llm(docs, query)
    enrich_span(metrics={"answer_length": len(answer)})
    return answer

def rag_pipeline(datapoint):
    query = datapoint["inputs"]["query"]
    docs = retrieve(query)
    return generate(docs, query)

def answer_quality(outputs, inputs, ground_truth):
    expected = ground_truth.get("answer", "")
    return 1.0 if expected.lower() in str(outputs).lower() else 0.0

result = evaluate(
    function=rag_pipeline,
    dataset=my_dataset,
    evaluators=[answer_quality],
    name="rag-eval"
)
```

After running, the dashboard shows `answer_quality` at the session level and `num_docs`, `answer_length` at individual span levels.

***

## Human Review

Automated evaluators handle measurable dimensions, but some assessments need human judgment. HoneyHive provides two ways to add human evaluation to your experiments:

### Review Mode

Open any experiment run and click **Review Mode** to annotate results directly. You can review the full session output or drill into any individual span within a trace, such as a specific sub-agent's response or a retrieval step. Each span can be annotated independently.

### Annotation Queues

For systematic review, create an [annotation queue](/v2/evaluation/annotation-queues) that filters specific events for targeted annotation. Queues can target full sessions or specific nested events (e.g., only the `generate_response` span in a RAG pipeline, or a particular sub-agent's output). Events matching your filter criteria are routed to the queue automatically.

Both approaches use annotation fields defined by [Human Evaluators](/v2/evaluators/human), such as quality ratings, categorical labels, or free-text feedback. Create human evaluators first to configure what your team will assess.

***

## Next Steps

<CardGroup cols={2}>
  <Card title="Run Your First Experiment" icon="flask" href="/v2/introduction/experiments-quickstart">
    Hands-on tutorial with a real example
  </Card>

  <Card title="Client-Side Evaluators" icon="code" href="/v2/evaluators/client_side">
    Write evaluator functions for experiments
  </Card>

  <Card title="Server-Side Evaluators" icon="server" href="/v2/evaluators/python">
    Configure Python evaluators in the HoneyHive UI
  </Card>

  <Card title="LLM Evaluators" icon="sparkles" href="/v2/evaluators/llm">
    Use LLMs for subjective quality assessment
  </Card>

  <Card title="Compare Experiments" icon="code-compare" href="/v2/evaluation/comparing_evals">
    Identify improvements and regressions across runs
  </Card>

  <Card title="Annotation Queues" icon="user-check" href="/v2/evaluation/annotation-queues">
    Set up human review workflows
  </Card>
</CardGroup>
