evaluate() produces fully traced sessions using the same OpenTelemetry infrastructure as production, so evaluation and observability aren’t separate workflows - they’re the same system.
For a hands-on walkthrough, see the Experiments Quickstart.
Experiment Structure
Every experiment combines three independent parts:| Component | What it is | Interface |
|---|---|---|
| Dataset | Test cases with inputs and expected outputs | List of {inputs, ground_truth} dicts, or a dataset_id referencing a managed dataset |
| Function | Your application logic | def fn(datapoint) → output dict |
| Evaluators | Scoring functions that assess outputs | def eval(outputs, inputs, ground_truth) → score |
evaluate() can test it.
These three components are deliberately decoupled. You can reuse a dataset across multiple functions, run the same function against different datasets, and swap evaluators without changing anything else.
Here’s a complete example:
Built-in Tracing
When you callevaluate(), your function is automatically traced using HoneyHive’s OpenTelemetry-based tracing. Every datapoint execution produces a full traced session, identical in structure to production traces, with no additional setup.
This means all tracing primitives work inside your function:
- Auto-instrumentation - LLM calls via OpenAI, Anthropic, etc. are captured automatically if you’ve configured instrumentors
- Custom spans - Use
@traceto create spans for any step in your pipeline - Enrichment - Call
enrich_span()to attach metrics, metadata, or feedback to any span - Nested traces - Multi-agent orchestration, sub-agent calls, and tool chains are traced with full parent-child relationships
Evaluator Types
HoneyHive supports four evaluator types, differentiated by what runs the evaluation logic.| Type | What runs the logic | Can run | Best for |
|---|---|---|---|
| Code | Deterministic Python | Client-side or server-side | Format checks, metrics, validation |
| LLM-as-judge | An LLM model | Server-side (or custom client-side) | Subjective quality, relevance, tone |
| Human | Domain experts | Server-side only | Edge cases, compliance, ground truth curation |
| Composite | Aggregation formula | Server-side only | Weighted quality indexes, tiered pass/fail gates |
Client-Side vs Server-Side
Evaluators can run in two places, each with a different interface and different tradeoffs. This is the most important architectural distinction in the evaluation system.Client-side evaluators
Run in your environment duringevaluate(). You define them as Python functions and pass them directly:
(outputs, inputs, ground_truth) → score
| Argument | Contains |
|---|---|
outputs | Return value of your function |
inputs | The inputs dict from the datapoint |
ground_truth | The ground_truth dict from the datapoint |
Server-side evaluators
Configured in the HoneyHive UI and run on HoneyHive’s infrastructure. They execute automatically on every matching trace, both from production and experiments, without any code changes.event dict (for Python evaluators) or {{ }} template syntax (for LLM evaluators)
| Property | Python access | LLM template access |
|---|---|---|
| Outputs | event["outputs"]["result"] | {{ outputs.result }} |
| Inputs | event["inputs"]["query"] | {{ inputs.query }} |
| Ground truth | event["feedback"]["ground_truth"] | {{ feedback.ground_truth }} |
| Event type | event["event_type"] | {{ event_type }} |
| Event name | event["event_name"] | {{ event_name }} |
The property keys (like
result, query) depend on how your functions are traced. Click Show Schema in the evaluator console to see available fields for your events.Choosing between them
| Client-side | Server-side | |
|---|---|---|
| Where it runs | Your environment | HoneyHive infrastructure |
| When it runs | During evaluate() only | Every matching trace (production + experiments) |
| Setup | Define in code, pass to evaluate() | Configure once in HoneyHive UI |
| Data interface | (outputs, inputs, ground_truth) | event dict or {{ }} templates |
| Versioning | Your source control | Built-in version history with rollback |
| Latency | Synchronous | Asynchronous (post-ingestion) |
Evaluation Scope
Evaluators can target different levels of your application.| Scope | What it evaluates | How |
|---|---|---|
| Session-level | End-to-end pipeline output | Pass evaluators to evaluate(), or set server-side evaluator filter to event_type: session |
| Span-level | Individual steps (LLM calls, retrieval, tools) | Call enrich_span(metrics={...}) inside traced functions, or filter server-side evaluators by event name |
answer_quality at the session level and num_docs, answer_length at individual span levels.
Human Review
Automated evaluators handle measurable dimensions, but some assessments need human judgment. HoneyHive provides two ways to add human evaluation to your experiments:Review Mode
Open any experiment run and click Review Mode to annotate results directly. You can review the full session output or drill into any individual span within a trace, such as a specific sub-agent’s response or a retrieval step. Each span can be annotated independently.Annotation Queues
For systematic review, create an annotation queue that filters specific events for targeted annotation. Queues can target full sessions or specific nested events (e.g., only thegenerate_response span in a RAG pipeline, or a particular sub-agent’s output). Events matching your filter criteria are routed to the queue automatically.
Both approaches use annotation fields defined by Human Evaluators, such as quality ratings, categorical labels, or free-text feedback. Create human evaluators first to configure what your team will assess.
Next Steps
Run Your First Experiment
Hands-on tutorial with a real example
Client-Side Evaluators
Write evaluator functions for experiments
Server-Side Evaluators
Configure Python evaluators in the HoneyHive UI
LLM Evaluators
Use LLMs for subjective quality assessment
Compare Experiments
Identify improvements and regressions across runs
Annotation Queues
Set up human review workflows

