HoneyHive Docs

AI quality is multidimensional. A response can be accurate but unhelpful, or fluent but hallucinated. Some dimensions (latency, format compliance, toxicity) are measurable by code. Others (helpfulness, brand voice, domain correctness) require human judgment or LLM-based assessment. Useful evaluation needs both, flowing through the same pipeline and producing comparable, trackable metrics. HoneyHive’s experiment framework is built on this idea. An experiment runs your function against a dataset and scores the outputs with evaluators - automated, LLM-based, or human. You define what to test, how to run it, and how to score it, each independently. Critically, evaluate() produces fully traced sessions using the same OpenTelemetry infrastructure as production, so evaluation and observability aren’t separate workflows - they’re the same system. For a hands-on walkthrough, see the Experiments Quickstart.

Experiment Structure

Every experiment combines three independent parts:

Component	What it is	Interface
Dataset	Test cases with inputs and expected outputs	List of `{inputs, ground_truth}` dicts, or a `dataset_id` referencing a managed dataset
Function	Your application logic	`def fn(datapoint)` → output dict
Evaluators	Scoring functions that assess outputs	`def eval(outputs, inputs, ground_truth)` → score

The function is whatever you’re trying to evaluate - a single LLM call, a RAG pipeline, a multi-agent system, or an API wrapper around an external service. It receives a datapoint and returns an output dict. There are no constraints on what happens inside: call models, query databases, invoke tools, orchestrate sub-agents. If your code can run it, evaluate() can test it. These three components are deliberately decoupled. You can reuse a dataset across multiple functions, run the same function against different datasets, and swap evaluators without changing anything else. Here’s a complete example:

from honeyhive import evaluate

dataset = [
    {"inputs": {"text": "I was charged twice"}, "ground_truth": {"intent": "billing"}},
    {"inputs": {"text": "App crashes on login"}, "ground_truth": {"intent": "technical"}},
]

def classify(datapoint):
    text = datapoint["inputs"]["text"]
    response = call_llm(f"Classify intent: {text}. Reply: billing, technical, account, or general.")
    return {"intent": response.strip().lower()}

def intent_match(outputs, inputs, ground_truth):
    return 1.0 if outputs.get("intent") == ground_truth.get("intent") else 0.0

result = evaluate(
    function=classify,
    dataset=dataset,
    evaluators=[intent_match],
    name="classifier-v1"
)

Built-in Tracing

When you call evaluate(), your function is automatically traced using HoneyHive’s OpenTelemetry-based tracing. Every datapoint execution produces a full traced session, identical in structure to production traces, with no additional setup. This means all tracing primitives work inside your function:

Auto-instrumentation - LLM calls via OpenAI, Anthropic, etc. are captured automatically if you’ve configured instrumentors
Custom spans - Use @trace to create spans for any step in your pipeline
Enrichment - Call enrich_span() to attach metrics, metadata, or feedback to any span
Nested traces - Multi-agent orchestration, sub-agent calls, and tool chains are traced with full parent-child relationships

from honeyhive import evaluate, trace, enrich_span

@trace
def create_plan(query):
    plan = call_llm(f"Create a plan for: {query}")
    enrich_span(metrics={"plan_steps": plan.count("\n") + 1})
    return plan

@trace
def execute(plan):
    result = call_llm(f"Execute this plan:\n{plan}")
    return result

def agent(datapoint):
    query = datapoint["inputs"]["query"]
    plan = create_plan(query)
    return execute(plan)

result = evaluate(
    function=agent,
    dataset=my_dataset,
    evaluators=[quality_check],
    name="agent-v2"
)

After this runs, every datapoint has a fully traced session in the dashboard. You can inspect each LLM call, see latency and token usage per step, and drill into the exact execution path, the same way you would for production traffic.

Evaluator Types

HoneyHive supports four evaluator types, differentiated by what runs the evaluation logic.

Type	What runs the logic	Can run	Best for
Code	Deterministic Python	Client-side or server-side	Format checks, metrics, validation
LLM-as-judge	An LLM model	Server-side (or custom client-side)	Subjective quality, relevance, tone
Human	Domain experts	Server-side only	Edge cases, compliance, ground truth curation
Composite	Aggregation formula	Server-side only	Weighted quality indexes, tiered pass/fail gates

For implementation details: Code | LLM | Human | Composite

Client-Side vs Server-Side

Evaluators can run in two places, each with a different interface and different tradeoffs. This is the most important architectural distinction in the evaluation system.

Client-side evaluators

Run in your environment during evaluate(). You define them as Python functions and pass them directly:

def length_check(outputs, inputs, ground_truth):
    return 1.0 if len(outputs.get("answer", "")) > 50 else 0.0

result = evaluate(
    function=my_function,
    dataset=my_dataset,
    evaluators=[length_check],
    name="my-experiment"
)

Interface: (outputs, inputs, ground_truth) → score

Argument	Contains
`outputs`	Return value of your function
`inputs`	The `inputs` dict from the datapoint
`ground_truth`	The `ground_truth` dict from the datapoint

Use when: You need custom libraries, proprietary models, access to local resources, or are working with sensitive data that shouldn’t leave your environment.

Server-side evaluators

Configured in the HoneyHive UI and run on HoneyHive’s infrastructure. They execute automatically on every matching trace, both from production and experiments, without any code changes.

# Server-side evaluators run automatically on matching traces.
# You don't pass them to evaluate().
result = evaluate(
    function=my_function,
    dataset=my_dataset,
    name="my-experiment"
)

Interface: event dict (for Python evaluators) or {{ }} template syntax (for LLM evaluators)

Property	Python access	LLM template access
Outputs	`event["outputs"]["result"]`	`{{ outputs.result }}`
Inputs	`event["inputs"]["query"]`	`{{ inputs.query }}`
Ground truth	`event["feedback"]["ground_truth"]`	`{{ feedback.ground_truth }}`
Event type	`event["event_type"]`	`{{ event_type }}`
Event name	`event["event_name"]`	`{{ event_name }}`

The property keys (like result, query) depend on how your functions are traced. Click Show Schema in the evaluator console to see available fields for your events.

Use when: You want consistent evaluation across all traces, zero-code-change monitoring, centralized management, or built-in version control.

Choosing between them

	Client-side	Server-side
Where it runs	Your environment	HoneyHive infrastructure
When it runs	During `evaluate()` only	Every matching trace (production + experiments)
Setup	Define in code, pass to `evaluate()`	Configure once in HoneyHive UI
Data interface	`(outputs, inputs, ground_truth)`	`event` dict or `{{ }}` templates
Versioning	Your source control	Built-in version history with rollback
Latency	Synchronous	Asynchronous (post-ingestion)

You can use both together. A common pattern: client-side evaluators for experiment-specific scoring, server-side evaluators for baseline checks (toxicity, format, PII) that run on all traces automatically.

Evaluation Scope

Evaluators can target different levels of your application.

Scope	What it evaluates	How
Session-level	End-to-end pipeline output	Pass evaluators to `evaluate()`, or set server-side evaluator filter to `event_type: session`
Span-level	Individual steps (LLM calls, retrieval, tools)	Call `enrich_span(metrics={...})` inside traced functions, or filter server-side evaluators by event name

For multi-step pipelines like RAG, combine both:

from honeyhive import evaluate, trace, enrich_span

@trace
def retrieve(query):
    docs = search(query)
    enrich_span(metrics={"num_docs": len(docs)})
    return docs

@trace
def generate(docs, query):
    answer = call_llm(docs, query)
    enrich_span(metrics={"answer_length": len(answer)})
    return answer

def rag_pipeline(datapoint):
    query = datapoint["inputs"]["query"]
    docs = retrieve(query)
    return generate(docs, query)

def answer_quality(outputs, inputs, ground_truth):
    expected = ground_truth.get("answer", "")
    return 1.0 if expected.lower() in str(outputs).lower() else 0.0

result = evaluate(
    function=rag_pipeline,
    dataset=my_dataset,
    evaluators=[answer_quality],
    name="rag-eval"
)

After running, the dashboard shows answer_quality at the session level and num_docs, answer_length at individual span levels.

Human Review

Automated evaluators handle measurable dimensions, but some assessments need human judgment. HoneyHive provides two ways to add human evaluation to your experiments:

Review Mode

Open any experiment run and click Review Mode to annotate results directly. You can review the full session output or drill into any individual span within a trace, such as a specific sub-agent’s response or a retrieval step. Each span can be annotated independently.

Annotation Queues

For systematic review, create an annotation queue that filters specific events for targeted annotation. Queues can target full sessions or specific nested events (e.g., only the generate_response span in a RAG pipeline, or a particular sub-agent’s output). Events matching your filter criteria are routed to the queue automatically. Both approaches use annotation fields defined by Human Evaluators, such as quality ratings, categorical labels, or free-text feedback. Create human evaluators first to configure what your team will assess.

Next Steps

Run Your First Experiment

Hands-on tutorial with a real example

Client-Side Evaluators

Write evaluator functions for experiments

Server-Side Evaluators

Configure Python evaluators in the HoneyHive UI

LLM Evaluators

Use LLMs for subjective quality assessment

Compare Experiments

Identify improvements and regressions across runs

Annotation Queues

Set up human review workflows

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

Concepts

Experiment Structure

Built-in Tracing

Evaluator Types

Client-Side vs Server-Side

Client-side evaluators

Server-side evaluators

Choosing between them

Evaluation Scope

Human Review

Review Mode

Annotation Queues

Next Steps

Run Your First Experiment

Client-Side Evaluators

Server-Side Evaluators

LLM Evaluators

Compare Experiments

Annotation Queues

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

​Experiment Structure

​Built-in Tracing

​Evaluator Types

​Client-Side vs Server-Side

​Client-side evaluators

​Server-side evaluators

​Choosing between them

​Evaluation Scope

​Human Review

​Review Mode

​Annotation Queues

​Next Steps

Run Your First Experiment

Client-Side Evaluators

Server-Side Evaluators

LLM Evaluators

Compare Experiments

Annotation Queues

Experiment Structure

Built-in Tracing

Evaluator Types

Client-Side vs Server-Side

Client-side evaluators

Server-side evaluators

Choosing between them

Evaluation Scope

Human Review

Review Mode

Annotation Queues

Next Steps