Skip to main content
AI quality is multidimensional. A response can be accurate but unhelpful, or fluent but hallucinated. Some dimensions (latency, format compliance, toxicity) are measurable by code. Others (helpfulness, brand voice, domain correctness) require human judgment or LLM-based assessment. Useful evaluation needs both, flowing through the same pipeline and producing comparable, trackable metrics. HoneyHive’s experiment framework is built on this idea. An experiment runs your function against a dataset and scores the outputs with evaluators - automated, LLM-based, or human. You define what to test, how to run it, and how to score it, each independently. Critically, evaluate() produces fully traced sessions using the same OpenTelemetry infrastructure as production, so evaluation and observability aren’t separate workflows - they’re the same system. For a hands-on walkthrough, see the Experiments Quickstart.

Experiment Structure

Every experiment combines three independent parts:
ComponentWhat it isInterface
DatasetTest cases with inputs and expected outputsList of {inputs, ground_truth} dicts, or a dataset_id referencing a managed dataset
FunctionYour application logicdef fn(datapoint) → output dict
EvaluatorsScoring functions that assess outputsdef eval(outputs, inputs, ground_truth) → score
The function is whatever you’re trying to evaluate - a single LLM call, a RAG pipeline, a multi-agent system, or an API wrapper around an external service. It receives a datapoint and returns an output dict. There are no constraints on what happens inside: call models, query databases, invoke tools, orchestrate sub-agents. If your code can run it, evaluate() can test it. These three components are deliberately decoupled. You can reuse a dataset across multiple functions, run the same function against different datasets, and swap evaluators without changing anything else. Here’s a complete example:
from honeyhive import evaluate

dataset = [
    {"inputs": {"text": "I was charged twice"}, "ground_truth": {"intent": "billing"}},
    {"inputs": {"text": "App crashes on login"}, "ground_truth": {"intent": "technical"}},
]

def classify(datapoint):
    text = datapoint["inputs"]["text"]
    response = call_llm(f"Classify intent: {text}. Reply: billing, technical, account, or general.")
    return {"intent": response.strip().lower()}

def intent_match(outputs, inputs, ground_truth):
    return 1.0 if outputs.get("intent") == ground_truth.get("intent") else 0.0

result = evaluate(
    function=classify,
    dataset=dataset,
    evaluators=[intent_match],
    name="classifier-v1"
)

Built-in Tracing

When you call evaluate(), your function is automatically traced using HoneyHive’s OpenTelemetry-based tracing. Every datapoint execution produces a full traced session, identical in structure to production traces, with no additional setup. This means all tracing primitives work inside your function:
  • Auto-instrumentation - LLM calls via OpenAI, Anthropic, etc. are captured automatically if you’ve configured instrumentors
  • Custom spans - Use @trace to create spans for any step in your pipeline
  • Enrichment - Call enrich_span() to attach metrics, metadata, or feedback to any span
  • Nested traces - Multi-agent orchestration, sub-agent calls, and tool chains are traced with full parent-child relationships
from honeyhive import evaluate, trace, enrich_span

@trace
def create_plan(query):
    plan = call_llm(f"Create a plan for: {query}")
    enrich_span(metrics={"plan_steps": plan.count("\n") + 1})
    return plan

@trace
def execute(plan):
    result = call_llm(f"Execute this plan:\n{plan}")
    return result

def agent(datapoint):
    query = datapoint["inputs"]["query"]
    plan = create_plan(query)
    return execute(plan)

result = evaluate(
    function=agent,
    dataset=my_dataset,
    evaluators=[quality_check],
    name="agent-v2"
)
After this runs, every datapoint has a fully traced session in the dashboard. You can inspect each LLM call, see latency and token usage per step, and drill into the exact execution path, the same way you would for production traffic.

Evaluator Types

HoneyHive supports four evaluator types, differentiated by what runs the evaluation logic.
TypeWhat runs the logicCan runBest for
CodeDeterministic PythonClient-side or server-sideFormat checks, metrics, validation
LLM-as-judgeAn LLM modelServer-side (or custom client-side)Subjective quality, relevance, tone
HumanDomain expertsServer-side onlyEdge cases, compliance, ground truth curation
CompositeAggregation formulaServer-side onlyWeighted quality indexes, tiered pass/fail gates
For implementation details: Code | LLM | Human | Composite

Client-Side vs Server-Side

Evaluators can run in two places, each with a different interface and different tradeoffs. This is the most important architectural distinction in the evaluation system.

Client-side evaluators

Run in your environment during evaluate(). You define them as Python functions and pass them directly:
def length_check(outputs, inputs, ground_truth):
    return 1.0 if len(outputs.get("answer", "")) > 50 else 0.0

result = evaluate(
    function=my_function,
    dataset=my_dataset,
    evaluators=[length_check],
    name="my-experiment"
)
Interface: (outputs, inputs, ground_truth) → score
ArgumentContains
outputsReturn value of your function
inputsThe inputs dict from the datapoint
ground_truthThe ground_truth dict from the datapoint
Use when: You need custom libraries, proprietary models, access to local resources, or are working with sensitive data that shouldn’t leave your environment.

Server-side evaluators

Configured in the HoneyHive UI and run on HoneyHive’s infrastructure. They execute automatically on every matching trace, both from production and experiments, without any code changes.
# Server-side evaluators run automatically on matching traces.
# You don't pass them to evaluate().
result = evaluate(
    function=my_function,
    dataset=my_dataset,
    name="my-experiment"
)
Interface: event dict (for Python evaluators) or {{ }} template syntax (for LLM evaluators)
PropertyPython accessLLM template access
Outputsevent["outputs"]["result"]{{ outputs.result }}
Inputsevent["inputs"]["query"]{{ inputs.query }}
Ground truthevent["feedback"]["ground_truth"]{{ feedback.ground_truth }}
Event typeevent["event_type"]{{ event_type }}
Event nameevent["event_name"]{{ event_name }}
The property keys (like result, query) depend on how your functions are traced. Click Show Schema in the evaluator console to see available fields for your events.
Use when: You want consistent evaluation across all traces, zero-code-change monitoring, centralized management, or built-in version control.

Choosing between them

Client-sideServer-side
Where it runsYour environmentHoneyHive infrastructure
When it runsDuring evaluate() onlyEvery matching trace (production + experiments)
SetupDefine in code, pass to evaluate()Configure once in HoneyHive UI
Data interface(outputs, inputs, ground_truth)event dict or {{ }} templates
VersioningYour source controlBuilt-in version history with rollback
LatencySynchronousAsynchronous (post-ingestion)
You can use both together. A common pattern: client-side evaluators for experiment-specific scoring, server-side evaluators for baseline checks (toxicity, format, PII) that run on all traces automatically.

Evaluation Scope

Evaluators can target different levels of your application.
ScopeWhat it evaluatesHow
Session-levelEnd-to-end pipeline outputPass evaluators to evaluate(), or set server-side evaluator filter to event_type: session
Span-levelIndividual steps (LLM calls, retrieval, tools)Call enrich_span(metrics={...}) inside traced functions, or filter server-side evaluators by event name
For multi-step pipelines like RAG, combine both:
from honeyhive import evaluate, trace, enrich_span

@trace
def retrieve(query):
    docs = search(query)
    enrich_span(metrics={"num_docs": len(docs)})
    return docs

@trace
def generate(docs, query):
    answer = call_llm(docs, query)
    enrich_span(metrics={"answer_length": len(answer)})
    return answer

def rag_pipeline(datapoint):
    query = datapoint["inputs"]["query"]
    docs = retrieve(query)
    return generate(docs, query)

def answer_quality(outputs, inputs, ground_truth):
    expected = ground_truth.get("answer", "")
    return 1.0 if expected.lower() in str(outputs).lower() else 0.0

result = evaluate(
    function=rag_pipeline,
    dataset=my_dataset,
    evaluators=[answer_quality],
    name="rag-eval"
)
After running, the dashboard shows answer_quality at the session level and num_docs, answer_length at individual span levels.

Human Review

Automated evaluators handle measurable dimensions, but some assessments need human judgment. HoneyHive provides two ways to add human evaluation to your experiments:

Review Mode

Open any experiment run and click Review Mode to annotate results directly. You can review the full session output or drill into any individual span within a trace, such as a specific sub-agent’s response or a retrieval step. Each span can be annotated independently.

Annotation Queues

For systematic review, create an annotation queue that filters specific events for targeted annotation. Queues can target full sessions or specific nested events (e.g., only the generate_response span in a RAG pipeline, or a particular sub-agent’s output). Events matching your filter criteria are routed to the queue automatically. Both approaches use annotation fields defined by Human Evaluators, such as quality ratings, categorical labels, or free-text feedback. Create human evaluators first to configure what your team will assess.

Next Steps