HoneyHive Docs

Client-side evaluators run within your application, giving you real-time feedback during execution and full control over evaluation logic. You can use them via two main workflows:

Workflow	When to Use	How
Adding Metrics to Traces	Production monitoring, guardrails	`enrich_span()`, `enrich_session()`
Evaluator Functions for Experiments	Testing against datasets, CI/CD	Define functions, pass to `evaluate()`

Adding Metrics to Traces

Compute scores in your application code and attach them to traces for monitoring and analysis. Use cases: Format validation, safety checks, PII detection, latency tracking, relevance scores.

import os
from honeyhive import HoneyHiveTracer, trace, enrich_span

HoneyHiveTracer.init(api_key=os.getenv("HH_API_KEY"))

@trace
def generate_response(query):
    response = call_llm(query)
    
    # Add metrics to this span
    enrich_span(metrics={
        "response_length": len(response),
        "contains_pii": check_pii(response),
        "relevance_score": compute_relevance(query, response),
    })
    
    return response

To see where to initialize the tracer for your environment, including AWS Lambda and long-running servers, see Tracer Initialization.

Metrics appear in the HoneyHive dashboard for charting, alerting, and filtering.

For complete documentation on adding metrics to traces, see Custom Metrics.

Evaluator Functions for Experiments

Define scoring functions that run locally during evaluate() to score outputs against expected results.

Writing an Evaluator

Evaluators receive three arguments and return a score:

def my_evaluator(outputs, inputs, ground_truth):
    """
    Args:
        outputs: Return value from your function
        inputs: The inputs dict from the datapoint
        ground_truth: The ground_truth dict from the datapoint (expected/reference values to score against)
    
    Returns:
        A score (number, boolean, or string)
    """
    expected = ground_truth.get("expected", "")
    return 1.0 if outputs == expected else 0.0

Use ground_truth (singular) for both the datapoint field and the evaluator argument. ground_truths was the pre-1.0 SDK name.

Running Evaluators

Pass evaluator functions to evaluate():

import os
from honeyhive import evaluate

def accuracy(outputs, inputs, ground_truth):
    expected = ground_truth.get("intent", "")
    actual = outputs.get("intent", "")
    return 1.0 if expected == actual else 0.0

def my_classifier(datapoint):
    text = datapoint["inputs"]["text"]
    # Your classification logic
    return {"intent": classify(text)}

dataset = [
    {"inputs": {"text": "I need a refund"}, "ground_truth": {"intent": "billing"}},
    {"inputs": {"text": "App won't load"}, "ground_truth": {"intent": "technical"}},
]

result = evaluate(
    function=my_classifier,
    dataset=dataset,
    evaluators=[accuracy],
    api_key=os.getenv("HH_API_KEY"),
    name="intent-classifier-v1",
)

For a complete tutorial with real examples, see Run Your First Experiment.

Keep target function and evaluator names stable across runs. Cross-run comparison pairs metrics by metric name and traced function (event) name. Renaming the target function (e.g., baseline_classifier to improved_classifier) or the evaluator (e.g., accuracy to accuracy_v2) between runs makes the comparison view treat them as unrelated metrics, so improvements and regressions no longer pair up. Iterate by editing the function bodies in place and re-running under the same names; label the run itself via the name= argument to evaluate(). For more on cross-run comparisons, see Comparing Experiments.

Evaluating Multi-Step Pipelines

For pipelines with multiple steps, combine both approaches:

Session-level: Pass evaluators to evaluate() for overall scoring
Span-level: Use enrich_span() within traced functions for step-specific metrics

import os
from honeyhive import evaluate
from honeyhive import trace, enrich_span

# Session-level evaluator
def answer_quality(outputs, inputs, ground_truth):
    expected = ground_truth.get("answer", "")
    return 1.0 if expected.lower() in outputs.lower() else 0.0

@trace
def retrieve_docs(query):
    docs = search_database(query)
    # Span-level metric
    enrich_span(metrics={"num_docs": len(docs), "retrieval_score": 0.85})
    return docs

@trace  
def generate_answer(docs, query):
    answer = call_llm(docs, query)
    enrich_span(metrics={"answer_length": len(answer)})
    return answer

def rag_pipeline(datapoint):
    query = datapoint["inputs"]["query"]
    docs = retrieve_docs(query)
    return generate_answer(docs, query)

result = evaluate(
    function=rag_pipeline,
    dataset=my_dataset,
    evaluators=[answer_quality],  # Scores the final output
    api_key=os.getenv("HH_API_KEY"),
    name="rag-eval",
)

After running, you’ll see both:

answer_quality scores at the session level
retrieval_score, num_docs, answer_length at the span level

Next Steps

Custom Metrics

Full guide to adding metrics to traces

Run Your First Experiment

Complete tutorial with real examples

Server-Side Evaluators

Run evaluators on HoneyHive infrastructure

LLM-as-Judge

Use LLMs to evaluate outputs

Documentation Index

​Adding Metrics to Traces

​Evaluator Functions for Experiments

​Writing an Evaluator

​Running Evaluators

​Evaluating Multi-Step Pipelines

​Next Steps

Custom Metrics

Run Your First Experiment

Server-Side Evaluators

LLM-as-Judge

Adding Metrics to Traces

Evaluator Functions for Experiments

Writing an Evaluator

Running Evaluators

Evaluating Multi-Step Pipelines

Next Steps