HoneyHive Docs

Client-side evaluators run within your application, giving you real-time feedback during execution and full control over evaluation logic. You can use them via two main workflows:

Workflow	When to Use	How
Adding Metrics to Traces	Production monitoring, guardrails	`enrich_span()`, `enrich_session()`
Evaluator Functions for Experiments	Testing against datasets, CI/CD	Define functions, pass to `evaluate()`

Adding Metrics to Traces

Compute scores in your application code and attach them to traces for monitoring and analysis. Use cases: Format validation, safety checks, PII detection, latency tracking, relevance scores.

import os
from honeyhive import HoneyHiveTracer, trace, enrich_span

HoneyHiveTracer.init(
    api_key=os.getenv("HH_API_KEY"),
    project=os.getenv("HH_PROJECT"),
)

@trace
def generate_response(query):
    response = call_llm(query)
    
    # Add metrics to this span
    enrich_span(metrics={
        "response_length": len(response),
        "contains_pii": check_pii(response),
        "relevance_score": compute_relevance(query, response),
    })
    
    return response

To see where to initialize the tracer for your environment, including AWS Lambda and long-running servers, see Tracer Initialization.

Metrics appear in the HoneyHive dashboard for charting, alerting, and filtering.

For complete documentation on adding metrics to traces, see Custom Metrics.

Evaluator Functions for Experiments

Define scoring functions that run locally during evaluate() to score outputs against expected results.

Writing an Evaluator

Evaluators receive three arguments and return a score:

def my_evaluator(outputs, inputs, ground_truth):
    """
    Args:
        outputs: Return value from your function
        inputs: The inputs dict from the datapoint
        ground_truth: The ground_truth dict from the datapoint
    
    Returns:
        A score (number, boolean, or string)
    """
    expected = ground_truth.get("expected", "")
    return 1.0 if outputs == expected else 0.0

Running Evaluators

Pass evaluator functions to evaluate():

import os
from honeyhive import evaluate

def accuracy(outputs, inputs, ground_truth):
    expected = ground_truth.get("intent", "")
    actual = outputs.get("intent", "")
    return 1.0 if expected == actual else 0.0

def my_classifier(datapoint):
    text = datapoint["inputs"]["text"]
    # Your classification logic
    return {"intent": classify(text)}

dataset = [
    {"inputs": {"text": "I need a refund"}, "ground_truth": {"intent": "billing"}},
    {"inputs": {"text": "App won't load"}, "ground_truth": {"intent": "technical"}},
]

result = evaluate(
    function=my_classifier,
    dataset=dataset,
    evaluators=[accuracy],
    api_key=os.getenv("HH_API_KEY"),
    project=os.getenv("HH_PROJECT"),
    name="intent-classifier-v1",
)

For a complete tutorial with real examples, see Run Your First Experiment.

Evaluating Multi-Step Pipelines

For pipelines with multiple steps, combine both approaches:

Session-level: Pass evaluators to evaluate() for overall scoring
Span-level: Use enrich_span() within traced functions for step-specific metrics

import os
from honeyhive import evaluate
from honeyhive import trace, enrich_span

# Session-level evaluator
def answer_quality(outputs, inputs, ground_truth):
    expected = ground_truth.get("answer", "")
    return 1.0 if expected.lower() in outputs.lower() else 0.0

@trace
def retrieve_docs(query):
    docs = search_database(query)
    # Span-level metric
    enrich_span(metrics={"num_docs": len(docs), "retrieval_score": 0.85})
    return docs

@trace  
def generate_answer(docs, query):
    answer = call_llm(docs, query)
    enrich_span(metrics={"answer_length": len(answer)})
    return answer

def rag_pipeline(datapoint):
    query = datapoint["inputs"]["query"]
    docs = retrieve_docs(query)
    return generate_answer(docs, query)

result = evaluate(
    function=rag_pipeline,
    dataset=my_dataset,
    evaluators=[answer_quality],  # Scores the final output
    api_key=os.getenv("HH_API_KEY"),
    project=os.getenv("HH_PROJECT"),
    name="rag-eval",
)

After running, you’ll see both:

answer_quality scores at the session level
retrieval_score, num_docs, answer_length at the span level

Next Steps

Custom Metrics

Full guide to adding metrics to traces

Run Your First Experiment

Complete tutorial with real examples

Server-Side Evaluators

Run evaluators on HoneyHive infrastructure

LLM-as-Judge

Use LLMs to evaluate outputs

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

Client-Side Evaluators

Adding Metrics to Traces

Evaluator Functions for Experiments

Writing an Evaluator

Running Evaluators

Evaluating Multi-Step Pipelines

Next Steps

Custom Metrics

Run Your First Experiment

Server-Side Evaluators

LLM-as-Judge

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

​Adding Metrics to Traces

​Evaluator Functions for Experiments

​Writing an Evaluator

​Running Evaluators

​Evaluating Multi-Step Pipelines

​Next Steps

Custom Metrics

Run Your First Experiment

Server-Side Evaluators

LLM-as-Judge

Adding Metrics to Traces

Evaluator Functions for Experiments

Writing an Evaluator

Running Evaluators

Evaluating Multi-Step Pipelines

Next Steps