Skip to main content
Client-side evaluators run within your application, giving you real-time feedback during execution and full control over evaluation logic. You can use them via two main workflows:
WorkflowWhen to UseHow
Adding Metrics to TracesProduction monitoring, guardrailsenrich_span(), enrich_session()
Evaluator Functions for ExperimentsTesting against datasets, CI/CDDefine functions, pass to evaluate()

Adding Metrics to Traces

Compute scores in your application code and attach them to traces for monitoring and analysis. Use cases: Format validation, safety checks, PII detection, latency tracking, relevance scores.
import os
from honeyhive import HoneyHiveTracer, trace, enrich_span

HoneyHiveTracer.init(
    api_key=os.getenv("HH_API_KEY"),
    project=os.getenv("HH_PROJECT"),
)

@trace
def generate_response(query):
    response = call_llm(query)
    
    # Add metrics to this span
    enrich_span(metrics={
        "response_length": len(response),
        "contains_pii": check_pii(response),
        "relevance_score": compute_relevance(query, response),
    })
    
    return response
To see where to initialize the tracer for your environment, including AWS Lambda and long-running servers, see Tracer Initialization.
Metrics appear in the HoneyHive dashboard for charting, alerting, and filtering.
For complete documentation on adding metrics to traces, see Custom Metrics.

Evaluator Functions for Experiments

Define scoring functions that run locally during evaluate() to score outputs against expected results.

Writing an Evaluator

Evaluators receive three arguments and return a score:
def my_evaluator(outputs, inputs, ground_truth):
    """
    Args:
        outputs: Return value from your function
        inputs: The inputs dict from the datapoint
        ground_truth: The ground_truth dict from the datapoint
    
    Returns:
        A score (number, boolean, or string)
    """
    expected = ground_truth.get("expected", "")
    return 1.0 if outputs == expected else 0.0

Running Evaluators

Pass evaluator functions to evaluate():
import os
from honeyhive import evaluate

def accuracy(outputs, inputs, ground_truth):
    expected = ground_truth.get("intent", "")
    actual = outputs.get("intent", "")
    return 1.0 if expected == actual else 0.0

def my_classifier(datapoint):
    text = datapoint["inputs"]["text"]
    # Your classification logic
    return {"intent": classify(text)}

dataset = [
    {"inputs": {"text": "I need a refund"}, "ground_truth": {"intent": "billing"}},
    {"inputs": {"text": "App won't load"}, "ground_truth": {"intent": "technical"}},
]

result = evaluate(
    function=my_classifier,
    dataset=dataset,
    evaluators=[accuracy],
    api_key=os.getenv("HH_API_KEY"),
    project=os.getenv("HH_PROJECT"),
    name="intent-classifier-v1",
)
For a complete tutorial with real examples, see Run Your First Experiment.

Evaluating Multi-Step Pipelines

For pipelines with multiple steps, combine both approaches:
  • Session-level: Pass evaluators to evaluate() for overall scoring
  • Span-level: Use enrich_span() within traced functions for step-specific metrics
import os
from honeyhive import evaluate
from honeyhive import trace, enrich_span

# Session-level evaluator
def answer_quality(outputs, inputs, ground_truth):
    expected = ground_truth.get("answer", "")
    return 1.0 if expected.lower() in outputs.lower() else 0.0

@trace
def retrieve_docs(query):
    docs = search_database(query)
    # Span-level metric
    enrich_span(metrics={"num_docs": len(docs), "retrieval_score": 0.85})
    return docs

@trace  
def generate_answer(docs, query):
    answer = call_llm(docs, query)
    enrich_span(metrics={"answer_length": len(answer)})
    return answer

def rag_pipeline(datapoint):
    query = datapoint["inputs"]["query"]
    docs = retrieve_docs(query)
    return generate_answer(docs, query)

result = evaluate(
    function=rag_pipeline,
    dataset=my_dataset,
    evaluators=[answer_quality],  # Scores the final output
    api_key=os.getenv("HH_API_KEY"),
    project=os.getenv("HH_PROJECT"),
    name="rag-eval",
)
After running, you’ll see both:
  • answer_quality scores at the session level
  • retrieval_score, num_docs, answer_length at the span level

Next Steps