Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.honeyhive.ai/llms.txt

Use this file to discover all available pages before exploring further.

Client-side evaluators run within your application, giving you real-time feedback during execution and full control over evaluation logic. You can use them via two main workflows:
WorkflowWhen to UseHow
Adding Metrics to TracesProduction monitoring, guardrailsenrich_span(), enrich_session()
Evaluator Functions for ExperimentsTesting against datasets, CI/CDDefine functions, pass to evaluate()

Adding Metrics to Traces

Compute scores in your application code and attach them to traces for monitoring and analysis. Use cases: Format validation, safety checks, PII detection, latency tracking, relevance scores.
import os
from honeyhive import HoneyHiveTracer, trace, enrich_span

HoneyHiveTracer.init(api_key=os.getenv("HH_API_KEY"))

@trace
def generate_response(query):
    response = call_llm(query)
    
    # Add metrics to this span
    enrich_span(metrics={
        "response_length": len(response),
        "contains_pii": check_pii(response),
        "relevance_score": compute_relevance(query, response),
    })
    
    return response
To see where to initialize the tracer for your environment, including AWS Lambda and long-running servers, see Tracer Initialization.
Metrics appear in the HoneyHive dashboard for charting, alerting, and filtering.
For complete documentation on adding metrics to traces, see Custom Metrics.

Evaluator Functions for Experiments

Define scoring functions that run locally during evaluate() to score outputs against expected results.

Writing an Evaluator

Evaluators receive three arguments and return a score:
def my_evaluator(outputs, inputs, ground_truth):
    """
    Args:
        outputs: Return value from your function
        inputs: The inputs dict from the datapoint
        ground_truth: The ground_truth dict from the datapoint (expected/reference values to score against)
    
    Returns:
        A score (number, boolean, or string)
    """
    expected = ground_truth.get("expected", "")
    return 1.0 if outputs == expected else 0.0
Use ground_truth (singular) for both the datapoint field and the evaluator argument. ground_truths was the pre-1.0 SDK name.

Running Evaluators

Pass evaluator functions to evaluate():
import os
from honeyhive import evaluate

def accuracy(outputs, inputs, ground_truth):
    expected = ground_truth.get("intent", "")
    actual = outputs.get("intent", "")
    return 1.0 if expected == actual else 0.0

def my_classifier(datapoint):
    text = datapoint["inputs"]["text"]
    # Your classification logic
    return {"intent": classify(text)}

dataset = [
    {"inputs": {"text": "I need a refund"}, "ground_truth": {"intent": "billing"}},
    {"inputs": {"text": "App won't load"}, "ground_truth": {"intent": "technical"}},
]

result = evaluate(
    function=my_classifier,
    dataset=dataset,
    evaluators=[accuracy],
    api_key=os.getenv("HH_API_KEY"),
    name="intent-classifier-v1",
)
For a complete tutorial with real examples, see Run Your First Experiment.
Keep target function and evaluator names stable across runs. Cross-run comparison pairs metrics by metric name and traced function (event) name. Renaming the target function (e.g., baseline_classifier to improved_classifier) or the evaluator (e.g., accuracy to accuracy_v2) between runs makes the comparison view treat them as unrelated metrics, so improvements and regressions no longer pair up. Iterate by editing the function bodies in place and re-running under the same names; label the run itself via the name= argument to evaluate(). For more on cross-run comparisons, see Comparing Experiments.

Evaluating Multi-Step Pipelines

For pipelines with multiple steps, combine both approaches:
  • Session-level: Pass evaluators to evaluate() for overall scoring
  • Span-level: Use enrich_span() within traced functions for step-specific metrics
import os
from honeyhive import evaluate
from honeyhive import trace, enrich_span

# Session-level evaluator
def answer_quality(outputs, inputs, ground_truth):
    expected = ground_truth.get("answer", "")
    return 1.0 if expected.lower() in outputs.lower() else 0.0

@trace
def retrieve_docs(query):
    docs = search_database(query)
    # Span-level metric
    enrich_span(metrics={"num_docs": len(docs), "retrieval_score": 0.85})
    return docs

@trace  
def generate_answer(docs, query):
    answer = call_llm(docs, query)
    enrich_span(metrics={"answer_length": len(answer)})
    return answer

def rag_pipeline(datapoint):
    query = datapoint["inputs"]["query"]
    docs = retrieve_docs(query)
    return generate_answer(docs, query)

result = evaluate(
    function=rag_pipeline,
    dataset=my_dataset,
    evaluators=[answer_quality],  # Scores the final output
    api_key=os.getenv("HH_API_KEY"),
    name="rag-eval",
)
After running, you’ll see both:
  • answer_quality scores at the session level
  • retrieval_score, num_docs, answer_length at the span level

Next Steps

Custom Metrics

Full guide to adding metrics to traces

Run Your First Experiment

Complete tutorial with real examples

Server-Side Evaluators

Run evaluators on HoneyHive infrastructure

LLM-as-Judge

Use LLMs to evaluate outputs