Documentation Index Fetch the complete documentation index at: https://docs.honeyhive.ai/llms.txt
Use this file to discover all available pages before exploring further.
Client-side evaluators run within your application, giving you real-time feedback during execution and full control over evaluation logic. You can use them via two main workflows:
Workflow When to Use How Adding Metrics to Traces Production monitoring, guardrails enrich_span(), enrich_session()Evaluator Functions for Experiments Testing against datasets, CI/CD Define functions, pass to evaluate()
Adding Metrics to Traces
Compute scores in your application code and attach them to traces for monitoring and analysis.
Use cases: Format validation, safety checks, PII detection, latency tracking, relevance scores.
import os
from honeyhive import HoneyHiveTracer, trace, enrich_span
HoneyHiveTracer.init(
api_key = os.getenv( "HH_API_KEY" ),
project = os.getenv( "HH_PROJECT" ),
)
@trace
def generate_response ( query ):
response = call_llm(query)
# Add metrics to this span
enrich_span( metrics = {
"response_length" : len (response),
"contains_pii" : check_pii(response),
"relevance_score" : compute_relevance(query, response),
})
return response
To see where to initialize the tracer for your environment, including AWS Lambda and long-running servers, see Tracer Initialization .
Metrics appear in the HoneyHive dashboard for charting, alerting, and filtering.
For complete documentation on adding metrics to traces, see Custom Metrics .
Evaluator Functions for Experiments
Define scoring functions that run locally during evaluate() to score outputs against expected results.
Writing an Evaluator
Evaluators receive three arguments and return a score:
def my_evaluator ( outputs , inputs , ground_truth ):
"""
Args:
outputs: Return value from your function
inputs: The inputs dict from the datapoint
ground_truth: The ground_truth dict from the datapoint
Returns:
A score (number, boolean, or string)
"""
expected = ground_truth.get( "expected" , "" )
return 1.0 if outputs == expected else 0.0
Running Evaluators
Pass evaluator functions to evaluate():
import os
from honeyhive import evaluate
def accuracy ( outputs , inputs , ground_truth ):
expected = ground_truth.get( "intent" , "" )
actual = outputs.get( "intent" , "" )
return 1.0 if expected == actual else 0.0
def my_classifier ( datapoint ):
text = datapoint[ "inputs" ][ "text" ]
# Your classification logic
return { "intent" : classify(text)}
dataset = [
{ "inputs" : { "text" : "I need a refund" }, "ground_truth" : { "intent" : "billing" }},
{ "inputs" : { "text" : "App won't load" }, "ground_truth" : { "intent" : "technical" }},
]
result = evaluate(
function = my_classifier,
dataset = dataset,
evaluators = [accuracy],
api_key = os.getenv( "HH_API_KEY" ),
project = os.getenv( "HH_PROJECT" ),
name = "intent-classifier-v1" ,
)
Evaluating Multi-Step Pipelines
For pipelines with multiple steps, combine both approaches:
Session-level: Pass evaluators to evaluate() for overall scoring
Span-level: Use enrich_span() within traced functions for step-specific metrics
import os
from honeyhive import evaluate
from honeyhive import trace, enrich_span
# Session-level evaluator
def answer_quality ( outputs , inputs , ground_truth ):
expected = ground_truth.get( "answer" , "" )
return 1.0 if expected.lower() in outputs.lower() else 0.0
@trace
def retrieve_docs ( query ):
docs = search_database(query)
# Span-level metric
enrich_span( metrics = { "num_docs" : len (docs), "retrieval_score" : 0.85 })
return docs
@trace
def generate_answer ( docs , query ):
answer = call_llm(docs, query)
enrich_span( metrics = { "answer_length" : len (answer)})
return answer
def rag_pipeline ( datapoint ):
query = datapoint[ "inputs" ][ "query" ]
docs = retrieve_docs(query)
return generate_answer(docs, query)
result = evaluate(
function = rag_pipeline,
dataset = my_dataset,
evaluators = [answer_quality], # Scores the final output
api_key = os.getenv( "HH_API_KEY" ),
project = os.getenv( "HH_PROJECT" ),
name = "rag-eval" ,
)
After running, you’ll see both:
answer_quality scores at the session level
retrieval_score, num_docs, answer_length at the span level
Next Steps
Custom Metrics Full guide to adding metrics to traces
Run Your First Experiment Complete tutorial with real examples
Server-Side Evaluators Run evaluators on HoneyHive infrastructure
LLM-as-Judge Use LLMs to evaluate outputs