Skip to main content
Python evaluators let you write custom evaluation logic that runs on HoneyHive’s infrastructure. Use them for format validation, metric calculations, or any programmatic assessment of your AI outputs.

Creating a Python Evaluator

  1. Navigate to the Evaluators tab in the HoneyHive console.
  2. Click Add Evaluator and select Python Evaluator.
HoneyHive Python evaluator creation interface showing code editor
HoneyHive’s server-side Python evaluators have access to Python’s standard library and packages including pandas, scikit-learn, jsonschema, sqlglot, and requests.

Event Schema

Python evaluators operate on event objects representing spans in your traces.
PropertyDescriptionExample
inputsInput data for the eventevent["inputs"]["query"]
outputsOutput data from the eventevent["outputs"]["content"]
feedbackUser feedback and ground truthevent["feedback"]["ground_truth"]
metadataAdditional event metadataevent["metadata"]["model"]
event_typeType: model, tool, chain, or sessionevent["event_type"]
event_nameName of the specific eventevent["event_name"]
Click Show Schema in the evaluator console to see all available properties for your events.

Evaluator Function

Define your evaluation logic in a Python function:
def check_unwanted_phrases(event):
    unwanted_phrases = ["As an AI language model", "I'm sorry, but I can't", "I don't have personal opinions"]
    model_completion = event["outputs"]["content"]
    return not any(phrase.lower() in model_completion.lower() for phrase in unwanted_phrases)

result = check_unwanted_phrases(event)
Looking for ready-made examples? Check out our Python Evaluator Templates.
Resource limits: Python evaluators have a 1GB memory limit and 30-second timeout. Optimize your code to stay within these constraints.

Configuration

Event Filters

Filter which events this evaluator runs on by Event Type and Event Name. Use this to target specific spans in your pipeline (e.g., only model events named generate_response).

Return Type

  • Boolean: For true/false evaluations
  • Numeric: For scores or ratings (configure the scale, e.g., 1-5)
  • String: For categorical outputs

Passing Range

Define pass/fail criteria for your evaluator. Useful for CI builds and detecting failed test cases.

Advanced Settings

Expand to configure:
  • Requires Ground Truth: Enable if your evaluator needs feedback.ground_truth
Click Create to save your evaluator.

Production Settings

After creating an evaluator, you can enable it for production traces from the Evaluators table:
  • Enabled: Toggle to run this evaluator on production traces (where source != evaluation)
  • Sampling %: When enabled, set a sampling percentage to control costs (e.g., 25% evaluates one in four events)

Using with Experiments

Server-side evaluators automatically run on all experiment traces that match your event filters. When you run evaluate(), your server-side evaluators score the results without any additional code.
from honeyhive import evaluate

# Server-side evaluators run automatically on matching events
result = evaluate(
    function=my_function,
    dataset=my_dataset,
    name="my-experiment"
)
# No need to pass evaluators param—server-side evaluators are applied automatically