Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.honeyhive.ai/llms.txt

Use this file to discover all available pages before exploring further.

Python evaluators let you write custom evaluation logic that runs on HoneyHive’s infrastructure. Use them for format validation, metric calculations, or any programmatic assessment of your AI outputs.

Creating a Python Evaluator

  1. Navigate to the Evaluators tab in the HoneyHive console.
  2. Click Add Evaluator and select Python Evaluator.
HoneyHive Python evaluator creation interface showing code editor

Event Schema

Python evaluators operate on an event object representing a span in your traces. The following fields are available as top-level variables in your evaluator code:
VariableDescription
eventThe full event object (dict with all fields below)
inputsInput data for the event
outputsOutput data from the event
feedbackUser feedback and ground truth
metadataAdditional event metadata
metricsScores from other evaluators that have already run on this event (e.g. metrics.get("relevance"))
configConfiguration used for this event (model, hyperparameters, template)
event_typeType: model, tool, chain, or session
event_nameName of the specific event
event_idUnique identifier for this event
session_idSession this event belongs to
project_idProject this event belongs to
sourceSource of the event
start_timeEvent start timestamp (Unix ms)
end_timeEvent end timestamp (Unix ms)
durationEvent duration in milliseconds
errorError message string if the event failed, otherwise None
user_propertiesCustom user properties attached to the event
Click Show Schema in the evaluator console to see the actual shape of your events.

Evaluator Function

Define your evaluation logic in a Python function. The function must take no arguments and access event data through the top-level variables:
def check_unwanted_phrases():
    unwanted_phrases = ["As an AI language model", "I'm sorry, but I can't"]
    model_completion = outputs["content"]
    return not any(phrase.lower() in model_completion.lower() for phrase in unwanted_phrases)
You can also assign directly to result:
result = len(outputs["content"].split()) > 10
If you assign to result, that value is used directly. Otherwise, HoneyHive calls the first callable it finds in your code. If you define helper functions, put them before your main evaluator function so the main one is found first.
Looking for ready-made examples? Check out our Python Evaluator Templates.

Available Packages

The following packages are available for import in your evaluator code:
PackageUse Case
jsonJSON parsing and serialization
reRegular expressions
math, statisticsNumerical computations
collectionsSpecialized data structures
datetimeDate/time handling
string, itertools, functools, operatorStandard library utilities
pandasDataFrames and data manipulation (in-memory only)
numpyNumerical arrays and math
sklearnMachine learning utilities (e.g. cosine similarity, metrics)
jsonschemaJSON schema validation
sqlglotSQL parsing and validation

Sandbox Restrictions

Python evaluators run in a sandboxed environment with these limits:
  • Code size: 4KB maximum
  • range() limit: range() is capped at 999 elements. For larger iterations, iterate over a list directly (e.g. outputs["content"].split()) which has no iteration limit.
  • No file I/O: open() in write mode and package-level I/O functions (e.g. pd.read_csv, np.load) are blocked
  • No network access: HTTP requests and remote data fetching (e.g. sklearn.datasets.fetch_*) are not available
  • Import restrictions: Only the packages listed above can be imported

Configuration

Event Filters

Filter which events this evaluator runs on using event type, event name, and additional property filters. See Event Filters for the full list of supported filter options and operators.

Return Type

  • Boolean: For true/false evaluations
  • Numeric: For scores or ratings (configure the scale, e.g., 1-5)
  • String: For categorical outputs

Passing Range

Define pass/fail criteria for your evaluator. Useful for CI builds and detecting failed test cases.

Advanced Settings

Expand to configure:
  • Requires Ground Truth: Enable if your evaluator needs feedback.ground_truth
Click Create to save your evaluator.

Production Settings

After creating an evaluator, you can enable it for production traces from the Evaluators table:
  • Enabled: Toggle to run this evaluator on all traces that match your event filters
  • Sampling %: When enabled, set a sampling percentage to control costs. The default is 10% (one in ten matching events)

Using with Experiments

When enabled, server-side evaluators automatically run on all traces that match your event filters, including experiment traces. When you run evaluate(), your enabled server-side evaluators score the results without any additional code.
from honeyhive import evaluate

# Server-side evaluators run automatically on matching events
result = evaluate(
    function=my_function,
    dataset=my_dataset,
    name="my-experiment"
)
# No need to pass evaluators param - server-side evaluators are applied automatically

Troubleshooting

SymptomCauseFix
ImportError: Import of module 'X' is not allowedModule not in the allowed listUse only available packages
PermissionError: file and network I/O is not allowedAttempted file write or network callUse in-memory operations only (e.g. df.to_dict() instead of df.to_csv("file.csv"))
Metric execution timed outCode exceeded the execution timeoutOptimize your logic or reduce data processing
Code snippet exceeds maximum sizeCode over 4KBSimplify your evaluator or extract helper logic
No function was defined and no 'result' was assignedMissing return valueEither define a function or assign to result
SyntaxErrorCode has Python syntax errorsCheck your code in a local Python environment first
Evaluator auto-disabled100+ failures within 1 hourFix the underlying error, then re-enable from the Evaluators table. See auto-disable details.

Run Your First Experiment

Get started with experiments

Experiments Framework

Learn how experiments and evaluators work together