Python evaluators
How to create a custom Python evaluator in HoneyHive console
In this guide, we’ll explore how to create custom Python evaluators by defining a simple assertion that checks whether the phrase “As an AI language model” is present in the output.
Creating the evaluator
UI Walkthrough
Use the following UI walkthrough alongside the guide below to create a custom Python evaluator in the HoneyHive console.
Navigate to the evaluator console
Navigating to the Evaluators tab in the left sidebar. Click Add Evaluator
to create a new evaluator and select Python Evaluator
.
Understand the schema
It’s important to familiarize yourself with HoneyHive’s event
and session
schema in order to successfully define Python evaluators.
The base unit of data in HoneyHive is called an event
, which represents a span in a trace. A root event in a trace is of the type session
, while all non-root events in a trace can be of 3 core types - model
, tool
and chain
.
session
event, which being a root event does not have any parents.session
: A root event used to group together multiplemodel
,tool
, andchain
events into a single trace. This is achieved by having a commonsession_id
across all children.model
events: Used to track the execution of any LLM requests.tool
events: Used to track execution of any deterministic functions like requests to vector DBs, requests to an external API, regex parsing, document reranking, and more.chain
events: Used to group together multiplemodel
andtool
events into composable units that can be evaluated and monitored independently. Typical examples of chains include retrieval pipelines, post-processing pipelines, and more.
Show Schema
in the evaluator console.For this particular example, we’ll be using the below event schema. Here, the model’s output to the user question is under outputs.content
.
{
"event_type": "model",
"event_name": "Model Completion",
"inputs": {
"chat_history": [
{
"role": "system",
"content": "Imagine you are an OS expert"
},
{
"role": "user",
"content": "What causes deadlocks?"
}
]
},
"outputs": {
"role": "assistant",
"content": "Deadlocks in operating systems occur when two or more processes hold resources and each process is waiting for the other to release another resource, creating a cycle of dependencies that prevent any of them from proceeding. Several conditions must be present simultaneously for a deadlock to occur:\n\n1. **Mutual Exclusion**: At least one resource must be held in a non-shareable mode; that is, if a process is using the resource, other processes must be prevented from using it simultaneously.\n\n2. **Hold and Wait**: A process must be holding at least one resource and waiting to acquire additional resources that are currently being held by other processes.\n\n3. **No Preemption**: Resources cannot be forcibly removed from the processes holding them until the resources are used to completion and voluntarily released by the processes.\n\n4. **Circular Wait**: There must be a set of processes, {P1, P2, ..., Pn}, such that P1 is waiting for a resource held by P2, P2 is waiting for a resource held by P3, ..., and Pn is waiting for a resource held by P1, forming a circular chain of processes.\n\nDeadlocks can occur in various systems, including operating systems, database systems, and networking. They are a significant issue in concurrent programming, where multiple processes or threads operate simultaneously, often requiring access to shared resources.\n\nTo handle deadlocks, systems may use several strategies:\n\n- **Prevention**: Altering the way resources are requested to ensure that at least one of the necessary conditions for deadlock cannot hold.\n- **Avoidance**: Allowing the system to enter a deadlock state but ensuring that the system can recover either by using algorithms that ensure safe resource allocation or by imposing additional information about future requests.\n- **Detection and Recovery**: Permitting deadlocks to occur and then taking action once they are detected, which might involve killing processes or forcing them to release resources.\n- **Ignorance**: In some cases, especially when deadlocks are rare or the cost of prevention is high, a system might choose to ignore the possibility of deadlocks and deal with them on an ad hoc basis when they occur.\n\nUnderstanding and managing deadlocks is crucial for designing systems that are robust, efficient, and capable of handling concurrent operations without significant performance degradation or system failures."
}
}
Set event filters
You can choose to compute your evaluator over a specific event type and event name, or over all sessions or a particular session name if you’re looking to evaluate properties that are spread across an entire trace.
In this example, we want this evaluator to be computed over all LLM requests, therefore we’ll select the Model
event type and compute this over All Completions
.
Define the evaluator function
Let’s start by defining and testing our evaluator. As stated earlier, we’ll define an assertion that checks whether the phrase “As an AI language model” is present in the output. See code below.
def language_model_assertion(event):
model_completion = event["outputs"]["content"]
if ("As an AI language model" not in model_completion):
return False
return True
result = language_model_assertion(event)
We’ll simply copy and paste the above code in the evaluator console. See below.
Configuration and setup
Configure return type
Since our evaluator function uses a Boolean
return type, we’ll configure it as Boolean
. This will allow us to use aggregation functions like Percentage True
when analyzing data.
Configure passing range
Passing ranges are useful in order to be able to detect which test cases failed in your evaluation. This is particularly useful for defining pass/fail criteria on a datapoint level in your CI builds.
We’d ideally want the model to not say “As an AI language model”, hence we’ll configure this as False
. This’ll allow us to automatically catch failures and any regressions.
Enable online evaluations
Since this evaluator is a computationally inexpensive, we can choose the enable this to run in production by toggling Enable in production
. This will ensure we can catch any responses where the model mentions “As an AI language model”.
Validating the evaluator
Test against recent event
You can quickly test your evaluator with the built-in IDE by either defining your datapoint to test against in the JSON editor, or retrieving any recent events from your project to test your evaluator against. In this example, we’ll test the evaluator against our recent logs. See below.
Congratulations! Looks like the evaluator works as expected. You can now save this evaluator by pressing the Create
button on the top right.