In this guide, we’ll explore how to create custom Python evaluators by defining a simple assertion that checks whether the phrase “As an AI language model” is present in the output.

Creating the evaluator

Navigating to the Evaluators tab in the left sidebar. Click Add Evaluator to create a new evaluator and select Python Evaluator.

Understand the schema

It’s important to familiarize yourself with HoneyHive’s event and session schema in order to successfully define Python evaluators.

The base unit of data in HoneyHive is called an event, which represents a span in a trace. A root event in a trace is of the type session, while all non-root events in a trace can be of 3 core types - modeltool and chain.

All events have a parent-child relationship, except session event, which being a root event does not have any parents.
  • session: A root event used to group together multiple model, tool, and chain events into a single trace. This is achieved by having a common session_id across all children.
  • model events: Used to track the execution of any LLM requests.
  • tool events: Used to track execution of any deterministic functions like requests to vector DBs, requests to an external API, regex parsing, document reranking, and more.
  • chain events: Used to group together multiple model and tool events into composable units that can be evaluated and monitored independently. Typical examples of chains include retrieval pipelines, post-processing pipelines, and more.
You can quickly explore the available event properties when creating an evaluator by clicking Show Schema in the evaluator console.

For this particular example, we’ll be using the below event schema. Here, the model’s output to the user question is under outputs.content.

{
  "event_type": "model",
  "event_name": "Model Completion",
  "inputs": {
    "chat_history": [
      {
        "role": "system",
        "content": "Imagine you are an OS expert"
      },
      {
        "role": "user",
        "content": "What causes deadlocks?"
      }
    ]
  },
  "outputs": {
    "role": "assistant",
    "content": "Deadlocks in operating systems occur when two or more processes hold resources and each process is waiting for the other to release another resource, creating a cycle of dependencies that prevent any of them from proceeding. Several conditions must be present simultaneously for a deadlock to occur:\n\n1. **Mutual Exclusion**: At least one resource must be held in a non-shareable mode; that is, if a process is using the resource, other processes must be prevented from using it simultaneously.\n\n2. **Hold and Wait**: A process must be holding at least one resource and waiting to acquire additional resources that are currently being held by other processes.\n\n3. **No Preemption**: Resources cannot be forcibly removed from the processes holding them until the resources are used to completion and voluntarily released by the processes.\n\n4. **Circular Wait**: There must be a set of processes, {P1, P2, ..., Pn}, such that P1 is waiting for a resource held by P2, P2 is waiting for a resource held by P3, ..., and Pn is waiting for a resource held by P1, forming a circular chain of processes.\n\nDeadlocks can occur in various systems, including operating systems, database systems, and networking. They are a significant issue in concurrent programming, where multiple processes or threads operate simultaneously, often requiring access to shared resources.\n\nTo handle deadlocks, systems may use several strategies:\n\n- **Prevention**: Altering the way resources are requested to ensure that at least one of the necessary conditions for deadlock cannot hold.\n- **Avoidance**: Allowing the system to enter a deadlock state but ensuring that the system can recover either by using algorithms that ensure safe resource allocation or by imposing additional information about future requests.\n- **Detection and Recovery**: Permitting deadlocks to occur and then taking action once they are detected, which might involve killing processes or forcing them to release resources.\n- **Ignorance**: In some cases, especially when deadlocks are rare or the cost of prevention is high, a system might choose to ignore the possibility of deadlocks and deal with them on an ad hoc basis when they occur.\n\nUnderstanding and managing deadlocks is crucial for designing systems that are robust, efficient, and capable of handling concurrent operations without significant performance degradation or system failures."
  }
}

Set event filters

You can choose to compute your evaluator over a specific event type and event name, or over all sessions or a particular session name if you’re looking to evaluate properties that are spread across an entire trace.

In this example, we want this evaluator to be computed over all LLM requests, therefore we’ll select the Model event type and compute this over All Completions.

Define the evaluator function

Let’s start by defining and testing our evaluator. As stated earlier, we’ll define an assertion that checks whether the phrase “As an AI language model” is present in the output. See code below.

def language_model_assertion(event):
	model_completion = event["outputs"]["content"]
	if ("As an AI language model" not in model_completion):
		return False
	return True

result = language_model_assertion(event)

We’ll simply copy and paste the above code in the evaluator console. See below.

metriccode

Configuration and setup

Configure return type

Since our evaluator function uses a Boolean return type, we’ll configure it as Boolean. This will allow us to use aggregation functions like Percentage True when analyzing data.

Configure passing range

Passing ranges are useful in order to be able to detect which test cases failed in your evaluation. This is particularly useful for defining pass/fail criteria on a datapoint level in your CI builds.

We’d ideally want the model to not say “As an AI language model”, hence we’ll configure this as False. This’ll allow us to automatically catch failures and any regressions.

Enable online evaluations

Since this evaluator is a computationally inexpensive, we can choose the enable this to run in production by toggling Enable in production. This will ensure we can catch any responses where the model mentions “As an AI language model”.

Validating the evaluator

Test against recent event

You can quickly test your evaluator with the built-in IDE by either defining your datapoint to test against in the JSON editor, or retrieving any recent events from your project to test your evaluator against. In this example, we’ll test the evaluator against our recent logs. See below.

metriccode

Congratulations! Looks like the evaluator works as expected. You can now save this evaluator by pressing the Create button on the top right.