LLM evaluators
Technical documentation for creating custom LLM evaluators in HoneyHive
LLM evaluators leverage large language models to assess the quality of AI-generated responses and other unstructured data operations (eg: semantic retrieval) based on custom criteria.
Creating an LLM Evaluator
- Navigate to the Evaluators tab in the HoneyHive console.
- Click
Add Evaluator
and selectLLM Evaluator
.
Event Schema
The base unit of data in HoneyHive is called an event
, which represents a span in a trace. A root event in a trace is of the type session
, while all non-root events in a trace can be of 3 core types - model
, tool
and chain
.
session
event, which being a root event does not have any parents.session
: A root event used to group together multiplemodel
,tool
, andchain
events into a single trace. This is achieved by having a commonsession_id
across all children.model
events: Used to track the execution of any LLM requests.tool
events: Used to track execution of any deterministic functions like requests to vector DBs, requests to an external API, regex parsing, document reranking, and more.chain
events: Used to group together multiplemodel
andtool
events into composable units that can be evaluated and monitored independently. Typical examples of chains include retrieval pipelines, post-processing pipelines, and more.
Show Schema
in the evaluator console.Evaluation Prompt
Define your evaluation prompt:
{{}}
to reference event properties in your prompt.Configuration
Return Type
Boolean
: For true/false evaluationsNumeric
: For numeric scores or ratingsString
: For categorical evals or other objects
Passing Range
Passing ranges are useful in order to be able to detect which test cases failed in your evaluation. This is particularly useful for defining pass/fail criteria on a datapoint level in your CI builds.
Online Evaluation
Toggle to enable real-time evaluation in production. We define production as any traces where source != evaluation
when initializing the tracer.
Enable sampling
Sampling allows us to run our evaluator over a smaller percentage of events from production. This helps minimize costs while still providing valuable insights about the performance of our application. We’ll choose to set sampling percentage to 25% in this example.
source
is not evaluation
or playground
, i.e. typically only production or staging environments. You can not sample events when running offline evaluations.Event Filters
You can choose to compute your evaluator over a specific event type and event name, or over all sessions or a particular session name if you’re looking to evaluate properties that are spread across an entire trace.
Validating the evaluator
LLM evaluators can often be unreliable and need validation and alignment with your own judgement before you can deploy them. You can quickly test your evaluator with the built-in IDE by either defining your datapoint to test against in the JSON editor, or retrieving 5 most recent events from your project to test your evaluator against.
Save your evaluator by clicking Create
in the top right corner.