HoneyHive Docs

LLM evaluators leverage large language models to assess the quality of AI-generated responses and other unstructured data operations (eg: semantic retrieval) based on custom criteria.

Creating an LLM Evaluator

Navigate to the Evaluators tab in the HoneyHive console.
Click Add Evaluator and select LLM Evaluator.

HoneyHive’s server-side LLM evaluators are powered by OpenAI’s gpt-4o model.

Event Schema

The base unit of data in HoneyHive is called an event, which represents a span in a trace. A root event in a trace is of the type session, while all non-root events in a trace can be of 3 core types - model, tool and chain.

All events have a parent-child relationship, except session event, which being a root event does not have any parents.

session: A root event used to group together multiple model, tool, and chain events into a single trace. This is achieved by having a common session_id across all children.
model events: Used to track the execution of any LLM requests.
tool events: Used to track execution of any deterministic functions like requests to vector DBs, requests to an external API, regex parsing, document reranking, and more.
chain events: Used to group together multiple model and tool events into composable units that can be evaluated and monitored independently. Typical examples of chains include retrieval pipelines, post-processing pipelines, and more.

Event Properties

You can quickly explore the available event properties when creating an evaluator by clicking Show Schema in the evaluator console.

Evaluation Prompt

Define your evaluation prompt:

[Instruction]
Evaluate the AI assistant's answer based on:
1. Relevance to the question
2. Accuracy of information
3. Clarity and coherence
4. Completeness of the answer

Provide a brief explanation and rate the response on a scale of 1 to 5.

[Question]
{{inputs.question}}

[Context]
{{inputs.context}}

[AI Assistant's Answer]
{{outputs.content}}

[Evaluation]
Explanation:
Rating: [[X]]

Use {{}} to reference event properties in your prompt.

Looking for ready-made examples? Check out our list of LLM Evaluator Templates.

Configuration

Return Type

Boolean: For true/false evaluations
Numeric: For numeric scores or ratings
String: For categorical evals or other objects

Passing Range

Passing ranges are useful in order to be able to detect which test cases failed in your evaluation. This is particularly useful for defining pass/fail criteria on a datapoint level in your CI builds.

Online Evaluation

Toggle to enable real-time evaluation in production. We define production as any traces where source != evaluation when initializing the tracer.

Enable sampling

Sampling allows us to run our evaluator over a smaller percentage of events from production. This helps minimize costs while still providing valuable insights about the performance of our application. When deploying evaluators in production or staging environments, be sure to select an appropriate sampling rate based on your estimated event ingestion rate to maintain optimal performance and cost efficiency. We’ll choose to set sampling percentage to 25% in this example.

Sampling only applies to events where source is not evaluation or playground, i.e. typically only production or staging environments. You can not sample events when running offline evaluations.

Event Filters

You can choose to compute your evaluator over a specific event type and event name, or over all sessions or a particular session name if you’re looking to evaluate properties that are spread across an entire trace.

Validating the evaluator

LLM evaluators can often be unreliable and need validation and alignment with your own judgement before you can deploy them. You can quickly test your evaluator with the built-in IDE by either defining your datapoint to test against in the JSON editor, or retrieving 5 most recent events from your project to test your evaluator against.

Save your evaluator by clicking Create in the top right corner.

Introduction

Guides

Tutorials

Learn more

LLM Evaluators

Creating an LLM Evaluator

Event Schema

Evaluation Prompt

Configuration

Return Type

Passing Range

Online Evaluation

Enable sampling

Event Filters

Validating the evaluator

Introduction

Guides

Tutorials

Learn more

​Creating an LLM Evaluator

​Event Schema

​Evaluation Prompt

​Configuration

​Return Type

​Passing Range

​Online Evaluation

​Enable sampling

​Event Filters

​Validating the evaluator

Creating an LLM Evaluator

Event Schema

Evaluation Prompt

Configuration

Return Type

Passing Range

Online Evaluation

Enable sampling

Event Filters

Validating the evaluator