LLM evaluators leverage large language models to assess the quality of AI-generated responses and other unstructured data operations (eg: semantic retrieval) based on custom criteria.

Creating an LLM Evaluator

  1. Navigate to the Evaluators tab in the HoneyHive console.
  2. Click Add Evaluator and select LLM Evaluator.

Event Schema

The base unit of data in HoneyHive is called an event, which represents a span in a trace. A root event in a trace is of the type session, while all non-root events in a trace can be of 3 core types - model, tool and chain.

All events have a parent-child relationship, except session event, which being a root event does not have any parents.
  • session: A root event used to group together multiple model, tool, and chain events into a single trace. This is achieved by having a common session_id across all children.
  • model events: Used to track the execution of any LLM requests.
  • tool events: Used to track execution of any deterministic functions like requests to vector DBs, requests to an external API, regex parsing, document reranking, and more.
  • chain events: Used to group together multiple model and tool events into composable units that can be evaluated and monitored independently. Typical examples of chains include retrieval pipelines, post-processing pipelines, and more.
You can quickly explore the available event properties when creating an evaluator by clicking Show Schema in the evaluator console.

Evaluation Prompt

Define your evaluation prompt:

[Instruction]
Evaluate the AI assistant's answer based on:
1. Relevance to the question
2. Accuracy of information
3. Clarity and coherence
4. Completeness of the answer

Provide a brief explanation and rate the response on a scale of 1 to 5.

[Question]
{{inputs.question}}

[Context]
{{inputs.context}}

[AI Assistant's Answer]
{{outputs.content}}

[Evaluation]
Explanation:
Rating: [[X]]
Use {{}} to reference event properties in your prompt.

Configuration

Return Type

  • Boolean: For true/false evaluations
  • Numeric: For numeric scores or ratings
  • String: For categorical evals or other objects

Passing Range

Passing ranges are useful in order to be able to detect which test cases failed in your evaluation. This is particularly useful for defining pass/fail criteria on a datapoint level in your CI builds.

Online Evaluation

Toggle to enable real-time evaluation in production. We define production as any traces where source != evaluation when initializing the tracer.

Enable sampling

Sampling allows us to run our evaluator over a smaller percentage of events from production. This helps minimize costs while still providing valuable insights about the performance of our application. We’ll choose to set sampling percentage to 25% in this example.

Sampling only applies to events where source is not evaluation or playground, i.e. typically only production or staging environments. You can not sample events when running offline evaluations.

Event Filters

You can choose to compute your evaluator over a specific event type and event name, or over all sessions or a particular session name if you’re looking to evaluate properties that are spread across an entire trace.

Validating the evaluator

LLM evaluators can often be unreliable and need validation and alignment with your own judgement before you can deploy them. You can quickly test your evaluator with the built-in IDE by either defining your datapoint to test against in the JSON editor, or retrieving 5 most recent events from your project to test your evaluator against.

Save your evaluator by clicking Create in the top right corner.