Overview
An introduction to configuring evaluators with HoneyHive
What are evaluators?
Fundamentally, evaluators are validation and testing functions that allow you to programmatically test and score the performance of an LLM application or a specific component within your application.
Our evaluation framework provides the capability to define multiple evaluators which can be used to evaluate performance of your app during development (against a reference dataset), or when monitoring traces in production. This allows you to test the effectiveness of your application across different scenerios during development and catch failure modes and errors in production.
Evaluators Types
Here are three types of evaluators we support in HoneyHive:
- Python Evaluators: Commonly used to configure unit tests with keyword assertions to check for specific syntax requirements. You can also use metrics like semantic similarity between ground truth and model responses, statistical measures like ROUGE, or even run code in containerized environments to check for code validity.
- LLM Evaluators: Uses LLMs to evaluate subjective traits like accuracy or context quality. Examples include evaluators such as
Context Relevance
,Answer Faithfulness
,Coherence
and more. These types of evaluators are useful for evaluating open-ended tasks like RAG and abstractive summarization where no one correct answer may exist and scoring requires a higher degree of reasoning. - Human Evaluators: Allows you to capture explicit feedback (
ratings
,binary
orfree-form text
) from domain experts to manually evaluate responses.
Concepts
Passing Range
This indicates the desired range of values returned by an evaluator upon a successful run. This is used to calculate pass/fail metrics for each evaluation run.
Return Type
The type of value returned by the evaluator. We currently support Numeric
, Boolean
, and String
.
Online Evaluators
This enables your evaluator to be automatically computed over all events where source is not evaluation
or playground
. This is useful if you’re looking to evaluate your production data. To enable this, toggle Enable in production
in your evaluator settings.
Enabling Sampling
For computationally-heavy evaluators like LLM evaluators, you can configure a smaller percentage of events to run your evaluator over.
source
is not evaluation
or playground
, i.e. typically only production or staging environments. You can not sample events when running offline evaluations.Ground Truth
Ground truth refers to ideal or expected response from your application. Some metrics such as Cosine Similarity
require ground truth labels in order to compute and therefore, may not be suitable for evaluating production data. Enabling this option ensures no metrics are computed in the absence of ground truth labels.
Event Filters
You can configure your evaluator to run over a particular type of event (LLM, Tool, Chain), or over a specific session_name
or event_name
. This is useful when separating out unit tests from end-to-end integration tests.
Unit vs Integration Testing
Evaluators can be used to test an LLM application end-to-end (integration tests), or specific components within an application (unit tests). For example, a naive RAG application can be broken down into retrieval and synthesis steps, which can be evaluated independently using Context Relevance
and Answer Faithfulness
evaluators respectively.
We recommend configuring evaluators over a specific event if you’re running a unit test, or over the entire session trace if you’re running an integration test.
What’s next
Learn how to set up your own evaluators for your unique use-case.