What are evaluators?

Fundamentally, evaluators are validation and testing functions that allow you to programmatically test and score the performance of an LLM application or a specific component within your application.

Our evaluation framework provides the capability to define multiple evaluators which can be used to evaluate performance of your app during development (against a reference dataset), or when monitoring traces in production. This allows you to test the effectiveness of your application across different scenerios during development and catch failure modes and errors in production.

Evaluators Types

Here are three types of evaluators we support in HoneyHive:

  1. Python Evaluators: Commonly used to configure unit tests with keyword assertions to check for specific syntax requirements. You can also use metrics like semantic similarity between ground truth and model responses, statistical measures like ROUGE, or even run code in containerized environments to check for code validity.
  2. LLM Evaluators: Uses LLMs to evaluate subjective traits like accuracy or context quality. Examples include evaluators such as Context Relevance, Answer Faithfulness, Coherence and more. These types of evaluators are useful for evaluating open-ended tasks like RAG and abstractive summarization where no one correct answer may exist and scoring requires a higher degree of reasoning.
  3. Human Evaluators: Allows you to capture explicit feedback (ratings, binary or free-form text) from domain experts to manually evaluate responses.

Concepts

Passing Range

This indicates the desired range of values returned by an evaluator upon a successful run. This is used to calculate pass/fail metrics for each evaluation run.

Return Type

The type of value returned by the evaluator. We currently support Numeric, Boolean, and String.

Online Evaluators

This enables your evaluator to be automatically computed over all events where source is not evaluation or playground. This is useful if you’re looking to evaluate your production data. To enable this, toggle Enable in production in your evaluator settings.

Enabling Sampling

For computationally-heavy evaluators like LLM evaluators, you can configure a smaller percentage of events to run your evaluator over.

Sampling only applies to events where source is not evaluation or playground, i.e. typically only production or staging environments. You can not sample events when running offline evaluations.

Ground Truth

Ground truth refers to ideal or expected response from your application. Some metrics such as Cosine Similarity require ground truth labels in order to compute and therefore, may not be suitable for evaluating production data. Enabling this option ensures no metrics are computed in the absence of ground truth labels.

Event Filters

You can configure your evaluator to run over a particular type of event (LLM, Tool, Chain), or over a specific session_name or event_name. This is useful when separating out unit tests from end-to-end integration tests.

Unit vs Integration Testing

Evaluators can be used to test an LLM application end-to-end (integration tests), or specific components within an application (unit tests). For example, a naive RAG application can be broken down into retrieval and synthesis steps, which can be evaluated independently using Context Relevance and Answer Faithfulness evaluators respectively.

We recommend configuring evaluators over a specific event if you’re running a unit test, or over the entire session trace if you’re running an integration test.

What’s next

Learn how to set up your own evaluators for your unique use-case.