Evaluators and Guardrails
Defining Evaluators
Define Python and LLM Evaluators within HoneyHive
Evaluator types
HoneyHive provides the core infrastructure to define and compute two types of evaluators:-
- Python Evaluators: Python evaluators allow you to define your own Python functions using external libraries such as
NumPy
,Pandas
,Scikit Learn
orRequests
. We’ve seen examples ranging from BERTScore to running SQL queries in containerized environments to check for validity — the evaluator you choose here really depends on your use-case. - LLM Evaluators: LLM evaluators allow you to prompt LLMs to grade your model completions (and optionally any input features such as context from your vector database) on your desired scale (eg: Likert Scale or Boolean). Common evaluators include subjective traits and characteristics such as Truthfulness, Coherence, Relevance or Groundedness.
Potential Bias with LLM Evaluators: While LLM evaluators can be good quantitative heuristic to measure performance, they’re prone to occasional bias and can be misleading across certain scenerios. We recommend users to test their LLM evaluators in the editor against recent production data and try to rely on human feedback when evaluating variants.
Define a Python Evaluator
- Accessing the Evaluators Section: Navigate to the Evaluators tab in the left sidebar.
- Creating a New Evaluator: Click
Add Evaluator
to create a new evaluator and selectPython Evaluator
. - Python Evaluator Example: Let’s start by defining and testing our evaluator. In this example, we’ll define a custom evaluator that counts the number of sentences in a given model completion. See below.
- Testing a Python Evaluator: You can quickly test your evaluator with the built-in IDE. To test an evaluator, define your datapoint in a JSON format and click
Run
to compute the metric.
Testing Evaluators: You can quickly retrieve the most recent datapoint from your project to test your evaluator against real-world app completions.
Define an LLM Evaluator
- Accessing the Evaluators Page: Navigate to the Evaluators tab in the left sidebar.
- Creating a new LLM Evaluator: Click
Add Evaluator
and selectLLM Evaluator
. - AI Feedback Function Example: Let’s start by defining and testing our evaluator. In this example, we’ll define an evaluator that uses GPT-4 to rate whether a given model completion contains any PII (such as phone numbers, SSN, emails, etc.). See below.
- Testing an LLM Evaluator: You can quickly test your feedback function with the built-in prompt editor. To test an evaluator, define your datapoint in a JSON format and click
Run
to compute the metric.
Using datapoint schema in prompt template: Some LLM evaluators such as
Truthfulness
require the model to not only analyze the completion, but also take the user input and retrieval chunks into account. HoneyHive allows you to edit the prompt template and use the datapoint schema to insert any property associated with a datapoint into the prompt template. Simply wrap the property name around curly brackets {{
}}
to insert the desired property in the prompt template. Example: {{inputs}}
.Offline evaluation vs. production monitoring
It’s important to note that some Python evaluators like common NLP metrics (eg: ROUGE, BERTScore, etc.) may require ground truth labels (i.e., ideal model responses or outputs) to compute. For this reason, they are better suited for offline evaluations during model development and testing rather than production monitoring for generative tasks.