HoneyHive provides the core infrastructure to define and compute two types of evaluators:-
- Python Evaluators: Python evaluators allow you to define your own Python functions using external libraries such as
Requests. We’ve seen examples ranging from BERTScore to running SQL queries in containerized environments to check for validity — the evaluator you choose here really depends on your use-case.
- LLM Evaluators: LLM evaluators allow you to prompt LLMs to grade your model completions (and optionally any input features such as context from your vector database) on your desired scale (eg: Likert Scale or Boolean). Common evaluators include subjective traits and characteristics such as Truthfulness, Coherence, Relevance or Groundedness.
Define a Python Evaluator
- Accessing the Evaluators Section: Navigate to the Evaluators tab in the left sidebar.
- Creating a New Evaluator: Click
Add Evaluatorto create a new evaluator and select
- Python Evaluator Example: Let’s start by defining and testing our evaluator. In this example, we’ll define a custom evaluator that counts the number of sentences in a given model completion. See below.
- Testing a Python Evaluator: You can quickly test your evaluator with the built-in IDE. To test an evaluator, define your datapoint in a JSON format and click
Runto compute the metric.
Define an LLM Evaluator
- Accessing the Evaluators Page: Navigate to the Evaluators tab in the left sidebar.
- Creating a new LLM Evaluator: Click
Add Evaluatorand select
- AI Feedback Function Example: Let’s start by defining and testing our evaluator. In this example, we’ll define an evaluator that uses GPT-4 to rate whether a given model completion contains any PII (such as phone numbers, SSN, emails, etc.). See below.
- Testing an LLM Evaluator: You can quickly test your feedback function with the built-in prompt editor. To test an evaluator, define your datapoint in a JSON format and click
Runto compute the metric.
Truthfulnessrequire the model to not only analyze the completion, but also take the user input and retrieval chunks into account. HoneyHive allows you to edit the prompt template and use the datapoint schema to insert any property associated with a datapoint into the prompt template. Simply wrap the property name around curly brackets
}}to insert the desired property in the prompt template. Example:
Offline evaluation vs. production monitoring
It’s important to note that some Python evaluators like common NLP metrics (eg: ROUGE, BERTScore, etc.) may require ground truth labels (i.e., ideal model responses or outputs) to compute. For this reason, they are better suited for offline evaluations during model development and testing rather than production monitoring for generative tasks.