Evaluator types

HoneyHive provides the core infrastructure to define and compute two types of evaluators:-

  1. Python Evaluators: Python evaluators allow you to define your own Python functions using external libraries such as NumPy, Pandas, Scikit Learn or Requests. We’ve seen examples ranging from BERTScore to running SQL queries in containerized environments to check for validity — the evaluator you choose here really depends on your use-case.
  2. LLM Evaluators: LLM evaluators allow you to prompt LLMs to grade your model completions (and optionally any input features such as context from your vector database) on your desired scale (eg: Likert Scale or Boolean). Common evaluators include subjective traits and characteristics such as Truthfulness, Coherence, Relevance or Groundedness.
Potential Bias with LLM Evaluators: While LLM evaluators can be good quantitative heuristic to measure performance, they’re prone to occasional bias and can be misleading across certain scenerios. We recommend users to test their LLM evaluators in the editor against recent production data and try to rely on human feedback when evaluating variants.

Define a Python Evaluator

  1. Accessing the Evaluators Section: Navigate to the Evaluators tab in the left sidebar.
  2. Creating a New Evaluator: Click Add Evaluator to create a new evaluator and select Python Evaluator.
  3. Python Evaluator Example: Let’s start by defining and testing our evaluator. In this example, we’ll define a custom evaluator that counts the number of sentences in a given model completion. See below.

metriccode

  1. Testing a Python Evaluator: You can quickly test your evaluator with the built-in IDE. To test an evaluator, define your datapoint in a JSON format and click Run to compute the metric.
Testing Evaluators: You can quickly retrieve the most recent datapoint from your project to test your evaluator against real-world app completions.

Define an LLM Evaluator

  1. Accessing the Evaluators Page: Navigate to the Evaluators tab in the left sidebar.
  2. Creating a new LLM Evaluator: Click Add Evaluator and select LLM Evaluator.
  3. AI Feedback Function Example: Let’s start by defining and testing our evaluator. In this example, we’ll define an evaluator that uses GPT-4 to rate whether a given model completion contains any PII (such as phone numbers, SSN, emails, etc.). See below.

metricconfig

  1. Testing an LLM Evaluator: You can quickly test your feedback function with the built-in prompt editor. To test an evaluator, define your datapoint in a JSON format and click Run to compute the metric.
Using datapoint schema in prompt template: Some LLM evaluators such as Truthfulness require the model to not only analyze the completion, but also take the user input and retrieval chunks into account. HoneyHive allows you to edit the prompt template and use the datapoint schema to insert any property associated with a datapoint into the prompt template. Simply wrap the property name around curly brackets {{ }} to insert the desired property in the prompt template. Example: {{inputs}}.

Offline evaluation vs. production monitoring

It’s important to note that some Python evaluators like common NLP metrics (eg: ROUGE, BERTScore, etc.) may require ground truth labels (i.e., ideal model responses or outputs) to compute. For this reason, they are better suited for offline evaluations during model development and testing rather than production monitoring for generative tasks.