Evaluators are tests that compute scores or metrics to measure the quality of inputs and outputs for your AI application or specific steps within it. They serve as a crucial component for validating whether your models meet performance criteria and align with domain expertise. Whether you’re fine-tuning prompts, comparing different generative models, or monitoring production systems, evaluators help maintain high standards through systematic testing and measurement.

Key characteristics of HoneyHive evaluators

HoneyHive provides a flexible and comprehensive evaluation framework that can be adapted to various needs and scenarios:

Evaluation Scope

HoneyHive provides flexible granularity in evaluation, allowing you to:

  • Assess entire end-to-end pipelines
  • Evaluate individual steps within your application flow
  • Monitor specific components such as model calls, tool usage, or chain execution
  • Track and evaluate sessions that group multiple operations together
If you want to know how to log client-side evaluations on specific traces and spans, explore our tracing documentation.

Development Stages

  • Offline Evaluation: Used during development and testing phases, including CI/CD pipelines and debugging sessions, where immediate results aren’t critical. In this stage, you can build test suites comprised of carefully curated examples of scenarios you wish to test, with or without ground truths.
  • Online Evaluation: Applied to production systems for real-time monitoring and quality assessment of live applications, enabling real-time quality monitoring, continuous validation of model outputs, and production guardrails and safety checks.
For an example of an offline evaluation with client-side evaluators, see how to run an experiment here.

Implementation Methods

Evaluators can be implemented using three primary methods:

  • Python Code Evaluators: Custom functions that programmatically assess outputs based on specific criteria, such as format validation, content checks, or metric calculations.
  • LLM-Assisted Evaluators: Leverage language models to perform qualitative assessments, such as checking for coherence, relevance, or alignment with requirements.
  • Domain Expert (Human) Evaluators: Enable subject matter experts to provide direct feedback and assessments through the HoneyHive platform.
If you want to know more about how to set up server-side Python, LLM, or Human-based evaluators, please refer to the Python evaluator, LLM Evaluator, Human Annotation pages.

Execution Environment

  • Client-Side Execution: Evaluators run locally within your application environment, providing immediate feedback and integration with your existing infrastructure.
  • Server-Side Execution: Evaluators operate remotely on HoneyHive’s infrastructure, post-ingestion, offering scalability, centralized management, and versioning without impacting your application’s performance.
If you want to know more about the differences and pros/cons of client-side and server-side evaluators, refer to the Client-side vs Server-side Evaluators page.

Examples of HoneyHive Evaluators

Let’s explore some practical examples of evaluators and how they align with different implementation approaches and use cases.

Example 1: Simple Regex validation

import re
from honeyhive import evaluator

@evaluator()
def ssn_evaluator(outputs):
    """
    Detects potential Social Security Numbers in text.
    Looks for the pattern XXX-XX-XXXX where X is a digit.
    """
    ssn_pattern = r'(?!000|666|9\d{2})\d{3}-(?!00)\d{2}-(?!0000)\d{4}'
    return bool(re.search(ssn_pattern, outputs))

This simple regex-based evaluator demonstrates a lightweight validation that checks for the presence of Social Security Numbers in text. Given its fast execution and immediate feedback requirements, it’s ideal for implementation as a client-side evaluator. It can be effectively used in both offline testing scenarios and online production environments where real-time PII detection is crucial.

Example 2: Similarity Scoring with Ground Truth

@evaluator()
def sample_evaluator(outputs, inputs, ground_truths):
        """
        Calculate simple word overlap (Jaccard similarity).
        """
        output_words = set(outputs.lower().split())
        truth_words = set(ground_truths.lower().split())

        intersection = len(output_words & truth_words)
        union = len(output_words | truth_words)
        
        return intersection / union if union > 0 else 0.0

This Jaccard similarity evaluator computes word overlap between model outputs and ground truth references. Though computationally lightweight and suitable for client-side execution, its requirement for ground truth data can limit its use in certain production scenarios where such reference data might not be readily available.

Example 3: LLM-as-a-judge evaluation

[Instruction]
Please act as an impartial judge and evaluate the quality of the answer provided by an AI assistant based on the context provided below. Your evaluation should consider the mentioned criteria. Begin your evaluation by providing a short explanation on how the answer from the AI assistant performs relative to the provided context. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 5 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".

[Criteria]
The answer generated by the AI assistant should be faithful to the provided context and should not include information that isn't supported by the context.

[The Start of Provided Context]
{{ inputs.context }}
[The End of Provided Context]

[The Start of AI Assistant's Answer]
{{ outputs.content }}
[The End of AI Assistant's Answer]

[Evaluation With Rating]

This example showcases a prompt template for LLM-assisted evaluation, where another language model acts as a judge to assess the faithfulness of AI responses to a given context. Due to its complexity and computational requirements (requiring additional LLM API calls), this type of evaluator is best implemented as a server-side evaluator. It’s particularly valuable for qualitative assessments that require nuanced understanding, such as checking coherence, relevance, or alignment with specific criteria.

When deploying in production environments, careful consideration should be given to sampling rates to manage costs and computational load while maintaining statistically significant evaluation coverage.

Example 4: External API Model Evaluation

import requests
import json

def sentiment_evaluator(outputs):
    """
    Evaluates text sentiment using Hugging Face's public sentiment analysis model.
    Returns a score between 0 and 1 indicating positive sentiment.
    """
    API_URL = "https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english"

    try:
        response = requests.post(
            API_URL,
            headers={
                "Content-Type": "application/json"
            },
            json={"inputs": outputs},
            timeout=5
        )

        if response.status_code == 200:
            result = response.json()
            # Extract positive sentiment score
            for label_data in result[0]:
                if label_data['label'] == 'POSITIVE':
                    return label_data['score']
            return 0.0
        else:
            print(f"API call failed with status code: {response.status_code}")
            return None

    except requests.exceptions.RequestException as e:
        print(f"Request failed: {str(e)}")
        return None

This evaluator demonstrates integration with external API services, specifically using Hugging Face’s Inference API for sentiment analysis. Due to potential I/O blocking and latency considerations, this type of evaluator is best implemented as a server-side evaluator with appropriate error handling and timeout mechanisms. It’s particularly suitable for non-critical evaluation scenarios where metrics are generated post-ingestion for enrichment and debugging purposes.