Evaluators are tests that measure the quality of inputs and outputs for your AI application or specific steps within it.
They serve as a crucial component for validating whether your models meet performance criteria and align with domain expertise.
Whether you’re fine-tuning prompts, comparing different generative models, or monitoring production systems, evaluators help maintain high standards through systematic testing and measurement.
Offline Evaluation: Used during development and testing phases, including CI/CD pipelines and debugging sessions, where immediate results aren’t critical. In this stage, you can build test
suites comprised of carefully curated examples of scenarios you wish to test, with or without ground truths.
Online Evaluation: Applied to production systems for real-time monitoring and quality assessment of live applications, enabling real-time quality monitoring, continuous validation of model outputs, and production guardrails and safety checks.
Evaluators can be implemented using three primary methods:
Python Code Evaluators: Custom functions that programmatically assess outputs based on specific criteria, such as format validation, content checks, or metric calculations.
LLM-Assisted Evaluators: Leverage language models to perform qualitative assessments, such as checking for coherence, relevance, or alignment with requirements.
Domain Expert (Human) Evaluators: Enable subject matter experts to provide direct feedback and assessments through the HoneyHive platform.
Evaluators can be run either locally (client-side) or remotely (server-side), each with its own set of advantages and use cases.
Comparison of Client-side and Server-side Evaluators
Client-Side Execution: Evaluators run locally within your application environment, providing immediate feedback and integration with your existing infrastructure.
Pros:
Quick validations and guardrails
Offline experiments and CI/CD pipelines
Real-time format checks and PII detection
Cons:
Limited by local resources and lack centralized management.
Client-side evaluators can be useful in different scenarios. Here are some examples that illustrate their use:
Refer to Client-side Evaluators to see how to use client-side evaluators for both tracing and experiments scenarios.
HoneyHive provides flexible granularity in evaluation, allowing you to:
Assess entire end-to-end pipelines
Evaluate individual steps within your application flow
Monitor specific components such as model calls, tool usage, or chain execution
Track and evaluate sessions that group multiple operations together
Consider a scenario where you have a multi-step pipeline consisting of: (a) a document retrieval step, and (b) a response generation step. By using evaluators, you can define overall metrics that apply to the entire session through the enrich_session method: