HoneyHive Docs

Evaluators are tests that measure the quality of inputs and outputs for your AI application or specific steps within it. They serve as a crucial component for validating whether your models meet performance criteria and align with domain expertise. Whether you’re fine-tuning prompts, comparing different generative models, or monitoring production systems, evaluators help maintain high standards through systematic testing and measurement.

Key characteristics of HoneyHive evaluators

HoneyHive provides a flexible and comprehensive evaluation framework that can be adapted to various needs and scenarios:

Development Stages

Offline Evaluation: Used during development and testing phases, including CI/CD pipelines and debugging sessions, where immediate results aren’t critical. In this stage, you can build test suites comprised of carefully curated examples of scenarios you wish to test, with or without ground truths.
Online Evaluation: Applied to production systems for real-time monitoring and quality assessment of live applications, enabling real-time quality monitoring, continuous validation of model outputs, and production guardrails and safety checks.

For an example of an offline evaluation with client-side evaluators, see how to run an experiment here.

Implementation Methods

Evaluators can be implemented using three primary methods:

Python Code Evaluators: Custom functions that programmatically assess outputs based on specific criteria, such as format validation, content checks, or metric calculations.
LLM-Assisted Evaluators: Leverage language models to perform qualitative assessments, such as checking for coherence, relevance, or alignment with requirements.
Domain Expert (Human) Evaluators: Enable subject matter experts to provide direct feedback and assessments through the HoneyHive platform.

Execution Environment

Evaluators can be run either locally (client-side) or remotely (server-side), each with its own set of advantages and use cases.

Comparison of Client-side and Server-side Evaluators

Client-Side Execution: Evaluators run locally within your application environment, providing immediate feedback and integration with your existing infrastructure.
- Pros:
  - Quick validations and guardrails
  - Offline experiments and CI/CD pipelines
  - Real-time format checks and PII detection
- Cons:
  - Limited by local resources and lack centralized management.

Client-side evaluators can be useful in different scenarios. Here are some examples that illustrate their use:

Refer to Client-side Evaluators to see how to use client-side evaluators for both tracing and experiments scenarios.
Check out our tutorial on Evaluating Advanced Reasoning Models on Putnam 2023 for an example of setting up an evaluation run using a client-side LLM-as-a-Judge evaluator.

Server-Side Execution: Evaluators operate remotely on HoneyHive’s infrastructure.
- Pros:
  - Asynchronous processing for resource-intensive tasks
  - Centralized management and versioning
  - Better scalability for large datasets
  - Support for human evaluations and post-ingestion analysis
- Cons:
  - Higher latency since results aren’t immediately available.

If you want to know more about how to set up server-side Python, LLM, or Human-based evaluators, please refer to the Python evaluator, LLM Evaluator, Human Annotation pages.

Evaluation Scope

HoneyHive provides flexible granularity in evaluation, allowing you to:

Assess entire end-to-end pipelines
Evaluate individual steps within your application flow
Monitor specific components such as model calls, tool usage, or chain execution
Track and evaluate sessions that group multiple operations together

Consider a scenario where you have a multi-step pipeline consisting of: (a) a document retrieval step, and (b) a response generation step. By using evaluators, you can define overall metrics that apply to the entire session through the enrich_session method:

from honeyhive import trace, enrich_session

@trace
def rag_pipeline(query):
    docs = get_relevant_docs(query)
    response = generate_response(docs, query)
    
    
    # Add session-level metrics
    enrich_session(metrics={
        "rag_pipeline": {
            "num_retrieved_docs": len(docs),
            "query_length": len(query.split())   
        }
    })
    
    return docs, response

And also add metrics on each of the particular steps with enrich_step:

from honeyhive import trace, enrich_span

@trace
def get_relevant_docs(query):
    medical_docs = [
        "Regular exercise reduces diabetes risk by 30%. Daily walking is recommended.",
        "Studies show morning exercises have better impact on blood sugar levels."
    ]
    enrich_span(metrics={"retrieval_relevance": 0.5})
    return medical_docs

@trace
def generate_response(docs, query):
    prompt = f"Question: {query}\nContext: {docs}\nAnswer:"
    response = "This is a test response."
    enrich_span(metrics={"contains_citations": True})
    return response

If you want to know more about how to log client-side evaluations on specific traces and spans, explore our tracing documentation.

Introduction

Guides

Tutorials

Learn more

Introduction

Key characteristics of HoneyHive evaluators

Development Stages

Implementation Methods

Execution Environment

Evaluation Scope

Introduction

Guides

Tutorials

Learn more

​Key characteristics of HoneyHive evaluators

​Development Stages

​Implementation Methods

​Execution Environment

​Evaluation Scope

Key characteristics of HoneyHive evaluators

Development Stages

Implementation Methods

Execution Environment

Evaluation Scope