HoneyHive provides a list of server-side evaluator templates that will help you get started with setting up evaluators for some of the most commonly used metrics for LLM applications.

In this document, we will cover how to properly set up tracing in your application to ensure the required information is captured in the expected format for server-side evaluators. Additionally, we will provide a detailed list of Python and LLM evaluator templates, complete with code examples and descriptions for each, to help you implement and customize them for your specific use case.

These templates provide ready-to-use examples. For detailed instructions on creating custom evaluators from scratch, see the Python Evaluators and LLM Evaluators documentation.

Setting Up Tracing for Server-side Evaluators

Server-side evaluators operate on event objects, so when instrumenting your application for sending traces to HoneyHive, you need to ensure the correct event properties are being captured and traced.

For example, suppose you want to set up a Python evaluator that requires both the model’s response and a provided ground truth, as well as an LLM evaluator that requires the model’s response and a provided context. In this case, you can wrap your model call within a function and enrich the event object with the necessary properties:

from honeyhive import enrich_span, trace

@trace
def generate_response(prompt, ground_truth, context):
    completion = openai_client.chat.completions.create(
        model="o3-mini",
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    enrich_span(feedback={"ground_truth": ground_truth},
                inputs={"context": context})
    
    return completion.choices[0].message.content

The traced function will automatically be mapped to a chain event, as it groups together a model event within it. The chain event will be named after the traced function.

When setting up an evaluator in HoneyHive for the example above, follow these steps:

  1. Select Filters
    • event type: chain
    • event name: generate_response
  2. Accessing properties
    • For Python Evaluators:
      • Access output content with event["outputs"]["result"]
      • Access ground truth with event["feedback"]["ground_truth"]
      • Access context with event["inputs"]["context"]
    • For LLM Evaluators:
      • Access output content with {{ outputs.result }}
      • Access ground truth with {{ feedback.ground_truth }}
      • Access context with {{ inputs.context }}

For instance, creating a custom Python evaluator that uses the output from the response along with the provided ground truth would look like this:

While creating an LLM custom evaluator that uses the response’s output in combination with the provided context would look like this:

Python Evaluators

Remember to adjust the event attributes in the code to align with your setup, as demonstrated in the tracing section above.

Response length

Measures response verbosity by counting words. Useful for controlling output length and monitoring response size.

Semantic Similarity

Measures semantic similarity between model output and ground truth using OpenAI embedding models.

Levenshtein Distance

Calculates normalized Levenshtein distance between model output and ground truth. Returns a score between 0 and 1, where 1 indicates perfect match.

ROUGE-L

Calculates ROUGE-L (Longest Common Subsequence) F1 score between generated and reference texts. Scores range 0-1, with higher values indicating better alignment.

BLEU

Calculates BLEU score, measuring translation quality by comparing n-gram overlap between system output and reference text.

JSON Schema Validation

Validates JSON output against a predefined schema. Ideal for ensuring consistent API responses or structured data output.

SQL Parse Check

Validates SQL syntax using SQLGlot parser. Essential for database query generation and SQL-related applications.

Flesch Reading Ease

Calculates text readability score. Higher scores (0-100) indicate easier reading. Useful for ensuring content accessibility.

JSON Key Coverage

Analyzes completeness of JSON array outputs by checking for required fields. Returns count of missing fields.

Tokens per Second

Calculates token generation speed. Useful for performance monitoring and optimization.

Keywords Assertion

Checks for presence of required keywords in output. Useful for ensuring coverage of specific topics or requirements.

OpenAI Moderation Filter

Uses OpenAI Moderation API to check content safety. Returns true if content is flagged for review.

External API Example

Template for external API integration. Demonstrates proper error handling and response processing.

LLM Evaluator Templates

Remember to adjust the event attributes in the code to align with your setup, as demonstrated in the tracing section above.

Answer Faitfhulness

Evaluates if the answer is faithful to the provided context in RAG systems

Answer Relevance

Evaluates if the answer is relevant to the user query

Context Relevance

Evaluates if the retrieved context is relevant to the user query in RAG systems

Format Adherence

Evaluates if the response follows the required format and structure

Tool Usage

Evaluates if the AI assistant uses the correct tools appropriately

Intent Identification

Evaluates if the AI correctly identifies and addresses the user intent

Toxicity

Evaluates the response for harmful, toxic, or inappropriate content

Coherence

Evaluates if the response is logically structured and well-organized