Composite evaluators in HoneyHive allow you to combine results from multiple Python, LLM, and Human evaluators into a single comprehensive score. They are particularly useful for complex multi-step pipelines where you want to measure alignment or track progress over time across various evaluation criteria.

Creating a Composite Evaluator

  1. Navigate to the Metrics tab in the HoneyHive console.
  2. Select or create a new composite evaluator (e.g., “RAGComposite”).

Configuration

Event Filters

You can choose to compute your evaluator over a specific event_type and event_name in your pipeline, including the root span (session).

Adding Evaluators

Add individual evaluators to your composite. Select from existing Python, LLM, or Human evaluators.

Aggregate Functions

Select one of the following aggregation methods:

Weighted average

Calculates the average of all component evaluator scores, taking into account their assigned weights.

Formula: Σ(score * weight) / Σ(weights)

Example:

  • Evaluator A (weight 2, score 4)
  • Evaluator B (weight 1, score 3) Result: (4 * 2 + 3 * 1) / (2 + 1) = 3.67

Weighted sum

Sums the weighted scores of all component evaluators.

Formula: Σ(score * weight)

Example:

  • Evaluator A (weight 2, score 4)
  • Evaluator B (weight 1, score 3) Result: (4 * 2) + (3 * 1) = 11

Hierarchical Highest True

This function is designed for boolean evaluators with associated priority levels. It determines the highest consecutive “true” score across evaluators, considering their priority order rather than their listed order.

Process:

  1. Evaluators are first sorted by their priority (lower number indicates higher priority).
  2. Starting from the highest priority, the function counts consecutive “true” results until it encounters a “false”.
  3. The priority number of the last consecutive “true” result is returned as the score.

Example:

  • Evaluator A (Priority 1, result: True)
  • Evaluator C (Priority 2, result: True)
  • Evaluator B (Priority 3, result: False)
  • Evaluator D (Priority 4, result: True)

Result: 2 (Evaluators with priority 1 and 2 were consecutively true, but priority 3 was false, so the highest priority score with consecutive true results is 2)

This is particularly useful for evaluating hierarchical criteria where higher priority conditions must be met before considering lower priority ones. It allows for a nuanced assessment of how far down the priority list the evaluation succeeded before encountering a failure.

Minimum

Returns the minimum score among all component evaluators, regardless of their weights.

Example:

  • Evaluator A (score 4)
  • Evaluator B (score 3)
  • Evaluator C (score 5) Result: 3

Maximum

Returns the maximum score among all component evaluators, regardless of their weights.

Example:

  • Evaluator A (score 4)
  • Evaluator B (score 3)
  • Evaluator C (score 5) Result: 5

Usage Notes

  • There is no limit to the number of individual evaluators that can be included in a composite evaluator.
  • Weights for each component evaluator are set manually by the user.
  • Composite evaluators can combine results from different types of evaluators (Python, LLM, Human) in a single score.

Best Practices

  1. Choose an appropriate aggregation function based on your evaluation needs:
    • Use Weighted average or Weighted sum for a balanced overall score.
    • Use Hierarchical Highest True for sequential or dependent criteria.
    • Use Minimum or Maximum to focus on worst-case or best-case performance respectively.
  2. Carefully consider the weights assigned to each component evaluator to reflect their relative importance.
  3. When using Hierarchical Highest True, assign priorities to your evaluators based on their criticality to the overall evaluation.
  4. Regularly review and adjust your composite evaluators to ensure they accurately represent your evaluation criteria as your project evolves.
  5. Use composite evaluators to get a holistic view of your system’s performance, but also monitor individual evaluator scores for detailed insights.

By leveraging composite evaluators, you can create nuanced, multi-faceted evaluation metrics that provide a comprehensive view of your AI system’s performance across various dimensions.