HoneyHive Docs

Composite evaluators aggregate results from multiple Python, LLM, and Human evaluators into a single score. Use them to create holistic quality metrics that combine different evaluation criteria. When to use composite evaluators:

Combining multiple quality dimensions into one score
Creating weighted quality indexes (e.g., accuracy + helpfulness + safety)
Building hierarchical pass/fail criteria (must pass A before B matters)
Tracking worst-case or best-case performance across evaluators

Creating a Composite Evaluator

Navigate to the Evaluators tab
Click Add Evaluator and select Composite Evaluator
Configure the aggregate function and select child evaluators

Child evaluators: Only evaluators with numeric, boolean, or categorical return types can be added. String evaluators and other composites are excluded.Composite return type: Composites can only return Numeric or Boolean. When set to Boolean, Weighted Average and Weighted Sum are disabled.

Configuration

Event Filters

Filter which events this composite evaluates using event type, event name, and additional property filters. The composite only aggregates child evaluator results from matching events. See Event Filters for the full list of supported filter options and operators.

Aggregate Function

Function	Use Case	Ignores Weights
Weighted Average	Balanced overall score	No
Weighted Sum	Cumulative importance	No
Hierarchical Highest True	Sequential pass/fail criteria	No (uses as priority)
Minimum	Worst-case performance	Yes
Maximum	Best-case performance	Yes

Child Evaluators

Select evaluators to include and set their weights. Browse by type: Python, LLM, or Human.

Aggregate Functions

Weighted Average

Calculates Σ(score × weight) / Σ(weights).

Evaluator	Weight	Score	Contribution
Accuracy	2	4	8
Clarity	1	3	3
Result			(8 + 3) / 3 = 3.67

Weighted Sum

Calculates Σ(score × weight).

Evaluator	Weight	Score	Contribution
Accuracy	2	4	8
Clarity	1	3	3
Result			8 + 3 = 11

Hierarchical Highest True

For boolean evaluators only. Returns the priority level of the highest consecutive true result, starting from priority 1. Useful for tiered pass/fail criteria where earlier checks must pass before later ones matter.

Evaluator	Priority (Weight)	Result
No PII	1	✓ True
Factually Correct	2	✓ True
Follows Guidelines	3	✗ False
Has Citations	4	✓ True

Result: 2 (Priorities 1-2 passed consecutively, priority 3 failed, so the chain breaks at 2)

Use for tiered quality gates: basic safety checks at priority 1, correctness at 2, style at 3. The score tells you how far the response got before failing.

Minimum / Maximum

Returns the lowest or highest score among all child evaluators. Weights are ignored.

Python Evaluators

Create code-based evaluators

LLM Evaluators

Use AI for qualitative assessment

Human Evaluators

Enable expert review workflows

Evaluators Introduction

Overview of all evaluator types

Human Evaluators

Version Control

⌘I

​Creating a Composite Evaluator

​Configuration

​Event Filters

​Aggregate Function

​Child Evaluators

​Aggregate Functions

​Weighted Average

​Weighted Sum

​Hierarchical Highest True

​Minimum / Maximum

​Related

Python Evaluators

LLM Evaluators

Human Evaluators

Evaluators Introduction

Creating a Composite Evaluator

Configuration

Event Filters

Aggregate Function

Child Evaluators

Aggregate Functions

Weighted Average

Weighted Sum

Hierarchical Highest True

Minimum / Maximum

Related