Composite evaluators aggregate results from multiple Python, LLM, and Human evaluators into a single score. Use them to create holistic quality metrics that combine different evaluation criteria.
When to use composite evaluators:
- Combining multiple quality dimensions into one score
- Creating weighted quality indexes (e.g., accuracy + helpfulness + safety)
- Building hierarchical pass/fail criteria (must pass A before B matters)
- Tracking worst-case or best-case performance across evaluators
Creating a Composite Evaluator
- Navigate to the Evaluators tab
- Click Add Evaluator and select Composite Evaluator
- Configure the aggregate function and select child evaluators
Child evaluators: Only evaluators with numeric, boolean, or categorical return types can be added. String evaluators and other composites are excluded.Composite return type: Composites can only return Numeric or Boolean. When set to Boolean, Weighted Average and Weighted Sum are disabled.
Configuration
Event Filters
Filter which events this composite evaluates by Event Type and Event Name. The composite only aggregates child evaluator results from matching events.
Aggregate Function
| Function | Use Case | Ignores Weights |
|---|
| Weighted Average | Balanced overall score | No |
| Weighted Sum | Cumulative importance | No |
| Hierarchical Highest True | Sequential pass/fail criteria | No (uses as priority) |
| Minimum | Worst-case performance | Yes |
| Maximum | Best-case performance | Yes |
Child Evaluators
Select evaluators to include and set their weights. Browse by type: Python, LLM, or Human.
Aggregate Functions
Weighted Average
Calculates Σ(score × weight) / Σ(weights).
| Evaluator | Weight | Score | Contribution |
|---|
| Accuracy | 2 | 4 | 8 |
| Clarity | 1 | 3 | 3 |
| Result | | | (8 + 3) / 3 = 3.67 |
Weighted Sum
Calculates Σ(score × weight).
| Evaluator | Weight | Score | Contribution |
|---|
| Accuracy | 2 | 4 | 8 |
| Clarity | 1 | 3 | 3 |
| Result | | | 8 + 3 = 11 |
Hierarchical Highest True
For boolean evaluators only. Returns the priority level of the highest consecutive true result, starting from priority 1. Useful for tiered pass/fail criteria where earlier checks must pass before later ones matter.
| Evaluator | Priority (Weight) | Result |
|---|
| No PII | 1 | ✓ True |
| Factually Correct | 2 | ✓ True |
| Follows Guidelines | 3 | ✗ False |
| Has Citations | 4 | ✓ True |
Result: 2 (Priorities 1-2 passed consecutively, priority 3 failed, so the chain breaks at 2)
Use for tiered quality gates: basic safety checks at priority 1, correctness at 2, style at 3. The score tells you how far the response got before failing.
Minimum / Maximum
Returns the lowest or highest score among all child evaluators. Weights are ignored.