Skip to main content
Composite evaluators aggregate results from multiple Python, LLM, and Human evaluators into a single score. Use them to create holistic quality metrics that combine different evaluation criteria. When to use composite evaluators:
  • Combining multiple quality dimensions into one score
  • Creating weighted quality indexes (e.g., accuracy + helpfulness + safety)
  • Building hierarchical pass/fail criteria (must pass A before B matters)
  • Tracking worst-case or best-case performance across evaluators

Creating a Composite Evaluator

  1. Navigate to the Evaluators tab
  2. Click Add Evaluator and select Composite Evaluator
  3. Configure the aggregate function and select child evaluators
Child evaluators: Only evaluators with numeric, boolean, or categorical return types can be added. String evaluators and other composites are excluded.Composite return type: Composites can only return Numeric or Boolean. When set to Boolean, Weighted Average and Weighted Sum are disabled.

Configuration

Event Filters

Filter which events this composite evaluates by Event Type and Event Name. The composite only aggregates child evaluator results from matching events.

Aggregate Function

FunctionUse CaseIgnores Weights
Weighted AverageBalanced overall scoreNo
Weighted SumCumulative importanceNo
Hierarchical Highest TrueSequential pass/fail criteriaNo (uses as priority)
MinimumWorst-case performanceYes
MaximumBest-case performanceYes

Child Evaluators

Select evaluators to include and set their weights. Browse by type: Python, LLM, or Human.

Aggregate Functions

Weighted Average

Calculates Σ(score × weight) / Σ(weights).
EvaluatorWeightScoreContribution
Accuracy248
Clarity133
Result(8 + 3) / 3 = 3.67

Weighted Sum

Calculates Σ(score × weight).
EvaluatorWeightScoreContribution
Accuracy248
Clarity133
Result8 + 3 = 11

Hierarchical Highest True

For boolean evaluators only. Returns the priority level of the highest consecutive true result, starting from priority 1. Useful for tiered pass/fail criteria where earlier checks must pass before later ones matter.
EvaluatorPriority (Weight)Result
No PII1✓ True
Factually Correct2✓ True
Follows Guidelines3✗ False
Has Citations4✓ True
Result: 2 (Priorities 1-2 passed consecutively, priority 3 failed, so the chain breaks at 2)
Use for tiered quality gates: basic safety checks at priority 1, correctness at 2, style at 3. The score tells you how far the response got before failing.

Minimum / Maximum

Returns the lowest or highest score among all child evaluators. Weights are ignored.