Skip to main content
LLM evaluators use large language models to evaluate the quality of AI-generated responses based on custom criteria. They’re ideal for qualitative evaluations like coherence, relevance, faithfulness, and tone.

Creating an LLM Evaluator

  1. Navigate to the Evaluators tab in the HoneyHive console.
  2. Click Add Evaluator and select LLM Evaluator.
LLM evaluator editor showing event filters, AI provider selection (OpenAI/gpt-4o), prompt editor with template syntax, and configuration options
LLM evaluators use your configured AI provider. Set up provider keys in Provider Keys to use models from OpenAI, Anthropic, or other providers.

Event Schema

LLM evaluators operate on event objects from your traces. Use {{ }} syntax to reference event properties in your prompt.
PropertyDescriptionExample
event_typeType of event: model, tool, chain, or session{{ event_type }}
event_nameName of the event or session{{ event_name }}
inputsInput data (prompt, query, context, etc.){{ inputs.question }}
outputsOutput data (completion, response, etc.){{ outputs.content }}
feedbackUser feedback and ground truth{{ feedback.ground_truth }}
Click Show Schema in the evaluator console to explore all available event properties for your project.
For detailed event schema documentation and tracing setup, see Configuring Tracing for Server-Side Evaluators.

Evaluation Prompt

Define your evaluation prompt using the {{ }} syntax to inject event data:
[Instruction]
Evaluate the AI assistant's answer based on:
1. Relevance to the question
2. Accuracy of information
3. Clarity and coherence
4. Completeness of the answer

Provide a brief explanation and rate the response on a scale of 1 to 5.

[Question]
{{ inputs.question }}

[Context]
{{ inputs.context }}

[AI Assistant's Answer]
{{ outputs.content }}

[Evaluation]
Explanation:
Rating: [[X]]
Use the [[X]] pattern for ratings. The evaluator automatically extracts the value inside the brackets.
Looking for ready-made examples? Check out our LLM Evaluator Templates.

Configuration

Return Type

  • Boolean: For true/false evaluations
  • Numeric: For scores or ratings (e.g., 1-5)
  • String: For categorical labels or text responses

Passing Range

Define the range of scores that indicate a passing evaluation. Useful for CI/CD pipelines and identifying failed test cases.

Enabled

Toggle to run this evaluator on production traces. Production is defined as any traces where source != evaluation.

Sampling Percentage

Run your evaluator on a percentage of production events to manage costs. Set a sampling percentage (e.g., 25%) based on your event volume.
Sampling only applies to production traces (source is not evaluation or playground). Offline evaluations always run on 100% of datapoints.

Event Filters

Use Set Up Filters to specify which events trigger this evaluator. Filters are ANDed together - an event must match all filters to be evaluated.

Preset Filters

Every evaluator includes two preset filters by default:
  • Event Type: Filter by model, tool, chain, or session
  • Event Name: Target a specific event name, or use “All” (e.g., “All Models”) to match any event of that type

Additional Filters

Click the + button to add filters on any event property. You can filter on any field available in your event schema, including nested properties using dot notation (e.g., inputs.question, metadata.model, outputs.content). Each filter consists of:
  • Field: Any property from the event schema
  • Operator: Depends on the field type (see below)
  • Value: The value to compare against
Operators by field type:
Field TypeOperators
Stringis, is not, contains, not contains, exists, not exists
Numberis, is not, greater than, less than, exists, not exists
Booleanis, exists, not exists
Datetimeis, is not, after, before, exists, not exists
Click Show Schema in the evaluator editor to browse all available event properties you can filter on.

Next Steps

Python Evaluators

Create code-based evaluators for programmatic checks

Evaluator Templates

Ready-to-use LLM and Python evaluator templates

Run Experiments

Use evaluators in offline experiments

Human Annotation

Set up human review workflows