Skip to main content
LLM evaluators use large language models to evaluate the quality of AI-generated responses based on custom criteria. They’re ideal for qualitative evaluations like coherence, relevance, faithfulness, and tone.

Creating an LLM Evaluator

  1. Navigate to the Evaluators tab in the HoneyHive console.
  2. Click Add Evaluator and select LLM Evaluator.
LLM evaluator editor showing event filters, AI provider selection (OpenAI/gpt-4o), prompt editor with template syntax, and configuration options
LLM evaluators use your configured AI provider. Set up provider keys in Provider Keys to use models from OpenAI, Anthropic, or other providers.

Event Schema

LLM evaluators operate on event objects from your traces. Use {{ }} syntax to reference event properties in your prompt.
PropertyDescriptionExample
event_typeType of event: model, tool, chain, or session{{ event_type }}
event_nameName of the event or session{{ event_name }}
inputsInput data (prompt, query, context, etc.){{ inputs.question }}
outputsOutput data (completion, response, etc.){{ outputs.content }}
feedbackUser feedback and ground truth{{ feedback.ground_truth }}
Click Show Schema in the evaluator console to explore all available event properties for your project.
For detailed event schema documentation and tracing setup, see Configuring Tracing for Server-Side Evaluators.

Evaluation Prompt

Define your evaluation prompt using the {{ }} syntax to inject event data:
[Instruction]
Evaluate the AI assistant's answer based on:
1. Relevance to the question
2. Accuracy of information
3. Clarity and coherence
4. Completeness of the answer

Provide a brief explanation and rate the response on a scale of 1 to 5.

[Question]
{{ inputs.question }}

[Context]
{{ inputs.context }}

[AI Assistant's Answer]
{{ outputs.content }}

[Evaluation]
Explanation:
Rating: [[X]]
Use the [[X]] pattern for ratings. The evaluator automatically extracts the value inside the brackets.
Looking for ready-made examples? Check out our LLM Evaluator Templates.

Configuration

Return Type

  • Boolean: For true/false evaluations
  • Numeric: For scores or ratings (e.g., 1-5)
  • String: For categorical labels or text responses

Passing Range

Define the range of scores that indicate a passing evaluation. Useful for CI/CD pipelines and identifying failed test cases.

Enabled

Toggle to run this evaluator on production traces. Production is defined as any traces where source != evaluation.

Sampling Percentage

Run your evaluator on a percentage of production events to manage costs. Set a sampling percentage (e.g., 25%) based on your event volume.
Sampling only applies to production traces (source is not evaluation or playground). Offline evaluations always run on 100% of datapoints.

Event Filters

Use Set Up Filters to specify which events trigger this evaluator:
  • event_type: Filter by model, tool, chain, or session
  • Event Name: Target a specific event name or use “All” to match any

Next Steps