HoneyHive Docs

LLM evaluators use large language models to evaluate the quality of AI-generated responses based on custom criteria. They’re ideal for qualitative evaluations like coherence, relevance, faithfulness, and tone.

Creating an LLM Evaluator

Navigate to the Evaluators tab in the HoneyHive console.
Click Add Evaluator and select LLM Evaluator.

LLM evaluator editor showing event filters, AI provider selection (OpenAI/gpt-4o), prompt editor with template syntax, and configuration options

LLM evaluators use your configured AI provider. Set up provider keys in Provider Keys to use models from OpenAI, Anthropic, or other providers.

Event Schema

LLM evaluators operate on event objects from your traces. Use {{ }} syntax to reference event properties in your prompt.

Property	Description	Example
`event_type`	Type of event: `model`, `tool`, `chain`, or `session`	`{{ event_type }}`
`event_name`	Name of the event or session	`{{ event_name }}`
`inputs`	Input data (prompt, query, context, etc.)	`{{ inputs.question }}`
`outputs`	Output data (completion, response, etc.)	`{{ outputs.content }}`
`feedback`	User feedback and ground truth	`{{ feedback.ground_truth }}`

Click Show Schema in the evaluator console to explore all available event properties for your project.

For detailed event schema documentation and tracing setup, see Configuring Tracing for Server-Side Evaluators.

Evaluation Prompt

Define your evaluation prompt using the {{ }} syntax to inject event data:

[Instruction]
Evaluate the AI assistant's answer based on:
1. Relevance to the question
2. Accuracy of information
3. Clarity and coherence
4. Completeness of the answer

Provide a brief explanation and rate the response on a scale of 1 to 5.

[Question]
{{ inputs.question }}

[Context]
{{ inputs.context }}

[AI Assistant's Answer]
{{ outputs.content }}

[Evaluation]
Explanation:
Rating: [[X]]

Use the [[X]] pattern for ratings. The evaluator automatically extracts the value inside the brackets.

Looking for ready-made examples? Check out our LLM Evaluator Templates.

Configuration

Return Type

Boolean: For true/false evaluations
Numeric: For scores or ratings (e.g., 1-5)
String: For categorical labels or text responses

Passing Range

Define the range of scores that indicate a passing evaluation. Useful for CI/CD pipelines and identifying failed test cases.

Enabled

Toggle to run this evaluator on production traces. Production is defined as any traces where source != evaluation.

Sampling Percentage

Run your evaluator on a percentage of production events to manage costs. Set a sampling percentage (e.g., 25%) based on your event volume.

Sampling only applies to production traces (source is not evaluation or playground). Offline evaluations always run on 100% of datapoints.

Event Filters

Use Set Up Filters to specify which events trigger this evaluator:

event_type: Filter by model, tool, chain, or session
Event Name: Target a specific event name or use “All” to match any

Next Steps

Python Evaluators

Create code-based evaluators for programmatic checks

Evaluator Templates

Ready-to-use LLM and Python evaluator templates

Run Experiments

Use evaluators in offline experiments

Human Annotation

Set up human review workflows

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

LLM Evaluators

Creating an LLM Evaluator

Event Schema

Evaluation Prompt

Configuration

Return Type

Passing Range

Enabled

Sampling Percentage

Event Filters

Next Steps

Python Evaluators

Evaluator Templates

Run Experiments

Human Annotation

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

​Creating an LLM Evaluator

​Event Schema

​Evaluation Prompt

​Configuration

​Return Type

​Passing Range

​Enabled

​Sampling Percentage

​Event Filters

​Next Steps

Python Evaluators

Evaluator Templates

Run Experiments

Human Annotation

Creating an LLM Evaluator

Event Schema

Evaluation Prompt

Configuration

Return Type

Passing Range

Enabled

Sampling Percentage

Event Filters

Next Steps