HoneyHive Docs

Online evaluations run your evaluators automatically on production traces after ingestion. This gives you continuous quality scores alongside your cost and latency metrics, without adding latency to your application.

How It Works

When you enable online evaluation for an evaluator, HoneyHive runs it asynchronously on incoming production traces:

Your application sends traces to HoneyHive
HoneyHive matches traces against your evaluator’s event filters
Matching events are evaluated (subject to your sampling rate)
Results appear as metrics in your dashboard and on individual traces

Online evaluations only run on production traces (where source is not evaluation or playground). Experiment traces are always evaluated at 100%.

Enabling Online Evaluation

You can enable online evaluation on any server-side evaluator (Python or LLM):

Go to the Evaluators page

Navigate to the Evaluators tab in HoneyHive.

Create or select an evaluator

Create a new Python or LLM evaluator, or select an existing one. Configure event filters, return type, and your evaluation logic.

HoneyHive LLM evaluator editor showing event filters set to model type, OpenAI gpt-4o provider, evaluation prompt with template syntax, sampling percentage, and return type configuration

Enable the evaluator

Toggle the Enabled switch in the evaluators table. This tells HoneyHive to run this evaluator on matching production traces.

Set a sampling percentage

Set the Sampling percentage to control what fraction of matching events get evaluated (e.g., 25%). This controls cost for LLM-based evaluators at high volumes.

Event Filters

Each evaluator has event filters that determine which traces it runs on:

Event type: Filter by model, tool, chain, or session
Event name: Target a specific named event, or use “All” to match any event of that type

For example, you might run a hallucination evaluator only on model events named generate_response, while running a tone evaluator on the full session.

Sampling

LLM-based evaluators incur model costs for every evaluation. At production scale, use sampling to control spend:

Volume	Suggested Sampling	Rationale
< 1K events/day	100%	Full coverage is affordable
1K - 10K events/day	25 - 50%	Good signal with moderate cost
10K+ events/day	5 - 25%	Statistical significance with controlled spend

Python evaluators are much cheaper to run than LLM evaluators. You can often run Python evaluators at 100% sampling even at high volumes.

Viewing Results

Online evaluation results are available in two places:

Dashboard charts: Select your evaluator as a metric in Custom Charts to track quality over time, group by properties, and set up alerts
Individual traces: Each evaluated trace shows its evaluator scores alongside inputs, outputs, and other metadata

HoneyHive monitoring dashboard showing charts for session duration, LLM call duration, token usage, and custom evaluator metrics like Search Relevance and Agent Execution Quality

You can also use the Discover view to build custom queries on evaluator scores, filter by source, and drill into individual events.

HoneyHive Discover view showing a Search Relevance evaluator metric charted over time for a tool_search_web event, grouped by source

Choosing Between Client-Side and Server-Side

	Client-Side	Server-Side (Online)
Runs	In your application	On HoneyHive after ingestion
Latency impact	Adds to request time	None
Best for	Guardrails, format checks, PII detection	LLM-as-judge, complex quality scoring
Managed in	Your code	HoneyHive UI

Use client-side evaluators for checks that need to happen during execution (guardrails, blocking unsafe responses). Use online evaluations for quality scoring that can happen asynchronously.

Next Steps

Python Evaluators

Create code-based evaluators for programmatic checks

LLM Evaluators

Use LLMs to score quality, relevance, and tone

Custom Charts

Visualize evaluator scores in dashboards

Alerts

Get notified when quality metrics drop

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

Online Evaluations

How It Works

Enabling Online Evaluation

Event Filters

Sampling

Viewing Results

Choosing Between Client-Side and Server-Side

Next Steps

Python Evaluators

LLM Evaluators

Custom Charts

Alerts

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

​How It Works

​Enabling Online Evaluation

​Event Filters

​Sampling

​Viewing Results

​Choosing Between Client-Side and Server-Side

​Next Steps

Python Evaluators

LLM Evaluators

Custom Charts

Alerts

How It Works

Enabling Online Evaluation

Event Filters

Sampling

Viewing Results

Choosing Between Client-Side and Server-Side

Next Steps