Using Discover
How to use HoneyHive’s query builder interface to monitor performance and drive systematic improvements at scale.
Understanding Your Data
HoneyHive’s data model captures a wealth of information. To effectively explore this data, understand these key components:
Metrics
Metrics are the numerical heartbeat of your app. They’re what you’ll visualize in charts.
-
Usage Metrics
Request Volume
: Queries over time. Spot usage spikes or drops.Cost
: Direct expenses. See if that new feature is breaking the bank.Duration
: System latency. Because slow responses kill engagement.
-
Evaluators
- Definition: Your custom quality checks, either Python or LLM-based.
- Requirements: Must return
float
orboolean
to chart. - Examples:
Keyword Presence
(boolean): “Does every product review mention the product?”Coherence Score
(float): “How logically sound are multi-turn conversations?”
-
User Feedback
- Definition: The voice of your users, quantified.
- Requirements:
float
orboolean
inputs. - Examples:
Usefulness Rating
(float): “On a scale of 1-5, how useful was this response?”Used in Report
(boolean): “Did the user actually use this in their report?”
Properties
Properties are the contextual gold of your data. All properties in the data model such as config
, user properties
, feedback
, metrics
, and metadata
can be used to slice and dice your data in powerful ways.
Chart Types
HoneyHive offers three chart types, each zooming in on different parts of your LLM pipeline:
Completion Charts
- Focus: Individual LLM calls.
- Key Metrics:
cost
,duration
,tokens
,errors
, and any specified evaluators. - Examples:
- Hypothesis: “Longer user messages cause more token waste.”
- Test: Chart
Average Unused Output Tokens
grouped bybinned_input_length
.
Session Charts
- Focus: Full user interactions and entire traces.
- Key Metrics:
User Turns
,Session Duration
,Avg User Rating
,Agent Trajectory
. - Examples:
- Hypothesis: “Agents start looping after
n
turns.” - Test: Chart
Agent Trajectory Evaluator
grouped byNumber of turns
.
- Hypothesis: “Agents start looping after
Event Charts
- Focus: Specific steps or tools.
- Examples:
Retrieval Latency
,Synthesis Quality
,Tool Choice Accuracy
. - Examples:
- Hypothesis: “Our reranker is the bottleneck in high-load scenarios.”
- Test: Chart
99th Percentile Rerank Time
vs.Requests per Minute
.
Building Charts
Choose Your Metric (What to Measure)
- Process: Pick chart type, then a relevant metric.
- Real-world Usage:
- Don’t just track
Request Volume
. Ask: “Is volume growing faster for paid or freemium?” - Beyond
Cost
, ponder: “Is cost per successful session decreasing over time?”
- Don’t just track
Apply Aggregation (How to Measure)
- Key Functions:
Average
: Typical case. “What’s our usual response time?”99th Percentile
: Edge cases. “How bad does it get for our unluckiest users?”Percentage True
: For booleans. “What % of responses are factually correct?”
- Real-world Usage:
Average
is good, butMedian
might better represent a skewed distribution.- Watch both
Average
and99th Percentile
to catch issues averages hide.
Filter and Group (Segmenting Data)
- Filtering:
- Syntax:
property operator value
. E.g.,industry == "finance"
. - Examples:
topic_category != "smalltalk"
to focus on core use cases.embedding_model == "v2" AND date > model_switch_date
for before/after analysis.
- Syntax:
- Grouping:
- Syntax: Select properties. E.g.,
prompt_template
,user_tier
. - Examples:
prompt_template
to see which prompts waste tokens.user_tier
andtopic_category
to see if premium users ask harder questions.
- Syntax: Select properties. E.g.,
Example: Monitoring TPS across multiple models