How to use HoneyHive’s query builder interface to monitor performance and drive systematic improvements at scale.
source = "production"
)prompt_version
, model
, `user_tier)Request Volume
: Queries over time. Spot usage spikes or drops.Cost
: Direct expenses. See if that new feature is breaking the bank.Duration
: System latency. Because slow responses kill engagement.float
or boolean
to chart.Keyword Presence
(boolean): “Does every product review mention the product?”Coherence Score
(float): “How logically sound are multi-turn conversations?”float
or boolean
inputs.Usefulness Rating
(float): “On a scale of 1-5, how useful was this response?”Used in Report
(boolean): “Did the user actually use this in their report?”config
, user properties
, feedback
, metrics
, and metadata
can be used to slice and dice your data.
cost
, duration
, tokens
, errors
, and any specified evaluators.Average Unused Output Tokens
grouped by binned_input_length
.User Turns
, Session Duration
, Avg User Rating
, Agent Trajectory
.n
turns.”Agent Trajectory Evaluator
grouped by Number of turns
.Retrieval Latency
, Synthesis Quality
, Tool Choice Accuracy
.99th Percentile Rerank Time
vs. Requests per Minute
.Choose Your Metric (What to Measure)
Request Volume
. Ask: “Is volume growing faster for paid or freemium?”Cost
, ponder: “Is cost per successful session decreasing over time?”Apply Aggregation (How to Measure)
Average
: Typical case. “What’s our usual response time?”99th Percentile
: Edge cases. “How bad does it get for our unluckiest users?”Percentage True
: For booleans. “What % of responses are factually correct?”Average
is good, but Median
might better represent a skewed distribution.Average
and 99th Percentile
to catch issues averages hide.Filter and Group (Segmenting Data)
property operator value
. E.g., industry == "finance"
.topic_category != "smalltalk"
to focus on core use cases.embedding_model == "v2" AND date > model_switch_date
for before/after analysis.prompt_template
, user_tier
.prompt_template
to see which prompts waste tokens.user_tier
and topic_category
to see if premium users ask harder questions.