Understanding Your Data

HoneyHive’s data model captures a wealth of information. To effectively explore this data, understand these key components:


Metrics are the numerical heartbeat of your app. They’re what you’ll visualize in charts.

  1. Usage Metrics

    • Request Volume: Queries over time. Spot usage spikes or drops.
    • Cost: Direct expenses. See if that new feature is breaking the bank.
    • Duration: System latency. Because slow responses kill engagement.
  2. Evaluators

    • Definition: Your custom quality checks, either Python or LLM-based.
    • Requirements: Must return float or boolean to chart.
    • Examples:
      • Keyword Presence (boolean): “Does every product review mention the product?”
      • Coherence Score (float): “How logically sound are multi-turn conversations?”
  3. User Feedback

    • Definition: The voice of your users, quantified.
    • Requirements: float or boolean inputs.
    • Examples:
      • Usefulness Rating (float): “On a scale of 1-5, how useful was this response?”
      • Used in Report (boolean): “Did the user actually use this in their report?”


Properties are the contextual gold of your data. All properties in the data model such as config, user properties, feedback, metrics, and metadata can be used to slice and dice your data in powerful ways.

Metrics chart performance. Properties unveil the context behind that performance. Both are crucial for exploratory data analysis.

Chart Types

HoneyHive offers three chart types, each zooming in on different parts of your LLM pipeline:


Example: Monitoring TPS across multiple models

Completion Charts

  • Focus: Individual LLM calls.
  • Key Metrics: cost, duration, tokens, errors, and any specified evaluators.
  • Examples:
    • Hypothesis: “Longer user messages cause more token waste.”
    • Test: Chart Average Unused Output Tokens grouped by binned_input_length.

Session Charts

  • Focus: Full user interactions and entire traces.
  • Key Metrics: User Turns, Session Duration, Avg User Rating, Agent Trajectory.
  • Examples:
    • Hypothesis: “Agents start looping after n turns.”
    • Test: Chart Agent Trajectory Evaluator grouped by Number of turns.

Event Charts

  • Focus: Specific steps or tools.
  • Examples: Retrieval Latency, Synthesis Quality, Tool Choice Accuracy.
  • Examples:
    • Hypothesis: “Our reranker is the bottleneck in high-load scenarios.”
    • Test: Chart 99th Percentile Rerank Time vs. Requests per Minute.

Building Charts


Choose Your Metric (What to Measure)

  • Process: Pick chart type, then a relevant metric.
  • Real-world Usage:
    • Don’t just track Request Volume. Ask: “Is volume growing faster for paid or freemium?”
    • Beyond Cost, ponder: “Is cost per successful session decreasing over time?”

Apply Aggregation (How to Measure)

  • Key Functions:
    • Average: Typical case. “What’s our usual response time?”
    • 99th Percentile: Edge cases. “How bad does it get for our unluckiest users?”
    • Percentage True: For booleans. “What % of responses are factually correct?”
  • Real-world Usage:
    • Average is good, but Median might better represent a skewed distribution.
    • Watch both Average and 99th Percentile to catch issues averages hide.

Filter and Group (Segmenting Data)

  • Filtering:
    • Syntax: property operator value. E.g., industry == "finance".
    • Examples:
      • topic_category != "smalltalk" to focus on core use cases.
      • embedding_model == "v2" AND date > model_switch_date for before/after analysis.
  • Grouping:
    • Syntax: Select properties. E.g., prompt_template, user_tier.
    • Examples:
      • prompt_template to see which prompts waste tokens.
      • user_tier and topic_category to see if premium users ask harder questions.