Our data model enables you to analyze your traces independently from specific steps and chains in your LLM pipeline. This allows you to monitor specific parts of your pipeline (e.g., your vector database step) independently and calculate metrics such as Median User Rating per Session or P99 Retrieval Latency, providing much more detailed and granular monitoring.

‍We support 3 core chart types‍

  1. Session charts: Helps you observe how users interact with your app over the course of a session and monitor key metrics such as Avg Number of User Turns, Avg Session Duration, Median User Rating per Session, and more.
  2. Completion charts: Helps you specifically monitor all LLM requests. This includes key metrics like cost, latency, token usage, API errors, and any specific evaluators you may have defined (e.g., Keyword Assertions, Answer Faithfulness, JSON Validation, etc.).
  3. Event charts: Help you monitor specific chains or tool events of interest. Examples include independently monitoring reranking and synthesis steps in a RAG pipeline, monitoring Context Relevance across retrieved chunks to validate data quality, and more. ‍

Customers building complex agents and RAG pipelines can optimize not just the prompt or their model, but also subcomponents in their pipelines such as their chunking strategy, retrieval architecture, tool use, and more.


  1. Real-time Observation: Log in to HoneyHive to observe your LLM application’s performance metrics in real-time. The dashboard provides an intuitive interface to visualize various metrics and their trends.
  2. Metric Definition: Define the specific metric you want to visualize. HoneyHive supports standard out-of-the-box metrics, custom metrics and user feedback. Standard metrics could include Request Volume, Cost, and Latency. Any custom metrics that you previously defined and enabled in production can also be visualized here. User feedback will be aggregated based on it’s return type. For example, you can select Accepted to track percentage of requests that were accepted by end-users.
  3. Aggregation Functions: Choose the aggregation function that best suits your analysis. Common functions include Average, Sum, Percentage True, 99th Percentile and more. Selecting the right aggregation function helps you distill complex data into meaningful insights. HoneyHive automatically provides different aggregation functions for boolean and float return-type metrics.
  4. Data Filtering and Comparison: Utilize the power of segmentation by using filter and group by. This allows you to focus on specific data slices based on user properties, custom metadata, or other relevant criteria. For example, you can filter by user_country or subscription_tier to perform cohort-level analysis. Any user properties or custom metadata can be found here.


Example: Monitoring TPS across multiple models

Metric Types

HoneyHive supports various metric types for monitoring your LLM application’s performance:

  1. Usage Metrics: These include Request Volume, which measures the number of requests your application receives; Cost, which evaluates the expenses associated with running the application; and Duration, which assesses the span or trace duration and indicates system-level latency.
  2. Evaluators: All evaluators (Python, LLM, or Human) that you have defined and enabled in production can be analyzed here.
  3. User Feedback: You can visualize any user feedback fields that you have captured from your users in staging/production (as long as the return type is Float or Boolean) to analyze performance and user satisfaction.
Return Types: An evaluator will only be available to analyze under measuring if the return type is set as float or boolean. Any evaluators with string return type can only be used to group or filter charts.

Additional Metadata

User Properties

User properties provide valuable insights into user behavior and preferences. Common examples include:

  1. user_ID: A unique identifier for each user, helping you track individual user interactions.
  2. user_country: Allows you to analyze how different regions interact with your application.
  3. subscription_tier: Helps you understand the behavior of different user segments based on their subscription level.

Utilize these properties to perform cohort-level analysis, identifying trends and patterns among specific user groups.

Custom Metadata

Metadata offers flexibility in capturing additional information about user interactions. This arbitrary data can be passed with logged requests. Common examples include:

  1. Custom Tags: Tag requests with identifiers that hold significance within your application.
  2. Session Duration: Track how long users engage with your LLM application.
  3. Content Type: Categorize requests based on the type of content users are interacting with.

Leverage metadata to gain deeper insights into user interactions and tailor your LLM application accordingly.