Building Queries
Learn how to create custom queries to slice and dice your data.
Our data model enables you to analyze your traces independently from specific steps and chains in your LLM pipeline. This allows you to monitor specific parts of your pipeline (e.g., your vector database step) independently and calculate metrics such as Median User Rating per Session
or P99 Retrieval Latency
, providing much more detailed and granular monitoring.
We support 3 core chart types
Session
charts: Helps you observe how users interact with your app over the course of a session and monitor key metrics such as Avg Number of User Turns, Avg Session Duration, Median User Rating per Session, and more.Completion
charts: Helps you specifically monitor all LLM requests. This includes key metrics like cost, latency, token usage, API errors, and any specific evaluators you may have defined (e.g., Keyword Assertions, Answer Faithfulness, JSON Validation, etc.).Event
charts: Help you monitor specific chains or tool events of interest. Examples include independently monitoring reranking and synthesis steps in a RAG pipeline, monitoring Context Relevance across retrieved chunks to validate data quality, and more.
Customers building complex agents and RAG pipelines can optimize not just the prompt or their model, but also subcomponents in their pipelines such as their chunking strategy, retrieval architecture, tool use, and more.
Functionalities
- Real-time Observation: Log in to HoneyHive to observe your LLM application’s performance metrics in real-time. The dashboard provides an intuitive interface to visualize various metrics and their trends.
- Metric Definition: Define the specific metric you want to visualize. HoneyHive supports standard out-of-the-box metrics, custom metrics and user feedback. Standard metrics could include
Request Volume
,Cost
, andLatency
. Any custom metrics that you previously defined and enabled in production can also be visualized here. User feedback will be aggregated based on it’s return type. For example, you can selectAccepted
to track percentage of requests that were accepted by end-users. - Aggregation Functions: Choose the aggregation function that best suits your analysis. Common functions include
Average
,Sum
,Percentage True
,99th Percentile
and more. Selecting the right aggregation function helps you distill complex data into meaningful insights. HoneyHive automatically provides different aggregation functions forboolean
andfloat
return-type metrics. - Data Filtering and Comparison: Utilize the power of segmentation by using
filter
andgroup by
. This allows you to focus on specific data slices based on user properties, custom metadata, or other relevant criteria. For example, you can filter byuser_country
orsubscription_tier
to perform cohort-level analysis. Any user properties or custom metadata can be found here.
Example: Monitoring TPS across multiple models
Metric Types
HoneyHive supports various metric types for monitoring your LLM application’s performance:
- Usage Metrics: These include
Request Volume
, which measures the number of requests your application receives;Cost
, which evaluates the expenses associated with running the application; andDuration
, which assesses the span or trace duration and indicates system-level latency. - Evaluators: All evaluators (Python, LLM, or Human) that you have defined and enabled in production can be analyzed here.
- User Feedback: You can visualize any user feedback fields that you have captured from your users in staging/production (as long as the return type is
Float
orBoolean
) to analyze performance and user satisfaction.
measuring
if the return type is set as float
or boolean
. Any evaluators with string
return type can only be used to group or filter charts.Set up LLM evaluators to analyze your production data
How to quickly set up LLM evaluators in HoneyHive.
Set up Python evaluators to analyze your production data
How to quickly set up Python evaluators in HoneyHive.
Additional Metadata
User Properties
User properties provide valuable insights into user behavior and preferences. Common examples include:
user_ID
: A unique identifier for each user, helping you track individual user interactions.user_country
: Allows you to analyze how different regions interact with your application.subscription_tier
: Helps you understand the behavior of different user segments based on their subscription level.
Utilize these properties to perform cohort-level analysis, identifying trends and patterns among specific user groups.
Custom Metadata
Metadata offers flexibility in capturing additional information about user interactions. This arbitrary data can be passed with logged requests. Common examples include:
- Custom Tags: Tag requests with identifiers that hold significance within your application.
- Session Duration: Track how long users engage with your LLM application.
- Content Type: Categorize requests based on the type of content users are interacting with.
Leverage metadata to gain deeper insights into user interactions and tailor your LLM application accordingly.