HoneyHive: A Structured Evaluation Framework

HoneyHive introduces a structured evaluation framework that applies software engineering principles to LLM app development. We do this by facilitating structured evaluations similar to unit tests and regression tests in software engineering.

You define input prompts, expected outputs, and evaluate model responses against your own custom metrics and AI feedback functions, therefore allowing you to curate your own test suites specific to each use-case.

How do we manage evaluations data?

There are 2 key components of evaluations

  1. Evaluation Run - a group of session or events recorded with source evaluation and a unique run_id
  2. Comparison - a group of evaluation runs that compares sessions/events against each other with the same datapoint_id

Evaluation Run

An evaluation run groups sessions or events together based on a common run_id.

In our interface, we summarize the metrics present on the session & all its children. Presenting an interface as shown below:

In this interface, you can apply different aggregation functions over the metrics, filter for particular sessions, and step into the trace view for each run.


A comparison groups evaluation runs together based on a common datapoint_id.

In this interface, you can compare outputs side by side, as well as annotate the runs with feedback and ground truth.

Our current comparison view is being completely revamped, so expect a lot of changes in the coming weeks.

How can you use evaluations in your workflow?

We have seen evaluations used in our platform in the following ways:

Unit Testing

Evaluations can be used to test individual components of your LLM app, such as a single model or prompt. Using simple metrics such as

  • Assert the model doesn’t say “As an AI language model” or other phrases that indicate the model is not working
  • Calculate readability scores such as coherence, clarity and others to ensure the model is producing readable text
  • Apply output structure checks such as JSON validation to ensure the model is producing the expected output structure

Regression Testing

Evaluations can be used to test the overall performance of your LLM app. This can be done by running evaluations on a regular basis, such as daily or weekly, and comparing the results to a baseline. This can help you identify issues such as:

  • Changes in model performance over user queries it is expected to handle
  • Changes in vector database performance over user queries it is expected to handle

Quality Assurance

By sharing evaluations with domain experts in your team, you can get feedback on the quality of your LLM app. This can help you with:

  • Identifying and addressing bias in your LLM app
  • Validating vector database retrieval results for domain specific applications

Dataset Curation

Our evaluation UX also provides the ability to provide ground truth annotations and feedback. This helps teams with

  • Collecting ground truth for fine-tuning your LLMs
  • Improving pre-existing evaluation datasets with more ground truth annotations

Collaborative Learning and Knowledge Sharing

HoneyHive fosters collaborative learning among development teams. You can easily share evaluation results, insights, and learnings, promoting collective improvement of LLM apps.

Getting Started

To start using HoneyHive for evaluating your LLM apps, refer to the following resources: