Why Evaluate

Developing production LLM apps comes with its own unique set of challenges. Here are some key challenges to consider:

  1. Unpredictable Outputs: LLMs can produce different outputs for the same prompt, even when using the same temperature setting. Additionally, periodic changes in the underlying data and APIs can contribute to unpredictable results.
  2. Security: It is important to protect against prompt injection attacks and PII leakage. Safeguarding the integrity and security of your application requires precautions to prevent unauthorized manipulation of prompts.
  3. Bias: LLMs may contain inherent biases that can lead to unfair user experiences. It is crucial to identify and address these biases to ensure equitable outcomes for all users.
  4. Cost: Using state-of-the-art models can be expensive, particularly at scale. Evaluations help you select the right-sized model that meets your specific cost vs performance tradeoff.
  5. Latency: Real-time user experiences require fast response times. Evaluations help you strike a balance between latency and performance, enabling you to make informed decisions to help improve user experience.

To address these challenges, testing and evaluation processes are crucial when shipping LLM apps to production. Evaluations help uncover issues related to LLMs and provide valuable insights for making informed decisions. These insights can lead to alternative design choices, improved models or prompts, and other appropriate measures.

HoneyHive: A Structured Evaluation Framework

HoneyHive introduces a structured evaluation framework that applies software engineering principles to LLM app development. We do this by facilitating structured evaluations similar to unit tests and regression tests in software engineering.

You define input prompts, expected outputs, and evaluate model responses against your own custom metrics and AI feedback functions, therefore allowing you to curate your own test suites specific to each use-case.

How do we manage evaluations data?

An evaluation in HoneyHive is a collection of pipeline runs with a summary computed over it.

  • results - a 2D array of results from each pipeline run tracking the session id for each test case id and configuration id pair.
  • summary - a summary of the metrics computed over the results, including pass/fail, averages, and other statistics.

In our UI, we allow you to provide feedback and ground truth on runs, as well as write comments on the evaluation itself. This is important to make evaluation a collaborative effort in your team.

How can you use evaluations in your workflow?

We have seen evaluations used in our platform in the following ways:

Unit Testing

Evaluations can be used to test individual components of your LLM app, such as a single model or prompt. Using simple metrics such as

  • Assert the model doesn’t say “As an AI language model” or other phrases that indicate the model is not working
  • Calculate readability scores such as coherence, clarity and others to ensure the model is producing readable text
  • Apply output structure checks such as JSON validation to ensure the model is producing the expected output structure

Regression Testing

Evaluations can be used to test the overall performance of your LLM app. This can be done by running evaluations on a regular basis, such as daily or weekly, and comparing the results to a baseline. This can help you identify issues such as:

  • Changes in model performance over user queries it is expected to handle
  • Changes in vector database performance over user queries it is expected to handle

Quality Assurance

By sharing evaluations with domain experts in your team, you can get feedback on the quality of your LLM app. This can help you with:

  • Identifying and addressing bias in your LLM app
  • Validating vector database retrieval results for domain specific applications

Dataset Curation

Our evaluation UX also provides the ability to provide ground truth annotations and feedback. This helps teams with

  • Collecting ground truth for fine-tuning your LLMs
  • Improving pre-existing evaluation datasets with more ground truth annotations

Collaborative Learning and Knowledge Sharing

HoneyHive fosters collaborative learning among development teams. You can easily share evaluation results, insights, and learnings, promoting collective improvement of LLM apps.

Getting Started

To start using HoneyHive for evaluating your LLM apps, refer to the following resources: