When building LLM apps, performance evaluation is a critical aspect of the development process. Fundamentally, evaluators are validation and testing functions that allow you to programmatically test the quality, performance or integrity of a model’s output without having to manually review it. Evaluators play a vital role in assessing the effectiveness of your apps across different scenerios and ensuring they meet the desired cost, safety, and performance criteria.

HoneyHive provides the capability to define custom evaluators (Python or LLM evaluators) and set up guardrails tailored to your unique use-case, which can be used to evaluate performance in an offline setting, or when monitoring live data from production.

This guide will walk you through the process of defining custom evaluators and guardrails in HoneyHive, covering how evaluators work within the platform and how they can be used to evaluate end-to-end LLM pipelines.

Why define custom evaluators and guardrails

  1. Automated Testing: LLM outputs are subjective, hard to interpret and often require manual review. Evaluators allow you to scale your evaluation process by designing functions that can programmatically evaluate model outputs without having to invest in costly manual review and annotation workflows.
  2. Tailored Evaluation: Every LLM app is unique, and out-of-the-box evaluators may not fully capture the specific requirements of your use-case. Custom evaluators allow you to design evaluations that are aligned with your app’s objectives, enabling more relevant and meaningful assessments.
  3. Holistic Assessment: LLM apps often involve complex pipelines, and evaluating the end-to-end performance is crucial. Custom evaluators enable you to assess the entire pipeline, considering various stages of data processing and model interactions.
  4. Cost and Safety Considerations: In production environments, monitoring the cost of LLM app usage and ensuring safety are paramount. Custom evaluators and guardrails can help you keep track of resource consumption and set up safety mechanisms to prevent undesirable outcomes.

Getting Started

To start defining your own evaluators and guardrails in HoneyHive, refer to the following resources: