Running evaluations is a natural extension of the tracing capabilities of HoneyHive. We recommend you to go through the tracing quickstart before proceeding with this guide.

What is an Evaluation Run?

An evaluation run in HoneyHive is a group of related sessions that have a common metadata.run_id field.

When an evaluation run is linked to a dataset, then it can be compared against other runs on the same dataset.

The flexibility here helps you compare the performance of your application across any dimension of configuration you want to test - models, chunking strategies, vector databases, prompts, so on.

Running an evaluation

Prerequisites

  • You have already created a project in HoneyHive, as explained here.
  • You have an API key for your project, as explained here.
  • You have instrumented your application with the HoneyHive SDK, as explained here.

Expected Time: 5 minutes

Steps

1

Setup Evaluators

Evaluators can be setup either client-side before logging the events or processed server-side post ingestion. Setup server side evaluators that evaluate the sessions and events from the evaluation.

2

Setup input data

3

Create the flow you want to evaluate

The function to evaluate should accept a parameter input. This parameter refers to an object that maps to each datapoint or json passed to the evaluation.

The value returned by the function would map to the outputs field of each session in the evaluation.

4

Run evaluation

Dashboard View

Remember to review the results in your HoneyHive dashboard to gain insights into your model’s performance across different inputs. The dashboard provides a comprehensive view of the evaluation results and performance across multiple runs.

Conclusion

By following these steps, you can set up and run evaluations using Realign and HoneyHive. This allows you to systematically test your LLM-based systems across various scenarios and collect performance data for analysis.

Sample files