Running Evaluations
Evaluate and benchmark your LLM apps programmatically with HoneyHive.
Introduction
In the following example, we are going to walk through how to log your pipeline runs to HoneyHive for benchmarking and sharing. For a complete overview of evaluations in HoneyHive, you can refer to our Evaluations Overview page.
For this quickstart tutorial, we will do a simple evaluation comparing 2 different gpt-3.5-turbo
variants.
Warning
This is temporarily incompatible with Google Colab, please run this in a .py file through the command lineDefine Evaluation Pipeline
To start, we need to define our pipeline. This function describes how the datapoint is run across the config.
Prepare A Dataset & Configurations
Now that the evaluation is configured, we can set up our offline evaluation. Begin by fetching a dataset to evaluate over.
Running the eval
Once you have instantiated your honeyhive.eval()
the eval is executed by calling the .using()
method
parralelize=True
will save 10x on time.Share & Collaborate
After running the evaluation, you will receive a url taking you to the evaluation interface in the HoneyHive platform.
From there, you can share it with other members of your team via email or by directly sharing the link.
Sharing Evaluations
How to collaborate over an evaluation in HoneyHive
From discussions, we can garner more insights and then run more evaluations iteratively till we are ready to go to production!
Up Next
Log Requests & Feedback
How to quickly set up logging with HoneyHive.
Create Evaluation Metrics
How to set up metrics and run evaluations in HoneyHive.
API Reference Guide
Our reference guide on how to integrate the HoneyHive SDK and APIs with your application.
Prompt Engineering and Fine-Tuning Guides
Guides for prompt engineering and fine-tuning your models.