Logging Evaluations Runs
Introduction
In the following example, we are going to walk through how to log your pipeline runs to HoneyHive for benchmarking and sharing. For a complete overview of evaluations in HoneyHive, you can refer to our Evaluations Overview page.
Setup HoneyHive and get your API key
If you haven’t already done so, then the first thing you will need to do is create a HoneyHive project.
After creating the project, you can find your API key in the Settings page under Account.
Once you have created a HoneyHive project and got your API keys, you can now start tracing your custom pipeline.
Install the SDK & authenticate
We currently support a native Python SDK. For other languages, we encourage using HTTP request libraries to send requests.
!pip install honeyhive -q
Authenticate honeyhive
and all the packages required to execute your pipeline. Make sure to import time
for tracking latency.
For this quickstart tutorial, we will do a simple evaluation comparing 2 different gpt-3.5-turbo
variants.
import honeyhive
import openai
import time
# import any other vector databases, APIs and other model providers you might need
honeyhive.api_key = "HONEYHIVE_API_KEY"
openai.api_key = "OPENAI_API_KEY"
Initialize Evaluation
Configure your evaluation by pointing it to the project you have in HoneyHive. HoneyHiveEvaluation
accepts the following parameters:
project
: a necessary field which decides which project to log the evaluation toname
: a necessary field which sets the name for the evaluationdescription
: an optional field to describe what kind of evaluation you are attemptingdataset_name
: an optional field to set a name for the dataset you ran the evaluation overmetrics
: an optional field to specify which HoneyHive configured metrics you want to run
This tutorial’s evaluation is set to try different max_tokens
settings and compute an AI feedback function Conciseness
and a custom metric Number of Characters
.
honeyhive_eval = HoneyHiveEvaluation(
project="Email Writer App",
name="Max Tokens Comparison",
description="Finding best max tokens for OpenAI chat models",
dataset_name="Test",
metrics=["Conciseness", "Number of Characters"]
)
Metrics
How to create a metric in HoneyHive via the UI
Prepare A Dataset & Configurations
Now that the evaluation is configured, we can set up our offline evaluation. Begin by fetching a dataset to evaluate over.
dataset = [
{"topic": "Test", "tone": "test"},
{"topic": "AI", "tone": "neutral"}
]
# in case you have a saved dataset in HoneyHive
from honeyhive.sdk.datasets import get_dataset
dataset = get_dataset("Email Writer Samples")
For evaluation configurations, we can optionally set any arbitrary field. However, there are some reserved fields for the platform:
name
,version
: these are the reserved fields to render the display name for this configurationmodel
,provider
,hyperparameters
: these are reserved to render the correct model in the UI to allow others to fork their variantsprompt_template
,chat
: similar to the previous fields, these are needed as well for consistent forking logic
config = {
"name": "max_tokens_100",
"model": "gpt-3.5-turbo",
"provider": "openai",
"hyperparameters": {
"temperature": 0.5,
"max_tokens": 100
},
"chat": [
{
"role": "system",
"content": "You are a helpful assistant who helps people write emails.",
},
{
"role": "user",
"content": "Topic: {{topic}}\n\nTone: {{tone}}."
}
]
}
Run and Log Evaluation
We log our evaluation runs via the log_run
function on the HoneyHiveEvaluation
object. log_run
accepts one input-output pair at a time.
log_run
accepts 4 parameters:
config
: a necessary field specifying the configuration dictionary for that runinput
: a necessary field specifying the datapoint dictionary for that runcompletion
: a necessary field specifying the final string response at the end of the pipelinemetrics
: an optional field specifying any recorded metrics of interest for the run (ex. cost, latency, etc)
# parallelized version of the evaluation run code
import concurrent.futures
def parallel_task(data, config):
data_run = []
messages = honeyhive.utils.fill_chat_template(config["chat"], data)
start = time.time()
openai_response = openai.ChatCompletion.create(
model=config["model"],
messages=messages,
**config["hyperparameters"]
)
end = time.time()
honeyhive_eval.log_run(
config=config,
input=data,
completion=openai_response.choices[0].message.content,
metrics={
"cost": honeyhive.utils.calculate_openai_cost(
config["model"], openai_response.usage
),
"latency": (end - start) * 1000,
"response_length": openai_response.usage["completion_tokens"],
**openai_response["usage"]
}
)
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = []
for data in dataset:
for config in configs:
futures.append(executor.submit(parallel_task, data, config))
for future in concurrent.futures.as_completed(futures):
# Do any further processing if required on the results
pass
Run more configurations
Now we can tweak the config and re-run our pipeline while logging the new runs via log_run
.
config = {
# same configuration as above here
"hyperparameters": {"temperature": 0.5, "max_tokens": 400},
}
# identical Evaluation Run code as above
for data in dataset:
...
Log Comments & Publish
Finally, we can log any comments from what we have seen in the logs or while iterating on the pipeline and publish the evaluation.
honeyhive_eval.log_comment("Results are decent")
honeyhive_eval.finish()
Share & Collaborate
After running the evaluation, you will receive a url taking you to the evaluation interface in the HoneyHive platform.
From there, you can share it with other members of your team via email or by directly sharing the link.
Sharing Evaluations
How to collaborate over an evaluation in HoneyHive
From discussions, we can garner more insights and then run more evaluations iteratively till we are ready to go to production!
Up Next
Log Requests & Feedback
How to quickly set up logging with HoneyHive.
Create Evaluation Metrics
How to set up metrics and run evaluations in HoneyHive.
API Reference Guide
Our reference guide on how to integrate the HoneyHive SDK and APIs with your application.
Prompt Engineering and Fine-Tuning Guides
Guides for prompt engineering and fine-tuning your models.