Logging Evaluations via SDK
Get API key
After signing up on the app, you can find your API key in the Settings page under Account.
Install the SDK
pip install honeyhive -q
Initialize your evaluation run
Let’s start by running a quick experiment to test whether a including certain instructions in the prompt actually affects outputs across two different prompt variants. For this particular example, we’ll compare a prompt with explicit instructions to output concise sales emails vs another prompt with no such instructions. We’ll be evaluating the results using the Conciseness
metric defined out-of-the-box in HoneyHive.
To get started, initialize your evaluation run.
# Your HoneyHive project and name are required fields in this function.
eval = hh.evaluations.init(
project="Sandbox - Email Writer",
name="Does saying concise improve conciseness",
description="Tests whether prompt engineering can help make outputs more concise.",
dataset_name="Test cases for sales emails",
# Let's evaluate these prompts against HoneyHive's Conciseness and Hallucination score metrics
metrics=["Conciseness", "Hallucination"]
)
Define test cases and pipeline configuration
Now that you’ve initialized the evaluation run, define your test cases and prompt-model configurations for this evaluation.
chunk_size
, chunk_overlap
, vectorDB_index_name
, etc. to your configs when running evaluations via the SDK. This can help you compare different vector databases, indexes, chunking sizes, and other variations in your LLM pipeline when running evaluations.#Defining test cases
dataset = [
{
"topic": "HoneyHive's developer platform for evaluating and monitoring LLM apps",
"tone": "Informative"
},
{
"topic": "Using LLMOps tools to help Acme improve it's Q&A chatbot",
"tone": "Friendly"
}
]
# Configs help you track different parameters across the versions you're evaluating.
# Required properties include model, provider, version, hyperparameters, and chat (for GPT 3.5 Turbo or 4) or prompt (for all other models)
configs = [
{
"model": "gpt-3.5-turbo",
"provider": "openai",
"version": "simple",
"hyperparameters": {
"temperature": 0.5,
"max_tokens": 200
},
"chat": [
{
"role": "system",
"content": "You are a helpful assistant who helps SDRs write sales emails."
},
{
"role": "user",
"content": "Topic: {{topic}}\n\nTone: {{tone}}."
}
],
# You can also add arbitrary config fields to include any metadata from your LLM pipeline. Let's include a few examples.
"vector_db": "pinecone",
"chunk_size": "250 tokens",
"chunk_overlap": "10%"
},
{
"model": "gpt-3.5-turbo",
"provider": "openai",
"version": "concise-instructions",
"hyperparameters": {
"temperature": 0.5,
"max_tokens": 200
},
"chat": [
{
"role": "system",
"content": "You are a helpful assistant who helps SDRs write sales emails. Write concise and personalized sales emails with a clear value proposition and one call-to-action. Avoid jargon, use power words sparingly, and have a well-defined follow-up strategy in place."
},
{
"role": "user",
"content": "Topic: {{topic}}\n\nTone: {{tone}}."
}
],
# You can also add arbitrary config fields to include any metadata from your LLM pipeline. Let's include a few examples.
"vector_db": "pinecone",
"chunk_size": "500 tokens",
"chunk_overlap": "15%"
}
]
Run your LLM pipelines
Now that define your test cases, run your LLM pipelines to start logging evaluation results.
import time
import copy
runs = []
# Run the pipeline configs against the test cases
for data in dataset:
data_run = []
for config in configs:
copy_chat = copy.deepcopy(config['chat'])
for chat in copy_chat:
chat['content'] = hh.utils.fill_template(
chat['content'],
data
)
start = time.time()
openai_response = openai.ChatCompletion.create(
model=config["model"],
messages=copy_chat,
**config["hyperparameters"]
)
end = time.time()
latency = (end - start) * 1000
data_run.append({
"completion": openai_response.choices[0].message.content,
"cost": hh.utils.calculate_openai_cost(config["model"], openai_response.usage),
"latency": latency,
"response_length": openai_response.usage["completion_tokens"]
})
runs.append(data_run)
Log evaluation results in HoneyHive
Now that you’ve run your test cases across our two different configs, let’s log your evaluation run in HoneyHive.
hh.evaluations.log(
evaluation=eval,
configs=configs,
inputs=dataset,
run=runs,
comments=[
"Using only two test cases in this example. We should assemble a larger dataset for the next run."
]
)
Analyze results and share learnings
Congrats! You’ve successfully run your first evaluation with HoneyHive! You can now view and start analyzing your evaluation results.
From the results, it looks like adding instructions for making the outputs more concise doesn’t only make the outputs more concise, but also reduces hallucination.