In the following example, we are going to walk through how to log your pipeline runs to HoneyHive for benchmarking and sharing. For a complete overview of evaluations in HoneyHive, you can refer to our Evaluations Overview page.

Setup HoneyHive and get your API key

If you haven’t already done so, then the first thing you will need to do is create a HoneyHive project.

After creating the project, you can find your API key in the Settings page under Account.

Once you have created a HoneyHive project and got your API keys, you can now start tracing your custom pipeline.

Install the SDK & authenticate

We currently support a native Python SDK. For other languages, we encourage using HTTP request libraries to send requests.

!pip install honeyhive -q

Authenticate honeyhive and all the packages required to execute your pipeline. Make sure to import time for tracking latency.

For this quickstart tutorial, we will do a simple evaluation comparing 2 different gpt-3.5-turbo variants.

import honeyhive
import openai
import time

# import any other vector databases, APIs and other model providers you might need
honeyhive.api_key = "HONEYHIVE_API_KEY"
openai.api_key = "OPENAI_API_KEY"

Initialize Evaluation

Configure your evaluation by pointing it to the project you have in HoneyHive. HoneyHiveEvaluation accepts the following parameters:

  1. project: a necessary field which decides which project to log the evaluation to
  2. name: a necessary field which sets the name for the evaluation
  3. description: an optional field to describe what kind of evaluation you are attempting
  4. dataset_name: an optional field to set a name for the dataset you ran the evaluation over
  5. metrics: an optional field to specify which HoneyHive configured metrics you want to run

This tutorial’s evaluation is set to try different max_tokens settings and compute an AI feedback function Conciseness and a custom metric Number of Characters.

honeyhive_eval = HoneyHiveEvaluation(
    project="Email Writer App",
    name="Max Tokens Comparison",
    description="Finding best max tokens for OpenAI chat models",
    metrics=["Conciseness", "Number of Characters"] 

Note on Metrics: In this function, you can only define metrics that are configured for your project within HoneyHive. These metrics will automatically be computed across your evaluation harness upon ingestion. Learn more about how to define metrics and guardrails in HoneyHive.


How to create a metric in HoneyHive via the UI

Prepare A Dataset & Configurations

Now that the evaluation is configured, we can set up our offline evaluation. Begin by fetching a dataset to evaluate over.

Try to pick data as close to your production distribution as possible
dataset = [
    {"topic": "Test", "tone": "test"},
    {"topic": "AI", "tone": "neutral"}

# in case you have a saved dataset in HoneyHive
from honeyhive.sdk.datasets import get_dataset
dataset = get_dataset("Email Writer Samples")

For evaluation configurations, we can optionally set any arbitrary field. However, there are some reserved fields for the platform:

  1. name, version: these are the reserved fields to render the display name for this configuration
  2. model, provider, hyperparameters: these are reserved to render the correct model in the UI to allow others to fork their variants
  3. prompt_template, chat: similar to the previous fields, these are needed as well for consistent forking logic
config =  {
      "name": "max_tokens_100",
      "model": "gpt-3.5-turbo",
      "provider": "openai",
      "hyperparameters": {
        "temperature": 0.5, 
        "max_tokens": 100
      "chat": [
            "role": "system",
            "content": "You are a helpful assistant who helps people write emails.",
            "role": "user",
            "content": "Topic: {{topic}}\n\nTone: {{tone}}."

Run and Log Evaluation

We log our evaluation runs via the log_run function on the HoneyHiveEvaluation object. log_run accepts one input-output pair at a time.

log_run accepts 4 parameters:

  1. config: a necessary field specifying the configuration dictionary for that run
  2. input: a necessary field specifying the datapoint dictionary for that run
  3. completion: a necessary field specifying the final string response at the end of the pipeline
  4. metrics: an optional field specifying any recorded metrics of interest for the run (ex. cost, latency, etc)
Parallelize your evaluation runs whenever possible! This will help you save 10x on time.
# parallelized version of the evaluation run code
import concurrent.futures

def parallel_task(data, config):
    data_run = []
    messages = honeyhive.utils.fill_chat_template(config["chat"], data)

    start = time.time()
    openai_response = openai.ChatCompletion.create(
    end = time.time()

            "cost": honeyhive.utils.calculate_openai_cost(
                config["model"], openai_response.usage
            "latency": (end - start) * 1000,
            "response_length": openai_response.usage["completion_tokens"],

with concurrent.futures.ThreadPoolExecutor() as executor:
    futures = []
    for data in dataset:
        for config in configs:
            futures.append(executor.submit(parallel_task, data, config))
    for future in concurrent.futures.as_completed(futures):
        # Do any further processing if required on the results

Run more configurations

Now we can tweak the config and re-run our pipeline while logging the new runs via log_run.

config =  {
      # same configuration as above here
      "hyperparameters": {"temperature": 0.5, "max_tokens": 400},

# identical Evaluation Run code as above
for data in dataset:

Log Comments & Publish

Finally, we can log any comments from what we have seen in the logs or while iterating on the pipeline and publish the evaluation.

honeyhive_eval.log_comment("Results are decent")

Share & Collaborate

After running the evaluation, you will receive a url taking you to the evaluation interface in the HoneyHive platform.

From there, you can share it with other members of your team via email or by directly sharing the link.


Sharing Evaluations

How to collaborate over an evaluation in HoneyHive

From discussions, we can garner more insights and then run more evaluations iteratively till we are ready to go to production!

Up Next