HoneyHive Docs

Running experiments is a natural extension of the tracing capabilities of HoneyHive. We recommend you to go through the tracing quickstart before proceeding with this guide.

Full code

Here’s a minimal example to get you started with experiments in HoneyHive:

Sample eval script

from honeyhive import evaluate, evaluator
import os
from openai import OpenAI
import random

openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Create function to be evaluated
# inputs -> parameter to which datapoint or json value will be passed
# (optional) ground_truths -> ground truth value for the input
def function_to_evaluate(inputs, ground_truths):
    completion = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"You are an expert analyst specializing in {inputs['product_type']} market trends."},
            {"role": "user", "content": f"Could you provide an analysis of the current market performance and consumer reception of {inputs['product_type']} in {inputs['region']}? Please include any notable trends or challenges specific to this region."}
        ]
    )

    # Output -> session output
    return completion.choices[0].message.content

dataset = [
    {
        "inputs": {
            "product_type": "electric vehicles",
            "region": "western europe",
            "time_period": "first half of 2023",
            "metric_1": "total revenue",
            "metric_2": "market share"
        },
        "ground_truths": {
            "response": "As of 2023, the electric vehicle (EV) market in Western Europe is experiencing significant growth, with the region maintaining its status as a global leader in EV adoption. [continue...]",
        }
    },
    {
        "inputs": {
            "product_type": "gaming consoles",
            "region": "north america",
            "time_period": "holiday season 2022",
            "metric_1": "units sold",
            "metric_2": "gross profit margin"
        },
        "ground_truths": {
            "response": "As of 2023, the gaming console market in North America is characterized by intense competition, steady consumer demand, and evolving trends influenced by technological advancements and changing consumer preferences. [continue...]",
        }
    },
    {
        "inputs": {
            "product_type": "smart home devices",
            "region": "australia and new zealand",
            "time_period": "fiscal year 2022-2023",
            "metric_1": "customer acquisition cost",
            "metric_2": "average revenue per user"
        },
        "ground_truths": {
            "response": "As of 2023, the market for smart home devices in Australia and New Zealand is experiencing robust growth, driven by increasing consumer interest in home automation and the enhanced convenience and security these devices offer. [continue...]",
        }
    },
]

@evaluator()
def sample_evaluator(outputs, inputs, ground_truths):
    # Code here
    return random.randint(1, 5)

if __name__ == "__main__":
    # Run experiment
    evaluate(
        function = function_to_evaluate,               # Function to be evaluated
        api_key = '<HONEYHIVE_API_KEY>',
        project = '<HONEYHIVE_PROJECT>',
        name = 'Sample Experiment',
        dataset = dataset,                      # to be passed for json_list
        evaluators=[sample_evaluator],                 # to compute client-side metrics on each run
        server_url='<HONEYHIVE_SERVER_URL>'  # Optional / Required for self-hosted or dedicated deployments
    )

Running an experiment

Prerequisites

You have already created a project in HoneyHive, as explained here.
You have an API key for your project, as explained here.

Expected Time: 5 minutesSteps

Setup input data

Let’s create our dataset by inputting data directly into our code using a list of JSON objects:

dataset = [
    {
        "inputs": {
            "product_type": "electric vehicles",
            "region": "western europe"   
        },
        "ground_truths": {
            "response": "As of 2023, the electric vehicle (EV) ... ",
        }
    },
    {
        "inputs": {
            "product_type": "gaming consoles",
            "region": "north america"
        },
        "ground_truths": {
            "response": "As of 2023, the gaming console market ... ",
        }
    },
    {
        "inputs": {
            "product_type": "smart home devices",
            "region": "australia and new zealand" 
        },
        "ground_truths": {
            "response": "As of 2023, the market for smart home devices in Australia and New Zealand ... ",
        }
    },
]

The inputs and ground_truths fields will be accessible in both the function we want to evaluate and the evaluator function, as we will see below.

Define the function you want to evaluate

Define the function you want to evaluate. This can be arbitrarily complex, anywhere from a prompt or a simple retrieval pipeline, to an end-to-end multi-agent system:

# inputs -> parameter to which datapoint or json value will be passed
# (optional) ground_truths -> ground truth values for the input
def function_to_evaluate(inputs, ground_truths):

    # Code here

    return result

Important Note About ParametersThe function parameters are positional arguments and must be specified in this order:

inputs (first parameter): dictionary of parameters from your dataset
ground_truths (second parameter): optional ground truth dictionary

The value returned by the function would map to the outputs field of each trace in the experiment and will be accessible to your evaluator function, as we will see below.

(Optional) Setup Evaluators

Define client-side evaluators in your code that run immediately after each experiment iteration. These evaluators have direct access to inputs, outputs, and ground truths, and run synchronously with your experiment.

@evaluator()
def sample_evaluator(outputs, inputs, ground_truths):
    # Code here
    import random
    return random.randint(1, 5)

Important Note About Evaluator ParametersThe evaluator parameters are positional arguments and must be specified in this order:

outputs (first parameter): the output returned by the evaluated function
inputs (second parameter): the original input dictionary
ground_truths (third parameter): the ground truth dictionary

For more complex multi-step pipelines, you can compute and log client-side evaluators on specific traces and spans directly in your experiment harness.

Run experiment

Finally, you can run your experiment with evaluate:

from honeyhive import evaluate
from your_module import function_to_evaluate

if __name__ == "__main__":
    evaluate(
        function = function_to_evaluate,
        api_key = '<HONEYHIVE_API_KEY>',
        project = '<HONEYHIVE_PROJECT>',
        name = 'Sample Experiment',
        # To be passed for datasets managed in code
        dataset = dataset,
        # Add evaluators to your trace at the end of each execution
        evaluators=[sample_evaluator, ...],
        server_url='<HONEYHIVE_SERVER_URL>'  # Optional / Required for self-hosted or dedicated deployments
    )

If you are using a self-hosted or dedicated deployment, you also need to pass:

server_url: The private HoneyHive endpoint found in the Settings page in the HoneyHive app.

Dashboard View

Remember to review the results in your HoneyHive dashboard to gain insights into your model’s performance across different inputs. The dashboard provides a comprehensive view of the experiment results and performance across multiple runs.

Conclusion

By following these steps, you can set up and run experiments using HoneyHive. This allows you to systematically test your LLM-based systems across various scenarios and collect performance data for analysis.

Next Steps

If you are interested in a specific workflow, we recommend reading the walkthrough for the relevant product area.

Introduction to Evaluators

Learn how to evaluate and monitor your AI applications with HoneyHive’s flexible evaluation framework.

Comparing Experiments

Compare experiments side-by-side in HoneyHive to identify improvements, regressions, and optimize your workflows.

Running Experiments with HoneyHive's managed datasets

Run experiments using HoneyHive’s managed datasets, enabling centralized dataset management and version control.

Running Experiments with HoneyHive's server-side evaluators

Server-side evaluators are centralized, scalable, and versioned, making them ideal for resource-intensive or asynchronous tasks.

Introduction

Guides

Tutorials

Learn more

Quickstart

Full code

Running an experiment

Dashboard View

Conclusion

Next Steps

Introduction to Evaluators

Comparing Experiments

Running Experiments with HoneyHive's managed datasets

Running Experiments with HoneyHive's server-side evaluators

Introduction

Guides

Tutorials

Learn more

​Full code

​Running an experiment

​Dashboard View

​Conclusion

​Next Steps

Introduction to Evaluators

Comparing Experiments

Running Experiments with HoneyHive's managed datasets

Running Experiments with HoneyHive's server-side evaluators

Full code

Running an experiment

Dashboard View

Conclusion

Next Steps