Use this file to discover all available pages before exploring further.
Running experiments is a natural extension of the tracing capabilities of HoneyHive. We recommend you to go through the tracing quickstart before proceeding with this guide.
Here’s a minimal example to get you started with experiments in HoneyHive:
Sample eval script
from honeyhive import evaluate, evaluatorimport osfrom openai import OpenAIimport randomopenai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])# Create function to be evaluated# inputs -> parameter to which datapoint or json value will be passed# (optional) ground_truths -> ground truth value for the inputdef function_to_evaluate(inputs, ground_truths): completion = openai_client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": f"You are an expert analyst specializing in {inputs['product_type']} market trends."}, {"role": "user", "content": f"Could you provide an analysis of the current market performance and consumer reception of {inputs['product_type']} in {inputs['region']}? Please include any notable trends or challenges specific to this region."} ] ) # Output -> session output return completion.choices[0].message.contentdataset = [ { "inputs": { "product_type": "electric vehicles", "region": "western europe", "time_period": "first half of 2023", "metric_1": "total revenue", "metric_2": "market share" }, "ground_truths": { "response": "As of 2023, the electric vehicle (EV) market in Western Europe is experiencing significant growth, with the region maintaining its status as a global leader in EV adoption. [continue...]", } }, { "inputs": { "product_type": "gaming consoles", "region": "north america", "time_period": "holiday season 2022", "metric_1": "units sold", "metric_2": "gross profit margin" }, "ground_truths": { "response": "As of 2023, the gaming console market in North America is characterized by intense competition, steady consumer demand, and evolving trends influenced by technological advancements and changing consumer preferences. [continue...]", } }, { "inputs": { "product_type": "smart home devices", "region": "australia and new zealand", "time_period": "fiscal year 2022-2023", "metric_1": "customer acquisition cost", "metric_2": "average revenue per user" }, "ground_truths": { "response": "As of 2023, the market for smart home devices in Australia and New Zealand is experiencing robust growth, driven by increasing consumer interest in home automation and the enhanced convenience and security these devices offer. [continue...]", } },]@evaluator()def sample_evaluator(outputs, inputs, ground_truths): # Code here return random.randint(1, 5)if __name__ == "__main__": # Run experiment evaluate( function = function_to_evaluate, # Function to be evaluated api_key = '<HONEYHIVE_API_KEY>', project = '<HONEYHIVE_PROJECT>', name = 'Sample Experiment', dataset = dataset, # to be passed for json_list evaluators=[sample_evaluator], # to compute client-side metrics on each run server_url='<HONEYHIVE_SERVER_URL>' # Optional / Required for self-hosted or dedicated deployments )
You have already created a project in HoneyHive, as explained here.
You have an API key for your project, as explained here.
Expected Time: 5 minutesSteps
1
Setup input data
Let’s create our dataset by inputting data directly into our code using a list of JSON objects:
dataset = [ { "inputs": { "product_type": "electric vehicles", "region": "western europe" }, "ground_truths": { "response": "As of 2023, the electric vehicle (EV) ... ", } }, { "inputs": { "product_type": "gaming consoles", "region": "north america" }, "ground_truths": { "response": "As of 2023, the gaming console market ... ", } }, { "inputs": { "product_type": "smart home devices", "region": "australia and new zealand" }, "ground_truths": { "response": "As of 2023, the market for smart home devices in Australia and New Zealand ... ", } },]
The inputs and ground_truths fields will be accessible in both the function we want to evaluate and the evaluator function, as we will see below.
2
Define the function you want to evaluate
Define the function you want to evaluate. This can be arbitrarily complex, anywhere from a prompt or a simple retrieval pipeline, to an end-to-end multi-agent system:
# inputs -> parameter to which datapoint or json value will be passed# (optional) ground_truths -> ground truth values for the inputdef function_to_evaluate(inputs, ground_truths): # Code here return result
Important Note About ParametersThe function parameters are positional arguments and must be specified in this order:
inputs (first parameter): dictionary of parameters from your dataset
ground_truths (second parameter): optional ground truth dictionary
The value returned by the function would map to the outputs field of each trace in the experiment and will be accessible to your evaluator function, as we will see below.
3
(Optional) Setup Evaluators
Define client-side evaluators in your code that run immediately after each experiment iteration. These evaluators have direct access to inputs, outputs, and ground truths, and run synchronously with your experiment.
@evaluator()def sample_evaluator(outputs, inputs, ground_truths): # Code here import random return random.randint(1, 5)
Important Note About Evaluator ParametersThe evaluator parameters are positional arguments and must be specified in this order:
outputs (first parameter): the output returned by the evaluated function
inputs (second parameter): the original input dictionary
ground_truths (third parameter): the ground truth dictionary
Finally, you can run your experiment with evaluate:
from honeyhive import evaluatefrom your_module import function_to_evaluateif __name__ == "__main__": evaluate( function = function_to_evaluate, api_key = '<HONEYHIVE_API_KEY>', project = '<HONEYHIVE_PROJECT>', name = 'Sample Experiment', # To be passed for datasets managed in code dataset = dataset, # Add evaluators to your trace at the end of each execution evaluators=[sample_evaluator, ...], server_url='<HONEYHIVE_SERVER_URL>' # Optional / Required for self-hosted or dedicated deployments )
Remember to review the results in your HoneyHive dashboard to gain insights into your model’s performance across different inputs. The dashboard provides a comprehensive view of the experiment results and performance across multiple runs.
By following these steps, you can set up and run experiments using HoneyHive. This allows you to systematically test your LLM-based systems across various scenarios and collect performance data for analysis.
Here’s a minimal example to get you started with experiments in HoneyHive:
Sample eval script
import { evaluate } from "honeyhive";import { OpenAI } from 'openai';const openai = new OpenAI({ apiKey: process.env.OPENAI_KEY });// Create function to be evaluated// input -> parameter to which datapoint or json value will be passed// ground_truths -> optional parameter - ground truth valueexport async function functionToEvaluate(input: Record<string, any>, ground_truths: Record<string, any>) { try { const response = await openai.chat.completions.create({ model: "gpt-4", messages: [ { role: 'system', content: `You are an expert analyst specializing in ${input.product_type} market trends.` }, { role: 'user', content: `Could you provide an analysis of the current market performance and consumer reception of ${input.product_type} in ${input.region}? Please include any notable trends or challenges specific to this region.` } ], }); // Output -> session output return response.choices[0].message; } catch (error) { console.error('Error making GPT-4 call:', error); throw error; }}const dataset = [ { "inputs": { "product_type": "electric vehicles", "region": "western europe" }, "ground_truths": { "response": "As of 2023, the electric vehicle (EV) ... ", } }, { "inputs": { "product_type": "gaming consoles", "region": "north america" }, "ground_truths": { "response": "As of 2023, the gaming console market ... ", } }, { "inputs": { "product_type": "smart home devices", "region": "australia and new zealand" }, "ground_truths": { "response": "As of 2023, the market for smart home devices in Australia and New Zealand ... ", } }]// Sample evaluator that returns fixed metricsfunction sampleEvaluator(outputs: any, inputs: Record<string, any>, ground_truths: Record<string, any>) { // Code here return { sample_metric: 0.5, sample_metric_2: true };}evaluate({ function: functionToEvaluate, // Function to be evaluated apiKey: '<HONEYHIVE_API_KEY>', project: '<HONEYHIVE_PROJECT>', name: 'Sample Experiment', dataset: dataset, // to be passed for json_list evaluators: [sampleEvaluator], // to compute client-side metrics on each run serverUrl: '<HONEYHIVE_SERVER_URL>' // Optional / Required for self-hosted or dedicated deployments})
You have already created a project in HoneyHive, as explained here.
You have an API key for your project, as explained here.
Expected Time: 5 minutesSteps
1
Setup input data
Let’s create our dataset by inputting data directly into our code using a list of JSON objects:
const dataset = [ { "inputs": { "product_type": "electric vehicles", "region": "western europe" }, "ground_truths": { "response": "As of 2023, the electric vehicle (EV) ... ", } }, { "inputs": { "product_type": "gaming consoles", "region": "north america" }, "ground_truths": { "response": "As of 2023, the gaming console market ... ", } }, { "inputs": { "product_type": "smart home devices", "region": "australia and new zealand" }, "ground_truths": { "response": "As of 2023, the market for smart home devices in Australia and New Zealand ... ", } }]
The input fields in the dataset should map to the fields mapped in the evaluate function.
2
Create the flow you want to evaluate
Define the function you want to evaluate in your experiment:
// Create function to be evaluated export async function functionToEvaluate(input: Record<string, any>, ground_truths: Record<string, any>) { try { // your code here return result; } catch (error) { console.error('Error:', error); throw error; } }
Important Note About ParametersThe function parameters are positional arguments and must be specified in this order:
inputs (first parameter): dictionary of parameters from your dataset
ground_truths (second parameter): optional ground truth dictionary
The value returned by the function would map to the outputs field of each run in the experiment.
3
(Optional) Setup Evaluators
Define client-side evaluators in your code that run immediately after each experiment iteration. These evaluators have direct access to inputs, outputs, and ground truths, and run synchronously with your experiment.
// input -> input defined above// output -> output returned by the functionfunction sampleEvaluator(outputs: any, inputs: Record<string, any>, ground_truths: Record<string, any>) { // Code here // Each evaluator can return a dictionary of metrics return { sample_metric: 0.5, sample_metric_2: true };}
Important Note About Evaluator ParametersThe evaluator parameters are positional arguments and must be specified in this order:
outputs (first parameter): the output returned by the evaluated function
inputs (second parameter): the original input dictionary
ground_truths (third parameter): the ground truth dictionary
import { evaluate } from "honeyhive";import { functionToEvaluate } from "./your-module";evaluate({ function: functionToEvaluate, // Direct reference since signature matches apiKey: '<HONEYHIVE_API_KEY>', project: '<HONEYHIVE_PROJECT>', name: 'Sample Experiment', dataset: dataset, // to be passed for json_list evaluators: [sampleEvaluator], // Add evaluators to run at the end of each run serverUrl: '<HONEYHIVE_SERVER_URL>' // Optional / Required for self-hosted or dedicated deployments})
Remember to review the results in your HoneyHive dashboard to gain insights into your model’s performance across different inputs. The dashboard provides a comprehensive view of the experiment results and performance across multiple runs.
By following these steps, you can set up and run experiments using HoneyHive. This allows you to systematically test your LLM-based systems across various scenarios and collect performance data for analysis.