Skip to main content

What You’ll Learn

By the end of this tutorial, you will:
  • Run experiments with evaluate()
  • Compare two prompts and measure which performs better
  • Create evaluators to automatically score outputs
  • View results and metrics in the HoneyHive dashboard
Time: ~10 minutes

Step 1: Setup

Install dependencies and configure your environment:
pip install "honeyhive>=1.0.0rc0" openai
import os
from openai import OpenAI
from honeyhive import evaluate

os.environ["HH_API_KEY"] = "your-honeyhive-api-key"
os.environ["HH_PROJECT"] = "my-project"
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

client = OpenAI()
If you have existing code with HoneyHiveTracer.init(), you don’t need it here - evaluate() handles tracing automatically. See Tracing Integration for details.

Step 2: Create Two Functions to Compare

Write two versions of an intent classifier - one vague, one structured:
def classify_vague(datapoint):
    text = datapoint["inputs"]["text"]
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a customer support assistant. Classify the customer's intent."},
            {"role": "user", "content": text}
        ],
        temperature=0
    )
    return {"intent": response.choices[0].message.content.strip().lower()}


def classify_structured(datapoint):
    text = datapoint["inputs"]["text"]
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"""Classify this customer support message into ONE category:
- billing: payment issues, invoices, charges, refunds
- technical: bugs, errors, how to use features
- account: login, password, profile, settings
- general: other questions, feedback

Reply with ONLY the category name.

Message: {text}
Category:"""}],
        temperature=0
    )
    return {"intent": response.choices[0].message.content.strip().lower()}
Key pattern: Each function receives a datapoint dict with inputs and returns an output dict.

Step 3: Create Your Dataset

Define test cases with messages and expected intents:
dataset = [
    {
        "inputs": {"text": "I was charged twice for my subscription this month."},
        "ground_truth": {"intent": "billing"}
    },
    {
        "inputs": {"text": "The export button isn't working. Getting error code 500."},
        "ground_truth": {"intent": "technical"}
    },
    {
        "inputs": {"text": "I forgot my password and the reset email never arrived."},
        "ground_truth": {"intent": "account"}
    },
    {
        "inputs": {"text": "Just wanted to say your support team was amazing. Thanks!"},
        "ground_truth": {"intent": "general"}
    },
]

Step 4: Create an Evaluator

Evaluators score your function’s outputs against ground truth:
def intent_match(outputs, inputs, ground_truth):
    """Check if the classified intent matches expected."""
    actual = outputs.get("intent", "").lower()
    expected = ground_truth.get("intent", "").lower()
    
    # Score 1.0 if match, 0.0 if not
    return 1.0 if expected in actual else 0.0
Evaluator signature: (outputs, inputs, ground_truth) → returns a score (typically 0.0 to 1.0)

Step 5: Run Both Experiments

Run experiments with each prompt version:
result_vague = evaluate(
    function=classify_vague,
    dataset=dataset,
    evaluators=[intent_match],
    name="intent-vague-prompt"
)

result_structured = evaluate(
    function=classify_structured,
    dataset=dataset,
    evaluators=[intent_match],
    name="intent-structured-prompt"
)

print(f"Vague: {result_vague.run_id}")
print(f"Structured: {result_structured.run_id}")

Step 6: View Results in Dashboard

Go to app.honeyhive.aiExperiments to see your results. The vague prompt produces verbose outputs, causing mismatches. The structured prompt produces clean single-word outputs that match exactly.
Vague prompt results showing 0.3 accuracy
Structured prompt results showing 1.0 accuracy
You can also compare runs programmatically with compare_runs(). See Compare Experiments for details.

import os
from openai import OpenAI
from honeyhive import evaluate

os.environ["HH_API_KEY"] = "your-honeyhive-api-key"
os.environ["HH_PROJECT"] = "my-project"
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

client = OpenAI()

def classify_vague(datapoint):
    text = datapoint["inputs"]["text"]
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a customer support assistant. Classify the customer's intent."},
            {"role": "user", "content": text}
        ],
        temperature=0
    )
    return {"intent": response.choices[0].message.content.strip().lower()}

def classify_structured(datapoint):
    text = datapoint["inputs"]["text"]
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"""Classify this customer support message into ONE category:
- billing: payment issues, invoices, charges, refunds
- technical: bugs, errors, how to use features
- account: login, password, profile, settings
- general: other questions, feedback

Reply with ONLY the category name.

Message: {text}
Category:"""}],
        temperature=0
    )
    return {"intent": response.choices[0].message.content.strip().lower()}

dataset = [
    {"inputs": {"text": "I was charged twice for my subscription this month."}, "ground_truth": {"intent": "billing"}},
    {"inputs": {"text": "The export button isn't working. Getting error code 500."}, "ground_truth": {"intent": "technical"}},
    {"inputs": {"text": "I forgot my password and the reset email never arrived."}, "ground_truth": {"intent": "account"}},
    {"inputs": {"text": "Just wanted to say your support team was amazing. Thanks!"}, "ground_truth": {"intent": "general"}},
]

def intent_match(outputs, inputs, ground_truth):
    return 1.0 if ground_truth.get("intent", "") in outputs.get("intent", "") else 0.0

result_vague = evaluate(function=classify_vague, dataset=dataset, evaluators=[intent_match], name="intent-vague-prompt")
result_structured = evaluate(function=classify_structured, dataset=dataset, evaluators=[intent_match], name="intent-structured-prompt")

print(f"Vague: {result_vague.run_id}")
print(f"Structured: {result_structured.run_id}")

What You Learned

  • Create datasets with inputs and ground truths for testing
  • Write evaluators that automatically score outputs
  • Run experiments with evaluate() to measure performance
  • Compare runs programmatically with compare_runs() or in the dashboard

What’s Next?