Initialize your evaluation run

Let’s start by running a quick experiment to test whether a including certain instructions in the prompt actually affects outputs across two different prompt variants. For this particular example, we’ll compare a prompt with explicit instructions to output concise sales emails vs another prompt with no such instructions. We’ll be evaluating the results using the Conciseness metric defined out-of-the-box in HoneyHive.

Define test cases and pipeline configuration

Now that you’ve initialized the evaluation run, define your test cases and prompt-model configurations for this evaluation.

HoneyHive allows you to log arbitrary metadata properties like chunk_size, chunk_overlap, vectorDB_index_name, etc. to your configs when running evaluations via the SDK. This can help you compare different vector databases, indexes, chunking sizes, and other variations in your LLM pipeline when running evaluations.
# Defining test cases
# must be a list of dictionaries 

dataset = [...]

# Configs help you track different parameters across the versions you're evaluating.
# Required properties include model, provider, version, hyperparameters, and chat (for GPT 3.5 Turbo or 4) or prompt (for all other models)

messages = [...]
configs = [
    {"model": "gpt-3.5-turbo", "chat": messages},
    {"model": "gpt-4-1106", "chat" : messages}

Run your LLM pipelines

Now that define your test cases, run your LLM pipelines to start logging evaluation results.

We’ve provided a simple OpenAI Chat Completions example below; you can also run complex pipelines that use external tools and frameworks like Llamaindex, Langchain, Pinecone, Chroma, etc.

def pipeline_a(config, datapoint, tracer, metrics):

    # put your pipeline logic in here!

    with tracer.tool(...):

    with tracer.model(...):
        response = completion(...)

    tracer.output = str(response)
    return tracer, metrics

# Simply pass the function into your eval!

eval = hh.eval(

Analyze results and share learnings

Congrats! You’ve successfully run your first evaluation with HoneyHive! You can now view and start analyzing your evaluation results.

From the results, it looks like adding instructions for making the outputs more concise doesn’t only make the outputs more concise, but also reduces hallucination.


While automated, LLM-based evaluation metrics can be helpful indicators, you should always try to review completions manually to validate outputs. With HoneyHive, you can easily invite domain experts to rate model completions and share relevant learnings with your entire team.