In the experiments Quickstart, you learned how to run an experiment using client-side evaluators executed directly within your application’s environment. This guide focuses on utilizing server-side evaluators powered by HoneyHive’s infrastructure. Server-side evaluators offer several advantages, particularly for resource-intensive or asynchronous tasks, as they are centralized, scalable, and versioned.

If you want to know more about the differences between client-side and server-side evaluators, refer to the Evaluators Documentation.

Full code

Below is a minimal example demonstrating how to run an experiment using server-side evaluators:

Running an experiment

Prerequisites

  • You have already created a project in HoneyHive, as explained here.
  • You have an API key for your project, as explained here.

Expected Time: 5 minutes

Steps

1

Setup input data

Let’s create our dataset by inputting data directly into our code using a list of JSON objects:

dataset = [
    {
        "inputs": {
            "product_type": "electric vehicles",
            "region": "western europe",   
        },
        "ground_truths": {
            "response": "As of 2023, the electric vehicle (EV) ... ",
        }
    },
    {
        "inputs": {
            "product_type": "gaming consoles",
            "region": "north america",
        },
        "ground_truths": {
            "response": "As of 2023, the gaming console market ... ",
        }
    },
    {
        "inputs": {
            "product_type": "smart home devices",
            "region": "australia and new zealand", 
        },
        "ground_truths": {
            "response": "As of 2023, the market for smart home devices in Australia and New Zealand ... ",
        }
    },
]
The inputs and ground_truths fields will be accessible in both the function we want to evaluate and the evaluator function, as we will see below.
2

Create the flow you want to evaluate

Define the function you want to evaluate:

def function_to_evaluate(inputs, ground_truths):
    completion = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"You are an expert analyst specializing in {inputs['product_type']} market trends."},
            {"role": "user", "content": f"Could you provide an analysis of the current market performance and consumer reception of {inputs['product_type']} in {inputs['region']}? Please include any notable trends or challenges specific to this region."}
        ]
    )

    # Output -> session output
    return completion.choices[0].message.content
  • inputs is a dictionary with the parameters used in your function, as defined in our dataset.
  • The value returned by the function would map to the outputs field of each run in the experiment and will be accessible to your evaluator function, as we will see below.
  • ground_truths is an optional field and, as the name suggests, contains the ground truth for each set of inputs.
3

Setup Server-side Evaluators

Let’s create a server-side Python evaluator that will simply measure the length of the model’s response. This evaluator will specifically work with events of type “model”, which represent LLM completions in your application:

  1. Navigate to the Evaluators tab in the HoneyHive console.
  2. Click Add Evaluator and select Python Evaluator.

You can find more information about server-side Python evaluators here.

When creating server-side evaluators, you’ll work with span attributes that are automatically passed to your evaluator function through the event dictionary parameter, such as inputs, outputs, or metadata. For our Response Length evaluator, we are interested in the model’s response, which we’ll access using the event["outputs"]["content"] path:

    def metric_name(event):
        """
        Response Length Metric

        Counts the number of words in the model's output. Useful for measuring verbosity,
        controlling output length, and monitoring response size.

        Args:
            event (dict): Dictionary containing model output (and potentially other fields).
                        - event["outputs"]["content"] (str): The model's text output.

        Returns:
            int: The total number of words in the model's response.
        """
        model_response = event["outputs"]["content"]  # Replace this based on your specific event attributes
        
        # Split response into words and count them
        # Note: This is a simple implementation. Consider using NLTK or spaCy for more accurate tokenization.
        model_words = model_response.split(" ")
        return len(model_words)

You can find more information on model events and their properties here.

4

Run experiment

Finally, you can run your experiment with evaluate:

from honeyhive import evaluate
from your_module import function_to_evaluate

if __name__ == "__main__":
    evaluate(
        function = function_to_evaluate,
        hh_api_key = '<HONEYHIVE_API_KEY>',
        hh_project = '<HONEYHIVE_PROJECT>',
        name = 'Sample Experiment',
        dataset = dataset,           # to be passed for code-managed datasets
        # You can also provide client-side evaluators if they are already set up.
        # evaluators=[sample_evaluator, ...]
    )

Dashboard View

You should now be able to see the Response Length metric in your dashboard. Note that even though we didn’t pass any local evaluators when running evaluate, our server-side evaluator was properly configured and executed.

Conclusion

By following these steps, you can set up and run experiments using server-side HoneyHive evaluators. This allows you to systematically test your LLM-based systems across various scenarios and collect performance data for analysis while providing a consistent, centralized approach to deployment, management, and versioning across environments.