Using Server-Side Evaluators
Run experiments using server-side HoneyHive evaluators
In the experiments Quickstart, you learned how to run an experiment using client-side evaluators executed directly within your application’s environment. This guide focuses on utilizing server-side evaluators powered by HoneyHive’s infrastructure. Server-side evaluators offer several advantages, particularly for resource-intensive or asynchronous tasks, as they are centralized, scalable, and versioned.
If you want to know more about the differences between client-side and server-side evaluators, refer to the Evaluators Documentation.
Full code
Below is a minimal example demonstrating how to run an experiment using server-side evaluators:
Running an experiment
Prerequisites
- You have already created a project in HoneyHive, as explained here.
- You have an API key for your project, as explained here.
Expected Time: 5 minutes
Steps
Setup input data
Let’s create our dataset by inputting data directly into our code using a list of JSON objects:
inputs
and ground_truths
fields will be accessible in both the function we want to evaluate and the evaluator function, as we will see below. Create the flow you want to evaluate
Define the function you want to evaluate:
inputs
is a dictionary with the parameters used in your function, as defined in our dataset.- The value returned by the function would map to the
outputs
field of each run in the experiment and will be accessible to your evaluator function, as we will see below. ground_truths
is an optional field and, as the name suggests, contains the ground truth for each set of inputs.
Setup Server-side Evaluators
Let’s create a server-side Python evaluator that will simply measure the length of the model’s response. This evaluator will specifically work with events of type “model”, which represent LLM completions in your application:
- Navigate to the Evaluators tab in the HoneyHive console.
- Click
Add Evaluator
and selectPython Evaluator
.
You can find more information about server-side Python evaluators here.
When creating server-side evaluators, you’ll work with span attributes that are automatically passed to your evaluator function through the event
dictionary parameter, such as inputs
, outputs
, or metadata
.
For our Response Length evaluator, we are interested in the model’s response, which we’ll access using the event["outputs"]["content"]
path:
You can find more information on model events and their properties here.
Run experiment
Finally, you can run your experiment with evaluate
:
Dashboard View
You should now be able to see the Response Length
metric in your dashboard. Note that even though we didn’t
pass any local evaluators when running evaluate
, our server-side evaluator was properly configured and executed.
Conclusion
By following these steps, you can set up and run experiments using server-side HoneyHive evaluators. This allows you to systematically test your LLM-based systems across various scenarios and collect performance data for analysis while providing a consistent, centralized approach to deployment, management, and versioning across environments.