Client-Side Evaluators
Learn how to use client-side evaluators for both tracing and experiments
Client-side evaluators run within your application environment, providing immediate feedback and integration with your existing infrastructure.
Evaluators can be utilized either:
- online: real-time metrics for live applications
- offline: controlled experimental environments
For online evaluation, HoneyHive enables you to log evaluation results directly alongside your traces at various stages of your pipeline. For offline evaluation, evaluators are most effective when used with HoneyHive’s evaluation harness, which is designed to run and manage experiments seamlessly.
Online Evaluation
Once tracing is set up for your application, performing client-side online evaluations becomes straightforward. It simply involves enriching your traces and spans with additional context using the metrics field. This field allows you to pass any custom metric, using any of the primary data types. Metrics can be logged for any type of event and at every step of your pipeline.
For example, consider a Retrieval-Augmented Generation (RAG) scenario where you have evaluator functions implemented to assess: a) retrieval quality, b) model response generation, and c) overall pipeline performance.
These metrics can be seamlessly passed alongside your traces within each relevant span, as shown below:
In this example, enrich_span
is being used to add metrics on particular steps: get_relevant_docs
and generate_response
,
while enrich_session
is used to set metrics that apply to the entire session or pipeline run.
You can learn more about logging external evaluation results here.
Offline Experiments
You can also use client-side evaluators as part of your experiments sessions. In an experiments settings, your evaluator will have access to outputs (as generated by the evaluated function), inputs and ground truth (as defined in your dataset).
You should define your evaluators with the appropriate parameter signature - they can accept one parameter (outputs
), two parameters (outputs
, inputs
), or three parameters (outputs
, inputs
, ground_truths
) depending on what data your evaluation logic requires.
By default, evaluation results are stored at the session level. The return values of evaluator functions should represent meaningful evaluation metrics, such as numerical scores, booleans, or other significant measurements.
You can use your evaluators to evaluate a target function in a controlled setting with curated datasets, like this:
This will run the experiment with the datapoints contained in your dataset and run the evaluation on the target function’s output for each of the datapoint.
For a complete explanation of running experiments, refer to the Experiments Quickstart Example.
Multi-step Evaluation in Experiment Runs
If your experiment involves complex, multi-step pipelines, you can log metrics either at the trace level or on a per-span level to gain more detailed insights.
In this example, we define two evaluators: consistency_evaluator
for the main rag_pipeline
function, and retrieval_relevance_evaluator
for the document retrieval step. The first is passed directly to evaluate()
, while the second is enriched within the retrieval step itself.
After running this script, you should be able to see both metrics displayed in your Experiments dashboard.