Client-side evaluators run within your application environment, providing immediate feedback and integration with your existing infrastructure.

Evaluators can be utilized either:

  • online: real-time metrics for live applications
  • offline: controlled experimental environments

For online evaluation, HoneyHive enables you to log evaluation results directly alongside your traces at various stages of your pipeline. For offline evaluation, evaluators are most effective when used with HoneyHive’s evaluation harness, which is designed to run and manage experiments seamlessly.

Online Evaluation

Once tracing is set up for your application, performing client-side online evaluations becomes straightforward. It simply involves enriching your traces and spans with additional context using the metrics field. This field allows you to pass any custom metric, using any of the primary data types. Metrics can be logged for any type of event and at every step of your pipeline.

For example, consider a Retrieval-Augmented Generation (RAG) scenario where you have evaluator functions implemented to assess: a) retrieval quality, b) model response generation, and c) overall pipeline performance.

These metrics can be seamlessly passed alongside your traces within each relevant span, as shown below:

from honeyhive import HoneyHiveTracer, trace, enrich_span, enrich_session


HoneyHiveTracer.init(
api_key="my-api-key",
project="my-project",
)

@trace
def get_relevant_docs(query):
    medical_docs = [
        "Regular exercise reduces diabetes risk by 30%. Daily walking is recommended.",
        "Studies show morning exercises have better impact on blood sugar levels."
    ]
    enrich_span(metrics={"retrieval_relevance": 0.5})
    return medical_docs

@trace
def generate_response(docs, query):
    prompt = f"Question: {query}\nContext: {docs}\nAnswer:"
    response = "This is a test response."
    enrich_span(metrics={"contains_citations": True})
    return response

@trace
def rag_pipeline(query):
    docs = get_relevant_docs(query)
    response = generate_response(docs, query)
    
    
    # Add session-level metrics
    enrich_session(metrics={
        "rag_pipeline": {
            "num_retrieved_docs": len(docs),
            "query_length": len(query.split())   
        }
    })
    
    return docs, response

def main():
    query = "How does exercise affect diabetes?"
    retrieved_docs, generated_response = rag_pipeline(query)

if __name__ == "__main__":
    main()

In this example, enrich_span is being used to add metrics on particular steps: get_relevant_docs and generate_response, while enrich_session is used to set metrics that apply to the entire session or pipeline run.

You can learn more about logging external evaluation results here.

Offline Experiments

You can also use client-side evaluators as part of your experiments sessions. In an experiments settings, your evaluator will have access to outputs (as generated by the evaluated function), inputs and ground truth (as defined in your dataset).

You should define your evaluators with the appropriate parameter signature - they can accept one parameter (outputs), two parameters (outputs, inputs), or three parameters (outputs, inputs, ground_truths) depending on what data your evaluation logic requires.

def sample_evaluator(outputs, inputs, ground_truths):
    # Code here
    return random.randint(1, 5)

By default, evaluation results are stored at the session level. The return values of evaluator functions should represent meaningful evaluation metrics, such as numerical scores, booleans, or other significant measurements.

You can use your evaluators to evaluate a target function in a controlled setting with curated datasets, like this:

    from honeyhive import evaluate

    import random

    def sample_evaluator(outputs, inputs, ground_truths):
        # Code here
        return random.randint(1, 5)


    # Create function to be evaluated
    # inputs -> parameter to which datapoint or json value will be passed
    # (optional) ground_truths -> ground truth value for the input
    def function_to_evaluate(inputs, ground_truths):
        complete_prompt = f"You are an expert analyst specializing in {inputs['product_type']} market trends. Could you provide an analysis of the current market performance and consumer reception of {inputs['product_type']} in {inputs['region']}?"
        response = "This is a test response."
        return response

    if __name__ == "__main__":
        # Run experiment
        evaluate(
            function = function_to_evaluate,               # Function to be evaluated
            hh_api_key = '<HONEYHIVE_API_KEY>',
            hh_project = '<HONEYHIVE_PROJECT>',
            name = 'Sample Experiment',
            dataset_id = '<DATASET_ID>',                      # this example assumes the existence of a managed dataset in HoneyHive
            evaluators=[sample_evaluator]                 # to compute client-side metrics on each run
        )

This will run the experiment with the datapoints contained in your dataset and run the evaluation on the target function’s output for each of the datapoint.

For a complete explanation of running experiments, refer to the Experiments Quickstart Example.

Multi-step Evaluation in Experiment Runs

If your experiment involves complex, multi-step pipelines, you can log metrics either at the trace level or on a per-span level to gain more detailed insights.

In this example, we define two evaluators: consistency_evaluator for the main rag_pipeline function, and retrieval_relevance_evaluator for the document retrieval step. The first is passed directly to evaluate(), while the second is enriched within the retrieval step itself.

from honeyhive import evaluate, evaluator
from honeyhive import trace, enrich_span

def retrieval_relevance_evaluator(query, docs):
    # code here
    avg_relevance = 0.5
    return avg_relevance

@evaluator()
def consistency_evaluator(outputs, inputs, ground_truths):
    # code here
    consistency_score = 0.66
    return consistency_score


@trace
def get_relevant_docs(query):
    retrieved_docs = [
        "Regular exercise reduces diabetes risk by 30%. Daily walking is recommended.",
        "Studies show morning exercises have better impact on blood sugar levels."
    ]
    retrieval_relevance = retrieval_relevance_evaluator(query, retrieved_docs)
    enrich_span(metrics={"retrieval_relevance": retrieval_relevance})
    return retrieved_docs

def generate_response(docs, query):
    prompt = f"Question: {query}\nContext: {docs}\nAnswer:"
    response = "This is a test response"
    return response

def rag_pipeline(inputs, ground_truths):
    query = inputs["query"]
    docs = get_relevant_docs(query)
    response = generate_response(docs, query)
        
    return response

dataset = [
    {
        "inputs": {
            "query": "How does exercise affect diabetes?",
        },
        "ground_truths": {
            "response": "Regular exercise reduces diabetes risk by 30%. Daily walking is recommended.",
        }
    },
]


if __name__ == "__main__":
    # Run experiment
    evaluate(
        function = rag_pipeline,               # Function to be evaluated
        hh_api_key = '<your-api-key>',
        hh_project = '<your-project-name>',
        name = 'Multi Step Evals',
        dataset = dataset,
        evaluators=[consistency_evaluator],                 # to compute client-side metrics on each run
    )

After running this script, you should be able to see both metrics displayed in your Experiments dashboard.

Next Steps