Use this file to discover all available pages before exploring further.
Client-side evaluators run within your application environment, providing immediate feedback and integration with your existing infrastructure.Evaluators can be utilized either:
Once tracing is set up for your application, performing client-side online evaluations becomes straightforward. It simply involves enriching your traces and spans with additional context using the metrics field. This field allows you to pass any custom metric, using any of the primary data types. Metrics can be logged for any type of event and at every step of your pipeline.For example, consider a Retrieval-Augmented Generation (RAG) scenario where you have evaluator functions implemented to assess:
a) retrieval quality,
b) model response generation, and
c) overall pipeline performance.These metrics can be seamlessly passed alongside your traces within each relevant span, as shown below:
from honeyhive import HoneyHiveTracer, trace, enrich_span, enrich_sessionHoneyHiveTracer.init(api_key="my-api-key",project="my-project",)@tracedef get_relevant_docs(query): medical_docs = [ "Regular exercise reduces diabetes risk by 30%. Daily walking is recommended.", "Studies show morning exercises have better impact on blood sugar levels." ] enrich_span(metrics={"retrieval_relevance": 0.5}) return medical_docs@tracedef generate_response(docs, query): prompt = f"Question: {query}\nContext: {docs}\nAnswer:" response = "This is a test response." enrich_span(metrics={"contains_citations": True}) return response@tracedef rag_pipeline(query): docs = get_relevant_docs(query) response = generate_response(docs, query) # Add session-level metrics enrich_session(metrics={ "rag_pipeline": { "num_retrieved_docs": len(docs), "query_length": len(query.split()) } }) return docs, responsedef main(): query = "How does exercise affect diabetes?" retrieved_docs, generated_response = rag_pipeline(query)if __name__ == "__main__": main()
In this example, enrich_span is being used to add metrics on particular steps: get_relevant_docs and generate_response,
while enrich_session is used to set metrics that apply to the entire session or pipeline run.
You can learn more about logging external evaluation results here.
You can also use client-side evaluators as part of your experiments sessions. In an experiments settings, your evaluator will have access to outputs (as generated by the evaluated function), inputs and ground truth (as defined in your dataset).You should define your evaluators with the appropriate parameter signature - they can accept one parameter (outputs), two parameters (outputs, inputs), or three parameters (outputs, inputs, ground_truths) depending on what data your evaluation logic requires.
def sample_evaluator(outputs, inputs, ground_truths): # Code here return random.randint(1, 5)
By default, evaluation results are stored at the session level. The return values of evaluator functions should represent meaningful evaluation metrics, such as numerical scores, booleans, or other significant measurements.You can use your evaluators to evaluate a target function in a controlled setting with curated datasets, like this:
from honeyhive import evaluate import random def sample_evaluator(outputs, inputs, ground_truths): # Code here return random.randint(1, 5) # Create function to be evaluated # inputs -> parameter to which datapoint or json value will be passed # (optional) ground_truths -> ground truth value for the input def function_to_evaluate(inputs, ground_truths): complete_prompt = f"You are an expert analyst specializing in {inputs['product_type']} market trends. Could you provide an analysis of the current market performance and consumer reception of {inputs['product_type']} in {inputs['region']}?" response = "This is a test response." return response if __name__ == "__main__": # Run experiment evaluate( function = function_to_evaluate, # Function to be evaluated hh_api_key = '<HONEYHIVE_API_KEY>', hh_project = '<HONEYHIVE_PROJECT>', name = 'Sample Experiment', dataset_id = '<DATASET_ID>', # this example assumes the existence of a managed dataset in HoneyHive evaluators=[sample_evaluator] # to compute client-side metrics on each run )
This will run the experiment with the datapoints contained in your dataset and run the evaluation on the target function’s output for each of the datapoint.
If your experiment involves complex, multi-step pipelines, you can log metrics either at the trace level or on a per-span level to gain more detailed insights.In this example, we define two evaluators: consistency_evaluator for the main rag_pipeline function, and retrieval_relevance_evaluator for the document retrieval step. The first is passed directly to evaluate(), while the second is enriched within the retrieval step itself.
from honeyhive import evaluate, evaluatorfrom honeyhive import trace, enrich_spandef retrieval_relevance_evaluator(query, docs): # code here avg_relevance = 0.5 return avg_relevance@evaluator()def consistency_evaluator(outputs, inputs, ground_truths): # code here consistency_score = 0.66 return consistency_score@tracedef get_relevant_docs(query): retrieved_docs = [ "Regular exercise reduces diabetes risk by 30%. Daily walking is recommended.", "Studies show morning exercises have better impact on blood sugar levels." ] retrieval_relevance = retrieval_relevance_evaluator(query, retrieved_docs) enrich_span(metrics={"retrieval_relevance": retrieval_relevance}) return retrieved_docsdef generate_response(docs, query): prompt = f"Question: {query}\nContext: {docs}\nAnswer:" response = "This is a test response" return responsedef rag_pipeline(inputs, ground_truths): query = inputs["query"] docs = get_relevant_docs(query) response = generate_response(docs, query) return responsedataset = [ { "inputs": { "query": "How does exercise affect diabetes?", }, "ground_truths": { "response": "Regular exercise reduces diabetes risk by 30%. Daily walking is recommended.", } },]if __name__ == "__main__": # Run experiment evaluate( function = rag_pipeline, # Function to be evaluated hh_api_key = '<your-api-key>', hh_project = '<your-project-name>', name = 'Multi Step Evals', dataset = dataset, evaluators=[consistency_evaluator], # to compute client-side metrics on each run )
After running this script, you should be able to see both metrics displayed in your Experiments dashboard.
Once tracing is set up for your application, performing client-side online evaluations becomes straightforward. It simply involves enriching your traces and spans with additional context using the metrics field. This field allows you to pass any custom metric, using any of the primary data types. Metrics can be logged for any type of event and at every step of your pipeline.For example, consider a Retrieval-Augmented Generation (RAG) scenario where you have evaluator functions implemented to assess:
a) retrieval quality,
b) model response generation, and
c) overall pipeline performance.These metrics can be seamlessly passed alongside your traces within each relevant span, as shown below:
import { HoneyHiveTracer, traceTool, traceModel, traceChain, enrichSpan, enrichSession } from "honeyhive";// Keep interfaces used in the functionsinterface MedicalDocument { docs: string[]; response: string;}interface RagPipelineMetrics { num_retrieved_docs: number; query_length: number;}// Initialize tracer // Ensure HH_API_KEY and HH_PROJECT are set in your environmentconst tracer = await HoneyHiveTracer.init({ sessionName: "online-client-evals", // apiKey and project will be picked from environment variables});// Define the get_relevant_docs function with traceToolconst getRelevantDocs = traceTool(function getRelevantDocs( query: string): string[] { const medicalDocs = [ "Regular exercise reduces diabetes risk by 30%. Daily walking is recommended.", "Studies show morning exercises have better impact on blood sugar levels." ]; enrichSpan({ metrics: { retrieval_relevance: 0.5 } }); return medicalDocs;});// Define generateResponse with traceModel (or traceTool if not an LLM call)const generateResponse = traceModel(function generateResponse( docs: string[], query: string): string { const prompt = `Question: ${query}\nContext: ${docs}\nAnswer:`; const response = "This is a test response."; enrichSpan({ metrics: { contains_citations: true } }); return response;});// Define ragPipeline with traceChainconst ragPipeline = traceChain(function ragPipeline( query: string): MedicalDocument { const docs = getRelevantDocs(query); const response = generateResponse(docs, query); enrichSession({ metrics: { rag_pipeline: { num_retrieved_docs: docs.length, query_length: query.split(" ").length } as RagPipelineMetrics } }); return { docs, response };});// --- Main Execution Logic ---// Wrap the execution in tracer.trace() to establish contextawait tracer.trace(async () => { const query = "How does exercise affect diabetes?"; await ragPipeline(query); // Assuming ragPipeline might become async});// Don't forget to flush the tracer if your script exits immediately after// await tracer.flush();
Previously, tracing and enrichment involved calling methods directly on the tracer instance (e.g., tracer.traceFunction(), tracer.enrichSpan()). While this pattern still works, it is now deprecated and will be removed in a future major version.Please update your code to use the imported functions (traceTool, traceModel, traceChain, enrichSpan, enrichSession) along with the tracer.trace() wrapper as shown in the example above. This new approach simplifies usage within nested functions by not requiring the tracer instance to be passed around.Example of the deprecated pattern:
In this example, enrichSpan is being used to add metrics on particular steps: getRelevantDocs and generateResponse, while enrichSession is used to set metrics that apply to the entire session or pipeline run.
You can also use client-side evaluators as part of your experiments sessions. In an experiments settings, your evaluator will have access to outputs (as generated by the evaluated function), and inputs (as defined in your dataset).You should define your evaluators with the appropriate parameter signature - they can accept two parameters (input, output), where input contains the data passed to your function and output contains the result returned by your function.
This will run the experiment with the datapoints contained in your dataset and run the evaluation on the target function’s output for each of the datapoint.