Online Evaluation
Once tracing is set up for your application, performing client-side online evaluations becomes straightforward. It simply involves enriching your traces and spans with additional context using the metrics field. This field allows you to pass any custom metric, using any of the primary data types. Metrics can be logged for any type of event and at every step of your pipeline.
For example, consider a Retrieval-Augmented Generation (RAG) scenario where you have evaluator functions implemented to assess:
a) retrieval quality,
b) model response generation, and
c) overall pipeline performance.
These metrics can be seamlessly passed alongside your traces within each relevant span, as shown below:
from honeyhive import HoneyHiveTracer, trace, enrich_span, enrich_session
HoneyHiveTracer.init(
api_key="my-api-key",
project="my-project",
)
@trace
def get_relevant_docs(query):
medical_docs = [
"Regular exercise reduces diabetes risk by 30%. Daily walking is recommended.",
"Studies show morning exercises have better impact on blood sugar levels."
]
enrich_span(metrics={"retrieval_relevance": 0.5})
return medical_docs
@trace
def generate_response(docs, query):
prompt = f"Question: {query}\nContext: {docs}\nAnswer:"
response = "This is a test response."
enrich_span(metrics={"contains_citations": True})
return response
@trace
def rag_pipeline(query):
docs = get_relevant_docs(query)
response = generate_response(docs, query)
# Add session-level metrics
enrich_session(metrics={
"rag_pipeline": {
"num_retrieved_docs": len(docs),
"query_length": len(query.split())
}
})
return docs, response
def main():
query = "How does exercise affect diabetes?"
retrieved_docs, generated_response = rag_pipeline(query)
if __name__ == "__main__":
main()
In this example, enrich_span
is being used to add metrics on particular steps: get_relevant_docs
and generate_response
,
while enrich_session
is used to set metrics that apply to the entire session or pipeline run.
You can learn more about logging external evaluation results here.
Offline Experiments
You can also use client-side evaluators as part of your experiments sessions. In an experiments settings, your evaluator will have access to outputs (as generated by the evaluated function), inputs and ground truth (as defined in your dataset).
You should define your evaluators with the appropriate parameter signature - they can accept one parameter (outputs
), two parameters (outputs
, inputs
), or three parameters (outputs
, inputs
, ground_truths
) depending on what data your evaluation logic requires.
def sample_evaluator(outputs, inputs, ground_truths):
# Code here
return random.randint(1, 5)
By default, evaluation results are stored at the session level. The return values of evaluator functions should represent meaningful evaluation metrics, such as numerical scores, booleans, or other significant measurements.
You can use your evaluators to evaluate a target function in a controlled setting with curated datasets, like this:
from honeyhive import evaluate
import random
def sample_evaluator(outputs, inputs, ground_truths):
# Code here
return random.randint(1, 5)
# Create function to be evaluated
# inputs -> parameter to which datapoint or json value will be passed
# (optional) ground_truths -> ground truth value for the input
def function_to_evaluate(inputs, ground_truths):
complete_prompt = f"You are an expert analyst specializing in {inputs['product_type']} market trends. Could you provide an analysis of the current market performance and consumer reception of {inputs['product_type']} in {inputs['region']}?"
response = "This is a test response."
return response
if __name__ == "__main__":
# Run experiment
evaluate(
function = function_to_evaluate, # Function to be evaluated
hh_api_key = '<HONEYHIVE_API_KEY>',
hh_project = '<HONEYHIVE_PROJECT>',
name = 'Sample Experiment',
dataset_id = '<DATASET_ID>', # this example assumes the existence of a managed dataset in HoneyHive
evaluators=[sample_evaluator] # to compute client-side metrics on each run
)
This will run the experiment with the datapoints contained in your dataset and run the evaluation on the target function’s output for each of the datapoint.
Multi-step Evaluation in Experiment Runs
If your experiment involves complex, multi-step pipelines, you can log metrics either at the trace level or on a per-span level to gain more detailed insights.
In this example, we define two evaluators: consistency_evaluator
for the main rag_pipeline
function, and retrieval_relevance_evaluator
for the document retrieval step. The first is passed directly to evaluate()
, while the second is enriched within the retrieval step itself.
from honeyhive import evaluate, evaluator
from honeyhive import trace, enrich_span
def retrieval_relevance_evaluator(query, docs):
# code here
avg_relevance = 0.5
return avg_relevance
@evaluator()
def consistency_evaluator(outputs, inputs, ground_truths):
# code here
consistency_score = 0.66
return consistency_score
@trace
def get_relevant_docs(query):
retrieved_docs = [
"Regular exercise reduces diabetes risk by 30%. Daily walking is recommended.",
"Studies show morning exercises have better impact on blood sugar levels."
]
retrieval_relevance = retrieval_relevance_evaluator(query, retrieved_docs)
enrich_span(metrics={"retrieval_relevance": retrieval_relevance})
return retrieved_docs
def generate_response(docs, query):
prompt = f"Question: {query}\nContext: {docs}\nAnswer:"
response = "This is a test response"
return response
def rag_pipeline(inputs, ground_truths):
query = inputs["query"]
docs = get_relevant_docs(query)
response = generate_response(docs, query)
return response
dataset = [
{
"inputs": {
"query": "How does exercise affect diabetes?",
},
"ground_truths": {
"response": "Regular exercise reduces diabetes risk by 30%. Daily walking is recommended.",
}
},
]
if __name__ == "__main__":
# Run experiment
evaluate(
function = rag_pipeline, # Function to be evaluated
hh_api_key = '<your-api-key>',
hh_project = '<your-project-name>',
name = 'Multi Step Evals',
dataset = dataset,
evaluators=[consistency_evaluator], # to compute client-side metrics on each run
)
After running this script, you should be able to see both metrics displayed in your Experiments dashboard.
Online Evaluation
Once tracing is set up for your application, performing client-side online evaluations becomes straightforward. It simply involves enriching your traces and spans with additional context using the metrics field. This field allows you to pass any custom metric, using any of the primary data types. Metrics can be logged for any type of event and at every step of your pipeline.
For example, consider a Retrieval-Augmented Generation (RAG) scenario where you have evaluator functions implemented to assess:
a) retrieval quality,
b) model response generation, and
c) overall pipeline performance.
These metrics can be seamlessly passed alongside your traces within each relevant span, as shown below:
from honeyhive import HoneyHiveTracer, trace, enrich_span, enrich_session
HoneyHiveTracer.init(
api_key="my-api-key",
project="my-project",
)
@trace
def get_relevant_docs(query):
medical_docs = [
"Regular exercise reduces diabetes risk by 30%. Daily walking is recommended.",
"Studies show morning exercises have better impact on blood sugar levels."
]
enrich_span(metrics={"retrieval_relevance": 0.5})
return medical_docs
@trace
def generate_response(docs, query):
prompt = f"Question: {query}\nContext: {docs}\nAnswer:"
response = "This is a test response."
enrich_span(metrics={"contains_citations": True})
return response
@trace
def rag_pipeline(query):
docs = get_relevant_docs(query)
response = generate_response(docs, query)
# Add session-level metrics
enrich_session(metrics={
"rag_pipeline": {
"num_retrieved_docs": len(docs),
"query_length": len(query.split())
}
})
return docs, response
def main():
query = "How does exercise affect diabetes?"
retrieved_docs, generated_response = rag_pipeline(query)
if __name__ == "__main__":
main()
In this example, enrich_span
is being used to add metrics on particular steps: get_relevant_docs
and generate_response
,
while enrich_session
is used to set metrics that apply to the entire session or pipeline run.
You can learn more about logging external evaluation results here.
Offline Experiments
You can also use client-side evaluators as part of your experiments sessions. In an experiments settings, your evaluator will have access to outputs (as generated by the evaluated function), inputs and ground truth (as defined in your dataset).
You should define your evaluators with the appropriate parameter signature - they can accept one parameter (outputs
), two parameters (outputs
, inputs
), or three parameters (outputs
, inputs
, ground_truths
) depending on what data your evaluation logic requires.
def sample_evaluator(outputs, inputs, ground_truths):
# Code here
return random.randint(1, 5)
By default, evaluation results are stored at the session level. The return values of evaluator functions should represent meaningful evaluation metrics, such as numerical scores, booleans, or other significant measurements.
You can use your evaluators to evaluate a target function in a controlled setting with curated datasets, like this:
from honeyhive import evaluate
import random
def sample_evaluator(outputs, inputs, ground_truths):
# Code here
return random.randint(1, 5)
# Create function to be evaluated
# inputs -> parameter to which datapoint or json value will be passed
# (optional) ground_truths -> ground truth value for the input
def function_to_evaluate(inputs, ground_truths):
complete_prompt = f"You are an expert analyst specializing in {inputs['product_type']} market trends. Could you provide an analysis of the current market performance and consumer reception of {inputs['product_type']} in {inputs['region']}?"
response = "This is a test response."
return response
if __name__ == "__main__":
# Run experiment
evaluate(
function = function_to_evaluate, # Function to be evaluated
hh_api_key = '<HONEYHIVE_API_KEY>',
hh_project = '<HONEYHIVE_PROJECT>',
name = 'Sample Experiment',
dataset_id = '<DATASET_ID>', # this example assumes the existence of a managed dataset in HoneyHive
evaluators=[sample_evaluator] # to compute client-side metrics on each run
)
This will run the experiment with the datapoints contained in your dataset and run the evaluation on the target function’s output for each of the datapoint.
Multi-step Evaluation in Experiment Runs
If your experiment involves complex, multi-step pipelines, you can log metrics either at the trace level or on a per-span level to gain more detailed insights.
In this example, we define two evaluators: consistency_evaluator
for the main rag_pipeline
function, and retrieval_relevance_evaluator
for the document retrieval step. The first is passed directly to evaluate()
, while the second is enriched within the retrieval step itself.
from honeyhive import evaluate, evaluator
from honeyhive import trace, enrich_span
def retrieval_relevance_evaluator(query, docs):
# code here
avg_relevance = 0.5
return avg_relevance
@evaluator()
def consistency_evaluator(outputs, inputs, ground_truths):
# code here
consistency_score = 0.66
return consistency_score
@trace
def get_relevant_docs(query):
retrieved_docs = [
"Regular exercise reduces diabetes risk by 30%. Daily walking is recommended.",
"Studies show morning exercises have better impact on blood sugar levels."
]
retrieval_relevance = retrieval_relevance_evaluator(query, retrieved_docs)
enrich_span(metrics={"retrieval_relevance": retrieval_relevance})
return retrieved_docs
def generate_response(docs, query):
prompt = f"Question: {query}\nContext: {docs}\nAnswer:"
response = "This is a test response"
return response
def rag_pipeline(inputs, ground_truths):
query = inputs["query"]
docs = get_relevant_docs(query)
response = generate_response(docs, query)
return response
dataset = [
{
"inputs": {
"query": "How does exercise affect diabetes?",
},
"ground_truths": {
"response": "Regular exercise reduces diabetes risk by 30%. Daily walking is recommended.",
}
},
]
if __name__ == "__main__":
# Run experiment
evaluate(
function = rag_pipeline, # Function to be evaluated
hh_api_key = '<your-api-key>',
hh_project = '<your-project-name>',
name = 'Multi Step Evals',
dataset = dataset,
evaluators=[consistency_evaluator], # to compute client-side metrics on each run
)
After running this script, you should be able to see both metrics displayed in your Experiments dashboard.
Online Evaluation
Once tracing is set up for your application, performing client-side online evaluations becomes straightforward. It simply involves enriching your traces and spans with additional context using the metrics field. This field allows you to pass any custom metric, using any of the primary data types. Metrics can be logged for any type of event and at every step of your pipeline.
For example, consider a Retrieval-Augmented Generation (RAG) scenario where you have evaluator functions implemented to assess:
a) retrieval quality,
b) model response generation, and
c) overall pipeline performance.
These metrics can be seamlessly passed alongside your traces within each relevant span, as shown below:
import { HoneyHiveTracer, traceTool, traceModel, traceChain, enrichSpan, enrichSession } from "honeyhive";
// Keep interfaces used in the functions
interface MedicalDocument {
docs: string[];
response: string;
}
interface RagPipelineMetrics {
num_retrieved_docs: number;
query_length: number;
}
// Initialize tracer
// Ensure HH_API_KEY and HH_PROJECT are set in your environment
const tracer = await HoneyHiveTracer.init({
sessionName: "online-client-evals",
// apiKey and project will be picked from environment variables
});
// Define the get_relevant_docs function with traceTool
const getRelevantDocs = traceTool(function getRelevantDocs(
query: string
): string[] {
const medicalDocs = [
"Regular exercise reduces diabetes risk by 30%. Daily walking is recommended.",
"Studies show morning exercises have better impact on blood sugar levels."
];
enrichSpan({
metrics: { retrieval_relevance: 0.5 }
});
return medicalDocs;
});
// Define generateResponse with traceModel (or traceTool if not an LLM call)
const generateResponse = traceModel(function generateResponse(
docs: string[],
query: string
): string {
const prompt = `Question: ${query}\nContext: ${docs}\nAnswer:`;
const response = "This is a test response.";
enrichSpan({
metrics: { contains_citations: true }
});
return response;
});
// Define ragPipeline with traceChain
const ragPipeline = traceChain(function ragPipeline(
query: string
): MedicalDocument {
const docs = getRelevantDocs(query);
const response = generateResponse(docs, query);
enrichSession({
metrics: {
rag_pipeline: {
num_retrieved_docs: docs.length,
query_length: query.split(" ").length
} as RagPipelineMetrics
}
});
return { docs, response };
});
// --- Main Execution Logic ---
// Wrap the execution in tracer.trace() to establish context
await tracer.trace(async () => {
const query = "How does exercise affect diabetes?";
await ragPipeline(query); // Assuming ragPipeline might become async
});
// Don't forget to flush the tracer if your script exits immediately after
// await tracer.flush();
Previously, tracing and enrichment involved calling methods directly on the tracer
instance (e.g., tracer.traceFunction()
, tracer.enrichSpan()
). While this pattern still works, it is now deprecated and will be removed in a future major version.
Please update your code to use the imported functions (traceTool
, traceModel
, traceChain
, enrichSpan
, enrichSession
) along with the tracer.trace()
wrapper as shown in the example above. This new approach simplifies usage within nested functions by not requiring the tracer
instance to be passed around.
Example of the deprecated pattern:
// OLD (DEPRECATED) PATTERN:
// const tracer = await HoneyHiveTracer.init({...});
// const getRelevantDocs = tracer.traceFunction()(function getRelevantDocs(...) { ... });
// tracer.enrichSpan({...});
// tracer.enrichSession({...});
In this example, enrichSpan
is being used to add metrics on particular steps: getRelevantDocs
and generateResponse
, while enrichSession
is used to set metrics that apply to the entire session or pipeline run.
Offline Experiments
You can also use client-side evaluators as part of your experiments sessions. In an experiments settings, your evaluator will have access to outputs (as generated by the evaluated function), and inputs (as defined in your dataset).
You should define your evaluators with the appropriate parameter signature - they can accept two parameters (input
, output
), where input contains the data passed to your function and output contains the result returned by your function.
interface MarketAnalysisInput {
product_type: string;
region: string;
}
interface MarketAnalysisOutput {
content: string;
role: string;
}
interface EvaluatorMetrics {
sample_metric: number;
sample_metric_2: boolean;
}
export async function functionToEvaluate(input: MarketAnalysisInput): Promise<MarketAnalysisOutput> {
try {
const dummyResponse: MarketAnalysisOutput = {
content: `This is a simulated analysis of ${input.product_type} in ${input.region}.
Market trends show significant growth with increasing consumer adoption.
Regional challenges include supply chain constraints and regulatory considerations.`,
role: "assistant"
};
return dummyResponse;
} catch (error) {
console.error('Error in function:', error);
throw error;
}
}
const dataset: MarketAnalysisInput[] = [
{
product_type: "electric vehicles",
region: "western europe"
},
{
product_type: "gaming consoles",
region: "north america"
}
];
function sampleEvaluator(input: MarketAnalysisInput, output: MarketAnalysisOutput): EvaluatorMetrics {
return {
sample_metric: 0.5,
sample_metric_2: true
};
}
evaluate({
evaluationFunction: functionToEvaluate,
hh_api_key: '<HONEYHIVE_API_KEY>',
hh_project: '<HONEYHIVE_PROJECT>',
name: 'Sample Experiment',
dataset: dataset,
evaluators: [sampleEvaluator],
server_url: '<HONEYHIVE_SERVER_URL>'
});
This will run the experiment with the datapoints contained in your dataset and run the evaluation on the target function’s output for each of the datapoint.