Reference documentation for the evaluate function
evaluate
function for orchestrating evaluation runsevaluator
decorator for creating custom evaluation metricsevaluate
Functionevaluate
function integrates with HoneyHive’s tracing system to capture detailed telemetry about each evaluation run, including inputs, outputs, metrics, and runtime metadata.
A detailed explanation of how tracing works in Python can be found here.
Callable
):
The function to evaluate. Must return a serializable output. This function will be executed for each datapoint in the dataset.
The function parameters are positional arguments and must be specified in this order:
inputs
: dictionary of inputs
ground_truths
(optional): dictionary of ground truth values
str
, optional):
API key for authenticating with HoneyHive services. If not provided, falls back to HH_API_KEY
environment variable.
str
, optional):
Project identifier in HoneyHive. If not provided, falls back to HH_PROJECT
environment variable.
str
, optional):
Identifier for this evaluation run. Used in HoneyHive’s tracing and run management.
str
, optional):
ID of an existing HoneyHive dataset to use for evaluation inputs. Mutually exclusive with dataset
.
List[Dict[str, Any]]
, optional):
List of input dictionaries to evaluate against. Each dictionary should have an inputs key and optionally a ground_truths key. Alternative to using a HoneyHive dataset through dataset_id
.
List[Callable]
, optional):
List of evaluator functions that process inputs and outputs to generate metrics. Each evaluator can be defined with 1, 2, or 3 parameters, in the order: outputs
, inputs
(optional), ground_truths
(optional). They can be either a regular function or decorated with @evaluator (which track additional settings and metadata).
EvaluationResult
object containing:
evaluator
Decoratorevaluator
decorator provides a flexible way to wrap functions with evaluation logic, enabling result transformation, aggregation, validation, and repetition. It can be used both within formal experiments via evaluate()
and as a standalone metric computation tool.
outputs
(required): The model’s output to evaluate
Any
- commonly str
, dict
, list
, or custom output typesinputs
(optional): Input context
Dict[str, Any]
ground_truth
(optional): Expected results
Dict[str, Any]
Setting | Type | Description | Example |
---|---|---|---|
transform | str | Expression to transform the output value. Useful for mapping / filtering your output. | "value * 2" |
aggregate | str | Expression to aggregate multiple results (when repeat >1) | "sum(values)" |
checker | str | Expression to validate results | "value in target" |
target | str or list | Target value for validation | [4, 5] |
repeat | int | Number of times to repeat evaluation | 3 |
weight | float | Importance weight of the evaluator | 0.5 |
asserts | bool | Whether to apply the assert keyword to final output | True |
dataset_id
or dataset
to be provided