Python
Reference documentation for the evaluate function
HoneyHive’s Python SDK provides a comprehensive evaluation framework for testing, tracing, and collecting metrics on AI model outputs, data processing pipelines, or any computational process requiring detailed performance analysis.
The framework consists of two main components:
- The
evaluate
function for orchestrating evaluation runs - The
evaluator
decorator for creating custom evaluation metrics
The evaluate
Function
The evaluate
function integrates with HoneyHive’s tracing system to capture detailed telemetry about each evaluation run, including inputs, outputs, metrics, and runtime metadata.
A detailed explanation of how tracing works in Python can be found here.
Example Usage
Function Signature
Parameters
Required Parameters
- function (
Callable
): The function to evaluate. Must return a serializable output. This function will be executed for each datapoint in the dataset. Accepts:-
inputs
: dictionary of inputs -
ground_truths
(optional): dictionary of ground truth values
-
Optional Parameters
-
hh_api_key (
str
, optional): API key for authenticating with HoneyHive services. If not provided, falls back toHH_API_KEY
environment variable. -
hh_project (
str
, optional): Project identifier in HoneyHive. If not provided, falls back toHH_PROJECT
environment variable. -
name (
str
, optional): Identifier for this evaluation run. Used in HoneyHive’s tracing and run management. -
suite (str, optional): Name of the evaluation suite. If not provided, uses the directory name of the calling script.
-
dataset_id (
str
, optional): ID of an existing HoneyHive dataset to use for evaluation inputs. Mutually exclusive withdataset
. -
dataset (
List[Dict[str, Any]]
, optional): List of input dictionaries to evaluate against. Each dictionary should have an inputs key and optionally a ground_truths key. Alternative to using a HoneyHive dataset throughdataset_id
. -
evaluators (
List[Callable]
, optional): List of evaluator functions that process inputs and outputs to generate metrics. Each evaluator can be defined with 1, 2, or 3 parameters, in the order:outputs
,inputs
(optional),ground_truths
(optional). They can be either a regular function or decorated with @evaluator (which track additional settings and metadata). -
max_workers (int, default=10): Maximum number of concurrent workers for parallel evaluation.
-
verbose (bool, default=False): Whether to print detailed logs during evaluation.
-
server_url (str, optional): Custom server URL for HoneyHive API.
Return Value
Returns an EvaluationResult
object containing:
The evaluator
Decorator
The evaluator
decorator provides a flexible way to wrap functions with evaluation logic, enabling result transformation, aggregation, validation, and repetition. It can be used both within formal experiments via evaluate()
and as a standalone metric computation tool.
Example Usage
Function Signatures
Decorated functions must accept 1-3 arguments in this order:
-
outputs
(required): The model’s output to evaluate- Type:
Any
- commonlystr
,dict
,list
, or custom output types - Represents the output from the model or function being evaluated
- Example: Response text, generated code, classification results
- Type:
-
inputs
(optional): Input context- Type:
Dict[str, Any]
- Contains the input data used to generate the outputs
- Example:
- Type:
-
ground_truth
(optional): Expected results- Type:
Dict[str, Any]
- Contains the correct or expected outputs for comparison
- Example:
- Type:
Settings
The decorator accepts the following settings either through initialization or configuration:
Setting | Type | Description | Example |
---|---|---|---|
transform | str | Expression to transform the output value. Useful for mapping / filtering your output. | "value * 2" |
aggregate | str | Expression to aggregate multiple results (when repeat >1) | "sum(values)" |
checker | str | Expression to validate results | "value in target" |
target | str or list | Target value for validation | [4, 5] |
repeat | int | Number of times to repeat evaluation | 3 |
weight | float | Importance weight of the evaluator | 0.5 |
asserts | bool | Whether to apply the assert keyword to final output | True |
Return Value
The return values of evaluator functions should represent some form of evaluation metric, like a score, boolean, or any other meaningful measurement.
Technical Notes
-
Execution Flow
- Validates input parameters and credentials
- Initializes HoneyHive tracing session
- Processes dataset or query list sequentially
- Executes evaluation function for each input
- Runs evaluators on function outputs
- Collects and stores metrics
- Returns evaluation metadata
-
Dataset Processing
- HoneyHive datasets are fetched via API and processed automatically
- Query lists are assigned a generated external dataset ID using MD5 hashing
- Each datapoint/query is processed sequentially
- Supports partial completion on failure
-
Tracing Integration
- Automatically initializes HoneyHiveTracer for each evaluation
- Captures:
- Input parameters
- Function outputs
- Evaluator metrics
- Runtime metadata
- Error states
- Links all sessions to a single evaluation run
-
Error Management
- Validates all required parameters before execution
- Handles API communication errors gracefully
- Preserves partial results on failure
- Maintains evaluation run status
- Logs detailed error information
Notes
- The evaluation framework requires either
dataset_id
orquery_list
to be provided - HoneyHive credentials (API key and project) must be available either as parameters or environment variables
- Evaluator functions must handle both inputs and outputs and return a dictionary of metrics
- All evaluation runs are automatically traced using HoneyHiveTracer