The evaluate function is a core utility designed to orchestrate automated evaluations through HoneyHive’s infrastructure. It provides systematic testing, tracing, and metrics collection capabilities for any Python function, particularly useful for evaluating AI model outputs, data processing pipelines, or any computational process requiring detailed performance analysis.

The evaluation framework integrates with HoneyHive’s tracing system to capture detailed telemetry about each evaluation run, including inputs, outputs, metrics, and runtime metadata. A detailed explanation of how tracing works in Python can be found here.

Example Usage

from honeyhive import evaluate

# Define evaluation function
def test_llm_response(inputs: Dict[str, Any]) -> str:
    response = llm.generate(inputs["prompt"])
    return response

# Define evaluator
def quality_evaluator(inputs: Dict[str, Any], outputs: Any) -> Dict[str, float]:
    return {
        "response_quality": calculate_quality(outputs),
        "prompt_relevance": measure_relevance(inputs["prompt"], outputs)
    }

# Run evaluation
result = evaluate(
    function=test_llm_response,
    hh_api_key="your-api-key",
    hh_project="Project Name",
    name="LLM Quality Test",
    dataset_id="test_dataset_001",
    evaluators=[quality_evaluator]
)

print(f"Evaluation Run ID: {result['run_id']}")
print(f"Session IDs: {result['session_ids']}")

Function Signature

def evaluate(
    function: Callable[[Dict[str, Any]], Any],
    hh_api_key: Optional[str] = None,
    hh_project: Optional[str] = None,
    name: Optional[str] = None,
    dataset_id: Optional[str] = None,
    query_list: Optional[List[Dict[str, Any]]] = None,
    runs: Optional[int] = None,
    evaluators: Optional[List[Callable[[Dict[str, Any], Any], Dict[str, Any]]]] = None,
) -> Dict[str, Any]

Parameters

Required Parameters

  • function (Callable[[Dict[str, Any]], Any]): The evaluation function to be tested. Must accept a dictionary of inputs and return a serializable output. This function will be executed for each datapoint in the dataset or query list.

Optional Parameters

  • hh_api_key (str, optional): API key for authenticating with HoneyHive services. If not provided, falls back to HH_API_KEY environment variable.

  • hh_project (str, optional): Project identifier in HoneyHive. If not provided, falls back to HH_PROJECT environment variable.

  • name (str, optional): Identifier for this evaluation run. Used in HoneyHive’s tracing and run management.

  • dataset_id (str, optional): ID of an existing HoneyHive dataset to use for evaluation inputs. Mutually exclusive with query_list.

  • query_list (List[Dict[str, Any]], optional): List of input dictionaries to evaluate against. Alternative to using a HoneyHive dataset.

  • runs (int, optional): Number of evaluation iterations to perform. Defaults to the length of dataset or query list.

  • evaluators (List[Callable[[Dict[str, Any], Any], Dict[str, Any]]], optional): List of evaluator functions that process inputs and outputs to generate metrics. Each evaluator should return a dictionary of metrics.

Return Value

Returns a dictionary containing evaluation metadata:

{
    "run_id": str,           # Unique identifier for this evaluation run
    "dataset_id": str,       # Dataset ID (HoneyHive or generated external ID)
    "session_ids": List[str],# List of individual evaluation session IDs
    "status": str           # Final status of the evaluation run
}

Technical Notes

  1. Execution Flow

    • Validates input parameters and credentials
    • Initializes HoneyHive tracing session
    • Processes dataset or query list sequentially
    • Executes evaluation function for each input
    • Runs evaluators on function outputs
    • Collects and stores metrics
    • Returns evaluation metadata
  2. Dataset Processing

    • HoneyHive datasets are fetched via API and processed automatically
    • Query lists are assigned a generated external dataset ID using MD5 hashing
    • Each datapoint/query is processed sequentially
    • Supports partial completion on failure
  3. Tracing Integration

    • Automatically initializes HoneyHiveTracer for each evaluation
    • Captures:
      • Input parameters
      • Function outputs
      • Evaluator metrics
      • Runtime metadata
      • Error states
    • Links all sessions to a single evaluation run
  4. Error Management

    • Validates all required parameters before execution
    • Handles API communication errors gracefully
    • Preserves partial results on failure
    • Maintains evaluation run status
    • Logs detailed error information

Notes

  • The evaluation framework requires either dataset_id or query_list to be provided
  • HoneyHive credentials (API key and project) must be available either as parameters or environment variables
  • Evaluator functions must handle both inputs and outputs and return a dictionary of metrics
  • All evaluation runs are automatically traced using HoneyHiveTracer