The evaluate function is a core utility designed to orchestrate automated evaluations through HoneyHive’s infrastructure. It provides systematic testing, tracing, and metrics collection capabilities for any TypeScript/JavaScript function, with particular emphasis on AI model evaluation, data processing pipelines, and performance analysis.

The evaluation framework manages the complete lifecycle of an evaluation run, from initialization through execution to completion, while integrating with HoneyHive’s tracing system for comprehensive telemetry capture. A detailed explanation of how tracing works in Typescript can be found here.

Example Usage

import { evaluate } from "honeyhive";

// Define evaluation function
const evaluationFunction = async (inputs: Record<string, any>) => {
    const response = await llm.generate(inputs.prompt);
    return response;
};

// Define evaluator
const qualityEvaluator = (inputs: Record<string, any>, outputs: any) => {
    return {
        responseQuality: calculateQuality(outputs),
        promptRelevance: measureRelevance(inputs.prompt, outputs)
    };
};

// Run evaluation
const result = await evaluate({
    evaluationFunction,
    hh_api_key: "your-api-key",
    hh_project: "Project Name",
    name: "LLM Quality Test",
    dataset_id: "test_dataset_001",
    evaluators: [qualityEvaluator]
});

console.log(`Evaluation Run ID: ${result.run_id}`);
console.log(`Session IDs: ${result.session_ids}`);

Function Signature and Interfaces

async function evaluate(config: EvaluationConfig): Promise<EvaluationResult>;

interface EvaluationConfig {
    evaluationFunction: Function;
    hh_api_key: string;
    hh_project: string;
    name: string;
    dataset_id?: string;
    query_list?: Record<string, any>[];
    runs?: number;
    evaluators?: Function[];
}

interface EvaluationResult {
    run_id: string;
    dataset_id: string | undefined;
    session_ids: string[];
    status: Status;
}

Parameters

Required Parameters

  • evaluationFunction (Function): The function to evaluate. Must accept an inputs object and return a serializable output.

  • hh_api_key (string): API key for authenticating with HoneyHive services.

  • hh_project (string): Project identifier in HoneyHive.

  • name (string): Identifier for this evaluation run.

Optional Parameters

  • dataset_id (string, optional): ID of an existing HoneyHive dataset to use for evaluation inputs.

  • query_list (Record<string, any>[], optional): List of input objects to evaluate against. Alternative to using a dataset.

  • runs (number, optional): Number of evaluation iterations. Defaults to dataset/query_list length.

  • evaluators (Function[], optional): List of evaluator functions that process inputs and outputs to generate metrics.

Return Value

Returns a Promise that resolves to an evaluation result object:

{
    run_id: string;           // HoneyHive run identifier
    dataset_id: string;       // Dataset ID (HoneyHive or generated)
    session_ids: string[];    // Individual evaluation session IDs
    status: Status;          // Run status (e.g., "COMPLETED")
}

Technical Notes

  1. Execution Flow

    • Validates configuration requirements and credentials
    • Initializes evaluation state and HoneyHive client
    • Loads dataset (HoneyHive dataset or generates ID for query list)
    • Creates evaluation run in HoneyHive
    • For each iteration:
      • Retrieves input data from dataset or query list
      • Initializes HoneyHiveTracer for the iteration
      • Executes evaluation function with inputs
      • Runs evaluators on function outputs
      • Enriches trace with metadata and metrics
      • Collects session ID
    • Updates evaluation status to completed
    • Returns evaluation metadata
  2. Dataset Processing

    • Supports both HoneyHive datasets and external query lists
    • Generates MD5 hashes for external datasets
    • Handles datapoint fetching and validation
    • Manages dataset linkage in traces
  3. Tracing Integration

    • Creates individual trace sessions per evaluation
    • Captures:
      • Input/output pairs
      • Evaluator metrics
      • Runtime metadata
      • Dataset linkage
    • Automatically flushes traces after each run
  4. Error Management

    • Validates configuration requirements
    • Handles API communication errors
    • Manages evaluator failures independently
    • Preserves partial results on failure

Notes

  • Either dataset_id or query_list must be provided
  • External query lists are automatically assigned a dataset ID
  • Evaluator functions should handle both inputs and outputs
  • All evaluation runs are automatically traced using HoneyHiveTracer
  • Evaluation status is updated to reflect completion or failure