The evaluate function is a core utility designed to orchestrate automated evaluations through HoneyHive’s infrastructure. It provides systematic testing, tracing, and metrics collection capabilities for any TypeScript/JavaScript function, with particular emphasis on AI model evaluation, data processing pipelines, and performance analysis.

The evaluation framework manages the complete lifecycle of an evaluation run, from initialization through execution to completion, while integrating with HoneyHive’s tracing system for comprehensive telemetry capture. A detailed explanation of how tracing works in Typescript can be found here.

Example Usage

import { evaluate } from "honeyhive";

// Define evaluation function
const evaluationFunction = async (inputs: Record<string, any>, ground_truths: Record<string, any>) => {
    const response = await llm.generate(inputs.prompt);
    return response;
};

// Define evaluator
const qualityEvaluator = (outputs: any, inputs: Record<string, any>, ground_truths: Record<string, any>) => {
    return {
        responseQuality: calculateQuality(outputs),
        promptRelevance: measureRelevance(inputs.prompt, outputs)
    };
};

// Run evaluation
const result = await evaluate({
    function: evaluationFunction,
    hh_api_key: "your-api-key",
    hh_project: "Project Name",
    name: "LLM Quality Test",
    dataset_id: "test_dataset_001",
    evaluators: [qualityEvaluator]
});

console.log(`Evaluation Run ID: ${result.run_id}`);
console.log(`Session IDs: ${result.session_ids}`);

Function Signature and Interfaces

async function evaluate(config: EvaluationConfig): Promise<EvaluationResult>;

interface EvaluationConfig {
    apiKey?: string | undefined;
    project?: string | undefined;
    name?: string | undefined;
    suite?: string | undefined;
    function?: (...args: any[]) => any | undefined;
    evaluators?: ((...args: any[]) => any)[] | undefined;
    dataset?: Dict<any>[] | undefined;
    datasetId?: string | undefined;
    maxWorkers?: number | undefined;
    runConcurrently?: boolean | undefined;
    serverUrl?: string | undefined;
    verbose?: boolean | undefined;
    disableHttpTracing?: boolean | undefined;
    metadata?: Dict<any> | undefined;
    instrumentModules?: Record<string, any> | undefined;
}

interface EvaluationResult {
    runId: string;
    datasetId: string | undefined;
    sessionIds: string[];
    status: Status;
    suite: string;
    stats: Dict<any>;
    data: Dict<any>;
}

Parameters

Required Parameters

  • function (Function): The function to evaluate. The function parameters are positional arguments and must be specified in this order: (1) an inputs object, (2) an optional ground truth object, and return a serializable output.

  • apiKey (string): API key for authenticating with HoneyHive services.

  • project (string): Project identifier in HoneyHive.

  • name (string): Identifier for this evaluation run.

  • suite (string, optional): Name of the evaluation suite. If not provided, uses the directory name of the calling script.

  • maxWorkers (number, optional): Maximum number of concurrent workers for parallel evaluation. Defaults to 10.

  • runConcurrently (boolean, optional): Whether to run evaluations concurrently. Defaults to false.

  • serverUrl (string, optional): Custom server URL for HoneyHive API.

  • verbose (boolean, optional): Whether to print detailed logs during evaluation. Defaults to false.

  • disableHttpTracing (boolean, optional): Whether to disable automatic HTTP request tracing. Defaults to false.

  • metadata (Record<string, any>, optional): Additional metadata to attach to the evaluation run.

  • instrumentModules (Record<string, any>, optional): Modules to instrument for automatic tracing.

Optional Parameters

  • datasetId (string, optional): ID of an existing HoneyHive dataset to use for evaluation inputs.

  • dataset (Record<string, any>[], optional): List of input objects to evaluate against. Alternative to using a dataset.

  • evaluators (Function[], optional): List of evaluator functions. The function parameters are positional arguments and must be specified in this order: (1) outputs, (2) inputs, (3) and ground truths to generate metrics.

Return Value

Returns a Promise that resolves to an evaluation result object:

{
    runId: string;           // HoneyHive run identifier
    datasetId: string | undefined;  // Dataset ID (HoneyHive or generated) 
    sessionIds: string[];    // Individual evaluation session IDs
    status: Status;         // Run status (e.g., "COMPLETED")
    suite: string;          // Name of the evaluation suite
    stats: Dict<any>;       // Statistics about the evaluation run
    data: Dict<any>;        // Additional data associated with the run
}

Technical Notes

  1. Execution Flow

    • Validates configuration requirements and credentials
    • Initializes evaluation state and HoneyHive client
    • Loads dataset (HoneyHive dataset or generates ID for in-code datasets)
    • Creates evaluation run in HoneyHive
    • For each iteration:
      • Retrieves input data from dataset
      • Initializes HoneyHiveTracer for the iteration
      • Executes evaluation function with inputs
      • Runs evaluators on function outputs
      • Enriches trace with metadata and metrics
      • Collects session ID
    • Updates evaluation status to completed
    • Returns evaluation metadata
  2. Dataset Processing

    • Supports both HoneyHive datasets and external datasets
    • Generates MD5 hashes for external datasets
    • Handles datapoint fetching and validation
    • Manages dataset linkage in traces
  3. Tracing Integration

    • Creates individual trace sessions per evaluation
    • Captures:
      • Input/output pairs
      • Evaluator metrics
      • Runtime metadata
      • Dataset linkage
    • Automatically flushes traces after each run
  4. Error Management

    • Validates configuration requirements
    • Handles API communication errors
    • Manages evaluator failures independently
    • Preserves partial results on failure

Notes

  • Either dataset_id or dataset must be provided
  • External datasets are automatically assigned a dataset ID
  • Evaluator functions should handle both inputs and outputs
  • All evaluation runs are automatically traced using HoneyHiveTracer
  • Evaluation status is updated to reflect completion or failure