HoneyHive’s Python SDK provides a comprehensive evaluation framework for testing, tracing, and collecting metrics on AI model outputs, data processing pipelines, or any computational process requiring detailed performance analysis.

The framework consists of two main components:

  1. The evaluate function for orchestrating evaluation runs
  2. The evaluator decorator for creating custom evaluation metrics

The evaluate Function

The evaluate function integrates with HoneyHive’s tracing system to capture detailed telemetry about each evaluation run, including inputs, outputs, metrics, and runtime metadata. A detailed explanation of how tracing works in Python can be found here.

Example Usage

from honeyhive import evaluate

# Define evaluation function
def test_llm_response(inputs: Dict[str, Any]) -> str:
    response = llm.generate(inputs["prompt"])
    return response

@evaluator()
def quality_evaluator(outputs, inputs, ground_truths):
    # Your evaluation code here
    return True

# Create dataset
dataset = [
    {
        "inputs": {"prompt": "Explain quantum computing"},
        "ground_truths": {"ideal_response": "Quantum computing uses quantum bits..."}
    },
    {
        "inputs": {"prompt": "What is machine learning?"},
        "ground_truths": {"ideal_response": "Machine learning is..."}
    }
]

# Run evaluation
result = evaluate(
    function=test_llm_response,
    hh_api_key="your-api-key",
    hh_project="Project Name",
    name="LLM Quality Test",
    dataset=dataset,
    evaluators=[quality_evaluator]
)

# Export results to JSON
result.to_json()

Function Signature

def evaluate(
    function: Callable,
    hh_api_key: Optional[str] = None,
    hh_project: Optional[str] = None,
    name: Optional[str] = None,
    suite: Optional[str] = None,
    dataset_id: Optional[str] = None,
    dataset: Optional[List[Dict[str, Any]]] = None,
    evaluators: Optional[List[Any]] = None,
    max_workers: int = 10,
    verbose: bool = False,
    server_url: Optional[str] = None,
) -> EvaluationResult

Parameters

Required Parameters

  • function (Callable): The function to evaluate. Must return a serializable output. This function will be executed for each datapoint in the dataset. Accepts:
    • inputs: dictionary of inputs

    • ground_truths (optional): dictionary of ground truth values

Optional Parameters

  • hh_api_key (str, optional): API key for authenticating with HoneyHive services. If not provided, falls back to HH_API_KEY environment variable.

  • hh_project (str, optional): Project identifier in HoneyHive. If not provided, falls back to HH_PROJECT environment variable.

  • name (str, optional): Identifier for this evaluation run. Used in HoneyHive’s tracing and run management.

  • suite (str, optional): Name of the evaluation suite. If not provided, uses the directory name of the calling script.

  • dataset_id (str, optional): ID of an existing HoneyHive dataset to use for evaluation inputs. Mutually exclusive with dataset.

  • dataset (List[Dict[str, Any]], optional): List of input dictionaries to evaluate against. Each dictionary should have an inputs key and optionally a ground_truths key. Alternative to using a HoneyHive dataset through dataset_id.

  • evaluators (List[Callable], optional): List of evaluator functions that process inputs and outputs to generate metrics. Each evaluator can be defined with 1, 2, or 3 parameters, in the order: outputs, inputs (optional), ground_truths (optional). They can be either a regular function or decorated with @evaluator (which track additional settings and metadata).

  • max_workers (int, default=10): Maximum number of concurrent workers for parallel evaluation.

  • verbose (bool, default=False): Whether to print detailed logs during evaluation.

  • server_url (str, optional): Custom server URL for HoneyHive API.

Return Value

Returns an EvaluationResult object containing:

@dataclass
class EvaluationResult:
    run_id: str                # Unique identifier for this evaluation run
    stats: Dict[str, Any]      # Statistics about the evaluation run
    dataset_id: str            # Dataset ID (HoneyHive or generated external ID)
    session_ids: list          # List of individual evaluation session IDs
    status: str                # Final status of the evaluation run
    suite: str                 # Name of the evaluation suite
    data: Dict[str, list]      # Evaluation data including inputs, outputs, and metrics

    def to_json(self):         # Method to export results to a JSON file
        # Exports data to a JSON file named after the suite

The evaluator Decorator

The evaluator decorator provides a flexible way to wrap functions with evaluation logic, enabling result transformation, aggregation, validation, and repetition. It can be used both within formal experiments via evaluate() and as a standalone metric computation tool.

Example Usage

from honeyhive import enrich_span, evaluator

# Basic evaluator with just outputs
@evaluator()
def length_evaluator(outputs):
    return len(outputs) / 100  # Score based on length

# Evaluator with outputs and inputs
@evaluator()
def relevance_evaluator(outputs, inputs):
    # Check if words from the user prompt appear in the output
    prompt_words = inputs["user_prompt"].lower().split()
    count = sum(word in outputs.lower() for word in prompt_words)
    return count / len(prompt_words)

# Evaluator with outputs, inputs, and ground truths
@evaluator()
def accuracy_evaluator(outputs, inputs, ground_truths):
    # Compare output to ground truth
    similarity = calculate_similarity(outputs, ground_truths["ideal_response"])
    return similarity

# Use directly to enrich a span with metrics
def generate_response(question):
  completion_content = create_completion(question)
  eval_result = relevance_evaluator(completion_content, {"user_prompt": question})
  enrich_span(metrics={"eval_result": eval_result})

# Use in evaluate()
evaluate(function=some_func, evaluators=[length_evaluator, relevance_evaluator, accuracy_evaluator])

Function Signatures

Decorated functions must accept 1-3 arguments in this order:

  1. outputs (required): The model’s output to evaluate

    • Type: Any - commonly str, dict, list, or custom output types
    • Represents the output from the model or function being evaluated
    • Example: Response text, generated code, classification results
  2. inputs (optional): Input context

    • Type: Dict[str, Any]
    • Contains the input data used to generate the outputs
    • Example:
      {
          "prompt": "What is the capital of France?",
          "temperature": 0.7,
          "max_tokens": 100
      }
      
  3. ground_truth (optional): Expected results

    • Type: Dict[str, Any]
    • Contains the correct or expected outputs for comparison
    • Example:
      {
          "answer": "Paris",
          "category": "geography"
      }
      

Settings

The decorator accepts the following settings either through initialization or configuration:

SettingTypeDescriptionExample
transformstrExpression to transform the output value. Useful for mapping / filtering your output."value * 2"
aggregatestrExpression to aggregate multiple results (when repeat>1)"sum(values)"
checkerstrExpression to validate results"value in target"
targetstr or listTarget value for validation[4, 5]
repeatintNumber of times to repeat evaluation3
weightfloatImportance weight of the evaluator0.5
assertsboolWhether to apply the assert keyword to final outputTrue

Return Value

The return values of evaluator functions should represent some form of evaluation metric, like a score, boolean, or any other meaningful measurement.

Technical Notes

  1. Execution Flow

    • Validates input parameters and credentials
    • Initializes HoneyHive tracing session
    • Processes dataset or query list sequentially
    • Executes evaluation function for each input
    • Runs evaluators on function outputs
    • Collects and stores metrics
    • Returns evaluation metadata
  2. Dataset Processing

    • HoneyHive datasets are fetched via API and processed automatically
    • Query lists are assigned a generated external dataset ID using MD5 hashing
    • Each datapoint/query is processed sequentially
    • Supports partial completion on failure
  3. Tracing Integration

    • Automatically initializes HoneyHiveTracer for each evaluation
    • Captures:
      • Input parameters
      • Function outputs
      • Evaluator metrics
      • Runtime metadata
      • Error states
    • Links all sessions to a single evaluation run
  4. Error Management

    • Validates all required parameters before execution
    • Handles API communication errors gracefully
    • Preserves partial results on failure
    • Maintains evaluation run status
    • Logs detailed error information

Notes

  • The evaluation framework requires either dataset_id or query_list to be provided
  • HoneyHive credentials (API key and project) must be available either as parameters or environment variables
  • Evaluator functions must handle both inputs and outputs and return a dictionary of metrics
  • All evaluation runs are automatically traced using HoneyHiveTracer