HoneyHive Docs

HoneyHive helps you observe, evaluate, and iterate on AI applications. HoneyHive’s abstractions have been designed for maximal extensibility & reusability. All concepts are minimally opinionated.

Project

Everything in HoneyHive is organized by projects. A project is a logically-separated workspace to develop, evaluate, and monitor a specific AI agent or an end-to-end application leveraging one or multiple agents.

Observability

Session

A session is a collection of events that represent a single user interaction with your application. Sessions can trace a single agent execution or an end-to-end user conversation with multiple turns, depending on your configuration.

Event

An event tracks the execution of a specific operation in your application, along with inputs, outputs, metadata, and feedback. This is synonymous with a single span in a trace. Events have three types:

Type	Use Case
`model`	LLM API calls (OpenAI, Anthropic, etc.)
`tool`	External calls (vector DBs, APIs, functions)
`chain`	Logical groupings of multiple events

Events can be enriched with metrics (numeric scores like latency, cost, or custom evaluations), feedback (user ratings or corrections), metadata (custom key-value pairs), and user properties (user ID, tier, etc.). Full details on the wide-event data model can be found in Tracing Introduction.

Trace visualization showing nested events within a session

Tracing Introduction

Data model, OpenTelemetry architecture, and context propagation.

Evaluation

Datapoint

A datapoint is an input-output pair (with optional ground truth and metadata) that represents a single test case. Datapoints can be created manually or saved directly from production traces.

Each datapoint has a unique datapoint_id used to track it across experiments and comparisons. Datapoints link back to the events that generated them.

Dataset

A dataset is a collection of datapoints used to run evaluations, compare model versions, or fine-tune custom models. Datasets can be exported and used programmatically in your CI pipelines. Learn more in Datasets.

Experiment Run

An experiment run executes your application against a dataset and scores the outputs with evaluators. Experiments track metrics across all datapoints, enabling you to compare different versions of your application.

Experiment results showing metrics aggregated across datapoints

You can apply aggregation functions, filter results, and drill into individual traces:

Regression comparison between two experiment runs

Two experiment runs can be compared when their sessions share a common datapoint_id in metadata.

Evaluator

An evaluator is a function that scores your application’s outputs. Evaluators can be:

Python functions - Custom logic you define
LLM-as-judge - Use an LLM to assess quality
Human evaluation - Route to annotation queues

Python evaluator code in the HoneyHive editor

Evaluators run client-side (in your environment) or server-side (on HoneyHive’s infrastructure). Learn more in Evaluators.

Evaluation Framework

Understand the evaluation philosophy and how datasets, experiments, and evaluators work together.

Prompt

A prompt is a versioned configuration for an LLM call. It includes the model name, provider, prompt template, and hyperparameters (temperature, tools, etc.).

The Playground lets you iterate on prompts and “vibe-check” models. Domain experts can independently improve prompts based on evaluation results, then deploy changes without engineering involvement. Learn more in Prompts.

Deep Dives

Tracing Introduction

Wide-event data model, OpenTelemetry, BYOI architecture, and multi-instance tracing.

Enrichment Schema

Reference for enrichment namespaces and data types.

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

Overview

Project

Observability

Session

Event

Tracing Introduction

Evaluation

Datapoint

Dataset

Experiment Run

Evaluator

Evaluation Framework

Prompt

Deep Dives

Tracing Introduction

Enrichment Schema

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

​Project

​Observability

​Session

​Event

Tracing Introduction

​Evaluation

​Datapoint

​Dataset

​Experiment Run

​Evaluator

Evaluation Framework

​Prompt

​Deep Dives

Tracing Introduction

Enrichment Schema

Project

Observability

Session

Event

Evaluation

Datapoint

Dataset

Experiment Run

Evaluator

Prompt

Deep Dives