HoneyHive Docs

HoneyHive separates the Control Plane from the Data Plane so your application data (traces, evaluations, datasets) never touches the control plane infrastructure. This federated architecture is the foundation of HoneyHive’s security model and determines where your data lives.

Your data stays isolated

Trace and evaluation data is stored in the Data Plane, which has no shared database or credentials with the Control Plane.

You choose where it lives

Deploy the Data Plane in any AWS region, in your own cloud account, or on-premise. See Hosting Models.

Nothing changes when you scale

Move from shared to dedicated infrastructure without changing your SDK integration or workflows.

How It Works

HoneyHive runs as two independent planes:

Control Plane — handles authentication (SSO, SAML 2.0, email/password, MFA), role-based access control, and organization/workspace/project configuration. Stores organizational metadata in PostgreSQL. Has no access to your trace data.
Data Plane — handles trace ingestion, event enrichment, evaluation jobs, and LLM proxy. Operates on its own databases and message queues. Verifies access using short-lived, cryptographically signed tokens issued by the Control Plane.

Detailed service architecture

Control Plane Services

The Control Plane manages authentication, authorization, and platform configuration. It has no access to your trace or evaluation data.

Service	What it does
Backend API	REST API for authentication, RBAC, organization/workspace/project management, prompt templates, and alert configuration. Exposes a JWKS endpoint for Data Plane token verification.
Web UI	Next.js web application for all platform features. Communicates with both Control Plane and Data Plane APIs.
Controller	Orchestrates Control Plane and Data Plane coordination. Manages Data Plane lifecycle, stream routing, and identity bootstrap (ECDSA keypairs for cluster JWTs). Communicates with Data Plane Controller via bidirectional gRPC stream.
Writer Service	Consumes events from the NATS queue and writes them to ClickHouse. Handles buffering, batching, and real-time enrichment (session linking, metadata inheritance, computed fields). Includes retry logic with exponential backoff and a dead letter queue (S3) for failed writes.
Notification Service	Processes alert notifications and delivers them via email (SES), Slack, or webhooks. Supports scope-based routing and severity stages (critical, warning, resolution).

Control Plane Data Stores

Store	What it holds
PostgreSQL (cpdb)	User accounts, roles and permissions, organization/workspace/project hierarchy, prompt templates, evaluator configurations, alert definitions.
ClickHouse	Traces, spans, evaluation scores, session aggregates, event schemas. High-performance columnar database optimized for high-volume writes and analytical queries. Data encrypted at rest with configurable retention policies.
Object Storage (S3)	Dead letter queue for the Writer Service (failed write batches).
Redis	Session cache, rate limiting, and ephemeral state.

Data Plane Services

The Data Plane processes and stores all application data. It verifies access using JWT tokens issued by the Control Plane via a JWKS endpoint — the two planes share no database or credentials.

Service	What it does
Ingestion Service	Receives traces and spans from the HoneyHive SDK via OTLP-compatible HTTP and gRPC endpoints. Validates API keys, normalizes events, and publishes to NATS for downstream processing. Acknowledges receipt immediately to minimize client latency.
Backend API	REST API for Data Plane operations: datasets, datapoints, metrics, experiment runs, charts, provider secrets, and storage. Authenticates requests via JWT tokens or API keys.
Controller	Manages Data Plane lifecycle and communicates with the Control Plane Controller via bidirectional gRPC stream. Reports health metrics and handles identity bootstrap.
Evaluation Service	Consumes events from the NATS queue and executes evaluators (Python, LLM-based, or custom). Publishes evaluation scores to the control plane event stream for persistence. Manages annotation queues and processes online evaluators configured for a project.
LLM Proxy	Routes LLM requests to AI providers via LiteLLM for Playground and LLM-based evaluators. Supports multiple providers (OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Google Vertex AI). Provider credentials are encrypted and scoped per workspace (see Provider Keys).
Python Metric Service	Executes user-defined Python metric code in a sandboxed environment with RestrictedPython. Supports common libraries (pandas, numpy, sklearn, jsonschema) with timeout protection and code size limits.

Data Plane Data Stores

Store	What it holds
PostgreSQL (dpdb)	Datasets, datapoints, metric definitions, metric versions, provider secrets (encrypted), chart configurations, experiment run metadata.
Object Storage (S3)	Trace data, large payloads, and long-term archival. Server-side encryption (SSE-KMS) with versioning for audit trails and lifecycle policies for cost optimization.
Redis	Caching and rate limiting.

Event Processing Pipeline

The ingestion pipeline is designed for high throughput, low latency, and zero data loss:

Ingestion — the SDK sends traces to the Ingestion Service via OTLP-compatible HTTP or gRPC. The service validates API keys, normalizes incoming events, and publishes to encrypted NATS streams. Receipt is acknowledged immediately to minimize client latency.
Writing and enrichment — the Writer Service pulls events from the CP NATS stream in batches. It enriches events in real time (session linking, metadata inheritance, computed fields) and writes them to ClickHouse. Failed batches are retried with exponential backoff; persistently failing events are sent to a dead letter queue on S3.
Evaluation — the Evaluation Service consumes from the DP NATS stream and executes configured evaluators. Python metrics run in the sandboxed Python Metric Service. LLM-based evaluators route through the LLM Proxy. Scores are published to the CP NATS stream, where the Writer Service persists them to ClickHouse.

Message Queues

HoneyHive uses NATS with JetStream for durable, at-least-once message delivery:

Stream	Subjects	Purpose
events-stream (CP NATS)	`events.>`	Trace and span events for the Writer Service
notifications-stream (CP NATS)	`notifications.>`	Alert notifications for the Notification Service
evaluation-stream (DP NATS)	`evaluation.>`	Evaluation tasks for the Evaluation Service

In production, the Control Plane and Data Plane run separate NATS clusters. The CP NATS cluster uses TLS for external communication. The DP NATS cluster runs internally with no external access.

ClickHouse Schema

ClickHouse stores data in four primary tables:

Table	Engine	Purpose
events	ReplacingMergeTree	Traces, spans, and event data (24 columns)
session_aggregates	AggregatingMergeTree	Pre-computed session-level aggregations
event_schemas	AggregatingMergeTree	Schema tracking for event structure discovery
project_event_details	AggregatingMergeTree	Per-project event metadata and statistics

Hosting Models

The federated architecture enables three hosting options. In all models, the Data Plane’s databases are fully separate from the Control Plane.

Model	Control Plane	Data Plane	Best for
Multi-Tenant SaaS	Shared	Shared (AWS US-West-2)	Getting started, teams without strict data residency requirements
Dedicated Cloud	Shared (managed by HoneyHive)	Dedicated (your AWS region)	Regulated enterprises needing data residency or private networking
Self-Hosted	Your environment	Your environment	Organizations requiring complete infrastructure control

Moving from Multi-Tenant SaaS to Dedicated Cloud or Self-Hosted increases physical isolation without changing how you use the platform — your SDK integration, dashboards, and workflows stay the same.

Data Residency

You control where your AI application data is stored:

Requirement	Solution
US data residency	Multi-Tenant SaaS (AWS US-West-2)
EU data residency (GDPR)	Dedicated Cloud in an EU AWS region
Custom region	Dedicated Cloud in any AWS region worldwide
Full control	Self-Hosted on AWS, GCP, Azure, or on-premise

For Dedicated Cloud and Self-Hosted customers, HoneyHive supports private connectivity via AWS PrivateLink and VPC Peering so trace data never traverses the public internet.

Reliability & Performance

High Availability

Multi-AZ deployment — services and databases distributed across multiple availability zones
Automatic failover — database and compute resources automatically failover on failure
NATS clustering — 3-replica NATS clusters with JetStream for durable message delivery
Health checks — continuous monitoring with automatic recovery
Zero-downtime deployments — rolling updates ensure no interruption during platform upgrades

Scalability

Horizontal auto-scaling — Kubernetes HPA scales pods based on CPU and memory utilization
Independent scaling — Control Plane and Data Plane scale independently based on their respective workloads
Queue-based buffering — NATS decouples ingestion from processing, absorbing traffic spikes with at-least-once delivery guarantees
Batch processing — the Writer Service buffers and batches writes to ClickHouse for optimal throughput

Security

For encryption, network security, infrastructure details, and compliance certifications, see Security.

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

Platform Architecture

Your data stays isolated

You choose where it lives

Nothing changes when you scale

How It Works

Control Plane Services

Control Plane Data Stores

Data Plane Services

Data Plane Data Stores

Event Processing Pipeline

Message Queues

ClickHouse Schema

Hosting Models

Data Residency

Reliability & Performance

High Availability

Scalability

Security

Getting Started

Observability

Evaluation

Prompt Management

Administration

Learn More

Your data stays isolated

You choose where it lives

Nothing changes when you scale

​How It Works

​Control Plane Services

​Control Plane Data Stores

​Data Plane Services

​Data Plane Data Stores

​Event Processing Pipeline

​Message Queues

​ClickHouse Schema

​Hosting Models

​Data Residency

​Reliability & Performance

​High Availability

​Scalability

​Security

How It Works

Control Plane Services

Control Plane Data Stores

Data Plane Services

Data Plane Data Stores

Event Processing Pipeline

Message Queues

ClickHouse Schema

Hosting Models

Data Residency

Reliability & Performance

High Availability

Scalability

Security