HoneyHive separates the Control Plane from the Data Plane so your application data (traces, evaluations, datasets) never touches the control plane infrastructure. This federated architecture is the foundation of HoneyHive’s security model and determines where your data lives.Documentation Index
Fetch the complete documentation index at: https://docs.honeyhive.ai/llms.txt
Use this file to discover all available pages before exploring further.
Your data stays isolated
Trace and evaluation data is stored in the Data Plane, which has no shared database or credentials with the Control Plane.
You choose where it lives
Deploy the Data Plane in any AWS region, in your own cloud account, or on-premise. See Hosting Models.
Nothing changes when you scale
Move from shared to dedicated infrastructure without changing your SDK integration or workflows.
How It Works
HoneyHive runs as two independent planes:- Control Plane — handles authentication (SSO, SAML 2.0, email/password, MFA), role-based access control, and organization/workspace/project configuration. Stores organizational metadata in PostgreSQL. Has no access to your trace data.
- Data Plane — handles trace ingestion, event enrichment, evaluation jobs, and LLM proxy. Operates on its own databases and message queues. Verifies access using short-lived, cryptographically signed tokens issued by the Control Plane.
Detailed service architecture
Detailed service architecture
Control Plane Services
The Control Plane manages authentication, authorization, and platform configuration. It has no access to your trace or evaluation data.| Service | What it does |
|---|---|
| Backend API | REST API for authentication, RBAC, organization/workspace/project management, prompt templates, and alert configuration. Exposes a JWKS endpoint for Data Plane token verification. |
| Web UI | Next.js web application for all platform features. Communicates with both Control Plane and Data Plane APIs. |
| Controller | Orchestrates Control Plane and Data Plane coordination. Manages Data Plane lifecycle, stream routing, and identity bootstrap (ECDSA keypairs for cluster JWTs). Communicates with Data Plane Controller via bidirectional gRPC stream. |
| Writer Service | Consumes events from the NATS queue and writes them to ClickHouse. Handles buffering, batching, and real-time enrichment (session linking, metadata inheritance, computed fields). Includes retry logic with exponential backoff and a dead letter queue (S3) for failed writes. |
| Notification Service | Processes alert notifications and delivers them via email (SES), Slack, or webhooks. Supports scope-based routing and severity stages (critical, warning, resolution). |
Control Plane Data Stores
| Store | What it holds |
|---|---|
| PostgreSQL (cpdb) | User accounts, roles and permissions, organization/workspace/project hierarchy, prompt templates, evaluator configurations, alert definitions. |
| ClickHouse | Traces, spans, evaluation scores, session aggregates, event schemas. High-performance columnar database optimized for high-volume writes and analytical queries. Data encrypted at rest with configurable retention policies. |
| Object Storage (S3) | Dead letter queue for the Writer Service (failed write batches). |
| Redis | Session cache, rate limiting, and ephemeral state. |
Data Plane Services
The Data Plane processes and stores all application data. It verifies access using JWT tokens issued by the Control Plane via a JWKS endpoint — the two planes share no database or credentials.| Service | What it does |
|---|---|
| Ingestion Service | Receives traces and spans from the HoneyHive SDK via OTLP-compatible HTTP and gRPC endpoints. Validates API keys, normalizes events, and publishes to NATS for downstream processing. Acknowledges receipt immediately to minimize client latency. |
| Backend API | REST API for Data Plane operations: datasets, datapoints, metrics, experiment runs, charts, provider secrets, and storage. Authenticates requests via JWT tokens or API keys. |
| Controller | Manages Data Plane lifecycle and communicates with the Control Plane Controller via bidirectional gRPC stream. Reports health metrics and handles identity bootstrap. |
| Evaluation Service | Consumes events from the NATS queue and executes evaluators (Python, LLM-based, or custom). Publishes evaluation scores to the control plane event stream for persistence. Manages annotation queues and processes online evaluators configured for a project. |
| LLM Proxy | Routes LLM requests to AI providers via LiteLLM for Playground and LLM-based evaluators. Supports multiple providers (OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Google Vertex AI). Provider credentials are encrypted and scoped per workspace (see Provider Keys). |
| Python Metric Service | Executes user-defined Python metric code in a sandboxed environment with RestrictedPython. Supports common libraries (pandas, numpy, sklearn, jsonschema) with timeout protection and code size limits. |
Data Plane Data Stores
| Store | What it holds |
|---|---|
| PostgreSQL (dpdb) | Datasets, datapoints, metric definitions, metric versions, provider secrets (encrypted), chart configurations, experiment run metadata. |
| Object Storage (S3) | Trace data, large payloads, and long-term archival. Server-side encryption (SSE-KMS) with versioning for audit trails and lifecycle policies for cost optimization. |
| Redis | Caching and rate limiting. |
Event Processing Pipeline
The ingestion pipeline is designed for high throughput, low latency, and zero data loss:- Ingestion — the SDK sends traces to the Ingestion Service via OTLP-compatible HTTP or gRPC. The service validates API keys, normalizes incoming events, and publishes to encrypted NATS streams. Receipt is acknowledged immediately to minimize client latency.
- Writing and enrichment — the Writer Service pulls events from the CP NATS stream in batches. It enriches events in real time (session linking, metadata inheritance, computed fields) and writes them to ClickHouse. Failed batches are retried with exponential backoff; persistently failing events are sent to a dead letter queue on S3.
- Evaluation — the Evaluation Service consumes from the DP NATS stream and executes configured evaluators. Python metrics run in the sandboxed Python Metric Service. LLM-based evaluators route through the LLM Proxy. Scores are published to the CP NATS stream, where the Writer Service persists them to ClickHouse.
Message Queues
HoneyHive uses NATS with JetStream for durable, at-least-once message delivery:| Stream | Subjects | Purpose |
|---|---|---|
| events-stream (CP NATS) | events.> | Trace and span events for the Writer Service |
| notifications-stream (CP NATS) | notifications.> | Alert notifications for the Notification Service |
| evaluation-stream (DP NATS) | evaluation.> | Evaluation tasks for the Evaluation Service |
ClickHouse Schema
ClickHouse stores data in four primary tables:| Table | Engine | Purpose |
|---|---|---|
| events | ReplacingMergeTree | Traces, spans, and event data (24 columns) |
| session_aggregates | AggregatingMergeTree | Pre-computed session-level aggregations |
| event_schemas | AggregatingMergeTree | Schema tracking for event structure discovery |
| project_event_details | AggregatingMergeTree | Per-project event metadata and statistics |
Hosting Models
The federated architecture enables three hosting options. In all models, the Data Plane’s databases are fully separate from the Control Plane.| Model | Control Plane | Data Plane | Best for |
|---|---|---|---|
| Multi-Tenant SaaS | Shared | Shared (AWS US-West-2) | Getting started, teams without strict data residency requirements |
| Dedicated Cloud | Shared (managed by HoneyHive) | Dedicated (your AWS region) | Regulated enterprises needing data residency or private networking |
| Self-Hosted | Your environment | Your environment | Organizations requiring complete infrastructure control |
Moving from Multi-Tenant SaaS to Dedicated Cloud or Self-Hosted increases physical isolation without changing how you use the platform — your SDK integration, dashboards, and workflows stay the same.
Data Residency
You control where your AI application data is stored:| Requirement | Solution |
|---|---|
| US data residency | Multi-Tenant SaaS (AWS US-West-2) |
| EU data residency (GDPR) | Dedicated Cloud in an EU AWS region |
| Custom region | Dedicated Cloud in any AWS region worldwide |
| Full control | Self-Hosted on AWS, GCP, Azure, or on-premise |
Reliability & Performance
High Availability
- Multi-AZ deployment — services and databases distributed across multiple availability zones
- Automatic failover — database and compute resources automatically failover on failure
- NATS clustering — 3-replica NATS clusters with JetStream for durable message delivery
- Health checks — continuous monitoring with automatic recovery
- Zero-downtime deployments — rolling updates ensure no interruption during platform upgrades
Scalability
- Horizontal auto-scaling — Kubernetes HPA scales pods based on CPU and memory utilization
- Independent scaling — Control Plane and Data Plane scale independently based on their respective workloads
- Queue-based buffering — NATS decouples ingestion from processing, absorbing traffic spikes with at-least-once delivery guarantees
- Batch processing — the Writer Service buffers and batches writes to ClickHouse for optimal throughput

