Skip to main content
HoneyHive separates the Control Plane from the Data Plane so your application data (traces, evaluations, datasets) never touches the control plane infrastructure. This federated architecture is the foundation of HoneyHive’s security model and determines where your data lives.

Your data stays isolated

Trace and evaluation data is stored in the Data Plane, which has no shared database or credentials with the Control Plane.

You choose where it lives

Deploy the Data Plane in any AWS region, in your own cloud account, or on-premise. See Hosting Models.

Nothing changes when you scale

Move from shared to dedicated infrastructure without changing your SDK integration or workflows.

How It Works

HoneyHive runs as two independent planes:
  • Control Plane — handles authentication (SSO, SAML 2.0, email/password, MFA), role-based access control, and organization/workspace/project configuration. Stores organizational metadata in PostgreSQL. Has no access to your trace data.
  • Data Plane — handles trace ingestion, event enrichment, evaluation jobs, and LLM proxy. Operates on its own databases and message queues. Verifies access using short-lived, cryptographically signed tokens issued by the Control Plane.

Control Plane Services

The Control Plane manages authentication, authorization, and platform configuration. It has no access to your trace or evaluation data.
ServiceWhat it does
Backend APIREST API for authentication, RBAC, organization/workspace/project management, prompt templates, and alert configuration. Exposes a JWKS endpoint for Data Plane token verification.
Web UINext.js web application for all platform features. Communicates with both Control Plane and Data Plane APIs.
ControllerOrchestrates Control Plane and Data Plane coordination. Manages Data Plane lifecycle, stream routing, and identity bootstrap (ECDSA keypairs for cluster JWTs). Communicates with Data Plane Controller via bidirectional gRPC stream.
Writer ServiceConsumes events from the NATS queue and writes them to ClickHouse. Handles buffering, batching, and real-time enrichment (session linking, metadata inheritance, computed fields). Includes retry logic with exponential backoff and a dead letter queue (S3) for failed writes.
Notification ServiceProcesses alert notifications and delivers them via email (SES), Slack, or webhooks. Supports scope-based routing and severity stages (critical, warning, resolution).

Control Plane Data Stores

StoreWhat it holds
PostgreSQL (cpdb)User accounts, roles and permissions, organization/workspace/project hierarchy, prompt templates, evaluator configurations, alert definitions.
ClickHouseTraces, spans, evaluation scores, session aggregates, event schemas. High-performance columnar database optimized for high-volume writes and analytical queries. Data encrypted at rest with configurable retention policies.
Object Storage (S3)Dead letter queue for the Writer Service (failed write batches).
RedisSession cache, rate limiting, and ephemeral state.

Data Plane Services

The Data Plane processes and stores all application data. It verifies access using JWT tokens issued by the Control Plane via a JWKS endpoint — the two planes share no database or credentials.
ServiceWhat it does
Ingestion ServiceReceives traces and spans from the HoneyHive SDK via OTLP-compatible HTTP and gRPC endpoints. Validates API keys, normalizes events, and publishes to NATS for downstream processing. Acknowledges receipt immediately to minimize client latency.
Backend APIREST API for Data Plane operations: datasets, datapoints, metrics, experiment runs, charts, provider secrets, and storage. Authenticates requests via JWT tokens or API keys.
ControllerManages Data Plane lifecycle and communicates with the Control Plane Controller via bidirectional gRPC stream. Reports health metrics and handles identity bootstrap.
Evaluation ServiceConsumes events from the NATS queue and executes evaluators (Python, LLM-based, or custom). Publishes evaluation scores to the control plane event stream for persistence. Manages annotation queues and processes online evaluators configured for a project.
LLM ProxyRoutes LLM requests to AI providers via LiteLLM for Playground and LLM-based evaluators. Supports multiple providers (OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Google Vertex AI). Provider credentials are encrypted and scoped per workspace (see Provider Keys).
Python Metric ServiceExecutes user-defined Python metric code in a sandboxed environment with RestrictedPython. Supports common libraries (pandas, numpy, sklearn, jsonschema) with timeout protection and code size limits.

Data Plane Data Stores

StoreWhat it holds
PostgreSQL (dpdb)Datasets, datapoints, metric definitions, metric versions, provider secrets (encrypted), chart configurations, experiment run metadata.
Object Storage (S3)Trace data, large payloads, and long-term archival. Server-side encryption (SSE-KMS) with versioning for audit trails and lifecycle policies for cost optimization.
RedisCaching and rate limiting.

Event Processing Pipeline

The ingestion pipeline is designed for high throughput, low latency, and zero data loss:
  1. Ingestion — the SDK sends traces to the Ingestion Service via OTLP-compatible HTTP or gRPC. The service validates API keys, normalizes incoming events, and publishes to encrypted NATS streams. Receipt is acknowledged immediately to minimize client latency.
  2. Writing and enrichment — the Writer Service pulls events from the CP NATS stream in batches. It enriches events in real time (session linking, metadata inheritance, computed fields) and writes them to ClickHouse. Failed batches are retried with exponential backoff; persistently failing events are sent to a dead letter queue on S3.
  3. Evaluation — the Evaluation Service consumes from the DP NATS stream and executes configured evaluators. Python metrics run in the sandboxed Python Metric Service. LLM-based evaluators route through the LLM Proxy. Scores are published to the CP NATS stream, where the Writer Service persists them to ClickHouse.

Message Queues

HoneyHive uses NATS with JetStream for durable, at-least-once message delivery:
StreamSubjectsPurpose
events-stream (CP NATS)events.>Trace and span events for the Writer Service
notifications-stream (CP NATS)notifications.>Alert notifications for the Notification Service
evaluation-stream (DP NATS)evaluation.>Evaluation tasks for the Evaluation Service
In production, the Control Plane and Data Plane run separate NATS clusters. The CP NATS cluster uses TLS for external communication. The DP NATS cluster runs internally with no external access.

ClickHouse Schema

ClickHouse stores data in four primary tables:
TableEnginePurpose
eventsReplacingMergeTreeTraces, spans, and event data (24 columns)
session_aggregatesAggregatingMergeTreePre-computed session-level aggregations
event_schemasAggregatingMergeTreeSchema tracking for event structure discovery
project_event_detailsAggregatingMergeTreePer-project event metadata and statistics

Hosting Models

The federated architecture enables three hosting options. In all models, the Data Plane’s databases are fully separate from the Control Plane.
ModelControl PlaneData PlaneBest for
Multi-Tenant SaaSSharedShared (AWS US-West-2)Getting started, teams without strict data residency requirements
Dedicated CloudShared (managed by HoneyHive)Dedicated (your AWS region)Regulated enterprises needing data residency or private networking
Self-HostedYour environmentYour environmentOrganizations requiring complete infrastructure control
Moving from Multi-Tenant SaaS to Dedicated Cloud or Self-Hosted increases physical isolation without changing how you use the platform — your SDK integration, dashboards, and workflows stay the same.

Data Residency

You control where your AI application data is stored:
RequirementSolution
US data residencyMulti-Tenant SaaS (AWS US-West-2)
EU data residency (GDPR)Dedicated Cloud in an EU AWS region
Custom regionDedicated Cloud in any AWS region worldwide
Full controlSelf-Hosted on AWS, GCP, Azure, or on-premise
For Dedicated Cloud and Self-Hosted customers, HoneyHive supports private connectivity via AWS PrivateLink and VPC Peering so trace data never traverses the public internet.

Reliability & Performance

High Availability

  • Multi-AZ deployment — services and databases distributed across multiple availability zones
  • Automatic failover — database and compute resources automatically failover on failure
  • NATS clustering — 3-replica NATS clusters with JetStream for durable message delivery
  • Health checks — continuous monitoring with automatic recovery
  • Zero-downtime deployments — rolling updates ensure no interruption during platform upgrades

Scalability

  • Horizontal auto-scaling — Kubernetes HPA scales pods based on CPU and memory utilization
  • Independent scaling — Control Plane and Data Plane scale independently based on their respective workloads
  • Queue-based buffering — NATS decouples ingestion from processing, absorbing traffic spikes with at-least-once delivery guarantees
  • Batch processing — the Writer Service buffers and batches writes to ClickHouse for optimal throughput

Security

For encryption, network security, infrastructure details, and compliance certifications, see Security.