> ## Documentation Index
> Fetch the complete documentation index at: https://docs.honeyhive.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Platform Architecture

> How HoneyHive's Management Plane and Data Plane architecture works.

HoneyHive separates the **Control Plane** from the **Data Plane** so your application data (traces, evaluations, datasets) never touches the control plane infrastructure. This federated architecture is the foundation of HoneyHive's security model and determines where your data lives.

<CardGroup cols={3}>
  <Card title="Your data stays isolated" icon="lock">
    Trace and evaluation data is stored in the Data Plane, which has no shared database or credentials with the Control Plane.
  </Card>

  <Card title="You choose where it lives" icon="globe">
    Deploy the Data Plane in any AWS region, in your own cloud account, or on-premise. See [Hosting Models](#hosting-models).
  </Card>

  <Card title="Nothing changes when you scale" icon="arrow-up-right">
    Move from shared to dedicated infrastructure without changing your SDK integration or workflows.
  </Card>
</CardGroup>

## How It Works

HoneyHive runs as two independent planes:

* **Control Plane** -- handles authentication (SSO, SAML 2.0, email/password, MFA), role-based access control, and organization/workspace/project configuration. Stores organizational metadata in PostgreSQL. Has no access to your trace data.
* **Data Plane** -- handles trace ingestion, event enrichment, evaluation jobs, and LLM proxy. Operates on its own databases and message queues. Verifies access using short-lived, cryptographically signed tokens issued by the Control Plane.

```mermaid theme={null}
graph LR
  subgraph users [" "]
    direction TB
    User["👤 User / Browser"]
    SDK["⚙️ SDK / API"]
  end

  subgraph cp ["Control Plane"]
    direction TB
    CPAuth["Auth & RBAC"]
    CPConfig["Org / Workspace / Project Config"]
    CPWrite["Event Writing & Enrichment"]
    CPStore[("Metadata DB &nbsp;&nbsp; Event Store")]
  end

  subgraph dp ["Data Plane"]
    direction TB
    DPIngest["Trace Ingestion"]
    DPEval["Evaluation Engine"]
    DPLLM["LLM Proxy"]
    DPStore[("Datasets &nbsp;&nbsp; Object Storage")]
  end

  User -->|"login, dashboards"| cp
  SDK -->|"traces, spans"| DPIngest
  DPIngest -->|"events"| CPWrite
  DPEval -->|"scores"| CPWrite
  cp -.-|"signed tokens (JWKS) · gRPC sync"| dp
```

<Accordion title="Detailed service architecture">
  ```mermaid theme={null}
  graph TB
    subgraph cp ["Control Plane"]
      CPBackend["Backend API"]
      CPFrontend["Web UI"]
      CPController["Controller"]
      CPWriter["Writer Service"]
      CPNotify["Notification Service"]
      CPDB[("PostgreSQL (cpdb)")]
      CPNATS["NATS (external, TLS)"]
      CH[("ClickHouse")]
      CPRedis["Redis"]
      CPBackend --> CPDB
      CPController --> CPDB
      CPController --> CPNATS
      CPWriter --> CPNATS
      CPWriter --> CH
      CPBackend --> CH
    end

    subgraph dp ["Data Plane"]
      DPIngest["Ingestion Service"]
      DPBackend["Backend API"]
      DPController["Controller"]
      DPEval["Evaluation Service"]
      DPLLM["LLM Proxy"]
      DPMetric["Python Metric Service"]
      DPDB[("PostgreSQL (dpdb)")]
      DPNATS["NATS (internal)"]
      DPRedis["Redis"]
      S3[("Object Storage")]
      DPIngest --> DPNATS
      DPEval --> DPNATS
      DPBackend --> DPDB
      DPController --> DPDB
    end

    User["User / Browser"] -->|"login, config"| CPFrontend
    CPFrontend --> CPBackend
    SDK["SDK / API"] -->|"traces, events"| DPIngest
    DPIngest -->|"events"| CPNATS
    cp -->|"signed tokens (JWKS)"| dp
    CPController <-->|"gRPC stream"| DPController
  ```
</Accordion>

## Control Plane Services

The Control Plane manages authentication, authorization, and platform configuration. It has no access to your trace or evaluation data.

| Service                  | What it does                                                                                                                                                                                                                                                                     |
| ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Backend API**          | REST API for authentication, RBAC, organization/workspace/project management, prompt templates, and alert configuration. Exposes a JWKS endpoint for Data Plane token verification.                                                                                              |
| **Web UI**               | Next.js web application for all platform features. Communicates with both Control Plane and Data Plane APIs.                                                                                                                                                                     |
| **Controller**           | Orchestrates Control Plane and Data Plane coordination. Manages Data Plane lifecycle, stream routing, and identity bootstrap (ECDSA keypairs for cluster JWTs). Communicates with Data Plane Controller via bidirectional gRPC stream.                                           |
| **Writer Service**       | Consumes events from the NATS queue and writes them to ClickHouse. Handles buffering, batching, and real-time enrichment (session linking, metadata inheritance, computed fields). Includes retry logic with exponential backoff and a dead letter queue (S3) for failed writes. |
| **Notification Service** | Processes alert notifications and delivers them via email (SES), Slack, or webhooks. Supports scope-based routing and severity stages (critical, warning, resolution).                                                                                                           |

### Control Plane Data Stores

| Store                   | What it holds                                                                                                                                                                                                                 |
| ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **PostgreSQL (cpdb)**   | User accounts, roles and permissions, organization/workspace/project hierarchy, prompt templates, evaluator configurations, alert definitions.                                                                                |
| **ClickHouse**          | Traces, spans, evaluation scores, session aggregates, event schemas. High-performance columnar database optimized for high-volume writes and analytical queries. Data encrypted at rest with configurable retention policies. |
| **Object Storage (S3)** | Dead letter queue for the Writer Service (failed write batches).                                                                                                                                                              |
| **Redis**               | Session cache, rate limiting, and ephemeral state.                                                                                                                                                                            |

## Data Plane Services

The Data Plane processes and stores all application data. It verifies access using JWT tokens issued by the Control Plane via a JWKS endpoint -- the two planes share no database or credentials.

| Service                   | What it does                                                                                                                                                                                                                                                                                           |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Ingestion Service**     | Receives traces and spans from the HoneyHive SDK via OTLP-compatible HTTP and gRPC endpoints. Validates API keys, normalizes events, and publishes to NATS for downstream processing. Acknowledges receipt immediately to minimize client latency.                                                     |
| **Backend API**           | REST API for Data Plane operations: datasets, datapoints, metrics, experiment runs, charts, provider secrets, and storage. Authenticates requests via JWT tokens or API keys.                                                                                                                          |
| **Controller**            | Manages Data Plane lifecycle and communicates with the Control Plane Controller via bidirectional gRPC stream. Reports health metrics and handles identity bootstrap.                                                                                                                                  |
| **Evaluation Service**    | Consumes events from the NATS queue and executes evaluators (Python, LLM-based, or custom). Publishes evaluation scores to the control plane event stream for persistence. Manages annotation queues and processes online evaluators configured for a project.                                         |
| **LLM Proxy**             | Routes LLM requests to AI providers via LiteLLM for Playground and LLM-based evaluators. Supports multiple providers (OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Google Vertex AI). Provider credentials are encrypted and scoped per workspace (see [Provider Keys](/v2/workspace/provider-keys)). |
| **Python Metric Service** | Executes user-defined Python metric code in a sandboxed environment with RestrictedPython. Supports common libraries (pandas, numpy, sklearn, jsonschema) with timeout protection and code size limits.                                                                                                |

### Data Plane Data Stores

| Store                   | What it holds                                                                                                                                                       |
| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **PostgreSQL (dpdb)**   | Datasets, datapoints, metric definitions, metric versions, provider secrets (encrypted), chart configurations, experiment run metadata.                             |
| **Object Storage (S3)** | Trace data, large payloads, and long-term archival. Server-side encryption (SSE-KMS) with versioning for audit trails and lifecycle policies for cost optimization. |
| **Redis**               | Caching and rate limiting.                                                                                                                                          |

## Event Processing Pipeline

The ingestion pipeline is designed for high throughput, low latency, and zero data loss:

```mermaid theme={null}
graph LR
  SDK["SDK / API"] -->|"OTLP HTTP/gRPC"| Ingest["Ingestion Service"]
  Ingest -->|"validate, normalize"| CPNATS["CP NATS Queue"]
  Ingest -->|"evaluation events"| DPNATS["DP NATS Queue"]
  CPNATS --> Writer["Writer Service"]
  Writer -->|"batch write"| CH[("ClickHouse")]
  DPNATS --> Eval["Evaluation Service"]
  Eval -->|"scores"| CPNATS
  CH --> S3[("S3 Archival")]
```

1. **Ingestion** -- the SDK sends traces to the Ingestion Service via OTLP-compatible HTTP or gRPC. The service validates API keys, normalizes incoming events, and publishes to encrypted NATS streams. Receipt is acknowledged immediately to minimize client latency.

2. **Writing and enrichment** -- the Writer Service pulls events from the CP NATS stream in batches. It enriches events in real time (session linking, metadata inheritance, computed fields) and writes them to ClickHouse. Failed batches are retried with exponential backoff; persistently failing events are sent to a dead letter queue on S3.

3. **Evaluation** -- the Evaluation Service consumes from the DP NATS stream and executes configured evaluators. Python metrics run in the sandboxed Python Metric Service. LLM-based evaluators route through the LLM Proxy. Scores are published to the CP NATS stream, where the Writer Service persists them to ClickHouse.

### Message Queues

HoneyHive uses NATS with JetStream for durable, at-least-once message delivery:

| Stream                             | Subjects          | Purpose                                          |
| ---------------------------------- | ----------------- | ------------------------------------------------ |
| **events-stream** (CP NATS)        | `events.>`        | Trace and span events for the Writer Service     |
| **notifications-stream** (CP NATS) | `notifications.>` | Alert notifications for the Notification Service |
| **evaluation-stream** (DP NATS)    | `evaluation.>`    | Evaluation tasks for the Evaluation Service      |

In production, the Control Plane and Data Plane run separate NATS clusters. The CP NATS cluster uses TLS for external communication. The DP NATS cluster runs internally with no external access.

### ClickHouse Schema

ClickHouse stores data in four primary tables:

| Table                       | Engine               | Purpose                                       |
| --------------------------- | -------------------- | --------------------------------------------- |
| **events**                  | ReplacingMergeTree   | Traces, spans, and event data (24 columns)    |
| **session\_aggregates**     | AggregatingMergeTree | Pre-computed session-level aggregations       |
| **event\_schemas**          | AggregatingMergeTree | Schema tracking for event structure discovery |
| **project\_event\_details** | AggregatingMergeTree | Per-project event metadata and statistics     |

## Hosting Models

The federated architecture enables three hosting options. In all models, the Data Plane's databases are fully separate from the Control Plane.

| Model                                  | Control Plane                 | Data Plane                  | Best for                                                           |
| -------------------------------------- | ----------------------------- | --------------------------- | ------------------------------------------------------------------ |
| [Multi-Tenant SaaS](/v2/setup/managed) | Shared                        | Shared (AWS US-West-2)      | Getting started, teams without strict data residency requirements  |
| [Dedicated Cloud](/v2/setup/dedicated) | Shared (managed by HoneyHive) | Dedicated (your AWS region) | Regulated enterprises needing data residency or private networking |
| [Self-Hosted](/v2/setup/self-hosted)   | Your environment              | Your environment            | Organizations requiring complete infrastructure control            |

<Info>
  Moving from Multi-Tenant SaaS to Dedicated Cloud or Self-Hosted increases physical isolation without changing how you use the platform -- your SDK integration, dashboards, and workflows stay the same.
</Info>

For self-hosted deployments, see [Infrastructure Requirements](/v2/setup/infrastructure-requirements) for supported dependency versions and required operators.

## Data Residency

You control where your AI application data is stored:

| Requirement              | Solution                                                               |
| ------------------------ | ---------------------------------------------------------------------- |
| US data residency        | [Multi-Tenant SaaS](/v2/setup/managed) (AWS US-West-2)                 |
| EU data residency (GDPR) | [Dedicated Cloud](/v2/setup/dedicated) in an EU AWS region             |
| Custom region            | [Dedicated Cloud](/v2/setup/dedicated) in any AWS region worldwide     |
| Full control             | [Self-Hosted](/v2/setup/self-hosted) on AWS, GCP, Azure, or on-premise |

For [Dedicated Cloud](/v2/setup/dedicated) and [Self-Hosted](/v2/setup/self-hosted) customers, HoneyHive supports private connectivity via AWS PrivateLink and VPC Peering so trace data never traverses the public internet. For detailed data flow diagrams, data classification, and retention controls in self-hosted deployments, see [Data Flow & Residency](/v2/setup/self-hosted/data-flow).

## Reliability & Performance

### High Availability

* **Multi-AZ deployment** -- services and databases distributed across multiple availability zones
* **Automatic failover** -- database and compute resources automatically failover on failure
* **NATS clustering** -- 3-replica NATS clusters with JetStream for durable message delivery
* **Health checks** -- continuous monitoring with automatic recovery
* **Zero-downtime deployments** -- rolling updates ensure no interruption during platform upgrades

### Scalability

* **Horizontal auto-scaling** -- Kubernetes HPA scales pods based on CPU and memory utilization
* **Independent scaling** -- Control Plane and Data Plane scale independently based on their respective workloads
* **Queue-based buffering** -- NATS decouples ingestion from processing, absorbing traffic spikes with at-least-once delivery guarantees
* **Batch processing** -- the Writer Service buffers and batches writes to ClickHouse for optimal throughput

## Security

For encryption, network security, infrastructure details, and compliance certifications, see [Security](/v2/setup/security).
