This document describes the architecture for HoneyHive hosted in AWS.Additional platform architecture documentation for Azure, GCP, or on-prem implementations, or our Enterprise+ Federated version (deployed in your own cloud environment with physically-separated data planes) is available upon request. Please contact your account executive or email support@honeyhive.ai for more information.
Overview
HoneyHive is a production-grade AI observability and evaluation platform built on enterprise-class infrastructure. Our architecture is designed to meet the stringent requirements of enterprise customers including security, compliance, scalability, and reliability. The platform consists of three core components:- Log Ingestion & Enrichment Pipeline - Real-time event processing with zero data loss
- Evaluation & Analysis Engine - Asynchronous job processing for offline evaluations
- Web Application & API - User-facing interfaces and programmatic access
Network Architecture

Infrastructure Overview
Our multi-tenant SaaS platform is hosted entirely within AWS US-West-2. Our dedicated SaaS version can be hosted in any AWS region worldwide.Edge & Network Layer
- Amazon Route 53: Global DNS routing with health checks and failover capabilities
- AWS Certificate Manager (ACM): Automated SSL/TLS certificate management for encrypted connections
- Application Load Balancer (ALB): Distributes incoming traffic across availability zones with automatic scaling
- VPC Architecture: Isolated Virtual Private Cloud with segregated public and private subnets across multiple availability zones
Security & Access Control
- AWS IAM Roles for Service Accounts (IRSA): Fine-grained permission management for Kubernetes pods without shared credentials
- AWS Secrets Manager: Centralized secrets management with automatic rotation
- AWS KMS: Customer-managed encryption keys for data-at-rest encryption
- AWS Firewall Manager: Centralized firewall rule management and DDoS protection
- NAT Gateway: Secure outbound internet access for private subnet resources
- VPC Internet Gateway: Controlled ingress for public-facing services
Compute & Orchestration
- Amazon EKS (Elastic Kubernetes Service): Managed Kubernetes cluster with automatic updates and patches
- Multi-AZ deployment for high availability
- Auto-scaling groups for dynamic capacity management
- Pod security policies and network policies enforced
- AWS EKS cluster runs in private subnets with no direct internet exposure
Data Storage & Processing
-
PostgreSQL on Amazon RDS:
- Metadata storage for projects, configurations, and user management
- Multi-AZ deployment with automatic failover
- Encrypted at rest using AWS KMS
- Automated backups with point-in-time recovery
- Read replicas for performance optimization
-
ClickHouse Instance:
- High-performance columnar database for event storage
- Customer events are encrypted at rest
- Optimized for analytical queries on large datasets
- Data retention policies configurable per customer
-
Amazon S3:
- Long-term log storage and archival
- Server-side encryption (SSE-KMS)
- Versioning enabled for audit trails
- Lifecycle policies for cost optimization
Monitoring & Observability
- Amazon CloudWatch: Real-time monitoring, logging, and alerting
- AWS CloudTrail: Comprehensive audit logging for all AWS API calls
- ArgoCD for GitOps: Infrastructure-as-code with automated deployments and rollbacks
Message Queue
- NATS: High-performance message queue for asynchronous processing
- TLS encryption for data in transit
- At-least-once delivery guarantees
- Isolated queues per tenant for data segregation
Data Flow Architecture

Request Flow
-
API Gateway Layer
- User requests enter through API Gateway with authentication and rate limiting
- Elastic Load Balancer distributes traffic across multiple availability zones
- TLS 1.2+ encryption enforced for all connections
-
Kubernetes Service Mesh
- EKS Load Balancer (Kube-system Namespace): Internal load balancing within the cluster
- VPC Deployment Runner (Control Plane Namespace): Orchestrates deployment and service discovery
-
Backend Services (Backend Namespace)
- Backend Service: Handles API requests, authentication, and authorization
- Connects to PostgreSQL RDS for metadata operations (prompts, datasets, configurations)
- Implements tenant isolation at the application layer
Event Processing Pipeline
The event processing pipeline is designed for high throughput, low latency, and zero data loss:-
Ingestion Service
- Receives events from customer applications via SDK or API
- Validates and normalizes incoming events
- Publishes to NATS Encrypted Queue for downstream processing
- Immediately acknowledges receipt to minimize client latency
-
ClickHouse Data Layer (Data Layer Namespace)
- Stores encrypted customer events with tenant isolation
- Optimized for high-volume writes and analytical queries
- Data encrypted at rest with customer-managed keys
-
Enrichment Service
- Consumes events from NATS queue
- Performs real-time enrichment (session inheritance, metric calculations)
- Updates event records with computed metadata
- Triggers online evaluators if configured
-
Evaluation Service
- Processes offline evaluation jobs
- Consumes from NATS Encrypted Queue
- Executes customer-defined evaluators (Python, LLM-based, or custom)
- Stores evaluation results back to ClickHouse
Data Storage
-
PostgreSQL RDS: Stores metadata including:
- User accounts and permissions
- Project configurations
- Prompt templates and versions
- Dataset definitions
- Evaluator configurations
-
ClickHouse: Stores telemetry data as wide events, including:
- Traces and spans
- Event logs
- Evaluation scores
- Aggregated metrics
- Metadata and custom properties
Security & Compliance
Data Encryption
- At Rest: All data encrypted using AWS KMS with customer-managed keys
- In Transit: TLS 1.2+ for all network communications
- Application Layer: Additional encryption for sensitive customer data
Network Security
- VPC Isolation: Customer VPC with private subnets for all data processing
- Security Groups: Strict ingress/egress rules limiting access to required ports only
- Network Policies: Kubernetes network policies enforce pod-to-pod communication restrictions
- AWS PrivateLink: Available for dedicated SaaS customers to establish private connectivity between your VPC and HoneyHive’s services without exposing traffic to the public internet
- External Secrets Store: Separates secrets from application code
Access Control
- IAM Roles: Service accounts use temporary credentials via IRSA
- RBAC: Kubernetes Role-Based Access Control for service permissions
- Least Privilege: Each service has minimal required permissions
- Multi-Factor Authentication: Available for all user accounts
Compliance
- SOC 2 Type II: Audited annually
- GDPR: Data residency and privacy controls
- HIPAA: Available for healthcare customers
Reliability & Performance
High Availability
- Multi-AZ Deployment: Services distributed across multiple availability zones
- Automatic Failover: Database and compute resources automatically failover on failure
- Health Checks: Continuous monitoring with automatic recovery
- Zero-Downtime Deployments: Rolling updates with canary deployments
Scalability
- Horizontal Auto-Scaling: Kubernetes HPA scales pods based on CPU/memory utilization
- Vertical Scaling: Database and storage scale independently
- Queue-Based Architecture: NATS queue buffers traffic spikes

