Skip to main content
This document describes the architecture for HoneyHive hosted in AWS.Additional platform architecture documentation for Azure, GCP, or on-prem implementations, or our Enterprise+ Federated version (deployed in your own cloud environment with physically-separated data planes) is available upon request. Please contact your account executive or email support@honeyhive.ai for more information.

Overview

HoneyHive is a production-grade AI observability and evaluation platform built on enterprise-class infrastructure. Our architecture is designed to meet the stringent requirements of enterprise customers including security, compliance, scalability, and reliability. The platform consists of three core components:
  1. Log Ingestion & Enrichment Pipeline - Real-time event processing with zero data loss
  2. Evaluation & Analysis Engine - Asynchronous job processing for offline evaluations
  3. Web Application & API - User-facing interfaces and programmatic access

Network Architecture

Infrastructure Overview

Our multi-tenant SaaS platform is hosted entirely within AWS US-West-2. Our dedicated SaaS version can be hosted in any AWS region worldwide.

Edge & Network Layer

  • Amazon Route 53: Global DNS routing with health checks and failover capabilities
  • AWS Certificate Manager (ACM): Automated SSL/TLS certificate management for encrypted connections
  • Application Load Balancer (ALB): Distributes incoming traffic across availability zones with automatic scaling
  • VPC Architecture: Isolated Virtual Private Cloud with segregated public and private subnets across multiple availability zones

Security & Access Control

  • AWS IAM Roles for Service Accounts (IRSA): Fine-grained permission management for Kubernetes pods without shared credentials
  • AWS Secrets Manager: Centralized secrets management with automatic rotation
  • AWS KMS: Customer-managed encryption keys for data-at-rest encryption
  • AWS Firewall Manager: Centralized firewall rule management and DDoS protection
  • NAT Gateway: Secure outbound internet access for private subnet resources
  • VPC Internet Gateway: Controlled ingress for public-facing services

Compute & Orchestration

  • Amazon EKS (Elastic Kubernetes Service): Managed Kubernetes cluster with automatic updates and patches
    • Multi-AZ deployment for high availability
    • Auto-scaling groups for dynamic capacity management
    • Pod security policies and network policies enforced
    • AWS EKS cluster runs in private subnets with no direct internet exposure

Data Storage & Processing

  • PostgreSQL on Amazon RDS:
    • Metadata storage for projects, configurations, and user management
    • Multi-AZ deployment with automatic failover
    • Encrypted at rest using AWS KMS
    • Automated backups with point-in-time recovery
    • Read replicas for performance optimization
  • ClickHouse Instance:
    • High-performance columnar database for event storage
    • Customer events are encrypted at rest
    • Optimized for analytical queries on large datasets
    • Data retention policies configurable per customer
  • Amazon S3:
    • Long-term log storage and archival
    • Server-side encryption (SSE-KMS)
    • Versioning enabled for audit trails
    • Lifecycle policies for cost optimization

Monitoring & Observability

  • Amazon CloudWatch: Real-time monitoring, logging, and alerting
  • AWS CloudTrail: Comprehensive audit logging for all AWS API calls
  • ArgoCD for GitOps: Infrastructure-as-code with automated deployments and rollbacks

Message Queue

  • NATS: High-performance message queue for asynchronous processing
    • TLS encryption for data in transit
    • At-least-once delivery guarantees
    • Isolated queues per tenant for data segregation

Data Flow Architecture

Request Flow

  1. API Gateway Layer
    • User requests enter through API Gateway with authentication and rate limiting
    • Elastic Load Balancer distributes traffic across multiple availability zones
    • TLS 1.2+ encryption enforced for all connections
  2. Kubernetes Service Mesh
    • EKS Load Balancer (Kube-system Namespace): Internal load balancing within the cluster
    • VPC Deployment Runner (Control Plane Namespace): Orchestrates deployment and service discovery
  3. Backend Services (Backend Namespace)
    • Backend Service: Handles API requests, authentication, and authorization
    • Connects to PostgreSQL RDS for metadata operations (prompts, datasets, configurations)
    • Implements tenant isolation at the application layer

Event Processing Pipeline

The event processing pipeline is designed for high throughput, low latency, and zero data loss:
  1. Ingestion Service
    • Receives events from customer applications via SDK or API
    • Validates and normalizes incoming events
    • Publishes to NATS Encrypted Queue for downstream processing
    • Immediately acknowledges receipt to minimize client latency
  2. ClickHouse Data Layer (Data Layer Namespace)
    • Stores encrypted customer events with tenant isolation
    • Optimized for high-volume writes and analytical queries
    • Data encrypted at rest with customer-managed keys
  3. Enrichment Service
    • Consumes events from NATS queue
    • Performs real-time enrichment (session inheritance, metric calculations)
    • Updates event records with computed metadata
    • Triggers online evaluators if configured
  4. Evaluation Service
    • Processes offline evaluation jobs
    • Consumes from NATS Encrypted Queue
    • Executes customer-defined evaluators (Python, LLM-based, or custom)
    • Stores evaluation results back to ClickHouse

Data Storage

  • PostgreSQL RDS: Stores metadata including:
    • User accounts and permissions
    • Project configurations
    • Prompt templates and versions
    • Dataset definitions
    • Evaluator configurations
  • ClickHouse: Stores telemetry data as wide events, including:
    • Traces and spans
    • Event logs
    • Evaluation scores
    • Aggregated metrics
    • Metadata and custom properties

Security & Compliance

Data Encryption

  • At Rest: All data encrypted using AWS KMS with customer-managed keys
  • In Transit: TLS 1.2+ for all network communications
  • Application Layer: Additional encryption for sensitive customer data

Network Security

  • VPC Isolation: Customer VPC with private subnets for all data processing
  • Security Groups: Strict ingress/egress rules limiting access to required ports only
  • Network Policies: Kubernetes network policies enforce pod-to-pod communication restrictions
  • AWS PrivateLink: Available for dedicated SaaS customers to establish private connectivity between your VPC and HoneyHive’s services without exposing traffic to the public internet
  • External Secrets Store: Separates secrets from application code

Access Control

  • IAM Roles: Service accounts use temporary credentials via IRSA
  • RBAC: Kubernetes Role-Based Access Control for service permissions
  • Least Privilege: Each service has minimal required permissions
  • Multi-Factor Authentication: Available for all user accounts

Compliance

  • SOC 2 Type II: Audited annually
  • GDPR: Data residency and privacy controls
  • HIPAA: Available for healthcare customers

Reliability & Performance

High Availability

  • Multi-AZ Deployment: Services distributed across multiple availability zones
  • Automatic Failover: Database and compute resources automatically failover on failure
  • Health Checks: Continuous monitoring with automatic recovery
  • Zero-Downtime Deployments: Rolling updates with canary deployments

Scalability

  • Horizontal Auto-Scaling: Kubernetes HPA scales pods based on CPU/memory utilization
  • Vertical Scaling: Database and storage scale independently
  • Queue-Based Architecture: NATS queue buffers traffic spikes