HoneyHive Docs

This document describes the architecture for HoneyHive hosted in AWS.Additional platform architecture documentation for Azure, GCP, or on-prem implementations, or our Enterprise+ Federated version (deployed in your own cloud environment with physically-separated data planes) is available upon request. Please contact your account executive or email support@honeyhive.ai for more information.

Overview

HoneyHive is a production-grade AI observability and evaluation platform built on enterprise-class infrastructure. Our architecture is designed to meet the stringent requirements of enterprise customers including security, compliance, scalability, and reliability. The platform consists of three core components:

Log Ingestion & Enrichment Pipeline - Real-time event processing with zero data loss
Evaluation & Analysis Engine - Asynchronous job processing for offline evaluations
Web Application & API - User-facing interfaces and programmatic access

Network Architecture

Infrastructure Overview

Our multi-tenant SaaS platform is hosted entirely within AWS US-West-2. Our dedicated SaaS version can be hosted in any AWS region worldwide.

Edge & Network Layer

Amazon Route 53: Global DNS routing with health checks and failover capabilities
AWS Certificate Manager (ACM): Automated SSL/TLS certificate management for encrypted connections
Application Load Balancer (ALB): Distributes incoming traffic across availability zones with automatic scaling
VPC Architecture: Isolated Virtual Private Cloud with segregated public and private subnets across multiple availability zones

Security & Access Control

AWS IAM Roles for Service Accounts (IRSA): Fine-grained permission management for Kubernetes pods without shared credentials
AWS Secrets Manager: Centralized secrets management with automatic rotation
AWS KMS: Customer-managed encryption keys for data-at-rest encryption
AWS Firewall Manager: Centralized firewall rule management and DDoS protection
NAT Gateway: Secure outbound internet access for private subnet resources
VPC Internet Gateway: Controlled ingress for public-facing services

Compute & Orchestration

Amazon EKS (Elastic Kubernetes Service): Managed Kubernetes cluster with automatic updates and patches
- Multi-AZ deployment for high availability
- Auto-scaling groups for dynamic capacity management
- Pod security policies and network policies enforced
- AWS EKS cluster runs in private subnets with no direct internet exposure

Data Storage & Processing

PostgreSQL on Amazon RDS:
- Metadata storage for projects, configurations, and user management
- Multi-AZ deployment with automatic failover
- Encrypted at rest using AWS KMS
- Automated backups with point-in-time recovery
- Read replicas for performance optimization
ClickHouse Instance:
- High-performance columnar database for event storage
- Customer events are encrypted at rest
- Optimized for analytical queries on large datasets
- Data retention policies configurable per customer
Amazon S3:
- Long-term log storage and archival
- Server-side encryption (SSE-KMS)
- Versioning enabled for audit trails
- Lifecycle policies for cost optimization

Monitoring & Observability

Amazon CloudWatch: Real-time monitoring, logging, and alerting
AWS CloudTrail: Comprehensive audit logging for all AWS API calls
ArgoCD for GitOps: Infrastructure-as-code with automated deployments and rollbacks

Message Queue

NATS: High-performance message queue for asynchronous processing
- TLS encryption for data in transit
- At-least-once delivery guarantees
- Isolated queues per tenant for data segregation

Data Flow Architecture

Request Flow

API Gateway Layer
- User requests enter through API Gateway with authentication and rate limiting
- Elastic Load Balancer distributes traffic across multiple availability zones
- TLS 1.2+ encryption enforced for all connections
Kubernetes Service Mesh
- EKS Load Balancer (Kube-system Namespace): Internal load balancing within the cluster
- VPC Deployment Runner (Control Plane Namespace): Orchestrates deployment and service discovery
Backend Services (Backend Namespace)
- Backend Service: Handles API requests, authentication, and authorization
- Connects to PostgreSQL RDS for metadata operations (prompts, datasets, configurations)
- Implements tenant isolation at the application layer

Event Processing Pipeline

The event processing pipeline is designed for high throughput, low latency, and zero data loss:

Ingestion Service
- Receives events from customer applications via SDK or API
- Validates and normalizes incoming events
- Publishes to NATS Encrypted Queue for downstream processing
- Immediately acknowledges receipt to minimize client latency
ClickHouse Data Layer (Data Layer Namespace)
- Stores encrypted customer events with tenant isolation
- Optimized for high-volume writes and analytical queries
- Data encrypted at rest with customer-managed keys
Enrichment Service
- Consumes events from NATS queue
- Performs real-time enrichment (session inheritance, metric calculations)
- Updates event records with computed metadata
- Triggers online evaluators if configured
Evaluation Service
- Processes offline evaluation jobs
- Consumes from NATS Encrypted Queue
- Executes customer-defined evaluators (Python, LLM-based, or custom)
- Stores evaluation results back to ClickHouse

Data Storage

PostgreSQL RDS: Stores metadata including:
- User accounts and permissions
- Project configurations
- Prompt templates and versions
- Dataset definitions
- Evaluator configurations
ClickHouse: Stores telemetry data as wide events, including:
- Traces and spans
- Event logs
- Evaluation scores
- Aggregated metrics
- Metadata and custom properties

Security & Compliance

Data Encryption

At Rest: All data encrypted using AWS KMS with customer-managed keys
In Transit: TLS 1.2+ for all network communications
Application Layer: Additional encryption for sensitive customer data

Network Security

VPC Isolation: Customer VPC with private subnets for all data processing
Security Groups: Strict ingress/egress rules limiting access to required ports only
Network Policies: Kubernetes network policies enforce pod-to-pod communication restrictions
AWS PrivateLink: Available for dedicated SaaS customers to establish private connectivity between your VPC and HoneyHive’s services without exposing traffic to the public internet
External Secrets Store: Separates secrets from application code

Access Control

IAM Roles: Service accounts use temporary credentials via IRSA
RBAC: Kubernetes Role-Based Access Control for service permissions
Least Privilege: Each service has minimal required permissions
Multi-Factor Authentication: Available for all user accounts

Compliance

SOC 2 Type II: Audited annually
GDPR: Data residency and privacy controls
HIPAA: Available for healthcare customers

Reliability & Performance

High Availability

Multi-AZ Deployment: Services distributed across multiple availability zones
Automatic Failover: Database and compute resources automatically failover on failure
Health Checks: Continuous monitoring with automatic recovery
Zero-Downtime Deployments: Rolling updates with canary deployments

Scalability

Horizontal Auto-Scaling: Kubernetes HPA scales pods based on CPU/memory utilization
Vertical Scaling: Database and storage scale independently
Queue-Based Architecture: NATS queue buffers traffic spikes

Introduction

Observability

Evaluation

Workspace Management

Learn More

Platform Architecture

Overview

Network Architecture

Infrastructure Overview

Edge & Network Layer

Security & Access Control

Compute & Orchestration

Data Storage & Processing

Monitoring & Observability

Message Queue

Data Flow Architecture

Request Flow

Event Processing Pipeline

Data Storage

Security & Compliance

Data Encryption

Network Security

Access Control

Compliance

Reliability & Performance

High Availability

Scalability

Introduction

Observability

Evaluation

Workspace Management

Learn More

​Overview

​Network Architecture

​Infrastructure Overview

​Edge & Network Layer

​Security & Access Control

​Compute & Orchestration

​Data Storage & Processing

​Monitoring & Observability

​Message Queue

​Data Flow Architecture

​Request Flow

​Event Processing Pipeline

​Data Storage

​Security & Compliance

​Data Encryption

​Network Security

​Access Control

​Compliance

​Reliability & Performance

​High Availability

​Scalability

Overview

Network Architecture

Infrastructure Overview

Edge & Network Layer

Security & Access Control

Compute & Orchestration

Data Storage & Processing

Monitoring & Observability

Message Queue

Data Flow Architecture

Request Flow

Event Processing Pipeline

Data Storage

Security & Compliance

Data Encryption

Network Security

Access Control

Compliance

Reliability & Performance

High Availability

Scalability