HoneyHive Docs

This guide covers ongoing operational procedures for self-hosted HoneyHive deployments. It is written for platform engineering teams responsible for maintaining the Data Plane infrastructure.

Most self-hosted deployments use a federated model where HoneyHive manages the Control Plane and you manage the Data Plane. This guide is written for that model. If you manage both planes, the same procedures apply to each. For architecture details, see Platform Architecture.

Upgrade path

Self-hosted HoneyHive has three independently versioned artifact layers. Each layer is upgraded separately.

Artifact	Format	Version mechanism	Who initiates
Infrastructure	Terraform modules	Git ref pin (`?ref=v1.0.0`)	Customer
Application	Helm charts	ArgoCD sync from chart repo	Customer
Kubernetes	EKS cluster	AWS-managed upgrades	Customer

Using HoneyHive’s Terraform modules is not required. You can provision the Data Plane infrastructure using any tooling, as long as the resources meet the minimum requirements described in Infrastructure Requirements.

Terraform module upgrades

For teams using HoneyHive Terraform modules, infrastructure is consumed via git references with explicit version pins:

module "honeyhive_vpc" {
  source = "git::https://github.com/honeyhiveai/honeyhive-terraform.git//hosting/aws/vpc?ref=v1.2.0"
}

To upgrade:

Review the changelog provided by HoneyHive for the target version.
Update the ?ref= tag in your Terraform configuration to the new version.
Run terraform plan and review the diff for breaking changes or resource replacements.
Apply in a maintenance window: terraform apply.
Verify infrastructure health (EKS node status, RDS connectivity, security group rules).

Always run terraform plan before applying. Some upgrades may replace resources (e.g., node groups), which causes temporary capacity reduction. Coordinate with HoneyHive support for major version upgrades.

Helm chart upgrades via ArgoCD

Application services are deployed through ArgoCD using a wave-ordered deployment strategy. The four deployment waves ensure dependencies are available before dependent services start:

Wave	Contents	Purpose
1	Core utilities	Secrets, config maps, service accounts
2	Observability stack	OpenTelemetry collectors, Prometheus
3	Data stores	PostgreSQL connections, ClickHouse, Redis, NATS
4	HoneyHive application services	DP services (ingestion, evaluation, backend, etc.)

To upgrade application services:

HoneyHive publishes a new Helm chart version with release notes.
Update the target revision in your ArgoCD Application manifest.
Run argocd app diff <app-name> to preview changes.
Sync the application: argocd app sync <app-name>.
ArgoCD deploys waves in order, waiting for health checks between waves.
Monitor the rollout in the ArgoCD UI or with argocd app get <app-name>.

Rollback procedures

ArgoCD maintains a history of deployed revisions. To roll back:

# List deployment history
argocd app history <app-name>

# Roll back to a previous revision
argocd app rollback <app-name> <revision-number>

For Terraform rollbacks, revert the ?ref= pin to the previous version and run terraform apply.

Database schema migrations are forward-only. If a Helm upgrade includes a migration, coordinate with HoneyHive support before rolling back to ensure data compatibility.

EKS cluster upgrades

EKS version upgrades follow standard AWS procedures and are independent of HoneyHive application upgrades. Refer to the AWS EKS version calendar for support timelines. Test upgrades in a non-production environment first. Coordinate with HoneyHive support if you have questions about Kubernetes version compatibility.

Backup and disaster recovery

RDS (PostgreSQL)

Amazon RDS provides automated backups with point-in-time recovery (PITR):

Setting	Recommended value
Automated backup retention	7 days minimum
PITR granularity	5-minute intervals (AWS default)
Multi-AZ	Enabled (automatic failover)
Encryption	Customer-managed KMS key

To restore to a point in time:

aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier honeyhive-dp-postgres \
  --target-db-instance-identifier honeyhive-dp-postgres-restored \
  --restore-time "YYYY-MM-DDTHH:MM:SSZ"

PITR creates a new RDS instance. After restoring, update your application configuration to point to the restored instance, then decommission the old one. Coordinate with HoneyHive support for any post-restore migration steps.

S3 (Object storage)

Enable versioning and lifecycle policies on all HoneyHive S3 buckets:

Control	Configuration
Versioning	Enabled. Protects against accidental deletion and overwrites
Lifecycle rules	Transition older versions to S3 Glacier after 30 days (configurable)
Replication	Cross-region replication optional for DR
Encryption	SSE-KMS with customer-managed key

ClickHouse

ClickHouse runs as a StatefulSet on EKS and is not backed by a managed AWS service. HoneyHive provides pre-packaged Kubernetes jobs for backup and restore operations. These are triggered through Helm values in your deployment configuration. To enable scheduled backups, configure the relevant Helm values for your environment. The jobs handle creating snapshots, uploading to S3, and restoring from snapshots. Contact HoneyHive support for the specific values and recommended backup schedule for your deployment size.

Monitoring and observability

The self-hosted deployment includes OpenTelemetry (OTEL) collectors and Prometheus, deployed in wave 2 of the ArgoCD deployment. These provide metrics collection and instrumentation for all application services. We expect platform teams to use Prometheus annotations and OTEL exporters to push metrics and traces into whatever internal monitoring stack you already use (Datadog, Grafana Cloud, Splunk, New Relic, etc.). This approach integrates with your existing alerting and dashboarding workflows rather than requiring a parallel stack. For the catalog of HoneyHive’s per-service application metrics (covering signup, ingestion, trace storage, evaluation, and the LLM proxy) and how to make them discoverable by your collector, see Application Metrics.

What HoneyHive instruments

Structured logging with request IDs across all services for ease of debugging
APM tracing with distributed trace context propagation
Metric instrumentation on hot paths (ingestion throughput, query latency, queue depth, error rates)

Dashboards and alerting

HoneyHive provides recommended dashboard configurations and a runbook for critical user flows upon request. The specifics vary based on your platform team’s observability stack. Contact your HoneyHive support contact to get the configuration for your monitoring platform.

Scaling

The deployment uses Karpenter for dynamic node provisioning and Kubernetes Horizontal Pod Autoscaler (HPA) for application-level scaling. Both are configurable through Helm values in your deployment. Karpenter automatically provisions right-sized nodes based on pending pod requirements. HPA scales individual services based on CPU and memory utilization thresholds. To adjust scaling parameters, override the relevant settings in your environment-specific values.yaml.

Evaluation service queue scaling

The evaluation service supports queue-depth scaling for high-throughput evaluation workloads. CPU-based HPA remains the default. If your cluster uses Datadog External Metrics, set dpEvaluationService.autoscaling.datadog.enabled: true and tune dpEvaluationService.autoscaling.datadog.targetBacklogPerPod in data-plane/services/values.yaml. If your cluster uses KEDA, set dpEvaluationService.autoscaling.keda.enabled: true. KEDA replaces the CPU HPA and scales directly from NATS num_pending via Prometheus.

Python metric execution capacity

Custom Python metrics run through dpPythonmetricService. You can tune two service-level values in data-plane/services/values.yaml:

Value	Default	Purpose
`dpPythonmetricService.gunicornWorkers`	`4`	Number of synchronous metric workers per pod. Raise alongside CPU requests when metrics queue up under load.
`dpPythonmetricService.pythonExecutionTimeout`	`0.1` seconds	Per-metric wall-clock timeout. Raise if your metric code needs more than 100 ms of CPU time per event. Metrics that exceed this budget fail with `execution_timeout` and HTTP 408.

Contact HoneyHive to discuss sizing recommendations for your specific workload and throughput requirements.

Incident response and support

Shared responsibility

The division of responsibility depends on your deployment model:

Federated (HoneyHive manages CP): HoneyHive operates the Control Plane end-to-end. You operate the Data Plane infrastructure, deploy application updates via ArgoCD, and handle L1/L2 triage. HoneyHive provides L3 support and root cause analysis.
Fully self-hosted (you manage CP + DP): You operate all infrastructure. HoneyHive provides support for application-level issues, upgrade guidance, and incident response assistance.

Support

Self-hosted deployments include support with response targets defined in your agreement. For incident response policies and compliance details, see the HoneyHive Trust Center.

Multi-Data-Plane management

Organizations that need multiple Data Planes (per region, business unit, or environment) can deploy additional instances using the same infrastructure modules.

Deploying an additional Data Plane

Provision a new AWS account (recommended) or a new VPC in an existing account.
Deploy infrastructure meeting the minimum requirements (using HoneyHive Terraform modules or your own tooling).
Register the new Data Plane with the Control Plane. HoneyHive configures the CP side.
Deploy application services via ArgoCD, using instance-specific values.yaml overrides.

Each Data Plane is fully independent: separate EKS cluster, RDS, S3, KMS keys, and observability stack. There is no shared state between Data Planes.

Consistency across Data Planes

Pin all Data Planes to the same Terraform module and Helm chart versions to reduce operational complexity.
Use a central git repository for ArgoCD Application manifests, with environment-specific overlays per Data Plane.
Standardize monitoring and alerting configuration so dashboards and runbooks work across all instances.

​Upgrade path

​Terraform module upgrades

​Helm chart upgrades via ArgoCD

​Rollback procedures

​EKS cluster upgrades

​Backup and disaster recovery

​RDS (PostgreSQL)

​S3 (Object storage)

​ClickHouse

​Monitoring and observability

​What HoneyHive instruments

​Dashboards and alerting

​Scaling

​Evaluation service queue scaling

​Python metric execution capacity

​Incident response and support

​Shared responsibility

​Support

​Multi-Data-Plane management

​Deploying an additional Data Plane

​Consistency across Data Planes

Upgrade path

Terraform module upgrades

Helm chart upgrades via ArgoCD

Rollback procedures

EKS cluster upgrades

Backup and disaster recovery

RDS (PostgreSQL)

S3 (Object storage)

ClickHouse

Monitoring and observability

What HoneyHive instruments

Dashboards and alerting

Scaling

Evaluation service queue scaling

Python metric execution capacity

Incident response and support

Shared responsibility

Support

Multi-Data-Plane management

Deploying an additional Data Plane

Consistency across Data Planes