Skip to main content
This guide covers ongoing operational procedures for self-hosted HoneyHive deployments. It is written for platform engineering teams responsible for maintaining the Data Plane infrastructure.
Most self-hosted deployments use a federated model where HoneyHive manages the Control Plane and you manage the Data Plane. This guide is written for that model. If you manage both planes, the same procedures apply to each. For architecture details, see Platform Architecture.

Upgrade path

Self-hosted HoneyHive has three independently versioned artifact layers. Each layer is upgraded separately.
ArtifactFormatVersion mechanismWho initiates
InfrastructureTerraform modulesGit ref pin (?ref=v1.0.0)Customer
ApplicationHelm chartsArgoCD sync from chart repoCustomer
KubernetesEKS clusterAWS-managed upgradesCustomer
Using HoneyHive’s Terraform modules is not required. You can provision the Data Plane infrastructure using any tooling, as long as the resources meet the minimum requirements described in Infrastructure Requirements.

Terraform module upgrades

For teams using HoneyHive Terraform modules, infrastructure is consumed via git references with explicit version pins:
module "honeyhive_vpc" {
  source = "git::https://github.com/honeyhiveai/honeyhive-terraform.git//hosting/aws/vpc?ref=v1.2.0"
}
To upgrade:
  1. Review the changelog provided by HoneyHive for the target version.
  2. Update the ?ref= tag in your Terraform configuration to the new version.
  3. Run terraform plan and review the diff for breaking changes or resource replacements.
  4. Apply in a maintenance window: terraform apply.
  5. Verify infrastructure health (EKS node status, RDS connectivity, security group rules).
Always run terraform plan before applying. Some upgrades may replace resources (e.g., node groups), which causes temporary capacity reduction. Coordinate with HoneyHive support for major version upgrades.

Helm chart upgrades via ArgoCD

Application services are deployed through ArgoCD using a wave-ordered deployment strategy. The four deployment waves ensure dependencies are available before dependent services start:
WaveContentsPurpose
1Core utilitiesSecrets, config maps, service accounts
2Observability stackOpenTelemetry collectors, Prometheus
3Data storesPostgreSQL connections, ClickHouse, Redis, NATS
4HoneyHive application servicesDP services (ingestion, evaluation, backend, etc.)
To upgrade application services:
  1. HoneyHive publishes a new Helm chart version with release notes.
  2. Update the target revision in your ArgoCD Application manifest.
  3. Run argocd app diff <app-name> to preview changes.
  4. Sync the application: argocd app sync <app-name>.
  5. ArgoCD deploys waves in order, waiting for health checks between waves.
  6. Monitor the rollout in the ArgoCD UI or with argocd app get <app-name>.

Rollback procedures

ArgoCD maintains a history of deployed revisions. To roll back:
# List deployment history
argocd app history <app-name>

# Roll back to a previous revision
argocd app rollback <app-name> <revision-number>
For Terraform rollbacks, revert the ?ref= pin to the previous version and run terraform apply.
Database schema migrations are forward-only. If a Helm upgrade includes a migration, coordinate with HoneyHive support before rolling back to ensure data compatibility.

EKS cluster upgrades

EKS version upgrades follow standard AWS procedures and are independent of HoneyHive application upgrades. Refer to the AWS EKS version calendar for support timelines. Test upgrades in a non-production environment first. Coordinate with HoneyHive support if you have questions about Kubernetes version compatibility.

Backup and disaster recovery

RDS (PostgreSQL)

Amazon RDS provides automated backups with point-in-time recovery (PITR):
SettingRecommended value
Automated backup retention7 days minimum
PITR granularity5-minute intervals (AWS default)
Multi-AZEnabled (automatic failover)
EncryptionCustomer-managed KMS key
To restore to a point in time:
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier honeyhive-dp-postgres \
  --target-db-instance-identifier honeyhive-dp-postgres-restored \
  --restore-time "YYYY-MM-DDTHH:MM:SSZ"
PITR creates a new RDS instance. After restoring, update your application configuration to point to the restored instance, then decommission the old one. Coordinate with HoneyHive support for any post-restore migration steps.

S3 (Object storage)

Enable versioning and lifecycle policies on all HoneyHive S3 buckets:
ControlConfiguration
VersioningEnabled. Protects against accidental deletion and overwrites
Lifecycle rulesTransition older versions to S3 Glacier after 30 days (configurable)
ReplicationCross-region replication optional for DR
EncryptionSSE-KMS with customer-managed key

ClickHouse

ClickHouse runs as a StatefulSet on EKS and is not backed by a managed AWS service. HoneyHive provides pre-packaged Kubernetes jobs for backup and restore operations. These are triggered through Helm values in your deployment configuration. To enable scheduled backups, configure the relevant Helm values for your environment. The jobs handle creating snapshots, uploading to S3, and restoring from snapshots. Contact HoneyHive support for the specific values and recommended backup schedule for your deployment size.

Monitoring and observability

The self-hosted deployment includes OpenTelemetry (OTEL) collectors and Prometheus, deployed in wave 2 of the ArgoCD deployment. These provide metrics collection and instrumentation for all application services. We expect platform teams to use Prometheus annotations and OTEL exporters to push metrics and traces into whatever internal monitoring stack you already use (Datadog, Grafana Cloud, Splunk, New Relic, etc.). This approach integrates with your existing alerting and dashboarding workflows rather than requiring a parallel stack.

What HoneyHive instruments

  • Structured logging with request IDs across all services for ease of debugging
  • APM tracing with distributed trace context propagation
  • Metric instrumentation on hot paths (ingestion throughput, query latency, queue depth, error rates)

Dashboards and alerting

HoneyHive provides recommended dashboard configurations and a runbook for critical user flows upon request. The specifics vary based on your platform team’s observability stack. Contact your HoneyHive support contact to get the configuration for your monitoring platform.

Scaling

The deployment uses Karpenter for dynamic node provisioning and Kubernetes Horizontal Pod Autoscaler (HPA) for application-level scaling. Both are configurable through Helm values in your deployment. Karpenter automatically provisions right-sized nodes based on pending pod requirements. HPA scales individual services based on CPU and memory utilization thresholds. To adjust scaling parameters, override the relevant settings in your environment-specific values.yaml. Contact HoneyHive to discuss sizing recommendations for your specific workload and throughput requirements.

Incident response and support

Shared responsibility

The division of responsibility depends on your deployment model:
  • Federated (HoneyHive manages CP): HoneyHive operates the Control Plane end-to-end. You operate the Data Plane infrastructure, deploy application updates via ArgoCD, and handle L1/L2 triage. HoneyHive provides L3 support and root cause analysis.
  • Fully self-hosted (you manage CP + DP): You operate all infrastructure. HoneyHive provides support for application-level issues, upgrade guidance, and incident response assistance.

Support

Self-hosted deployments include support with response targets defined in your agreement. For incident response policies and compliance details, see the HoneyHive Trust Center.

Multi-Data-Plane management

Organizations that need multiple Data Planes (per region, business unit, or environment) can deploy additional instances using the same infrastructure modules.

Deploying an additional Data Plane

  1. Provision a new AWS account (recommended) or a new VPC in an existing account.
  2. Deploy infrastructure meeting the minimum requirements (using HoneyHive Terraform modules or your own tooling).
  3. Register the new Data Plane with the Control Plane. HoneyHive configures the CP side.
  4. Deploy application services via ArgoCD, using instance-specific values.yaml overrides.
Each Data Plane is fully independent: separate EKS cluster, RDS, S3, KMS keys, and observability stack. There is no shared state between Data Planes.

Consistency across Data Planes

  • Pin all Data Planes to the same Terraform module and Helm chart versions to reduce operational complexity.
  • Use a central git repository for ArgoCD Application manifests, with environment-specific overlays per Data Plane.
  • Standardize monitoring and alerting configuration so dashboards and runbooks work across all instances.