Skip to main content
HoneyHive exposes Prometheus-format metrics on every application service. This page documents what’s exposed, how to make it discoverable by your existing collection stack, and the headline metrics for each functional area. HoneyHive ships an internal OpenTelemetry Collector and a kube-prometheus-stack that handle in-cluster trace plumbing and bundled dashboards (see Operations Guide). Those components are sized for internal use and aren’t configured to forward hh_* metrics to your observability backend. The expected integration model for that is your platform team’s existing metrics stack: an OpenTelemetry Collector, Prometheus, vmagent, or similar, scrapes HoneyHive’s services and forwards to wherever you collect telemetry today.

What’s exposed

Every HoneyHive application service exposes Prometheus text-format metrics at:
http://<service>.<namespace>.svc.cluster.local:9091/metrics
The five services most customers monitor:
ServiceLanguageCoversMetric prefix
cp-backend-serviceTypeScriptSignup, org/workspace creation, API key creation, UI querieshh_http_* (filter by route label)
cp-writer-serviceGoTrace storage: NATS to ClickHouse writes, DLQhh_writer_*
dp-ingestion-serviceGoSDK ingestion: events, batches, S3, NATS publishhh_ingestion_*
dp-evaluation-serviceTypeScriptEvaluation pipeline: queue depth, job outcomes, annotation lookupshh_dp_evaluation_*, hh_http_*
dp-llmproxy-servicePythonLLM proxy: error rate, upstream latencyhh_http_*
All five services carry a service_name label on every metric they emit.

Enable ServiceMonitor discovery

The Helm chart ships ServiceMonitor templates for each application service. They are gated by a single flag so installs that don’t run the Prometheus Operator CRDs are unaffected. In your environment’s values.yaml (both control plane and data plane charts), set:
serviceMonitor:
  enabled: true
  # Namespace where your scraper looks for ServiceMonitors.
  # Typically matches the namespace running kube-prometheus-stack
  # or your OpenTelemetry Collector.
  namespace: monitoring
  # Labels your scraper's serviceMonitorSelector matches on.
  # For kube-prometheus-stack this is usually `release: <helm-release-name>`.
  labels:
    release: monitoring
  interval: 30s
  scrapeTimeout: 10s
Then helm upgrade as usual. Confirm the resources rendered:
kubectl get servicemonitor -n monitoring
# NAME                       AGE
# cp-backend-service         1m
# cp-controller-service      1m
# cp-notification-service    1m
# cp-writer-service          1m
# dp-backend-service         1m
# dp-evaluation-service      1m
# dp-ingestion-service       1m
# dp-llmproxy-service        1m
# dp-pythonmetric-service    1m
The five services listed in the table above are the ones most customers monitor; the others (controller, notification, dp-backend, pythonmetric) also expose /metrics endpoints and are useful for internal health visibility.
ServiceMonitor is a CRD from the Prometheus Operator. Most Prometheus-compatible scrapers, including vmagent and the OpenTelemetry Collector’s Prometheus receiver via the target_allocator, can discover via ServiceMonitors. If your stack does not use the Prometheus Operator CRDs, leave the flag off and scrape the :9091/metrics endpoints directly using your scraper’s native service discovery.

Pointing your collector at HoneyHive

Two integration shapes cover most environments:
  • Prometheus-Operator CRDs (recommended). If your OpenTelemetry Collector, Prometheus, or vmagent uses the Prometheus Operator’s ServiceMonitor CRD for scrape-target discovery, the resources you enabled above are picked up automatically once your scraper’s serviceMonitorSelector matches the labels in the HoneyHive values. For OpenTelemetry Collector, this is the target_allocator with prometheusCR.enabled: true.
  • Direct Prometheus scrape config. If you don’t use the Operator CRDs, point your collector’s prometheus receiver (or scraper of choice) at HoneyHive’s services directly. Each service is reachable at http://<service>.<namespace>.svc.cluster.local:9091/metrics (see the table above for service names). Kubernetes service discovery filtered by the app.kubernetes.io/* labels HoneyHive sets on each service works equally well.
Either way, the hh_* metrics will appear in whatever backend your collector forwards to: OTLP, Prometheus remote-write, Datadog, etc.

Headline metrics

Each service emits many internal metrics; the tables below list the handful most useful for dashboards and alerts. The full registry for each service is linked at the end of its section if you want to drill in. All series carry a service_name label. Histograms expose _bucket, _count, and _sum series in the usual Prometheus convention; OTLP receivers translate these to OTel histogram data points automatically.

HTTP requests (every service)

Every HoneyHive HTTP service emits the same two HTTP-level metrics via shared middleware. These are the foundation for request-rate, error-rate, and latency-SLO alerts on any endpoint:
MetricTypeLabelsNotes
hh_http_requests_totalCountermethod, route, status_code, service_nameUse status_code=~"5.." for error-rate alerts.
hh_http_request_duration_secondsHistogrammethod, route, service_nameBuckets: 5ms–10s. Use for latency SLOs.
The route label is the Express route pattern (e.g. /v1/sessions/:session_id or /v1/alerts/:id), not the raw URL: dynamic segments stay parameterized, keeping cardinality bounded. Requests that don’t match any route are recorded as route="unmatched".

Signup, org setup, and admin operations (cp-backend-service)

For account and workspace setup activity, filter hh_http_* by the following route values (under service_name="cp_backend_service"):
OperationMethodroute value
Signup / session creationPOST/auth/session
User onboarding completionPOST/v1/user/onboard
Scope creation (org / workspace / project)POST/v1/scopetree/
Scope provision (data-plane bring-up)POST/v1/scopetree/provision
API key creationPOST/v1/api_key/
Events fetch (UI)GET/v1/events/
Events index (UI search)POST/v1/events/search-ids
Example: alert on signup error rate > 1% over 5 minutes:
sum(rate(hh_http_requests_total{service_name="cp_backend_service",route="/auth/session",status_code=~"5.."}[5m]))
  /
sum(rate(hh_http_requests_total{service_name="cp_backend_service",route="/auth/session"}[5m]))
> 0.01

Ingestion (dp-ingestion-service)

The ingestion service receives SDK events, buffers them, writes them to S3, and publishes to NATS for downstream processing. Headline metrics:
MetricTypeLabelsUse for
hh_ingestion_events_processed_totalCounteroperationIngestion throughput by operation (create, update, session, batch).
hh_ingestion_events_errored_totalCounteroperation, error_classIngestion failure rate.
hh_ingestion_processing_duration_secondsHistogramoperationEnd-to-end ingestion latency.
hh_ingestion_writer_buffer_depthGauge(none)Backpressure signal: items waiting to be flushed.
hh_ingestion_s3_operations_totalCounteroperation, resultS3 write health.
hh_ingestion_nats_published_totalCounterstream, resultDownstream publish health.
Additional metrics covering cache hit ratios, flush behavior, and per-operation S3 and NATS latency are also exported. To see the full list, port-forward to a pod and curl its metrics endpoint:
kubectl port-forward -n data-plane svc/dp-ingestion-service 9091:9091
curl -s localhost:9091/metrics | grep '^hh_ingestion_' | grep -v '^#'

Trace storage (cp-writer-service)

The writer service consumes from NATS and writes to ClickHouse. S3 writes happen in the ingestion service, not here. Headline metrics:
MetricTypeLabelsUse for
hh_writer_nats_messages_consumed_totalCounterresultWrite-path throughput.
hh_writer_clickhouse_write_duration_secondsHistogramtableClickHouse write latency by table.
hh_writer_clickhouse_errors_totalCountertable, error_classClickHouse write failure rate.
hh_writer_buffer_depth_recordsGaugetableBackpressure per ClickHouse table.
hh_writer_dlq_records_totalCountertableRecords sent to dead-letter; non-zero means data loss to investigate.
Additional metrics covering NATS batching, buffer flush cadence, retries, and bisect-on-failure behavior are also exported. To see the full list:
kubectl port-forward -n control-plane svc/cp-writer-service 9091:9091
curl -s localhost:9091/metrics | grep '^hh_writer_' | grep -v '^#'

Evaluation pipeline (dp-evaluation-service)

The evaluation service pulls jobs off a NATS work queue and processes them.
MetricTypeLabelsUse for
hh_dp_evaluation_nats_consumer_num_pendingGaugestream, consumerQueue depth: jobs waiting to be picked up.
hh_dp_evaluation_nats_consumer_num_ack_pendingGaugestream, consumerIn-flight jobs: picked up but not yet acked.
hh_dp_evaluation_jobs_completed_totalCounterresultJob outcomes. result="success" = ack’d; result="failure" = nak’d for retry. Incremented per delivery attempt, not per unique job.
Example: alert on evaluation queue backlog sustained above 1000 pending for 5 minutes:
min_over_time(
  hh_dp_evaluation_nats_consumer_num_pending[5m]
) > 1000
Example: alert on evaluation job failure rate > 5% over 5 minutes:
sum(rate(hh_dp_evaluation_jobs_completed_total{service_name="dp_evaluation_service",result="failure"}[5m]))
  /
sum(rate(hh_dp_evaluation_jobs_completed_total{service_name="dp_evaluation_service"}[5m]))
> 0.05
Additional annotation-queue lookup metrics (cache hit/miss, filter errors, lookup latency) are also exported. To see the full list:
kubectl port-forward -n data-plane svc/dp-evaluation-service 9091:9091
curl -s localhost:9091/metrics | grep '^hh_dp_evaluation_' | grep -v '^#'

LLM proxy (dp-llmproxy-service)

The LLM proxy exposes only the standard hh_http_* family. LiteLLM’s native Prometheus instrumentation is intentionally disabled so provider keys and message content never appear in metric labels. Each upstream provider lives at a distinct route, so per-provider error rate and latency fall out of hh_http_* filtered by route. Example: alert on proxy 5xx rate > 10% over 60 seconds:
sum(rate(hh_http_requests_total{service_name="dp_llm_proxy_service",status_code=~"5.."}[1m]))
  /
sum(rate(hh_http_requests_total{service_name="dp_llm_proxy_service"}[1m]))
> 0.10

Contact your HoneyHive support contact if you need help with dashboards, alert thresholds, or recipes for a specific backend.