hh_*
metrics to your observability backend.
The expected integration model for that is your platform team’s existing metrics
stack: an OpenTelemetry Collector, Prometheus, vmagent, or similar, scrapes
HoneyHive’s services and forwards to wherever you collect telemetry today.
What’s exposed
Every HoneyHive application service exposes Prometheus text-format metrics at:| Service | Language | Covers | Metric prefix |
|---|---|---|---|
cp-backend-service | TypeScript | Signup, org/workspace creation, API key creation, UI queries | hh_http_* (filter by route label) |
cp-writer-service | Go | Trace storage: NATS to ClickHouse writes, DLQ | hh_writer_* |
dp-ingestion-service | Go | SDK ingestion: events, batches, S3, NATS publish | hh_ingestion_* |
dp-evaluation-service | TypeScript | Evaluation pipeline: queue depth, job outcomes, annotation lookups | hh_dp_evaluation_*, hh_http_* |
dp-llmproxy-service | Python | LLM proxy: error rate, upstream latency | hh_http_* |
service_name label on every metric they emit.
Enable ServiceMonitor discovery
The Helm chart shipsServiceMonitor templates for each application service. They are
gated by a single flag so installs that don’t run the Prometheus Operator CRDs are
unaffected.
In your environment’s values.yaml (both control plane and data plane charts), set:
helm upgrade as usual. Confirm the resources rendered:
/metrics endpoints and are useful for internal health visibility.
ServiceMonitor is a CRD from the
Prometheus Operator.
Most Prometheus-compatible scrapers, including vmagent and the OpenTelemetry
Collector’s Prometheus receiver via the target_allocator, can discover via
ServiceMonitors. If your stack does not use the Prometheus Operator CRDs, leave the
flag off and scrape the :9091/metrics endpoints directly using your scraper’s
native service discovery.Pointing your collector at HoneyHive
Two integration shapes cover most environments:- Prometheus-Operator CRDs (recommended). If your OpenTelemetry Collector,
Prometheus, or vmagent uses the Prometheus Operator’s
ServiceMonitorCRD for scrape-target discovery, the resources you enabled above are picked up automatically once your scraper’sserviceMonitorSelectormatches the labels in the HoneyHive values. For OpenTelemetry Collector, this is thetarget_allocatorwithprometheusCR.enabled: true. - Direct Prometheus scrape config. If you don’t use the Operator CRDs, point
your collector’s
prometheusreceiver (or scraper of choice) at HoneyHive’s services directly. Each service is reachable athttp://<service>.<namespace>.svc.cluster.local:9091/metrics(see the table above for service names). Kubernetes service discovery filtered by theapp.kubernetes.io/*labels HoneyHive sets on each service works equally well.
hh_* metrics will appear in whatever backend your collector
forwards to: OTLP, Prometheus remote-write, Datadog, etc.
Headline metrics
Each service emits many internal metrics; the tables below list the handful most useful for dashboards and alerts. The full registry for each service is linked at the end of its section if you want to drill in. All series carry aservice_name label. Histograms expose _bucket, _count,
and _sum series in the usual Prometheus convention; OTLP receivers translate
these to OTel histogram data points automatically.
HTTP requests (every service)
Every HoneyHive HTTP service emits the same two HTTP-level metrics via shared middleware. These are the foundation for request-rate, error-rate, and latency-SLO alerts on any endpoint:| Metric | Type | Labels | Notes |
|---|---|---|---|
hh_http_requests_total | Counter | method, route, status_code, service_name | Use status_code=~"5.." for error-rate alerts. |
hh_http_request_duration_seconds | Histogram | method, route, service_name | Buckets: 5ms–10s. Use for latency SLOs. |
route label is the Express route pattern (e.g. /v1/sessions/:session_id
or /v1/alerts/:id), not the raw URL: dynamic segments stay parameterized,
keeping cardinality bounded. Requests that don’t match any route are recorded as
route="unmatched".
Signup, org setup, and admin operations (cp-backend-service)
For account and workspace setup activity, filter hh_http_* by the following
route values (under service_name="cp_backend_service"):
| Operation | Method | route value |
|---|---|---|
| Signup / session creation | POST | /auth/session |
| User onboarding completion | POST | /v1/user/onboard |
| Scope creation (org / workspace / project) | POST | /v1/scopetree/ |
| Scope provision (data-plane bring-up) | POST | /v1/scopetree/provision |
| API key creation | POST | /v1/api_key/ |
| Events fetch (UI) | GET | /v1/events/ |
| Events index (UI search) | POST | /v1/events/search-ids |
Ingestion (dp-ingestion-service)
The ingestion service receives SDK events, buffers them, writes them to S3, and
publishes to NATS for downstream processing. Headline metrics:
| Metric | Type | Labels | Use for |
|---|---|---|---|
hh_ingestion_events_processed_total | Counter | operation | Ingestion throughput by operation (create, update, session, batch). |
hh_ingestion_events_errored_total | Counter | operation, error_class | Ingestion failure rate. |
hh_ingestion_processing_duration_seconds | Histogram | operation | End-to-end ingestion latency. |
hh_ingestion_writer_buffer_depth | Gauge | (none) | Backpressure signal: items waiting to be flushed. |
hh_ingestion_s3_operations_total | Counter | operation, result | S3 write health. |
hh_ingestion_nats_published_total | Counter | stream, result | Downstream publish health. |
Trace storage (cp-writer-service)
The writer service consumes from NATS and writes to ClickHouse. S3 writes happen
in the ingestion service, not here. Headline metrics:
| Metric | Type | Labels | Use for |
|---|---|---|---|
hh_writer_nats_messages_consumed_total | Counter | result | Write-path throughput. |
hh_writer_clickhouse_write_duration_seconds | Histogram | table | ClickHouse write latency by table. |
hh_writer_clickhouse_errors_total | Counter | table, error_class | ClickHouse write failure rate. |
hh_writer_buffer_depth_records | Gauge | table | Backpressure per ClickHouse table. |
hh_writer_dlq_records_total | Counter | table | Records sent to dead-letter; non-zero means data loss to investigate. |
Evaluation pipeline (dp-evaluation-service)
The evaluation service pulls jobs off a NATS work queue and processes them.
| Metric | Type | Labels | Use for |
|---|---|---|---|
hh_dp_evaluation_nats_consumer_num_pending | Gauge | stream, consumer | Queue depth: jobs waiting to be picked up. |
hh_dp_evaluation_nats_consumer_num_ack_pending | Gauge | stream, consumer | In-flight jobs: picked up but not yet acked. |
hh_dp_evaluation_jobs_completed_total | Counter | result | Job outcomes. result="success" = ack’d; result="failure" = nak’d for retry. Incremented per delivery attempt, not per unique job. |
LLM proxy (dp-llmproxy-service)
The LLM proxy exposes only the standard hh_http_* family. LiteLLM’s native
Prometheus instrumentation is intentionally disabled so provider keys and message
content never appear in metric labels.
Each upstream provider lives at a distinct route, so per-provider error rate
and latency fall out of hh_http_* filtered by route. Example: alert on proxy
5xx rate > 10% over 60 seconds:
Contact your HoneyHive support contact if you need help with dashboards, alert thresholds, or recipes for a specific backend.

