HoneyHive Docs

HoneyHive exposes Prometheus-format metrics on every application service. This page documents what’s exposed, how to make it discoverable by your existing collection stack, and the headline metrics for each functional area. HoneyHive ships an internal OpenTelemetry Collector and a kube-prometheus-stack that handle in-cluster trace plumbing and bundled dashboards (see Operations Guide). Those components are sized for internal use and aren’t configured to forward hh_* metrics to your observability backend. The expected integration model for that is your platform team’s existing metrics stack: an OpenTelemetry Collector, Prometheus, vmagent, or similar, scrapes HoneyHive’s services and forwards to wherever you collect telemetry today.

What’s exposed

Every HoneyHive application service exposes Prometheus text-format metrics at:

http://<service>.<namespace>.svc.cluster.local:9091/metrics

The five services most customers monitor:

Service	Language	Covers	Metric prefix
`cp-backend-service`	TypeScript	Signup, org/workspace creation, API key creation, UI queries	`hh_http_*` (filter by `route` label)
`cp-writer-service`	Go	Trace storage: NATS to ClickHouse writes, DLQ	`hh_writer_*`
`dp-ingestion-service`	Go	SDK ingestion: events, batches, S3, NATS publish	`hh_ingestion_*`
`dp-evaluation-service`	TypeScript	Evaluation pipeline: queue depth, job outcomes, annotation lookups	`hh_dp_evaluation_`, `hh_http_`
`dp-llmproxy-service`	Python	LLM proxy: error rate, upstream latency	`hh_http_*`

All five services carry a service_name label on every metric they emit.

Enable ServiceMonitor discovery

The Helm chart ships ServiceMonitor templates for each application service. They are gated by a single flag so installs that don’t run the Prometheus Operator CRDs are unaffected. In your environment’s values.yaml (both control plane and data plane charts), set:

serviceMonitor:
  enabled: true
  # Namespace where your scraper looks for ServiceMonitors.
  # Typically matches the namespace running kube-prometheus-stack
  # or your OpenTelemetry Collector.
  namespace: monitoring
  # Labels your scraper's serviceMonitorSelector matches on.
  # For kube-prometheus-stack this is usually `release: <helm-release-name>`.
  labels:
    release: monitoring
  interval: 30s
  scrapeTimeout: 10s

Then helm upgrade as usual. Confirm the resources rendered:

kubectl get servicemonitor -n monitoring
# NAME                       AGE
# cp-backend-service         1m
# cp-controller-service      1m
# cp-notification-service    1m
# cp-writer-service          1m
# dp-backend-service         1m
# dp-evaluation-service      1m
# dp-ingestion-service       1m
# dp-llmproxy-service        1m
# dp-pythonmetric-service    1m

The five services listed in the table above are the ones most customers monitor; the others (controller, notification, dp-backend, pythonmetric) also expose /metrics endpoints and are useful for internal health visibility.

ServiceMonitor is a CRD from the Prometheus Operator. Most Prometheus-compatible scrapers, including vmagent and the OpenTelemetry Collector’s Prometheus receiver via the target_allocator, can discover via ServiceMonitors. If your stack does not use the Prometheus Operator CRDs, leave the flag off and scrape the :9091/metrics endpoints directly using your scraper’s native service discovery.

Pointing your collector at HoneyHive

Two integration shapes cover most environments:

Prometheus-Operator CRDs (recommended). If your OpenTelemetry Collector, Prometheus, or vmagent uses the Prometheus Operator’s ServiceMonitor CRD for scrape-target discovery, the resources you enabled above are picked up automatically once your scraper’s serviceMonitorSelector matches the labels in the HoneyHive values. For OpenTelemetry Collector, this is the target_allocator with prometheusCR.enabled: true.
Direct Prometheus scrape config. If you don’t use the Operator CRDs, point your collector’s prometheus receiver (or scraper of choice) at HoneyHive’s services directly. Each service is reachable at http://<service>.<namespace>.svc.cluster.local:9091/metrics (see the table above for service names). Kubernetes service discovery filtered by the app.kubernetes.io/* labels HoneyHive sets on each service works equally well.

Either way, the hh_* metrics will appear in whatever backend your collector forwards to: OTLP, Prometheus remote-write, Datadog, etc.

Headline metrics

Each service emits many internal metrics; the tables below list the handful most useful for dashboards and alerts. The full registry for each service is linked at the end of its section if you want to drill in. All series carry a service_name label. Histograms expose _bucket, _count, and _sum series in the usual Prometheus convention; OTLP receivers translate these to OTel histogram data points automatically.

HTTP requests (every service)

Every HoneyHive HTTP service emits the same two HTTP-level metrics via shared middleware. These are the foundation for request-rate, error-rate, and latency-SLO alerts on any endpoint:

Metric	Type	Labels	Notes
`hh_http_requests_total`	Counter	`method`, `route`, `status_code`, `service_name`	Use `status_code=~"5.."` for error-rate alerts.
`hh_http_request_duration_seconds`	Histogram	`method`, `route`, `service_name`	Buckets: 5ms–10s. Use for latency SLOs.

The route label is the Express route pattern (e.g. /v1/sessions/:session_id or /v1/alerts/:id), not the raw URL: dynamic segments stay parameterized, keeping cardinality bounded. Requests that don’t match any route are recorded as route="unmatched". For account and workspace setup activity, filter hh_http_* by the following route values (under service_name="cp_backend_service"):

Operation	Method	`route` value
Signup / session creation	`POST`	`/auth/session`
User onboarding completion	`POST`	`/v1/user/onboard`
Scope creation (org / workspace / project)	`POST`	`/v1/scopetree/`
Scope provision (data-plane bring-up)	`POST`	`/v1/scopetree/provision`
API key creation	`POST`	`/v1/api_key/`
Events fetch (UI)	`GET`	`/v1/events/`
Events index (UI search)	`POST`	`/v1/events/search-ids`

Example: alert on signup error rate > 1% over 5 minutes:

sum(rate(hh_http_requests_total{service_name="cp_backend_service",route="/auth/session",status_code=~"5.."}[5m]))
  /
sum(rate(hh_http_requests_total{service_name="cp_backend_service",route="/auth/session"}[5m]))
> 0.01

Ingestion (`dp-ingestion-service`)

The ingestion service receives SDK events, buffers them, writes them to S3, and publishes to NATS for downstream processing. Headline metrics:

Metric	Type	Labels	Use for
`hh_ingestion_events_processed_total`	Counter	`operation`	Ingestion throughput by operation (`create`, `update`, `session`, `batch`).
`hh_ingestion_events_errored_total`	Counter	`operation`, `error_class`	Ingestion failure rate.
`hh_ingestion_processing_duration_seconds`	Histogram	`operation`	End-to-end ingestion latency.
`hh_ingestion_writer_buffer_depth`	Gauge	(none)	Backpressure signal: items waiting to be flushed.
`hh_ingestion_s3_operations_total`	Counter	`operation`, `result`	S3 write health.
`hh_ingestion_nats_published_total`	Counter	`stream`, `result`	Downstream publish health.

Additional metrics covering cache hit ratios, flush behavior, and per-operation S3 and NATS latency are also exported. To see the full list, port-forward to a pod and curl its metrics endpoint:

kubectl port-forward -n data-plane svc/dp-ingestion-service 9091:9091
curl -s localhost:9091/metrics | grep '^hh_ingestion_' | grep -v '^#'

Trace storage (`cp-writer-service`)

The writer service consumes from NATS and writes to ClickHouse. S3 writes happen in the ingestion service, not here. Headline metrics:

Metric	Type	Labels	Use for
`hh_writer_nats_messages_consumed_total`	Counter	`result`	Write-path throughput.
`hh_writer_clickhouse_write_duration_seconds`	Histogram	`table`	ClickHouse write latency by table.
`hh_writer_clickhouse_errors_total`	Counter	`table`, `error_class`	ClickHouse write failure rate.
`hh_writer_buffer_depth_records`	Gauge	`table`	Backpressure per ClickHouse table.
`hh_writer_dlq_records_total`	Counter	`table`	Records sent to dead-letter; non-zero means data loss to investigate.

Additional metrics covering NATS batching, buffer flush cadence, retries, and bisect-on-failure behavior are also exported. To see the full list:

kubectl port-forward -n control-plane svc/cp-writer-service 9091:9091
curl -s localhost:9091/metrics | grep '^hh_writer_' | grep -v '^#'

Evaluation pipeline (`dp-evaluation-service`)

The evaluation service pulls jobs off a NATS work queue and processes them.

Metric	Type	Labels	Use for
`hh_dp_evaluation_nats_consumer_num_pending`	Gauge	`stream`, `consumer`	Queue depth: jobs waiting to be picked up.
`hh_dp_evaluation_nats_consumer_num_ack_pending`	Gauge	`stream`, `consumer`	In-flight jobs: picked up but not yet acked.
`hh_dp_evaluation_jobs_completed_total`	Counter	`result`	Job outcomes. `result="success"` = ack’d; `result="failure"` = nak’d for retry. Incremented per delivery attempt, not per unique job.

Example: alert on evaluation queue backlog sustained above 1000 pending for 5 minutes:

min_over_time(
  hh_dp_evaluation_nats_consumer_num_pending[5m]
) > 1000

Example: alert on evaluation job failure rate > 5% over 5 minutes:

sum(rate(hh_dp_evaluation_jobs_completed_total{service_name="dp_evaluation_service",result="failure"}[5m]))
  /
sum(rate(hh_dp_evaluation_jobs_completed_total{service_name="dp_evaluation_service"}[5m]))
> 0.05

Additional annotation-queue lookup metrics (cache hit/miss, filter errors, lookup latency) are also exported. To see the full list:

kubectl port-forward -n data-plane svc/dp-evaluation-service 9091:9091
curl -s localhost:9091/metrics | grep '^hh_dp_evaluation_' | grep -v '^#'

LLM proxy (`dp-llmproxy-service`)

The LLM proxy exposes only the standard hh_http_* family. LiteLLM’s native Prometheus instrumentation is intentionally disabled so provider keys and message content never appear in metric labels. Each upstream provider lives at a distinct route, so per-provider error rate and latency fall out of hh_http_* filtered by route. Example: alert on proxy 5xx rate > 10% over 60 seconds:

sum(rate(hh_http_requests_total{service_name="dp_llm_proxy_service",status_code=~"5.."}[1m]))
  /
sum(rate(hh_http_requests_total{service_name="dp_llm_proxy_service"}[1m]))
> 0.10

Contact your HoneyHive support contact if you need help with dashboards, alert thresholds, or recipes for a specific backend.

​What’s exposed

​Enable ServiceMonitor discovery

​Pointing your collector at HoneyHive

​Headline metrics

​HTTP requests (every service)

​Signup, org setup, and admin operations (cp-backend-service)

​Ingestion (dp-ingestion-service)

​Trace storage (cp-writer-service)

​Evaluation pipeline (dp-evaluation-service)

​LLM proxy (dp-llmproxy-service)

What’s exposed

Enable ServiceMonitor discovery

Pointing your collector at HoneyHive

Headline metrics

HTTP requests (every service)

Signup, org setup, and admin operations (`cp-backend-service`)

Ingestion (`dp-ingestion-service`)

Trace storage (`cp-writer-service`)

Evaluation pipeline (`dp-evaluation-service`)

LLM proxy (`dp-llmproxy-service`)