HoneyHive Docs

HoneyHive: the observability layer for enterprise agents

HoneyHive is the complete AI observability and evaluation platform for tracing, evaluating, monitoring, and improving AI agents from development to production.

Start Tracing

Instrument your first agent and capture traces in 5 minutes.

Run Your First Evaluation

Set up an experiment and evaluate your agent programmatically.

The Workflow

HoneyHive follows an Evaluation-Driven Development (EDD) workflow — similar to TDD in software engineering — where evaluation guides every stage of agent development.

Production: Observe and Evaluate

Instrument your application with distributed tracing to capture every interaction. Collect traces, user feedback, and quality metrics from production. Run online evals to surface edge cases at scale, and set up alerts to catch failures or metric drift.

Inspect every LLM call, tool invocation, and chain step in a structured execution log.

Testing: Curate Datasets & Run Experiments

Turn failing production traces into curated test datasets. Run experiments to measure the impact of your changes, track regressions over time, and gate releases in CI.

Compare prompts, models, or configurations side-by-side to see which changes improve performance.

Development: Iterate on Prompts

Use evaluation results to guide changes. Iterate on prompts, test new models, and optimize your application based on what the data shows. Validate changes against curated datasets before deploying.

Playground
Prompt Management

Test prompt variations and model configurations with instant feedback before committing to code.

Repeat: Continuous Improvement

Deploy improvements and continue the cycle. Each iteration builds on production data, creating a flywheel of improvement that makes your AI systems more reliable over time.

Platform Capabilities

Core features across the development lifecycle:

Tracing

Capture and visualize every step of your AI application with distributed tracing.

Experiments & Datasets

Test changes with offline experiments and curated datasets before deploying.

Monitoring & Alerting

Track metrics with dashboards and get alerts when quality degrades.

Online Evaluations

Run automated evals on production traces to catch issues early.

Annotation Queues

Collect expert feedback and turn it into labeled datasets.

Prompt Management

Version and manage prompts across UI and code.

Open Standards, Open Ecosystem

HoneyHive is built on OpenTelemetry, so it works across models, frameworks, and runtimes with no vendor lock-in.

Model Agnostic

Works with OpenAI, Anthropic, Bedrock, open-source models, and more.

Framework Agnostic

Native support for LangChain, CrewAI, Google ADK, AWS Strands, and more.

Runtime Agnostic

Trace any runtime - Lambdas, Kubernetes, Bedrock AgentCore, and more.

Bring Your Own Instrumentor

HoneyHive supports official OTEL GenAI, OpenLLMetry, and OpenInference semantic conventions.

Hosting Options

Multi-Tenant SaaS

Fully managed. Get started in minutes.

Dedicated Cloud

Single-tenant environment managed by our team.

Self-Hosted

Deploy in your VPC for full control and compliance.

Additional Resources

API Reference

REST API documentation for custom integrations.

SDK Documentation

Python SDK guides for advanced use cases.

Invite Your Team

Add teammates and configure role-based access control.

Integrations

Connect with OpenAI, Anthropic, LangChain, and more.

Documentation Index