HoneyHive Docs

Keep your HoneyHive resources (evaluators, datasets, datapoints, experiment runs) checked into your repo as YAML or JSON, and apply them with the HoneyHive CLI. The CLI publishes a JSON Schema for every command, so the file format stays in lockstep with the public API. This guide shows how to lay out a .honeyhive/ directory, discover each resource’s schema, and roll out changes from the terminal or CI.

The CLI is @honeyhive/cli. See Install & Quickstart to install it, and the full command reference for every namespace.

Why config as code

Reviewable: changes to an evaluator’s Python code or an LLM prompt go through the same PR review as the rest of your app.
Reproducible: the resource definitions live next to the code they evaluate, pinned to a commit.
Portable: the same files apply to staging and production projects by swapping HH_DATA_PLANE_URL and HH_API_KEY.
Agent-friendly: every command exposes its JSON Schema, so coding agents like Cursor and Claude Code can author and validate files without guessing.

Directory layout

A common convention is to keep one file per resource under .honeyhive/:

.honeyhive/
├── evaluators/
│   ├── keyword-check.yaml
│   └── relevance-llm.yaml
├── datasets/
│   └── qa-eval-set.yaml
└── datapoints/
    ├── q1.yaml
    └── q2.yaml

The directory names map one-to-one to CLI namespaces:

Folder	CLI namespace	API resource
`evaluators/`	`honeyhive metrics`	Metrics
`datasets/`	`honeyhive datasets`	Datasets
`datapoints/`	`honeyhive datapoints`	Datapoints
`experiments/`	`honeyhive experiments`	Experiments

“Evaluator” and “metric” are the same resource. The product UI and docs call them evaluators; the API and CLI call them metrics.

Discover the schema

Every CLI command that takes arguments supports two read-only flags:

--show-file-schema, which prints the JSON Schema for the full request object (the exact shape --filename accepts).
--show-argument-schema <flag-name>, which prints the JSON Schema for a single argument. Pass the kebab-case flag name without the leading --.

Both write pure JSON to stdout and never call the API.

# Schema for a new evaluator file
honeyhive metrics create --show-file-schema

# Schema for just the `criteria` field
honeyhive metrics create --show-argument-schema criteria

The output is plain JSON Schema. Pipe it to jq, save it to disk, or hand it to a coding agent so it can scaffold a valid file:

honeyhive metrics create --show-file-schema > .honeyhive/schemas/metric.schema.json

File schemas use the API field names (snake_case or camelCase, matching the public OpenAPI spec) rather than the CLI’s --kebab-case flag names. The CLI’s --filename flag passes the parsed file straight through to the API with no field translation.

Define an evaluator

Python evaluators are functions that score events on the server. To define one as a file, capture the same fields you would set in the Evaluators UI:

.honeyhive/evaluators/keyword-check.yaml

name: keyword-check
type: PYTHON
return_type: boolean
needs_ground_truth: false
description: Checks whether the response mentions the word "honey".
criteria: |
  def keyword_check():
      return "honey" in outputs["content"].lower()

LLM evaluators follow the same pattern. The prompt goes in criteria, and the model is selected with model_provider and model_name. return_type is float for numeric scores, boolean for pass/fail, string for free-form, or categorical when paired with a categories list:

.honeyhive/evaluators/relevance-llm.yaml

name: relevance-llm
type: LLM
return_type: float
scale: 5
model_provider: openai
model_name: gpt-4o
sampling_percentage: 25
description: Rates how well the answer addresses the question, 1-5.
criteria: |
  [Instruction]
  Rate the assistant's answer for relevance to the question on a scale of 1 to 5.

  [Question]
  {{ inputs.question }}

  [Answer]
  {{ outputs.content }}

  [Evaluation]
  Rating: [[X]]

scale is required for any return_type: float evaluator (LLM, Python, or composite) because the API needs the upper bound of the rating range. Set it to match the maximum value your prompt asks the model to produce (5 here, since the prompt rates 1 to 5). The JSON Schema marks scale as nullable, but the API rejects float evaluators that omit it with a bare 400.

To see every field the API accepts (categories, thresholds, child metrics for composites, event filters, etc.), inspect the schema:

honeyhive metrics create --show-file-schema | jq '.properties | keys'

Apply a file

Pass --filename (or -f) to send the entire file as the request body. The CLI picks the parser from the file extension and accepts .yaml, .yml, .json, and .jsonc (comments and trailing commas are allowed in both .json and .jsonc).

# Create the evaluator
honeyhive metrics create --filename .honeyhive/evaluators/keyword-check.yaml
# {
#   "inserted": true,
#   "metric_id": "01KRJB6SX9YA4J51NRFT6M27RC"
# }

The response shape varies per namespace, and the field that carries the assigned ID is named differently in create responses, list responses, and the update file body. To extract IDs reliably, match the namespace to its jq expression:

Namespace	`create` returns	Extract ID with	`list` field	`update` file requires
`metrics`	`{inserted, metric_id}`	`jq -r '.metric_id'`	`.metrics[].id`	`metric_id`
`datasets`	`{inserted, result: {insertedId}}`	`jq -r '.result.insertedId'`	`.datasets[].id`	`dataset_id`
`datapoints`	`{inserted, result: {insertedIds: [...]}}` (always an array, even for one datapoint)	`jq -r '.result.insertedIds[0]'`	`.datapoints[].id`	`datapoint_id`

For the full schema of any namespace, consult the CLI reference. The response includes the assigned metric_id. To make subsequent applies idempotent, pick one of these two patterns and stick with it across your repo:

Embedded ID: write the assigned metric_id back into the YAML as a top-level field, since metrics update reads it directly from the file body. Simple, but files with metric_id only validate against the metrics update schema (see Validate before applying).
Lockfile: keep IDs in a checked-in .honeyhive/state.json and look them up at apply time (see A simple sync script). Pairs well with YAML generated by other tools that shouldn’t be mutated, and the YAMLs themselves still validate against the metrics create schema.

The two patterns aren’t meant to compose: embedding metric_id and using a lockfile produces YAML that fails the create-schema validator the moment the ID is added. To update an existing evaluator, include metric_id in the file and call metrics update:

.honeyhive/evaluators/keyword-check.yaml

metric_id: 01KRJB6SX9YA4J51NRFT6M27RC
name: keyword-check
type: PYTHON
return_type: boolean
description: Checks whether the response mentions the word "honey" or "hive".
criteria: |
  def keyword_check():
      content = outputs["content"].lower()
      return "honey" in content or "hive" in content

honeyhive metrics update --filename .honeyhive/evaluators/keyword-check.yaml

The same --filename flow works for datasets create, datasets update, datapoints create, datapoints update, and every other namespace. Run --show-file-schema on the command you want to use to see the exact shape.

Define a dataset

A dataset definition only needs a name, an optional description, and the datapoint IDs it should include:

.honeyhive/datasets/qa-eval-set.yaml

name: qa-eval-set
description: Question/answer pairs used for the relevance evaluator.
datapoints: []

Apply it the same way:

DATASET_ID=$(honeyhive datasets create \
  --filename .honeyhive/datasets/qa-eval-set.yaml \
  | jq -r '.result.insertedId')

For each datapoint, define inputs and ground truth and link it to the dataset on create:

.honeyhive/datapoints/q1.yaml

inputs:
  question: What is the capital of France?
ground_truth:
  answer: Paris
metadata:
  external_id: q1
linked_datasets:
  - "01KRJB7WD8E2H4M9X3K2Y7Q1A5"  # paste the dataset_id printed by the create above

honeyhive datapoints create --filename .honeyhive/datapoints/q1.yaml

Both sides of the dataset/datapoint relationship are writable: datasets create accepts an initial datapoints: [<id>...] array, and datapoints create accepts linked_datasets: [<id>...]. For the config-as-code flow, link from the datapoint side as shown above. It keeps the dataset YAML stable across runs and matches the order the API expects when you create resources from empty.

This page covers the shape of your dataset as code. For keeping the contents of a dataset in sync with an external source (S3, a database, an internal tool), see Sync from External Sources. The two patterns combine well: define the dataset metadata as code, then populate datapoints from an external system.

A simple sync script

The CLI does not ship with a single honeyhive apply command yet, so most teams wrap their .honeyhive/ directory in a small script. The script below implements the lockfile pattern from Apply a file: it upserts every evaluator under .honeyhive/evaluators/, tracking IDs in a checked-in .honeyhive/state.json. The YAMLs themselves stay free of metric_id; the script writes it into a temporary copy before calling metrics update. Commit the lockfile alongside your YAML so collaborators and CI share the same IDs.

sync-evaluators.sh

#!/usr/bin/env bash
# Usage: HH_API_KEY=... bash sync-evaluators.sh
set -euo pipefail

STATE_FILE=".honeyhive/state.json"
[ -f "$STATE_FILE" ] || echo '{"evaluators": {}}' > "$STATE_FILE"

for file in .honeyhive/evaluators/*.yaml; do
  name=$(yq '.name' "$file")
  existing_id=$(jq -r --arg n "$name" '.evaluators[$n] // ""' "$STATE_FILE")

  if [ -n "$existing_id" ]; then
    # Update in place; CLI reads metric_id from the file body.
    # mktemp + .yaml suffix appended manually so this works on macOS (BSD
    # mktemp does not support --suffix).
    tmp="$(mktemp).yaml"
    yq ". + {\"metric_id\": \"$existing_id\"}" "$file" > "$tmp"
    honeyhive metrics update --filename "$tmp" > /dev/null
    rm -f "$tmp"
    echo "Updated $name ($existing_id)"
  else
    new_id=$(honeyhive metrics create --filename "$file" | jq -r '.metric_id')
    tmp=$(mktemp)
    jq --arg n "$name" --arg id "$new_id" \
      '.evaluators[$n] = $id' "$STATE_FILE" > "$tmp"
    mv "$tmp" "$STATE_FILE"
    echo "Created $name ($new_id)"
  fi
done

The CLI requires the file extension to match the parser (.yaml, .yml, .json, or .jsonc), which is why the update branch appends .yaml to mktemp’s output before writing.

Use --verbose (or HH_VERBOSE=true) when debugging to log the resolved API URL and masked key for each invocation. This makes it obvious whether a CI job is hitting staging or production.

Run from CI

Once .honeyhive/ is in version control, applying changes is a single CLI step. A minimal GitHub Actions job:

.github/workflows/honeyhive-sync.yml

name: Sync HoneyHive resources

on:
  push:
    branches: [main]
    paths: [".honeyhive/**"]

jobs:
  sync:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install HoneyHive CLI
        # Pin to a specific release so CI runs are reproducible.
        run: curl -fsSL https://github.com/honeyhiveai/honeyhive-cli/releases/download/v1.0.0/install.sh | sh
      - name: Apply evaluators
        env:
          HH_API_KEY: ${{ secrets.HH_API_KEY }}
        run: bash sync-evaluators.sh

For staging/production parity, pass different HH_API_KEY and HH_DATA_PLANE_URL values per environment without changing any file under .honeyhive/. HH_API_URL is deprecated and still works for older setups, but new configuration should use HH_DATA_PLANE_URL.

Validate before applying

Validate that a file matches the API’s schema before pushing changes by piping it through --show-file-schema plus a JSON Schema validator (ajv, check-jsonschema, etc.). Each command publishes its own schema: metrics create requires name, type, criteria and rejects unknown fields, while metrics update requires metric_id and accepts the same body keys. Pick the schema that matches the pattern you chose in Apply a file:

# Files have no metric_id; validate against the create schema.
honeyhive metrics create --show-file-schema > /tmp/metric-create.schema.json
check-jsonschema --schemafile /tmp/metric-create.schema.json .honeyhive/evaluators/*.yaml

This catches typos, missing required fields, and invalid enum values locally, before the request reaches the API.

The create schema sets additionalProperties: false, so a file written for the embedded-ID update flow will fail validation against it. Validate against whichever schema matches the pattern you picked in Apply a file.

Use with coding agents

The schema introspection flags are designed for AI coding agents. An agent that wants to add a new evaluator can:

Run honeyhive metrics create --show-file-schema to get the JSON Schema.
Generate a YAML file that conforms to it, placed under .honeyhive/evaluators/.
Apply it with honeyhive metrics create --filename ....

See AI Coding Agents for the pre-built HoneyHive Skills that bundle this workflow into agent-friendly slash commands.

HoneyHive CLI

Install the CLI and run your first commands.

CLI Command Reference

Every namespace, command, and flag in one place.

Sync Datasets from External Sources

Keep dataset contents in sync with S3, databases, or internal tools.

Python and LLM Evaluators

Background on the evaluator model the YAML files describe.

​Why config as code

​Directory layout

​Discover the schema

​Define an evaluator

​Apply a file

​Define a dataset

​A simple sync script

​Run from CI

​Validate before applying

​Use with coding agents

​Related references

HoneyHive CLI

CLI Command Reference

Sync Datasets from External Sources

Python and LLM Evaluators

Why config as code

Directory layout

Discover the schema

Define an evaluator

Apply a file

Define a dataset

A simple sync script

Run from CI

Validate before applying

Use with coding agents

Related references