HoneyHive Docs

Organization Templates let platform teams define a standard set of evaluators and monitoring charts that automatically populate across new projects. Instead of each team configuring observability from scratch, every new project starts with the resources your organization has standardized on.

Organization Templates are only available on the Enterprise plan. Managing templates requires workspace.templates.get and workspace.templates.set; Workspace Admins receive these by default on newly created organizations. On existing organizations, an org admin may need to add workspace.templates.* to the Workspace Admin role under Settings > Organization > Roles.

How templates work

Templates are configured via a YAML manifest in Settings > Organization > Templates. The manifest has two parts:

Template definitions - Reusable blueprints for evaluators and charts
Project templates - Which definitions to apply when a new project is created

When someone creates a new project, HoneyHive reads the manifest and creates the listed resources automatically.

Manifest structure

template_definitions:
  metric:
    # ... evaluator definitions
  chart:
    # ... chart definitions

config:
  merge_strategy: "replace" # or "merge"

project_templates:
  metric:
    - # evaluator names from definitions above
  chart:
    - # chart names from definitions above

Evaluator definitions

Define evaluators under template_definitions.metric. HoneyHive supports four evaluator types:

Type	YAML `type` value	Key fields	Use case
Human	`HUMAN`	`criteria`, `scale`	Domain expert annotation
Python	`CUSTOM`	`code_snippet`	Programmatic checks, format validation
LLM	`MODEL`	`prompt`	Qualitative assessments via AI
Composite	`COMPOSITE`	`aggregation_function`, `details`	Aggregate multiple evaluator scores

Common fields

Every evaluator definition supports these fields:

Field	Description
`type`	Evaluator type: `HUMAN`, `CUSTOM`, `MODEL`, or `COMPOSITE`
`description`	Short description of what the evaluator measures
`enabled_in_prod`	Whether the evaluator runs on production traces
`needs_ground_truth`	Whether ground truth data is required
`return_type`	Data type: `string`, `float`, or `boolean`
`threshold`	Passing range with `min` and `max`, or `null` for no threshold
`filters.filterArray`	Event filters that control which events trigger this evaluator
`sampling_percentage`	Percentage of production events to evaluate (1-100)

Type-specific fields

Human evaluators

Field	Description
`criteria`	The evaluation question shown to reviewers
`scale`	Numeric scale upper bound (e.g., `5` for 1-5 rating). `null` for non-numeric types.

Rating:
  type: "HUMAN"
  description: "How would you rate this overall?"
  enabled_in_prod: true
  needs_ground_truth: false
  return_type: "float"
  threshold:
    min: 3
    max: 5
  filters:
    filterArray: []
  sampling_percentage: 100
  criteria: "How would you rate this overall?"
  scale: 5

Python evaluators

Field	Description
`code_snippet`	Python function that receives an `event` dict and returns a value

The function has access to Python’s standard library and packages including pandas, scikit-learn, jsonschema, sqlglot, and requests.

Format Check:
  type: "CUSTOM"
  description: "Checks output is valid JSON"
  enabled_in_prod: true
  needs_ground_truth: false
  return_type: "boolean"
  threshold: null
  filters:
    filterArray: []
  sampling_percentage: 100
  code_snippet: |
    import json
    def evaluate(event):
        try:
            json.loads(event["outputs"]["content"])
            return True
        except (json.JSONDecodeError, KeyError):
            return False
    result = evaluate(event)

LLM evaluators

Field	Description
`prompt`	Evaluation prompt using `{{ }}` syntax to reference event properties

Answer Relevance:
  type: "MODEL"
  description: "Rates how relevant the answer is to the question"
  enabled_in_prod: true
  needs_ground_truth: false
  return_type: "float"
  threshold:
    min: 3
    max: 5
  filters:
    filterArray: []
  sampling_percentage: 25
  prompt: |
    Evaluate the AI assistant's answer for relevance to the question.
    Rate on a scale of 1 to 5.

    [Question]
    {{ inputs.question }}

    [Answer]
    {{ outputs.content }}

    Rating: [[X]]

Composite evaluators

Field	Description
`aggregation_function`	How to combine scores: `weighted_average`, `weighted_sum`, `min`, `max`, `hierarchical_highest_true`
`details`	List of child evaluators with `metric_name` and `weight`

Overall Quality:
  type: "COMPOSITE"
  description: "Weighted average of relevance and format checks"
  enabled_in_prod: true
  needs_ground_truth: false
  return_type: "float"
  threshold:
    min: 3
    max: 5
  filters:
    filterArray: []
  sampling_percentage: 100
  aggregation_function: "weighted_average"
  details:
    - metric_name: "Answer Relevance"
      weight: 2
    - metric_name: "Format Check"
      weight: 1

Chart definitions

Define charts under template_definitions.chart. Each chart specifies what to measure, how to aggregate, and how to filter.

Field	Description
`metric`	What to measure: `count`, `duration`, or a dotted path like `metadata.total_tokens`
`func`	Aggregation: `sum`, `avg`, `cumsum`, `min`, `max`, `p50`, `p95`, `p99`
`bucketing`	Time bucket: `minute`, `hour`, `day`, `week`, `month`
`dateRange.relative`	Default time range: `1d`, `7d`, `30d`
`groupBy`	Optional. Group results by a field (e.g., `event_name`)
`query`	Filters with `field`, `value`, `type`, and `operator` (`is`, `is not`, `contains`, `exists`)

Example chart definitions

"Daily Session Count":
  metric: "count"
  func: "sum"
  bucketing: "day"
  dateRange:
    relative: "7d"
  query:
    - field: "event_type"
      value: "session"
      type: "string"
      operator: "is"

"Average LLM Call Duration":
  metric: "duration"
  func: "avg"
  bucketing: "hour"
  dateRange:
    relative: "7d"
  query:
    - field: "event_type"
      value: "model"
      type: "string"
      operator: "is"

"Cumulative Total Tokens Usage":
  metric: "metadata.total_tokens"
  func: "cumsum"
  bucketing: "day"
  dateRange:
    relative: "7d"
  query:
    - field: "event_type"
      value: "model"
      type: "string"
      operator: "is"

"Average Grouped Event Duration":
  metric: "duration"
  func: "avg"
  groupBy: "event_name"
  bucketing: "hour"
  dateRange:
    relative: "7d"
  query:
    - field: "event_type"
      value: "session"
      type: "string"
      operator: "is not"

Project templates

The project_templates section lists which definitions are applied when a project is created. Reference definitions by name:

project_templates:
  metric:
    - Rating
    - Format Check
    - Answer Relevance
  chart:
    - "Daily Session Count"
    - "Average LLM Call Duration"
    - "Cumulative Total Tokens Usage"

Documentation Index

​How templates work

​Manifest structure

​Evaluator definitions

​Common fields

​Type-specific fields

​Chart definitions

​Project templates

How templates work

Manifest structure

Evaluator definitions

Common fields

Type-specific fields

Chart definitions

Project templates