Skip to main content
Organization Templates let platform teams define a standard set of evaluators and monitoring charts that automatically populate across new projects. Instead of each team configuring observability from scratch, every new project starts with the resources your organization has standardized on.
Organization Templates are only available on the Enterprise plan.

How templates work

Templates are configured via a YAML manifest in Settings > Organization > Templates. The manifest has two parts:
  1. Template definitions - Reusable blueprints for evaluators and charts
  2. Project templates - Which definitions to apply when a new project is created
When someone creates a new project, HoneyHive reads the manifest and creates the listed resources automatically.

Manifest structure

template_definitions:
  metric:
    # ... evaluator definitions
  chart:
    # ... chart definitions

config:
  merge_strategy: "replace" # or "merge"

project_templates:
  metric:
    - # evaluator names from definitions above
  chart:
    - # chart names from definitions above

Evaluator definitions

Define evaluators under template_definitions.metric. HoneyHive supports four evaluator types:
TypeYAML type valueKey fieldsUse case
HumanHUMANcriteria, scaleDomain expert annotation
PythonCUSTOMcode_snippetProgrammatic checks, format validation
LLMMODELpromptQualitative assessments via AI
CompositeCOMPOSITEaggregation_function, detailsAggregate multiple evaluator scores

Common fields

Every evaluator definition supports these fields:
FieldDescription
typeEvaluator type: HUMAN, CUSTOM, MODEL, or COMPOSITE
descriptionShort description of what the evaluator measures
enabled_in_prodWhether the evaluator runs on production traces
needs_ground_truthWhether ground truth data is required
return_typeData type: string, float, or boolean
thresholdPassing range with min and max, or null for no threshold
filters.filterArrayEvent filters that control which events trigger this evaluator
sampling_percentagePercentage of production events to evaluate (1-100)

Type-specific fields

FieldDescription
criteriaThe evaluation question shown to reviewers
scaleNumeric scale upper bound (e.g., 5 for 1-5 rating). null for non-numeric types.
Rating:
  type: "HUMAN"
  description: "How would you rate this overall?"
  enabled_in_prod: true
  needs_ground_truth: false
  return_type: "float"
  threshold:
    min: 3
    max: 5
  filters:
    filterArray: []
  sampling_percentage: 100
  criteria: "How would you rate this overall?"
  scale: 5
FieldDescription
code_snippetPython function that receives an event dict and returns a value
The function has access to Python’s standard library and packages including pandas, scikit-learn, jsonschema, sqlglot, and requests.
Format Check:
  type: "CUSTOM"
  description: "Checks output is valid JSON"
  enabled_in_prod: true
  needs_ground_truth: false
  return_type: "boolean"
  threshold: null
  filters:
    filterArray: []
  sampling_percentage: 100
  code_snippet: |
    import json
    def evaluate(event):
        try:
            json.loads(event["outputs"]["content"])
            return True
        except (json.JSONDecodeError, KeyError):
            return False
    result = evaluate(event)
FieldDescription
promptEvaluation prompt using {{ }} syntax to reference event properties
Answer Relevance:
  type: "MODEL"
  description: "Rates how relevant the answer is to the question"
  enabled_in_prod: true
  needs_ground_truth: false
  return_type: "float"
  threshold:
    min: 3
    max: 5
  filters:
    filterArray: []
  sampling_percentage: 25
  prompt: |
    Evaluate the AI assistant's answer for relevance to the question.
    Rate on a scale of 1 to 5.

    [Question]
    {{ inputs.question }}

    [Answer]
    {{ outputs.content }}

    Rating: [[X]]
FieldDescription
aggregation_functionHow to combine scores: weighted_average, weighted_sum, min, max, hierarchical_highest_true
detailsList of child evaluators with metric_name and weight
Overall Quality:
  type: "COMPOSITE"
  description: "Weighted average of relevance and format checks"
  enabled_in_prod: true
  needs_ground_truth: false
  return_type: "float"
  threshold:
    min: 3
    max: 5
  filters:
    filterArray: []
  sampling_percentage: 100
  aggregation_function: "weighted_average"
  details:
    - metric_name: "Answer Relevance"
      weight: 2
    - metric_name: "Format Check"
      weight: 1

Chart definitions

Define charts under template_definitions.chart. Each chart specifies what to measure, how to aggregate, and how to filter.
FieldDescription
metricWhat to measure: count, duration, or a dotted path like metadata.total_tokens
funcAggregation: sum, avg, cumsum, min, max, p50, p95, p99
bucketingTime bucket: minute, hour, day, week, month
dateRange.relativeDefault time range: 1d, 7d, 30d
groupByOptional. Group results by a field (e.g., event_name)
queryFilters with field, value, type, and operator (is, is not, contains, exists)
"Daily Session Count":
  metric: "count"
  func: "sum"
  bucketing: "day"
  dateRange:
    relative: "7d"
  query:
    - field: "event_type"
      value: "session"
      type: "string"
      operator: "is"

"Average LLM Call Duration":
  metric: "duration"
  func: "avg"
  bucketing: "hour"
  dateRange:
    relative: "7d"
  query:
    - field: "event_type"
      value: "model"
      type: "string"
      operator: "is"

"Cumulative Total Tokens Usage":
  metric: "metadata.total_tokens"
  func: "cumsum"
  bucketing: "day"
  dateRange:
    relative: "7d"
  query:
    - field: "event_type"
      value: "model"
      type: "string"
      operator: "is"

"Average Grouped Event Duration":
  metric: "duration"
  func: "avg"
  groupBy: "event_name"
  bucketing: "hour"
  dateRange:
    relative: "7d"
  query:
    - field: "event_type"
      value: "session"
      type: "string"
      operator: "is not"

Project templates

The project_templates section lists which definitions are applied when a project is created. Reference definitions by name:
project_templates:
  metric:
    - Rating
    - Format Check
    - Answer Relevance
  chart:
    - "Daily Session Count"
    - "Average LLM Call Duration"
    - "Cumulative Total Tokens Usage"