Online evaluations allow you to define domain-specific metrics that can be computed to evaluate your logs asynchronously.

Use encourage using Sampling to prevent costs associated with model-graded evaluations at production scale

Model-graded Evaluators

  • What: LLM functions scoring semantic qualities.
  • Why: Measure tone, creativity, persuasiveness—things usage metrics miss.
  • How: Create LLM Evaluators

Python Evaluators

  • What: Code-defined metrics for precise or complex measurements.
  • Why: Compute linguistic metrics, domain-specific scores, etc.
  • How: Create Python Evaluators