Use this file to discover all available pages before exploring further.
Client-side evaluators run within your application, giving you real-time feedback during execution and full control over evaluation logic. You can use them via two main workflows:
Compute scores in your application code and attach them to traces for monitoring and analysis.Use cases: Format validation, safety checks, PII detection, latency tracking, relevance scores.
Evaluators receive three arguments and return a score:
def my_evaluator(outputs, inputs, ground_truth): """ Args: outputs: Return value from your function inputs: The inputs dict from the datapoint ground_truth: The ground_truth dict from the datapoint (expected/reference values to score against) Returns: A score (number, boolean, or string) """ expected = ground_truth.get("expected", "") return 1.0 if outputs == expected else 0.0
Use ground_truth (singular) for both the datapoint field and the evaluator argument. ground_truths was the pre-1.0 SDK name.
Keep target function and evaluator names stable across runs. Cross-run comparison pairs metrics by metric name and traced function (event) name. Renaming the target function (e.g., baseline_classifier to improved_classifier) or the evaluator (e.g., accuracy to accuracy_v2) between runs makes the comparison view treat them as unrelated metrics, so improvements and regressions no longer pair up. Iterate by editing the function bodies in place and re-running under the same names; label the run itself via the name= argument to evaluate(). For more on cross-run comparisons, see Comparing Experiments.