In this guide, we’ll explore how to create custom human evaluators by defining a simple Conciseness Rating on the likert scale (1-5), which allows users to manually grade outputs on a custom evaluation criteria.

Creating the evaluator

Navigating to the Evaluators tab in the left sidebar. Click Add Evaluator to create a new evaluator and select Human Evaluator.

Define the criteria

Here, you can provide a description of the evaluation criteria to your annotators. This description will be available to any reviewers when evaluating individual events or sessions.

It is often a good idea to include examples of “dos and don’ts” for your annotators to better understand how to grade the outputs.

For this example, we will provide a detailed criteria for grading Conciseness. See below.

1. Relevance: Is the response directly related to the prompt without unnecessary details?
2. Clarity: Is the message clear and easily understandable?
3. Word Economy: Are unnecessary words, phrases, or sentences eliminated?
4. Precision: Does the response use precise language without being vague?
5. Elimination of Filler: Are redundant or filler words removed?
6. Logical Flow: Does the response follow a logical sequence without unnecessary jumps?
7. Brevity vs. Completeness: Is the response concise while still covering all necessary points?
8. Consistency: Does the response maintain consistent conciseness throughout?
9. Engagement: Does the response keep the reader's interest despite its brevity?
10. Overall Impact: Does the response effectively convey the message concisely?

We’ll simply copy and paste the above code in the evaluator console. See below.

metrichuman

Configuration and setup

Configure return type

We currently support the following return types:

  1. Numeric: Allows grading on a scale of 1-n, where n is the Rating Scale
  2. Binary: Allows grading on a 👍/👎 scale
  3. Notes: Allows providing free-form text as feedback

Since our evaluator in this example needs to return value between 1-5, we’ll configure it as Numeric and define Rating Scale as 5.

Configure passing range

Passing ranges are useful in order to be able to detect which test cases failed in your evaluation.

We’d ideally want the model to score 4 or 5, hence we’ll configure this as 4 to 5. This’ll allow us to account for slight errors in judgement made by the evaluator.

Using the evaluator

Evaluating logs

All evaluators defined here will automatically appear in the trace view under the session object. You can simply provide feedback on each session or any events / spans within a session. See below.

humaneval