> ## Documentation Index
> Fetch the complete documentation index at: https://docs.honeyhive.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Client-Side Evaluators

> Run evaluation logic in your application code

Client-side evaluators run within your application, giving you real-time feedback during execution and full control over evaluation logic. You can use them via two main workflows:

| Workflow                                | When to Use                       | How                                    |
| --------------------------------------- | --------------------------------- | -------------------------------------- |
| **Adding Metrics to Traces**            | Production monitoring, guardrails | `enrich_span()`, `enrich_session()`    |
| **Evaluator Functions for Experiments** | Testing against datasets, CI/CD   | Define functions, pass to `evaluate()` |

***

## Adding Metrics to Traces

Compute scores in your application code and attach them to traces for monitoring and analysis.

**Use cases:** Format validation, safety checks, PII detection, latency tracking, relevance scores.

```python theme={null}
import os
from honeyhive import HoneyHiveTracer, trace, enrich_span

HoneyHiveTracer.init(
    api_key=os.getenv("HH_API_KEY"),
    project=os.getenv("HH_PROJECT"),
)

@trace
def generate_response(query):
    response = call_llm(query)
    
    # Add metrics to this span
    enrich_span(metrics={
        "response_length": len(response),
        "contains_pii": check_pii(response),
        "relevance_score": compute_relevance(query, response),
    })
    
    return response
```

<Tip>
  To see where to initialize the tracer for your environment, including AWS Lambda and long-running servers, see [Tracer Initialization](/v2/tracing/tracer-initialization).
</Tip>

Metrics appear in the HoneyHive dashboard for charting, alerting, and filtering.

<Note>
  For complete documentation on adding metrics to traces, see [Custom Metrics](/v2/tracing/client-side-evals).
</Note>

***

## Evaluator Functions for Experiments

Define scoring functions that run locally during `evaluate()` to score outputs against expected results.

### Writing an Evaluator

Evaluators receive three arguments and return a score:

```python theme={null}
def my_evaluator(outputs, inputs, ground_truth):
    """
    Args:
        outputs: Return value from your function
        inputs: The inputs dict from the datapoint
        ground_truth: The ground_truth dict from the datapoint
    
    Returns:
        A score (number, boolean, or string)
    """
    expected = ground_truth.get("expected", "")
    return 1.0 if outputs == expected else 0.0
```

### Running Evaluators

Pass evaluator functions to `evaluate()`:

```python theme={null}
import os
from honeyhive import evaluate

def accuracy(outputs, inputs, ground_truth):
    expected = ground_truth.get("intent", "")
    actual = outputs.get("intent", "")
    return 1.0 if expected == actual else 0.0

def my_classifier(datapoint):
    text = datapoint["inputs"]["text"]
    # Your classification logic
    return {"intent": classify(text)}

dataset = [
    {"inputs": {"text": "I need a refund"}, "ground_truth": {"intent": "billing"}},
    {"inputs": {"text": "App won't load"}, "ground_truth": {"intent": "technical"}},
]

result = evaluate(
    function=my_classifier,
    dataset=dataset,
    evaluators=[accuracy],
    api_key=os.getenv("HH_API_KEY"),
    project=os.getenv("HH_PROJECT"),
    name="intent-classifier-v1",
)
```

<Note>
  For a complete tutorial with real examples, see [Run Your First Experiment](/v2/introduction/experiments-quickstart).
</Note>

### Evaluating Multi-Step Pipelines

For pipelines with multiple steps, combine both approaches:

* **Session-level:** Pass evaluators to `evaluate()` for overall scoring
* **Span-level:** Use `enrich_span()` within traced functions for step-specific metrics

```python theme={null}
import os
from honeyhive import evaluate
from honeyhive import trace, enrich_span

# Session-level evaluator
def answer_quality(outputs, inputs, ground_truth):
    expected = ground_truth.get("answer", "")
    return 1.0 if expected.lower() in outputs.lower() else 0.0

@trace
def retrieve_docs(query):
    docs = search_database(query)
    # Span-level metric
    enrich_span(metrics={"num_docs": len(docs), "retrieval_score": 0.85})
    return docs

@trace  
def generate_answer(docs, query):
    answer = call_llm(docs, query)
    enrich_span(metrics={"answer_length": len(answer)})
    return answer

def rag_pipeline(datapoint):
    query = datapoint["inputs"]["query"]
    docs = retrieve_docs(query)
    return generate_answer(docs, query)

result = evaluate(
    function=rag_pipeline,
    dataset=my_dataset,
    evaluators=[answer_quality],  # Scores the final output
    api_key=os.getenv("HH_API_KEY"),
    project=os.getenv("HH_PROJECT"),
    name="rag-eval",
)
```

After running, you'll see both:

* `answer_quality` scores at the session level
* `retrieval_score`, `num_docs`, `answer_length` at the span level

***

## Next Steps

<CardGroup cols={2}>
  <Card title="Custom Metrics" icon="chart-line" href="/v2/tracing/client-side-evals">
    Full guide to adding metrics to traces
  </Card>

  <Card title="Run Your First Experiment" icon="flask" href="/v2/introduction/experiments-quickstart">
    Complete tutorial with real examples
  </Card>

  <Card title="Server-Side Evaluators" icon="server" href="/v2/evaluators/python">
    Run evaluators on HoneyHive infrastructure
  </Card>

  <Card title="LLM-as-Judge" icon="robot" href="/v2/evaluators/llm">
    Use LLMs to evaluate outputs
  </Card>
</CardGroup>
