> ## Documentation Index > Fetch the complete documentation index at: https://docs.honeyhive.ai/llms.txt > Use this file to discover all available pages before exploring further. # End-to-End: Multi-Agent Tracing and Evaluation > Build a multi-agent customer support bot with Google ADK, add HoneyHive tracing, and run evaluations end to end. Follow a cookbook from agents to quality. Build a multi-agent customer support system using [Google ADK](https://google.github.io/adk-docs/) and progressively add HoneyHive observability -- from auto-tracing through enrichment, custom spans, and evaluation. **What you'll build**: A coordinator agent that routes customer queries to specialist sub-agents (billing and technical support), with full tracing and an evaluation pipeline. Clone the cookbook repo to run it yourself Looking for more runnable examples? Browse the full [cookbook collection on GitHub](https://github.com/honeyhiveai/cookbook). **Prerequisites**: * Python 3.11+ * A [HoneyHive account](https://app.us.honeyhive.ai) (grab your API key from the dashboard) * A [Google AI API key](https://aistudio.google.com/apikey) * An [OpenAI API key](https://platform.openai.com/api-keys) (for the evaluation step only) ## Setup ```bash theme={null} git clone https://github.com/honeyhiveai/cookbook.git cd cookbook/google-adk-cookbook pip install -r requirements.txt ``` Create a `.env` file: ```bash theme={null} HH_API_KEY=your-honeyhive-api-key GOOGLE_API_KEY=your-google-ai-api-key OPENAI_API_KEY=your-openai-key # only needed for evaluate.py ``` ## Walkthrough The app has three agents: a coordinator that routes queries, and two specialists with tools. ```python theme={null} from google.adk.agents import Agent billing_agent = Agent( name="billing_agent", model="gemini-2.0-flash", description="Handles billing inquiries including account balances, " "recent charges, invoice questions, and refund status.", instruction="You are a billing support specialist...", tools=[lookup_billing], ) technical_agent = Agent( name="technical_agent", model="gemini-2.0-flash", description="Handles technical support including product bugs, " "feature questions, error messages, and troubleshooting.", instruction="You are a technical support specialist...", tools=[search_knowledge_base], ) coordinator = Agent( name="customer_support", model="gemini-2.0-flash", description="Customer support coordinator that routes queries to specialists.", instruction="Route billing questions to billing_agent, " "technical issues to technical_agent. Always delegate.", sub_agents=[billing_agent, technical_agent], ) ``` The coordinator uses ADK's native [LLM delegation](https://google.github.io/adk-docs/agents/multi-agents/) -- the model reads each sub-agent's `description` and decides where to route. **Adding HoneyHive tracing takes 4 lines.** Initialize the tracer and instrumentor before running your agents: ```python highlight={1-2,4-8,10} theme={null} from honeyhive import HoneyHiveTracer from openinference.instrumentation.google_adk import GoogleADKInstrumentor tracer = HoneyHiveTracer.init( api_key=os.getenv("HH_API_KEY"), session_name="customer-support", ) GoogleADKInstrumentor().instrument(tracer_provider=tracer.provider) ``` Run it: ```bash theme={null} python main.py ``` In HoneyHive, you'll see the full trace hierarchy: coordinator routing decisions, sub-agent delegation, LLM calls, and tool executions -- all captured automatically with no changes to your agent code. Auto-tracing captures the agent mechanics, but you also need business context to make traces useful in production: which user made this request, what environment is this, what version of your app. ```python highlight={1-8} theme={null} tracer.enrich_session( user_properties={ "user_id": "customer_42", "plan": "enterprise", }, metadata={ "environment": "production", "app_version": "2.1.0", }, ) ``` This attaches to the session, so every trace in this session carries the context. In HoneyHive you can now: * **Filter** traces by user, plan, or environment * **Search** for all traces from a specific customer * **Compare** behavior across app versions The instrumentor captures ADK internals automatically. For your own code that wraps around agent calls -- input validation, formatting, orchestration -- use the `@trace()` decorator: ```python highlight={1,3} theme={null} from honeyhive import trace @trace() def preprocess_query(raw_query: str) -> str: """Normalize and validate customer input before sending to the agent.""" cleaned = raw_query.strip() if len(cleaned) < 3: return "Could you please provide more details about your question?" return cleaned ``` This creates a span in the trace for `preprocess_query`, so you can see exactly how long input processing takes alongside the agent spans. **When to add custom spans vs rely on auto-instrumentation:** | Use auto-instrumentation for | Use `@trace()` for | | --------------------------------- | -------------------------------------------- | | LLM calls, tool calls, agent runs | Input validation, output formatting | | Anything inside the ADK framework | Database lookups, API calls to your services | | Token usage, latency per call | Business logic that wraps agent calls | Once your agent is traced, you want to know: is it actually good? The `evaluate.py` script runs the agent against a test dataset and measures quality. **Define a dataset** of customer queries with expected categories: ```python theme={null} dataset = [ { "inputs": {"query": "I was charged $24.50 but I thought that was refunded?"}, "ground_truth": {"category": "billing"}, }, { "inputs": {"query": "The export button gives me error 500."}, "ground_truth": {"category": "technical"}, }, # ... 8 queries total ] ``` **Define evaluators** -- functions that score each response: ```python theme={null} def response_quality(outputs, inputs, ground_truth): """Use an LLM to judge whether the agent response is helpful.""" # Calls GPT-4o-mini to score the response 0.0 - 1.0 # based on accuracy, helpfulness, and tone ... def correct_routing(outputs, inputs, ground_truth): """Check if the query was routed to the right specialist.""" # LLM judge that verifies the right specialist handled the query ... ``` **Run the experiment:** ```python theme={null} from honeyhive import evaluate result = evaluate( function=run_support_agent, dataset=dataset, evaluators=[response_quality, correct_routing], name="customer-support-eval", ) ``` ```bash theme={null} python evaluate.py ``` HoneyHive's `evaluate()` runs your agent against every datapoint, applies each evaluator, and uploads the results. You can view them in the Experiments UI to see scores per query, aggregate metrics, and compare across runs. ## Next steps Configure tracing for serverless, web servers, and Kubernetes Trace agents across service boundaries Deep dive into HoneyHive's evaluation framework LangChain, LangGraph, Strands, CrewAI, and more