In this tutorial, you will learn how to run an experiment to evaluate a multi-step LLM application. We will demonstrate this by implementing a Retrieval-Augmented Generation (RAG) pipeline, utilizing MongoDB for document retrieval and OpenAI for response generation. By the end of this guide, you will have an evaluated RAG pipeline, having assessed its ability to retrieve relevant documents and generate consistent responses using metrics such as retrieval relevance and response consistency.

The key steps covered in this tutorial include:

  1. Setting up a vector database in MongoDB for semantic search.
  2. Defining a sample dataset with inputs and corresponding ground truth values.
  3. Establishing evaluators to calculate similarity metrics for both document retrieval and response generation stages.
  4. Implementing the RAG Pipeline, which includes document retrieval and response generation stages.
  5. Running a comprehensive experiment using HoneyHive’s evaluation framework and analyzing the results.

You can view the complete code for this tutorial here:

Overview

For this tutorial, we will use the example of a medical/health question answering application.

Let’s go through the main components of this example by splitting it into two parts: The RAG pipeline we wish to evaluate and the evaluators used to assess its performance.

RAG Pipeline

The pipeline consists of the following steps:

  • Document Retrieval: Using MongoDB’s vector search capabilities, we retrieve the most relevant documents for a given query.
  • Response Generation: Using OpenAI’s API, we generate a response based on the retrieved documents and the query.

Evaluators

  • Retrieval Evaluator: This evaluator assesses the retrieval process by measuring the semantic similarity between the query and the retrieved documents.
  • Response Evaluator: This evaluator measures the semantic similarity between the model’s final response and the provided ground truth for each query.
  • Pipeline Evaluator: This evaluator generates basic metrics related to the overall pipeline, such as the number of retrieved documents and the query length.

Overview of the RAG Pipeline to be evaluated

In the document retrieval phase, we will compute semantic similarity scores using sentence embeddings. These embeddings will be generated using the all-MiniLM-L6-v2 model from the sentence-transformers library.

Prerequisites

To be able to run this tutorial, make sure you have the following prerequisites in place:

  • A MongoDB Atlas Cluster set up and ready to use.
  • An OpenAI API key for model response generation.
  • A HoneyHive project already created, as outlined in here.
  • An API key for your HoneyHive project, as explained here.

Setting Up the Environment

First, let’s install all the required libraries:

pip install pymongo python-dotenv sentence-transformers scikit-learn honeyhive

Then, we initialize the necessary components, including MongoDB, OpenAI, and the SentenceTransformer model for embedding generation.

In this example, our MongoDB collection is preloaded with sample medical articles:

    {
        "title": "Exercise and Diabetes",
        "content": "Regular exercise reduces diabetes risk by 30%. Studies show that engaging in moderate physical activity for at least 30 minutes daily can help regulate blood sugar levels. Daily walking is particularly recommended for diabetes prevention.",
    },
    {
        "title": "Morning Exercise Benefits",
        "content": "Studies show morning exercises have better impact on blood sugar levels. Research indicates that working out before breakfast can improve insulin sensitivity and help with weight management.",
    },
    {
        "title": "Diet and Diabetes",
        "content": "A balanced diet rich in fiber and low in refined carbohydrates can help prevent diabetes. Whole grains, vegetables, and lean proteins are essential components of a diabetes-prevention diet.",
    }

This guide assumes you have:

  1. A MongoDB Atlas cluster set up
  2. A database named “medical_db” with a collection named “articles”
  3. A vector search index named “vector_index” configured on the “articles” collection with the following configuration:
    {
      "fields": [
        {
          "numDimensions": 384,
          "path": "embedding",
          "similarity": "cosine",
          "type": "vector"
        }
      ]
    }
    

If you haven’t set up these prerequisites, please refer to MongoDB Atlas’ documentation, or feel free to follow along with your pre-existing vector DB or external retrieval system!

Implementing the RAG Pipeline

Let’s build the actual RAG pipeline. Our main function will be rag_pipeline, that will call get_relevant_docs followed by generate_response.

@trace
def get_relevant_docs(query: str, top_k: int = 2):
    """Retrieves relevant documents from MongoDB using semantic search"""
    # Compute query embedding
    query_embedding = model.encode(query).tolist()

    # Search for similar documents using vector similarity
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "path": "embedding",
                "queryVector": query_embedding,
                "numCandidates": top_k * 2,  # Search through more candidates for better results
                "limit": top_k
            }
        }
    ]

    try:
        results = list(collection.aggregate(pipeline))
        retrieved_docs = [doc["content"] for doc in results]
        retrieved_embeddings = [doc["embedding"] for doc in results]
        retrieval_relevance = retrieval_relevance_evaluator(query_embedding, retrieved_embeddings)
        enrich_span(metrics={"retrieval_relevance": retrieval_relevance})

        return retrieved_docs
    except Exception as e:
        print(f"Search error: {e}")
        # Fallback to basic find if vector search fails
        return [doc["content"] for doc in collection.find().limit(top_k)]

@trace
def generate_response(docs: List[str], query: str):
    """Generates response using OpenAI model"""
    prompt = f"Question: {query}\nContext: {docs}\nAnswer:"
    completion = openai_client.chat.completions.create(
        model="o3-mini",
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    return completion.choices[0].message.content

def rag_pipeline(inputs: Dict, ground_truths: Dict) -> str:
    """Complete RAG pipeline that retrieves docs and generates response"""
    query = inputs["query"]
    docs = get_relevant_docs(query)
    response = generate_response(docs, query)

    enrich_session(metrics={
        "rag_pipeline": {
            "num_retrieved_docs": len(docs),
            "query_length": len(query.split())   
        }
    })
    return response

Note that the highlighted sections in the example above indicate where the code enriches our traces with session and span-level metrics using HoneyHive’s enrich_session and enrich_span methods.

Creating the dataset

Let’s define our sample dataset with the desired inputs and associated ground_truths:

dataset = [
    {
        "inputs": {
            "query": "How does exercise affect diabetes?",
        },
        "ground_truths": {
            "response": "Regular exercise reduces diabetes risk by 30%. Daily walking is recommended.",
        }
    },
    {
        "inputs": {
            "query": "What are the benefits of morning exercise?",
        },
        "ground_truths": {
            "response": "Morning exercise has better impact on blood sugar levels.",
        }
    },
    {
        "inputs": {
            "query": "What is the best diet for diabetes?",
        },
        "ground_truths": {
            "response": "A balanced diet rich in fiber and low in refined carbohydrates is recommended.",
        }
    },
    {
        "inputs": {
            "query": "What is the best way to manage stress?",
        },
        "ground_truths": {
            "response": "Regular exercise, a balanced diet, and adequate sleep are effective ways to manage stress.",
        }
    },
    {
        "inputs": {
            "query": "How do sleep patterns affect mental health?",
        },
        "ground_truths": {
            "response": "Sleep patterns significantly impact mental well-being. Poor sleep can lead to increased anxiety and depression risks.",
        }
    },
]

Notice that in our dataset, we have some questions that are not covered by the examples in our vector database, like questions about sleep patterns and stress management. In this simplified example, this can be easily detected. However, in real scenarios, this could be harder to identify.

Let’s see if this is reflected in our evaluation results at the end of this tutorial.

Defining the Evaluators

For the retrieval relevance evaluator, we calculate the cosine similarity between the query and each retrieved document. The final metric is the average of these similarity scores.

For the response consistency evaluator, we assess the semantic similarity between the generated output and the ground truth. This helps determine how closely the model’s response aligns with the expected answer.


def retrieval_relevance_evaluator(query_embedding: np.ndarray, retrieved_embeddings: List[np.ndarray]) -> float:
    """Evaluates the relevance of retrieved documents to the query"""
    try:
        similarities = cosine_similarity([query_embedding], retrieved_embeddings)[0]
    except Exception as e:
        print(f"Error: {e}")
        return 0.0

    # Return average similarity
    return float(np.mean(similarities))


@evaluator()
def consistency_evaluator(outputs: str, inputs: Dict[str, str], ground_truths: Dict[str, str]) -> float:
    """Evaluates consistency between outputs and ground truths"""
    output_embeddings = model.encode(outputs).reshape(1, -1)  # Reshape to 2D array
    truth_embeddings = model.encode(ground_truths["response"]).reshape(1, -1)  # Reshape to 2D array

    # Calculate cosine similarity between outputs and ground truths
    similarities = cosine_similarity(output_embeddings, truth_embeddings)

    # Return average similarity
    return float(np.mean(similarities))

Running the Experiment

Finally, we define a dataset and run the experiment using HoneyHive’s evaluate function.

if __name__ == "__main__":
    # Setup MongoDB with sample data
    setup_mongodb()

    # Run experiment
    evaluate(
        function=rag_pipeline,
        hh_api_key=os.getenv('HONEYHIVE_API_KEY'),
        hh_project=os.getenv('HONEYHIVE_PROJECT'),
        name='MongoDB RAG Pipeline Evaluation',
        dataset=dataset,
        evaluators=[consistency_evaluator],
    )

In this tutorial, we are logging metrics in three different ways: the response consistency evaluator is the main evaluator, and is passed directly to the evaluate harness, along with the function to be evaluated, rag_pipeline. The retrieval evaluator metric is logged by using enrich_span, as it’s related to the span get_relevant_docs, whereas the pipeline evaluator metrics are logged with enrich_session, because it contains metrics related to the overall session.

Results and Insights

After running the experiment, you can view the results in the Experiments page in HoneyHive:

The Experiments Dashboard

For the retrieval step, we observe that some queries resulted in low retrieval relevance. Examining the Evaluation Summary on the left, we also notice that the average response consistency (0.73) is higher than the average retrieval relevance (0.41). Let’s take a closer look at the distribution of these metrics:

Response Consistency - Distribution

Retrieval Relevance - Distribution

This suggests that while the model’s responses are generally on-topic, they may not always be grounded in the source of truth—particularly for the two examples with retrieval relevance scores below 0.25. Let’s drill down into one of these examples:

Low retrieval relevance data point

Here, we identify the root cause: in this example, queries about stress and sleep disorders had low retrieval relevance because the vector database lacked relevant documents on these topics.

Conclusion

By following this tutorial, you’ve built a multi-step RAG pipeline, integrated it with MongoDB and OpenAI, and evaluated its performance using HoneyHive. Explore the results further to uncover valuable insights and optimize your pipeline!