Multi-Step Experiments
Learn to evaluate multi-step LLM applications with component-level metrics
In this tutorial, you will learn how to run an experiment to evaluate a multi-step LLM application. We will demonstrate this by implementing a Retrieval-Augmented Generation (RAG) pipeline, utilizing MongoDB for document retrieval and OpenAI for response generation. By the end of this guide, you will have an evaluated RAG pipeline, having assessed its ability to retrieve relevant documents and generate consistent responses using metrics such as retrieval relevance and response consistency.
The key steps covered in this tutorial include:
- Setting up a vector database in MongoDB for semantic search.
- Defining a sample dataset with inputs and corresponding ground truth values.
- Establishing evaluators to calculate similarity metrics for both document retrieval and response generation stages.
- Implementing the RAG Pipeline, which includes document retrieval and response generation stages.
- Running a comprehensive experiment using HoneyHive’s evaluation framework and analyzing the results.
You can view the complete code for this tutorial here:
Overview
For this tutorial, we will use the example of a medical/health question answering application.
Let’s go through the main components of this example by splitting it into two parts: The RAG pipeline we wish to evaluate and the evaluators used to assess its performance.
RAG Pipeline
The pipeline consists of the following steps:
- Document Retrieval: Using MongoDB’s vector search capabilities, we retrieve the most relevant documents for a given query.
- Response Generation: Using OpenAI’s API, we generate a response based on the retrieved documents and the query.
Evaluators
- Retrieval Evaluator: This evaluator assesses the retrieval process by measuring the semantic similarity between the query and the retrieved documents.
- Response Evaluator: This evaluator measures the semantic similarity between the model’s final response and the provided ground truth for each query.
- Pipeline Evaluator: This evaluator generates basic metrics related to the overall pipeline, such as the number of retrieved documents and the query length.
Overview of the RAG Pipeline to be evaluated
In the document retrieval phase, we will compute semantic similarity scores using sentence embeddings. These embeddings will be generated using the all-MiniLM-L6-v2 model from the sentence-transformers library.
Prerequisites
To be able to run this tutorial, make sure you have the following prerequisites in place:
- A MongoDB Atlas Cluster set up and ready to use.
- An OpenAI API key for model response generation.
- A HoneyHive project already created, as outlined in here.
- An API key for your HoneyHive project, as explained here.
Setting Up the Environment
First, let’s install all the required libraries:
Then, we initialize the necessary components, including MongoDB, OpenAI, and the SentenceTransformer model for embedding generation.
In this example, our MongoDB collection is preloaded with sample medical articles:
This guide assumes you have:
- A MongoDB Atlas cluster set up
- A database named “medical_db” with a collection named “articles”
- A vector search index named “vector_index” configured on the “articles” collection with the following configuration:
If you haven’t set up these prerequisites, please refer to MongoDB Atlas’ documentation, or feel free to follow along with your pre-existing vector DB or external retrieval system!
Implementing the RAG Pipeline
Let’s build the actual RAG pipeline. Our main function will be rag_pipeline
, that will call get_relevant_docs
followed by generate_response
.
Note that the highlighted sections in the example above indicate where the code enriches our traces with session and span-level metrics using HoneyHive’s enrich_session
and enrich_span
methods.
Creating the dataset
Let’s define our sample dataset with the desired inputs
and associated ground_truths
:
Notice that in our dataset, we have some questions that are not covered by the examples in our vector database, like questions about sleep patterns and stress management. In this simplified example, this can be easily detected. However, in real scenarios, this could be harder to identify.
Let’s see if this is reflected in our evaluation results at the end of this tutorial.
Defining the Evaluators
For the retrieval relevance evaluator, we calculate the cosine similarity between the query and each retrieved document. The final metric is the average of these similarity scores.
For the response consistency evaluator, we assess the semantic similarity between the generated output and the ground truth. This helps determine how closely the model’s response aligns with the expected answer.
Running the Experiment
Finally, we define a dataset and run the experiment using HoneyHive’s evaluate
function.
In this tutorial, we are logging metrics in three different ways: the response consistency evaluator is the main evaluator, and is passed directly to the evaluate
harness, along with the function
to be evaluated, rag_pipeline
. The retrieval evaluator metric is logged by using enrich_span
, as it’s related to the span get_relevant_docs
, whereas the pipeline evaluator metrics
are logged with enrich_session
, because it contains metrics related to the overall session.
Results and Insights
After running the experiment, you can view the results in the Experiments page in HoneyHive:
The Experiments Dashboard
For the retrieval step, we observe that some queries resulted in low retrieval relevance. Examining the Evaluation Summary on the left, we also notice that the average response consistency (0.73) is higher than the average retrieval relevance (0.41). Let’s take a closer look at the distribution of these metrics:
Response Consistency - Distribution
Retrieval Relevance - Distribution
This suggests that while the model’s responses are generally on-topic, they may not always be grounded in the source of truth—particularly for the two examples with retrieval relevance scores below 0.25. Let’s drill down into one of these examples:
Low retrieval relevance data point
Here, we identify the root cause: in this example, queries about stress and sleep disorders had low retrieval relevance because the vector database lacked relevant documents on these topics.
Conclusion
By following this tutorial, you’ve built a multi-step RAG pipeline, integrated it with MongoDB and OpenAI, and evaluated its performance using HoneyHive. Explore the results further to uncover valuable insights and optimize your pipeline!