- Setting up a vector database in MongoDB for semantic search.
- Defining a sample dataset with inputs and corresponding ground truth values.
- Establishing evaluators to calculate similarity metrics for both document retrieval and response generation stages.
- Implementing the RAG Pipeline, which includes document retrieval and response generation stages.
- Running a comprehensive experiment using HoneyHive’s evaluation framework and analyzing the results.
Complete Example
Complete Example
Overview
For this tutorial, we will use the example of a medical/health question answering application. Let’s go through the main components of this example by splitting it into two parts: The RAG pipeline we wish to evaluate and the evaluators used to assess its performance.RAG Pipeline
The pipeline consists of the following steps:- Document Retrieval: Using MongoDB’s vector search capabilities, we retrieve the most relevant documents for a given query.
- Response Generation: Using OpenAI’s API, we generate a response based on the retrieved documents and the query.
Evaluators
- Retrieval Evaluator: This evaluator assesses the retrieval process by measuring the semantic similarity between the query and the retrieved documents.
- Response Evaluator: This evaluator measures the semantic similarity between the model’s final response and the provided ground truth for each query.
- Pipeline Evaluator: This evaluator generates basic metrics related to the overall pipeline, such as the number of retrieved documents and the query length.

Overview of the RAG Pipeline to be evaluated
Prerequisites
To be able to run this tutorial, make sure you have the following prerequisites in place:- A MongoDB Atlas Cluster set up and ready to use.
- An OpenAI API key for model response generation.
- A HoneyHive project already created, as outlined in here.
- An API key for your HoneyHive project, as explained here.
Setting Up the Environment
First, let’s install all the required libraries:Setup and initializations
Setup and initializations
This guide assumes you have:
- A MongoDB Atlas cluster set up
- A database named “medical_db” with a collection named “articles”
- A vector search index named “vector_index” configured on the “articles” collection with the following configuration:
Implementing the RAG Pipeline
Let’s build the actual RAG pipeline. Our main function will berag_pipeline
, that will call get_relevant_docs
followed by generate_response
.
enrich_session
and enrich_span
methods.
Creating the dataset
Let’s define our sample dataset with the desiredinputs
and associated ground_truths
:
Notice that in our dataset, we have some questions that are not covered by the examples in our vector database, like questions about sleep patterns and stress management.
In this simplified example, this can be easily detected. However, in real scenarios, this could be harder to identify.Let’s see if this is reflected in our evaluation results at the end of this tutorial.
Defining the Evaluators
For the retrieval relevance evaluator, we calculate the cosine similarity between the query and each retrieved document. The final metric is the average of these similarity scores. For the response consistency evaluator, we assess the semantic similarity between the generated output and the ground truth. This helps determine how closely the model’s response aligns with the expected answer.Running the Experiment
Finally, we define a dataset and run the experiment using HoneyHive’sevaluate
function.
evaluate
harness, along with the function
to be evaluated, rag_pipeline
. The retrieval evaluator metric is logged by using enrich_span
, as it’s related to the span get_relevant_docs
, whereas the pipeline evaluator metrics
are logged with enrich_session
, because it contains metrics related to the overall session.
Results and Insights
After running the experiment, you can view the results in the Experiments page in HoneyHive:
The Experiments Dashboard

Response Consistency - Distribution

Retrieval Relevance - Distribution

Low retrieval relevance data point