top of page
Writer's pictureAdrian Araya

How to Evaluate Retrieval Augmented Generation (RAG) Systems

Architecture of a basic RAG system
Figure 1. Architecture of a basic RAG system

Retrieval Augmented Generation

In recent years, Retrieval Augmented Generation (RAG) systems have taken the spotlight in the world of AI and natural language processing. A RAG system is a hybrid model that combines retrieval and generation, pulling relevant information from a large knowledge base and using it to generate natural, informative answers. This mix of retrieving data and crafting responses makes RAG systems a powerful tool for applications like customer support, virtual assistants, and research aides. 


However, as these systems become more popular and widely implemented, an essential question arises: how can we ensure they're working effectively? 


Evaluating RAG systems is challenging because they need to provide accurate information while sounding natural and engaging. In this blog, we’ll tackle RAG evaluation in two parts: 


  1. Assessing the performance of the Information Retrieval component to ensure the returned chunks are relevant to the query

  2. Evaluating the complete RAG system which includes the answer generation.


For both parts, different metrics are used, which we’ll dive into as we go along.


Q&A Dataset


To evaluate a RAG system, it’s essential to start by creating a Q&A dataset. The metrics we’ll use for evaluation rely on having a structured dataset with clear questions and corresponding answers.


The structure of this dataset is simple: each entry contains a question and an answer, where each question’s text directly correlates with the answer text. Importantly, both questions and answers must be based solely on the knowledge base/documents that the RAG system utilizes.


Carefully crafted questions and answers are highly recommended to ensure they align well with the available information. Remember, the system can only rely on this specific content to generate responses. Here are some guidelines to consider when building the dataset:


  • Use Relevant Keywords: Make sure to include as many relevant keywords as possible in each question.

    • ❌ Example: "Develop must have an initial commit, how can I create it?"

    • ✅ Example: "In gitflow, develop must have an initial commit, how can I create it?"


  • Abbreviations: If abbreviations are common, consider including both the full term and abbreviation.

    • ❌ Example: "How to update JP5 to JP6?"

    • ✅ Example: "How to update Jetpack 5 (JP5) to Jetpack 6 (JP6)?"


  • Match Answer Formatting: If the answer follows a specific format, phrase the question to prompt a similar structure.

    • Example answer: "X is an ..., for example: ..."

    • Corresponding question: "What is X, give me one example."


The dataset size is up to the evaluator, but keep in mind that additional fields will be needed for Information Retrieval metrics, which requires a manual effort—details on this process will be explained next.


Information Retrieval Evaluation


The most common metrics for evaluating the retriever component in a RAG system are precision@k, recall@k, F1@k score, and MRR (Mean Reciprocal Rank). Here’s how each metric is calculated:


Precision@k


Precision can be understood as the capacity of the retriever to discriminate between relevant and irrelevant chunks. For instance, if we set k = 10, this metric evaluates how many of those top 10 chunks are relevant to the question.

This metric and recall@k depend on how the retriever function returns chunks based on a question. The most basic implementation is to choose a fixed k (say 5) independent of the number of chunks and evaluate the system with that in mind. With this approach, the precision formula is defined as follows:



Process for Calculating Precision@k:


  1. For each question, request the most relevant chunks from the retriever.

  2. Manually assess how many of these chunks are relevant based on the corresponding answer. In this context, "relevant" means the chunk contains information that could help answer the question.

  3. Once you’ve identified the relevant chunks (k), apply the formula and save the result. Remember, if your system uses a threshold, k will vary per question.

  4. Repeat the process for all questions and calculate the average.


Example:


Suppose we have a retrieval system, and for a specific query, the system returns the top 5 chunks (k = 5). Upon reviewing these 5 chunks, we find that 3 of them are actually relevant to the query. Using the Precision@k formula:



In this case, the precision would be 0.6, or 60%, which means the system successfully retrieved 60% relevant chunks within the top 5 retrieved items.


Important Considerations for Precision@k:


If the information needed to answer the question is only found in a single chunk, then using a fixed k (for example, k=10) would result in a precision of 1/10, or 10%. This happens because only one out of the ten retrieved chunks is actually relevant. However, this does not mean the system is performing poorly; rather, it highlights that precision may not be the best metric for assessing the retriever’s performance. Recall is generally more suitable, as it emphasizes capturing as many relevant chunks as possible—an essential goal since, in the full RAG implementation, the LLM will filter out irrelevant chunks.

 

Recall@k


Recall can be understood as the capacity of the retriever to find all relevant chunks. For instance, if we set k = 10 , this metric evaluates how many of the top 10 retrieved chunks are actually relevant to the question, emphasizing the system’s ability to capture all possible useful information. The same steps are used to calculate k as for precision@k. The recall formula is:




Process for Calculating Recall@k:


The process to calculate the numerator is the same as the precision@k while the process for calculating the denominator is the one mentioned above. Once you calculate recall for all questions, you can calculate the average.


Example:


Suppose we have a retrieval system, and for a specific query, there are a total of 8 relevant chunks in the database. The system retrieves the top 5 chunks (k = 5), out of which 4 are relevant. Using the Recall@k formula:




In this case, the recall would be 0.5, or 50%, which means the system was able to retrieve 50% of all the relevant chunks within the top 5 retrieved results.


Important Considerations for Recall@k:


Calculating recall is more time-consuming because determining the "Total number of relevant items" can be challenging, especially for large knowledge bases. For this metric, access to all chunks is required, and each chunk must be assessed for relevance to each question. For instance, if your knowledge base has 100 documents, each split into 10 chunks, you’d have 1000 chunks to review per question.

 

F1 Score@k


F1 Score@k can be understood as a balanced measure that combines both precision and recall. It provides a single score to evaluate the retriever’s performance by considering both its accuracy in retrieving relevant chunks and its ability to capture all relevant information. For a given k, it essentially tells us how well the system performs in retrieving relevant chunks without missing important ones. This metric combines precision and recall with the formula:



Once precision and recall are obtained for each question, use this formula for the F1 score@k for each question and then calculate the average.


Example:


Suppose we have a retrieval system where, for a given query, the Precision@k is 0.6 (meaning 60% of the top k chunks retrieved are relevant), and the Recall@k is 0.75 (meaning 75% of all relevant chunks were retrieved within the top k results). Using the F1 Score formula:



In this case, the F1 Score is approximately 0.67, or 67%. This score represents a balance between the retriever’s ability to find relevant chunks (recall) and its accuracy in keeping irrelevant chunks out of the top k results (precision).

 

MRR (Mean Reciprocal Rank)


MRR can be understood as a measure of how well the retriever ranks relevant chunks. It specifically evaluates the position of the first relevant chunk in the list of retrieved results. A higher MRR indicates that relevant information appears closer to the top of the list, which is crucial for efficient retrieval. The formula for MRR is:



Where Q is the total number of questions, and rank_i is the position of the first relevant chunk in the list of relevant chunks returned for each question.


Steps to Calculate MRR:


  1. For each question, request the top k chunks from the retriever.

  2. Based on the corresponding answer, identify the rank of the first relevant chunk (rank_i). Here, "rank" refers to the position of the first relevant chunk in the retrieved list. For example, if the first relevant chunk appears as the 3rd item in the list, then rank_i = 3. If no relevant chunks are found, the result for that question should be 0, and you can skip to the next question.

  3. For each question, calculate 1/rank_i.

  4. Sum the results and divide by the total number of questions.


Example:


Suppose we have a retrieval system evaluating 3 different queries (∣Q∣=3), and we want to calculate the MRR based on the ranks of the first relevant chunk for each query:


  • For Query 1, the first relevant chunk is at rank 2.

  • For Query 2, the first relevant chunk is at rank 3.

  • For Query 3, the first relevant chunk is at rank 1.


Using the MRR formula:



Calculating each term:



So, the MRR for these queries is approximately 0.611. This score indicates the system's effectiveness in ranking relevant chunks near the top across multiple queries.


Interpreting Metrics Results:


  • Precision@k indicates the proportion of relevant chunks retrieved within the top k results, highlighting the retriever’s ability to prioritize relevant information when returning a limited set of chunks.

  • Recall@k emphasizes the ability of the information retrieval system to capture as many relevant chunks (from the total of real relevant chunks) as possible in the top results, which is important for tasks where comprehensiveness is key.

  • The F1 score balances precision and recall, giving a single measure of retrieval performance.

  • MRR shows how effectively the system orders relevant chunks, prioritizing those that contain the answer. A high MRR suggests that the most relevant chunks are ranked at the top, indicating that the embeddings and the metric used by our RAG system to retrieve chunks are working well.


As noted, these metrics do not directly use the dataset answers in the calculation formulas but instead rely on them in the analysis to determine if a chunk is relevant. In the case of metrics for evaluating the full RAG system, these answers are directly incorporated into the formulas. The next section will detail this RAG Evaluation process.

 

RAG Evaluation


To evaluate the entire RAG system, one popular set of metrics is the  RAGAS system. An open source implementation can be found in the Python library ragas. Below, we’ll go over each of these metrics and how to calculate them, followed by a Python code example to demonstrate their usage.


Faithfulness


This metric assesses how accurately the generated answer reflects the information contained in the provided contexts (chunks). The formula is:



In this formula:


  • "Claims" refers to statements or pieces of information included in the generated answer.

  • The numerator represents the number of claims in the generated answer that can be supported by the contexts.

  • The denominator represents the total number of claims in the generated answer.


This metric ensures that the system’s response is based on real and relevant information from the context.


Example:


Suppose we have a generated answer from a RAG system with 5 claims in total:


  • Claim 1: Supported by the context

  • Claim 2: Supported by the context

  • Claim 3: Not supported by the context

  • Claim 4: Supported by the context

  • Claim 5: Not supported by the context


Out of the 5 claims, 3 are verifiable and supported by the provided context, while 2 are not.

Using the Faithfulness Score formula:



In this case, the Faithfulness Score is 0.6 or 60%, indicating that 60% of the claims made in the generated answer are supported by the context provided.

 

Answer Relevancy


This metric measures how relevant the generated answer is to the question asked. It is calculated by comparing the cosine similarity between the embeddings of the generated answer and the question. The formula is:



Where:

  • N is the total number of questions evaluated.

  • Eg_i is the embedding of the generated answer.

  • Eo​ is the embedding of the original question.


This metric helps verify that the answer aligns with the intent of the question.

 

Semantic Similarity


This metric measures how closely the generated answer resembles the reference answer (ground truth), ensuring that both share the same meaning. The formula is:



In this formula:


  • V_generated represents the vector of the generated answer.

  • V_reference represents the vector of the correct answer (ground truth).


This metric evaluates whether the system is generating an answer that is semantically consistent with the expected answer.

 

Correctness


This metric assesses whether the generated answer is factually accurate based on the retrieved contexts. The formula used for this metric is:



Where:


  • "Claims" here refer to factual statements made in the generated answer.

  • TP (True Positives) represents the correct claims in the generated answer.

  • FP (False Positives) are the incorrect claims in the generated answer.

  • FN (False Negatives) are the correct claims that were not included in the generated answer.


A high F1 Score in Correctness indicates that the generated answer is factually accurate according to the provided contexts.


Example:


Suppose we have the following information for a generated answer evaluated for correctness:


  • True Positives (TP): 6 (claims in the answer that are correct according to the context)

  • False Positives (FP): 2 (incorrect claims in the answer)

  • False Negatives (FN): 3 (correct claims that were not included in the answer)


Using the F1 Score formula provided:



So, the F1 Score for answer correctness in this example is approximately 0.706 or 70.6%. This score reflects a balance between including accurate information (TP) and avoiding incorrect or missing claims (FP and FN).

 

Using the RAGAS Library for RAG Evaluation


Now that we’ve outlined each metric, let’s see how we can use the ragas library to obtain these metrics. The following example code was tested with ragas==0.1.20. To install the library, run:

pip install ragas==0.1.20

Important: This library uses OpenAI embeddings for calculations, so you’ll need to create a .env file in the same directory as the script with your OpenAI API key:

OPENAI_API_KEY=<your-key>

Replace <your-key> with your actual OpenAI API key.


Here’s the example script:

from datasets import Dataset
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_similarity,
    answer_correctness,
)
from ragas import evaluate
from dotenv import load_dotenv

# Load OPENAI_API_KEY env var
load_dotenv()

data_samples = {
    "question": ["How do I configure a PostgreSQL database connection in Python?"],
    "answer": [
        "To configure a PostgreSQL database connection in Python, start by installing the `psycopg2` library. Then, create a connection string with the required details such as database name, user, password, host, and port. Finally, use `psycopg2.connect()` to establish the connection."
    ],
    "contexts": [
        [
            "To set up a PostgreSQL connection in Python, first install the `psycopg2` package, which provides the PostgreSQL adapter. The connection string should include parameters like `dbname`, `user`, `password`, `host`, and `port`. A typical connection string example is: `dbname='mydb' user='myuser' password='mypassword' host='localhost' port='5432'`. Once configured, use `psycopg2.connect()` to establish a connection, and be sure to close it when done.",
            "When working with PostgreSQL in Python, connection handling can be enhanced by specifying additional parameters in the connection string, such as connection timeout or SSL mode. Ensure that the database user has the necessary privileges and that the PostgreSQL server allows connections from the client IP address.",
            "Troubleshooting PostgreSQL connections involves checking the connection string, verifying database permissions, and ensuring that the server firewall allows connections on port 5432. Use tools like `psql` to test the connection directly, and verify that the PostgreSQL server is running and accessible from the client machine."
        ]
    ],
    "ground_truth": [
        "Install `psycopg2`, then create a connection string for PostgreSQL with database credentials and host details, and use `psycopg2.connect()` to establish the connection."
    ]
}


dataset = Dataset.from_dict(data_samples)

score = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, answer_similarity, answer_correctness],
)

print("score:", score)

Running this script produced the following score:

score: {'faithfulness': 1.0000, 'answer_relevancy': 0.9967, 'answer_similarity': 0.9381, 'answer_correctness': 0.9845}

Step-by-Step to Evaluate All Questions:


  1. Create a New Dataset: Start with the initial Q&A dataset and add the fields "context", "answer", and "ground_truth".

    • context: List of chunks returned by the retriever.

    • answer: Answer generated by the RAG system for each question.

    • ground_truth: The correct answer from the original dataset.

  2. Load Data into data_samples: Follow the structure in the example script, ensuring each list entry is aligned by question. For instance, the first element in question aligns with the first elements in answer, contexts, and ground_truth.


Interpreting the Results


  • Faithfulness: A high Faithfulness score shows that the answer is directly based on the retrieved contexts, ensuring alignment with the source material.

  • Answer Relevancy: A high Answer Relevancy score indicates that the generated answer is closely related to the question, making it likely to address the user’s query directly.

  • Semantic Similarity: This score reveals how similar the generated answer is to the ground truth, ensuring that the answer preserves the core meaning of the ideal response.

  • Correctness: A high Correctness score confirms that the generated answer is factually accurate according to the retrieved information, suggesting reliability in the response.

 

Ready to Elevate Your RAG System’s Performance? Contact Us!


In this blog, we’ve walked through the essential steps for evaluating a Retrieval-Augmented Generation system, from creating a well-structured Q&A dataset to assessing both the retriever and the generated answers with advanced metrics. Do you need support with your RAG project or want to implement similar evaluation strategies? We’re here to help! Reach out to us at support@ridgerun.ai, and let’s discuss how we can work together to take your project to the next level.


29 views0 comments

Recent Posts

See All

コメント


bottom of page