RAG System Building Exercise

Introduction

Hello Students! 👋 Are you ready to dive into the exciting world of Retrieval-Augmented Generation (RAG) systems? This exercise will guide you through constructing your very own RAG system, combining the power of information retrieval with state-of-the-art language models. Let’s embark on this learning adventure together! First make sure to download the repository from this link: coling_rag_exercise

What is a RAG System?

A Retrieval-Augmented Generation (RAG) system is a powerful AI architecture that combines the strengths of large language models with external knowledge retrieval. Here’s how it works:

(Image taken from BentoML)

Retrieval: When given a query, the system searches a knowledge base to find relevant information like documents (here illustrated as VectorDB).
Augmentation: The retrieved information is then used to supplement the input to a language model (here illustrated as the retrieved chunks we pass to the large language model).
Generation: The language model generates a response based on both the original query and the retrieved information (here illustrated at the final stage).

This approach allows the system to access and utilize vast amounts of up-to-date information without the need to retrain the entire model, leading to more accurate and informed responses.

Getting Started

This code has been designed in a Python 3.11 environment. Therefore, to avoid any compatibility issues, we recommend using Python 3.11 for this exercise.

To begin your RAG system journey, you have two options for setting up your environment: either using venv or conda. We recommend conda for flexibility.

Option 1: Using venv (if you have Python 3.11 already installed in your system)

Create a new Python environment:
```
python -m venv rag_env
```
Activate the environment:
- On Windows: rag_env\Scripts\activate
- On macOS and Linux: source rag_env/bin/activate
Install the required packages:
```
pip install -r requirements.txt
```

Option 2: Using Conda (recommended for flexibility)

If you don’t have Conda installed, download and install Miniconda or Anaconda.
Create a new Conda environment with Python 3.11:
```
conda create -n rag_env python=3.11
```
Activate the Conda environment:
```
conda activate rag_env
```
Install the required packages:
```
pip install -r requirements.txt
```

Adding your .env file

Please add a .env file in the root directory of the project with the following content, so that you can use the OpenAI API. The API key will be provided during class: OPENAI_KEY=<your_openai_key> LOCAL=True QUICK_DEMO=True

LOCAL lets you run the embeddings locally if True and on OpenAI’s servers if False., QUICK_DEMO lets you run the code with a smaller dataset for faster results if True and with the full dataset if False.

Provided Data

In the data folder of this project, you’ll find the pdf files of all EPFL legal documents.

Feel free to explore these files to understand the kind of data your system will be working with.

Customizing Your Data

While we’ve provided sample data to get you started, we encourage you to experiment with your own documents and questions! Feel free to replace the provided files with your own text documents and relevant questions. This will allow you to test your RAG system on a wider range of topics and scenarios. To use your own data:

Preliminaries on Packages

The three new packages we will be using in this practical session are sqlite3, faiss, and langchain.

sqlite3

SQLite is a library that implements a small and fast SQL database engine. If you do not know how an SQL database works, do not worry. For these parts of the exercise where you are requested to implement an SQL query, simply ask help from the TAs! The main idea is to use this library as a database to store our documents and their embeddings.

If you are curious, here is a TLDR on how an SQL database works, generated by ChatGPT:

An SQL database works like a digital filing cabinet where data is stored in organized tables. Each table is like a spreadsheet, with rows and columns. Here’s the super simple breakdown: - Tables: Think of a table as a single file in a cabinet. Each table holds information about one topic (like “customers” or “orders”). The columns in the table represent different types of information about that topic (like “name,” “email,” or “order date”), while each row holds a unique entry (like a single customer or one specific order). - SQL Language: SQL (Structured Query Language) is like a set of instructions you use to interact with this filing cabinet. You can ask it to: - Get data with SELECT (like “Show me all the customers in New York”). - Add data with INSERT (like “Add a new customer to the list”). - Update data with UPDATE (like “Change a customer’s email address”). - Delete data with DELETE (like “Remove an old order”). - Relationships: You can connect tables to each other. For example, you might connect the “customers” table to the “orders” table to see which customer made each order. This is called a relationship, and it helps keep things organized and easy to find.

In short, an SQL database stores data in tables, and SQL is the language you use to interact with that data: searching, adding, updating, or connecting pieces of information as needed.

FAISS

FAISS (Facebook AI Similarity Search) is an open-source library developed by Facebook AI Research (FAIR) that provides efficient algorithms for searching and clustering high-dimensional data such as document embeddings. We will use this library to find the most relevant document to a user query.

LangChain

LangChain is a framework for developing applications powered by large language models (LLMs). In particular, LangChain will be helpful to be able to create chat sessions with LLM agents like ChatGPT. It conveniently stores the chat history and provides easy ways to create prompt templates without the overhead of having to code it!

Your Task

Your mission is to complete the missing parts of the RAG system written in main.py. While you could fill-in the file directly, we highly recommend following the detailed instructions in each subsection here. Note that there are many ways to implement some of these functions (sometimes there isn’t a clear right or wrong), so you can feel free to choose what functionality to include or not.

Running the Code

To run the main script to test your implementation:

python main.py

Tip: Each time you implement a certain portion of the task, try testing that function only.

1. Implementing efficient text chunking

In this task, you will fill in the process_pdf and chunk_document functions.

As shown the illustration above, in order to implement a document database that can be easily queried, we need to first vectorize the documents. In order to turn a document into a vector we need to first chunk it into discrete parts that can be easily processed.

First implement process_pdf to chunk the document. For now you can just implement the first two TODOs.
Then go to chunk_document and complete all the TODOs.
- The goal of this function is to chunk the document pages, handling chunks that cross page boundaries. This function is like a master chef slicing a long document into bite-sized pieces. It ensures that each chunk is just the right size for our model to digest, while keeping track of where each chunk came from.
- In particular you will have to concatenate prior leftover strings of pages that didn’t exactly fit into chunks of size 500. You do this by concatenating it with the next page string.
- Then you should chunk the concatenated string into chunks.
- Finally you should add all the chunks to the result variable where we accumulate the chunks.

2. Creating and managing embeddings

In this task, you will fill in the process_pdf, embed_chunks and process_and_store_chunks functions.

Next step is to store and update chunk embeddings.

Go back to process_pdf you will fill in the last TODO that does the call to process_and_store_chunks.
Then go to the embed_chunks function and complete the first TODO. You can implement the OpenAI embedding method after you make it work with the local model.
Finally go to the process_and_store_chunks function and complete all the TODOs.
- In particular, you should make two SQL insertion queries.
- The first insertion query is to store the chunks you got, with information on:
  - 1. the chunk’s text,
  - 1. which document the chunk comes from
  - 1. what page the chunk is at
- Another is to store the embeddings of each chunks, tied through its unique ID in the database.

3. Building a retrieval system using FAISS

In this task, you will fill in the create_faiss_index function.

An index is like a reference or a map that helps you find things quickly.

Imagine you have a huge list of items, and you want to find something in that list. Without an index, you would have to go through the whole list one by one. But with an index, you can go directly to the item you’re looking for, saving time.

In the context of FAISS and similar systems, an index is a structure that stores data in a way that allows for fast search. Instead of comparing every item to the one you’re looking for, the index helps you jump to the closest matches quickly

In the next step, in order to match a query with a document, we have to implement the FAISS indexing system using the L2 distance.

After we retrieve the embeddings you have to create the FAISS index by using the embedding dimension and their L2 distance.
Then add the actual embeddings to the index.

4. Integrating the retrieval system with a language model using LangChain

In this task, you will fill in the search_engine, search_tool, run_agent_conversation functions and the global lines between these functions.

Finally, in order to prompt the model with the document we can use a package called LangChain that will easily build the conversation by combining the relevant document and the prompt text for us!

First you need to fill in the TODO in search_engine. This function takes a question, finds the most relevant chunk in the database and returns it so you can enhance the question you pass to the LLM agent. Follow the docstring specs + comments to understand what the function needs to do and return.
And then you will need to implement the search_tool function. This function simply uses the search engine to find relevant information and format them in a human readable way. It’s quite short because the goal is to provide a function to LangChain as shown right below the search_tool definition.
Then fill in the TODOs between the search_tool and run_agent_conversation functions that create:
- A prompt template: go to system_prompt.txt and create a template
- An AI agent: create the agent by following the documentation hints
- The agent executtor: use the AgentExecutor to create the agent executor
Finally, fill-in the run_agent_conversation achieve the conversation loop:
- Get user inputs
- Request a response from the executor
- Fetch the source and page of the retrieved document to display it in the assistant output!

And voilà! You have completed your RAG pipeline.

Additional Tips

Remember, the journey is just as important as the destination. Don’t hesitate to experiment, ask questions, and learn from both successes and failures. Here are some tips:

Take your time to understand each component of the RAG system.
Test your code frequently as you implement each part.
Collaborate with your peers and share insights.
Don’t be afraid to consult package documentation or ask for help when needed. It’s a good practice to first read the documentation yourself and then ask for help if you are stuck.

We’re excited to see what you’ll create! Happy coding, and may your RAG system retrieve and generate with excellence! 🚀📚🤖

Going Further (Optional)

Congratulations on building your basic RAG system! However, the journey doesn’t end here. There are several ways to improve and extend your system for better performance and user experience. Here are some areas to consider:

1. Streaming Responses

Our current implementation displays the entire message at once, which can lead to long wait times for users. Implementing a chunk-by-chunk response system using LangChain’s streaming capabilities can significantly improve the user experience. This allows users to start reading the response while the rest is still being generated.

2. Improving the Retrieval Process

The retrieval part of our RAG system can be enhanced in several ways:

1. Adding Context to Chunks: Our current chunks might lack important context. For example, a chunk discussing “last quarter benefits” doesn’t specify which quarter it’s referring to. You can use an LLM to read the full document and append missing context to each chunk.
1. Hybrid Search: Pure embedding-based search can sometimes miss important keywords. Implementing a hybrid approach that combines keyword search (like BM25) with semantic search can lead to more accurate retrievals.
1. Reranking: Implementing a reranking step after the initial retrieval can improve the relevance of the results. While this might introduce some latency, especially on limited hardware, it can significantly enhance the quality of the retrieved information.

3. Fine-tuning and Domain Adaptation

Consider fine-tuning your language model on domain-specific data to improve its performance on your particular use case. This can lead to more accurate and relevant responses.

4. Implementing Feedback Mechanisms

Add a way for users to provide feedback on the system’s responses. This can help you identify areas for improvement and potentially implement a learning mechanism to enhance the system over time.

If your use case involves not just text but also images or other types of data, consider extending your RAG system to handle multiple modalities.

Remember, building a RAG system is an iterative process. Each of these improvements opens up new possibilities and challenges. Don’t hesitate to experiment and push the boundaries of what your system can do!

We’re excited to see how you’ll take your RAG system to the next level. Keep exploring, keep innovating, and most importantly, keep learning! 🚀🧠💡