This project demonstrates a Retrieval-Augmented Generation (RAG) system using hierarchical chunking to improve question-answering performance on a classic novel. The goal is to efficiently retrieve relevant context from a text and use it to enhance the answers generated by a Large Language Model (LLM).
We utilize H.G. Wells's 'The Time Machine' from Project Gutenberg (ID 35) as our primary text. This book serves as a representative dataset for evaluating the RAG system's ability to extract information and answer questions from a narrative. The objective is to demonstrate how hierarchical chunking and a vector database can provide more accurate and contextually relevant answers compared to a baseline LLM without RAG.
To set up and run this project, follow these steps:
-
Python Environment: Ensure you have Python 3.8+ installed. It's recommended to use a virtual environment.
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts�ctivate`
-
Install Dependencies: Install all required Python packages.
!pip install -U "fsspec[http]==2024.6.1" "gcsfs==2024.6.1" "protobuf<6" !pip install -U transformers accelerate safetensors huggingface_hub !pip install -U qdrant-client==1.9.1 sentence-transformers==3.2.1 !pip install -U sacrebleu==2.4.2 rouge-score==0.1.2 tiktoken==0.7.0 psutil==6.0.0 datasets==3.0.1
-
Hugging Face Token: Log in to Hugging Face Hub, as it's required for downloading the Gemma model. You will be prompted to enter your token.
from huggingface_hub import login from getpass import getpass HF_TOKEN = os.getenv("HF_TOKEN") if HF_TOKEN: login(token=HF_TOKEN, add_to_git_credential=False) else: print("Paste your HF token (starts with hf_):") HF_TOKEN = getpass() login(token=HF_TOKEN, add_to_git_credential=False)
-
Download Data: The script will automatically download 'The Time Machine' from Project Gutenberg and the NarrativeQA metadata.
This project follows a structured approach to RAG:
- Data Preprocessing: The chosen book is downloaded and cleaned to remove Project Gutenberg headers/footers and normalize line breaks.
- NarrativeQA Filter: Relevant QA pairs for 'The Time Machine' are extracted from the NarrativeQA dataset to form the evaluation set.
- Hierarchical Chunking: The cleaned text is broken down into parent and child chunks for efficient retrieval. This strategy aims to capture both broad context and specific details.
- Parent Chunk Size (PARENT_SZ):
1200tokens - Parent Overlap (PARENT_OV):
150tokens - Child Chunk Size (CHILD_SZ):
300tokens - Child Overlap (CHILD_OV):
50tokens
- Parent Chunk Size (PARENT_SZ):
- Embeddings: Each child chunk is converted into a vector embedding using the
sentence-transformers/all-MiniLM-L6-v2model. - Qdrant Indexing: The child chunk embeddings are stored in an on-disk Qdrant vector database for fast similarity search.
- Retrieval Logic: Given a question, the system queries the Qdrant index to find the most relevant child chunks. These child chunks then map back to their parent chunks to assemble a rich context for the LLM.
- Number of Child Hits (K_CHILD):
6 - Number of Top Parents for Context (TOP_PARENTS):
3 - Context Budget (tokens):
1400
- Number of Child Hits (K_CHILD):
- RAG Pipeline: The retrieved context and the question are fed to the
google/gemma-3-1b-itmodel to generate an answer. - Baseline Evaluation: Answers are also generated by the
google/gemma-3-1b-itmodel directly, without any RAG context, to serve as a baseline. - Evaluation Metrics: The generated answers from both baseline and RAG approaches are evaluated using BLEU-4 and ROUGE-L scores against the ground-truth answers from NarrativeQA.
The evaluation results highlight the benefits of using RAG with hierarchical chunking:
| Approach | BLEU-4 | ROUGE-L |
|---|---|---|
| Baseline | 0.11 | 5.11 |
| RAG | 0.21 | 8.81 |
These metrics demonstrate that the RAG implementation significantly improves the quality and relevance of the generated answers compared to the standalone LLM.
Further details are explained in the report.