Advanced RAG: Hierarchical Retrieval from Literature on Constrained Resources

This project demonstrates a Retrieval-Augmented Generation (RAG) system using hierarchical chunking to improve question-answering performance on a classic novel. The goal is to efficiently retrieve relevant context from a text and use it to enhance the answers generated by a Large Language Model (LLM).

Chosen Book: 'The Time Machine' (Gutenberg ID 35)

We utilize H.G. Wells's 'The Time Machine' from Project Gutenberg (ID 35) as our primary text. This book serves as a representative dataset for evaluating the RAG system's ability to extract information and answer questions from a narrative. The objective is to demonstrate how hierarchical chunking and a vector database can provide more accurate and contextually relevant answers compared to a baseline LLM without RAG.

Setup Instructions

To set up and run this project, follow these steps:

Python Environment: Ensure you have Python 3.8+ installed. It's recommended to use a virtual environment.
```
python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts�ctivate`
```

Install Dependencies: Install all required Python packages.

!pip install -U "fsspec[http]==2024.6.1" "gcsfs==2024.6.1" "protobuf<6"
!pip install -U transformers accelerate safetensors huggingface_hub
!pip install -U qdrant-client==1.9.1 sentence-transformers==3.2.1
!pip install -U sacrebleu==2.4.2 rouge-score==0.1.2 tiktoken==0.7.0 psutil==6.0.0 datasets==3.0.1

Hugging Face Token: Log in to Hugging Face Hub, as it's required for downloading the Gemma model. You will be prompted to enter your token.

from huggingface_hub import login
from getpass import getpass

HF_TOKEN = os.getenv("HF_TOKEN")
if HF_TOKEN:
    login(token=HF_TOKEN, add_to_git_credential=False)
else:
    print("Paste your HF token (starts with hf_):")
    HF_TOKEN = getpass()
    login(token=HF_TOKEN, add_to_git_credential=False)

Download Data: The script will automatically download 'The Time Machine' from Project Gutenberg and the NarrativeQA metadata.

Step-by-Step Guide

This project follows a structured approach to RAG:

Data Preprocessing: The chosen book is downloaded and cleaned to remove Project Gutenberg headers/footers and normalize line breaks.
NarrativeQA Filter: Relevant QA pairs for 'The Time Machine' are extracted from the NarrativeQA dataset to form the evaluation set.
Hierarchical Chunking: The cleaned text is broken down into parent and child chunks for efficient retrieval. This strategy aims to capture both broad context and specific details.
- Parent Chunk Size (PARENT_SZ): 1200 tokens
- Parent Overlap (PARENT_OV): 150 tokens
- Child Chunk Size (CHILD_SZ): 300 tokens
- Child Overlap (CHILD_OV): 50 tokens
Embeddings: Each child chunk is converted into a vector embedding using the sentence-transformers/all-MiniLM-L6-v2 model.
Qdrant Indexing: The child chunk embeddings are stored in an on-disk Qdrant vector database for fast similarity search.
Retrieval Logic: Given a question, the system queries the Qdrant index to find the most relevant child chunks. These child chunks then map back to their parent chunks to assemble a rich context for the LLM.
- Number of Child Hits (K_CHILD): 6
- Number of Top Parents for Context (TOP_PARENTS): 3
- Context Budget (tokens): 1400
RAG Pipeline: The retrieved context and the question are fed to the google/gemma-3-1b-it model to generate an answer.
Baseline Evaluation: Answers are also generated by the google/gemma-3-1b-it model directly, without any RAG context, to serve as a baseline.
Evaluation Metrics: The generated answers from both baseline and RAG approaches are evaluated using BLEU-4 and ROUGE-L scores against the ground-truth answers from NarrativeQA.

Results Summary

The evaluation results highlight the benefits of using RAG with hierarchical chunking:

Approach	BLEU-4	ROUGE-L
Baseline	0.11	5.11
RAG	0.21	8.81

These metrics demonstrate that the RAG implementation significantly improves the quality and relevance of the generated answers compared to the standalone LLM.

Detailed Analysis

Further details are explained in the report.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
notebooks		notebooks
report		report
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Advanced RAG: Hierarchical Retrieval from Literature on Constrained Resources

Chosen Book: 'The Time Machine' (Gutenberg ID 35)

Setup Instructions

Step-by-Step Guide

Results Summary

Detailed Analysis

About

Uh oh!

Releases

Packages

Languages

cetsen/retrieval-augmented-generation

Folders and files

Latest commit

History

Repository files navigation

Advanced RAG: Hierarchical Retrieval from Literature on Constrained Resources

Chosen Book: 'The Time Machine' (Gutenberg ID 35)

Setup Instructions

Step-by-Step Guide

Results Summary

Detailed Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages