Skip to content

cetsen/retrieval-augmented-generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Advanced RAG: Hierarchical Retrieval from Literature on Constrained Resources

This project demonstrates a Retrieval-Augmented Generation (RAG) system using hierarchical chunking to improve question-answering performance on a classic novel. The goal is to efficiently retrieve relevant context from a text and use it to enhance the answers generated by a Large Language Model (LLM).

Chosen Book: 'The Time Machine' (Gutenberg ID 35)

We utilize H.G. Wells's 'The Time Machine' from Project Gutenberg (ID 35) as our primary text. This book serves as a representative dataset for evaluating the RAG system's ability to extract information and answer questions from a narrative. The objective is to demonstrate how hierarchical chunking and a vector database can provide more accurate and contextually relevant answers compared to a baseline LLM without RAG.

Setup Instructions

To set up and run this project, follow these steps:

  1. Python Environment: Ensure you have Python 3.8+ installed. It's recommended to use a virtual environment.

    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts�ctivate`
  2. Install Dependencies: Install all required Python packages.

    !pip install -U "fsspec[http]==2024.6.1" "gcsfs==2024.6.1" "protobuf<6"
    !pip install -U transformers accelerate safetensors huggingface_hub
    !pip install -U qdrant-client==1.9.1 sentence-transformers==3.2.1
    !pip install -U sacrebleu==2.4.2 rouge-score==0.1.2 tiktoken==0.7.0 psutil==6.0.0 datasets==3.0.1
  3. Hugging Face Token: Log in to Hugging Face Hub, as it's required for downloading the Gemma model. You will be prompted to enter your token.

    from huggingface_hub import login
    from getpass import getpass
    
    HF_TOKEN = os.getenv("HF_TOKEN")
    if HF_TOKEN:
        login(token=HF_TOKEN, add_to_git_credential=False)
    else:
        print("Paste your HF token (starts with hf_):")
        HF_TOKEN = getpass()
        login(token=HF_TOKEN, add_to_git_credential=False)
  4. Download Data: The script will automatically download 'The Time Machine' from Project Gutenberg and the NarrativeQA metadata.

Step-by-Step Guide

This project follows a structured approach to RAG:

  1. Data Preprocessing: The chosen book is downloaded and cleaned to remove Project Gutenberg headers/footers and normalize line breaks.
  2. NarrativeQA Filter: Relevant QA pairs for 'The Time Machine' are extracted from the NarrativeQA dataset to form the evaluation set.
  3. Hierarchical Chunking: The cleaned text is broken down into parent and child chunks for efficient retrieval. This strategy aims to capture both broad context and specific details.
    • Parent Chunk Size (PARENT_SZ): 1200 tokens
    • Parent Overlap (PARENT_OV): 150 tokens
    • Child Chunk Size (CHILD_SZ): 300 tokens
    • Child Overlap (CHILD_OV): 50 tokens
  4. Embeddings: Each child chunk is converted into a vector embedding using the sentence-transformers/all-MiniLM-L6-v2 model.
  5. Qdrant Indexing: The child chunk embeddings are stored in an on-disk Qdrant vector database for fast similarity search.
  6. Retrieval Logic: Given a question, the system queries the Qdrant index to find the most relevant child chunks. These child chunks then map back to their parent chunks to assemble a rich context for the LLM.
    • Number of Child Hits (K_CHILD): 6
    • Number of Top Parents for Context (TOP_PARENTS): 3
    • Context Budget (tokens): 1400
  7. RAG Pipeline: The retrieved context and the question are fed to the google/gemma-3-1b-it model to generate an answer.
  8. Baseline Evaluation: Answers are also generated by the google/gemma-3-1b-it model directly, without any RAG context, to serve as a baseline.
  9. Evaluation Metrics: The generated answers from both baseline and RAG approaches are evaluated using BLEU-4 and ROUGE-L scores against the ground-truth answers from NarrativeQA.

Results Summary

The evaluation results highlight the benefits of using RAG with hierarchical chunking:

Approach BLEU-4 ROUGE-L
Baseline 0.11 5.11
RAG 0.21 8.81

These metrics demonstrate that the RAG implementation significantly improves the quality and relevance of the generated answers compared to the standalone LLM.

Detailed Analysis

Further details are explained in the report.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published