GitHub Issues to RAG Extractor

A universal command-line tool to extract issues and comments from any public GitHub repository and prepare them for Retrieval-Augmented Generation (RAG) pipelines. Perfect for building knowledge bases, documentation systems, and AI-powered support tools from any open-source project.

🛠️ Development Notes

LLM-Assisted Development: This project was developed as a weekend experiment in coding with Large Language Models. The development process used a combination of human creativity and AI assistance from ChatGPT o3-mini-high, Claude Sonnet 4, Claude Opus 4, and MCP (Model Context Protocol) for filesystem operations. It demonstrates how modern AI tools can accelerate software development while maintaining code quality through comprehensive testing.

Tools used:

🧠 A bit of human brain and hands
🤖 ChatGPT o3-mini-high (OpenAI)
🔮 Claude Sonnet 4 (Anthropic)
⚡ Claude Opus 4 (Anthropic)
🗂️ MCP for filesystem operations

✨ Features

🔍 Issue extraction with date-range and label filtering
💬 Comment extraction for each issue with full metadata
📝 Individual Markdown files for each issue (perfect for RAG indexing)
🔄 Exponential backoff and retry for GitHub API rate limits
⏯️ Resume capability after interruptions
📊 Structured JSON outputs split into files ≤4 MiB (Cloudflare-compatible)
🧩 Text chunking (~2000 characters) with metadata manifest
📁 Organized output structure in /results directory
✅ Comprehensive unit tests for all components

🗂️ Output Structure

results/
├── json/                          # All JSON data
│   ├── all_issues.json           # Raw issues data
│   ├── all_comments.json         # Raw comments data  
│   ├── merged_*.json             # Combined issue+comments (chunked)
│   └── corpus_*.json             # Text chunks for RAG indexing
├── markdown/                      # Individual issue files
│   ├── 0001-issue-title.md       # Issue #1 with metadata & comments
│   ├── 0002-another-issue.md     # Issue #2 with metadata & comments
│   └── ...
└── manifest.csv                  # Index mapping chunks to files

📂 Repository Structure

├── LICENSE                       # MIT License
├── README.md                     # This file
├── requirements.txt              # Python dependencies
├── config.yaml                   # Configuration file
├── run.py                        # Simple entry point
├── src/                          # Source code
│   ├── cli.py                    # CLI interface
│   ├── extractors.py             # Issue extraction logic
│   ├── comments.py               # Comment extraction logic
│   ├── rag_prep.py               # RAG preparation & chunking
│   ├── markdown_generator.py     # Markdown file generation
│   ├── output_manager.py         # Output directory management
│   ├── utils.py                  # Utilities & retry logic
│   └── state/                    # Persistent state files
└── tests/                        # Unit tests
    ├── test_extractors.py
    ├── test_comments.py
    ├── test_rag_prep.py
    └── ...

🚀 Quick Start

1. Clone & Install

git clone https://github.com/yanchuk/github-issues-rag.git
cd github-issues-rag
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\\Scripts\\activate
pip install -r requirements.txt

2. Configure (Optional)

Edit config.yaml or rely on interactive prompts:

# GitHub repository (leave blank for interactive prompt)
repo: "owner/repo"  # e.g., "microsoft/vscode", "facebook/react", "tensorflow/tensorflow"
token: ""  # Personal access token (or set via prompt)

# Filters for issue extraction  
filters:
  start_date: "2023-01-01"        # Optional: YYYY-MM-DD
  end_date: "2023-12-31"          # Optional: YYYY-MM-DD  
  labels: "bug,enhancement"       # Optional: comma-separated

# RAG preparation settings
chunk_size: 2000                  # Characters per text chunk
max_file_size: 4194304           # 4 MiB per output file
log_level: "INFO"

3. Run the Pipeline

# Clean start (removes previous state)
python run.py --clean

# Resume from previous run
python run.py --resume

# With CLI arguments (overrides config.yaml)
python run.py --repo "microsoft/vscode" --labels "bug" --start-date "2024-01-01"

# Extract from popular open-source projects
python run.py --repo "facebook/react" --labels "bug,enhancement"
python run.py --repo "tensorflow/tensorflow" --start-date "2024-01-01"
python run.py --repo "kubernetes/kubernetes" --labels "kind/bug,priority/high"

4. Outputs

After completion, check the results/ directory:

ls -la results/
├── json/                    # Raw and processed JSON data
├── markdown/                # Individual issue .md files  
└── manifest.csv            # RAG indexing manifest

Example Markdown Output:

# Fix memory leak in extension host

## Issue Metadata
- **Issue Number:** #12345
- **State:** CLOSED  
- **Author:** @username
- **Created:** January 15, 2024 at 10:30 AM UTC
- **Labels:** `bug` `performance` 
- **Assignees:** @maintainer

## Issue Description
The extension host process consumes excessive memory...

## Comments (3)
### Comment 1
**Author:** @contributor  
**Posted:** January 16, 2024 at 2:15 PM UTC

I can reproduce this issue with the following steps...

⚙️ CLI Options

python run.py [OPTIONS]

Options:
  --repo TEXT        GitHub repository (owner/repo)
  --token TEXT       GitHub personal access token  
  --start-date TEXT  Filter issues created after YYYY-MM-DD
  --end-date TEXT    Filter issues created before YYYY-MM-DD
  --labels TEXT      Comma-separated labels to filter
  --resume           Resume from last saved state
  --clean            Remove all state and start fresh
  --help             Show this message and exit

🧪 Testing

# Run all tests
pytest

# Run with verbose output
pytest -v

# Run specific test file
pytest tests/test_extractors.py

🔧 Configuration

Environment Variables

export GITHUB_TOKEN="your_token_here"  # Alternative to interactive input

GitHub Token Requirements

Scope needed: repo (for private repos) or no scopes (for public repos)
Rate limits: ~5000 requests/hour for authenticated users
Get a token: GitHub Settings > Developer settings > Personal access tokens

📋 Use Cases

Documentation websites - Convert GitHub issues into searchable knowledge bases
Support chatbots - Train AI models on historical issue resolutions
Project analysis - Analyze issue trends, common problems, and solutions
RAG applications - Build context-aware AI assistants for your projects
Research - Study open-source project evolution and community interactions

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes and add tests
Run tests (pytest)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Related Projects

GitHub API Documentation
RAG (Retrieval-Augmented Generation)
LangChain - Framework for LLM applications
ChromaDB - Vector database for RAG

⭐ Star this repo if it helped you build something awesome!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
config.yaml		config.yaml
readme.md		readme.md
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GitHub Issues to RAG Extractor

🛠️ Development Notes

✨ Features

🗂️ Output Structure

📂 Repository Structure

🚀 Quick Start

1. Clone & Install

2. Configure (Optional)

3. Run the Pipeline

4. Outputs

⚙️ CLI Options

🧪 Testing

🔧 Configuration

Environment Variables

GitHub Token Requirements

📋 Use Cases

🤝 Contributing

📄 License

🔗 Related Projects

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

yanchuk/github-issues-rag

Folders and files

Latest commit

History

Repository files navigation

GitHub Issues to RAG Extractor

🛠️ Development Notes

✨ Features

🗂️ Output Structure

📂 Repository Structure

🚀 Quick Start

1. Clone & Install

2. Configure (Optional)

3. Run the Pipeline

4. Outputs

⚙️ CLI Options

🧪 Testing

🔧 Configuration

Environment Variables

GitHub Token Requirements

📋 Use Cases

🤝 Contributing

📄 License

🔗 Related Projects

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages