A universal command-line tool to extract issues and comments from any public GitHub repository and prepare them for Retrieval-Augmented Generation (RAG) pipelines. Perfect for building knowledge bases, documentation systems, and AI-powered support tools from any open-source project.
LLM-Assisted Development: This project was developed as a weekend experiment in coding with Large Language Models. The development process used a combination of human creativity and AI assistance from ChatGPT o3-mini-high, Claude Sonnet 4, Claude Opus 4, and MCP (Model Context Protocol) for filesystem operations. It demonstrates how modern AI tools can accelerate software development while maintaining code quality through comprehensive testing.
Tools used:
- ๐ง A bit of human brain and hands
- ๐ค ChatGPT o3-mini-high (OpenAI)
- ๐ฎ Claude Sonnet 4 (Anthropic)
- โก Claude Opus 4 (Anthropic)
- ๐๏ธ MCP for filesystem operations
- ๐ Issue extraction with date-range and label filtering
- ๐ฌ Comment extraction for each issue with full metadata
- ๐ Individual Markdown files for each issue (perfect for RAG indexing)
- ๐ Exponential backoff and retry for GitHub API rate limits
- โฏ๏ธ Resume capability after interruptions
- ๐ Structured JSON outputs split into files โค4 MiB (Cloudflare-compatible)
- ๐งฉ Text chunking (~2000 characters) with metadata manifest
- ๐ Organized output structure in
/resultsdirectory - โ Comprehensive unit tests for all components
results/
โโโ json/ # All JSON data
โ โโโ all_issues.json # Raw issues data
โ โโโ all_comments.json # Raw comments data
โ โโโ merged_*.json # Combined issue+comments (chunked)
โ โโโ corpus_*.json # Text chunks for RAG indexing
โโโ markdown/ # Individual issue files
โ โโโ 0001-issue-title.md # Issue #1 with metadata & comments
โ โโโ 0002-another-issue.md # Issue #2 with metadata & comments
โ โโโ ...
โโโ manifest.csv # Index mapping chunks to files
โโโ LICENSE # MIT License
โโโ README.md # This file
โโโ requirements.txt # Python dependencies
โโโ config.yaml # Configuration file
โโโ run.py # Simple entry point
โโโ src/ # Source code
โ โโโ cli.py # CLI interface
โ โโโ extractors.py # Issue extraction logic
โ โโโ comments.py # Comment extraction logic
โ โโโ rag_prep.py # RAG preparation & chunking
โ โโโ markdown_generator.py # Markdown file generation
โ โโโ output_manager.py # Output directory management
โ โโโ utils.py # Utilities & retry logic
โ โโโ state/ # Persistent state files
โโโ tests/ # Unit tests
โโโ test_extractors.py
โโโ test_comments.py
โโโ test_rag_prep.py
โโโ ...
git clone https://github.com/yanchuk/github-issues-rag.git
cd github-issues-rag
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\\Scripts\\activate
pip install -r requirements.txtEdit config.yaml or rely on interactive prompts:
# GitHub repository (leave blank for interactive prompt)
repo: "owner/repo" # e.g., "microsoft/vscode", "facebook/react", "tensorflow/tensorflow"
token: "" # Personal access token (or set via prompt)
# Filters for issue extraction
filters:
start_date: "2023-01-01" # Optional: YYYY-MM-DD
end_date: "2023-12-31" # Optional: YYYY-MM-DD
labels: "bug,enhancement" # Optional: comma-separated
# RAG preparation settings
chunk_size: 2000 # Characters per text chunk
max_file_size: 4194304 # 4 MiB per output file
log_level: "INFO"# Clean start (removes previous state)
python run.py --clean
# Resume from previous run
python run.py --resume
# With CLI arguments (overrides config.yaml)
python run.py --repo "microsoft/vscode" --labels "bug" --start-date "2024-01-01"
# Extract from popular open-source projects
python run.py --repo "facebook/react" --labels "bug,enhancement"
python run.py --repo "tensorflow/tensorflow" --start-date "2024-01-01"
python run.py --repo "kubernetes/kubernetes" --labels "kind/bug,priority/high"After completion, check the results/ directory:
ls -la results/
โโโ json/ # Raw and processed JSON data
โโโ markdown/ # Individual issue .md files
โโโ manifest.csv # RAG indexing manifestExample Markdown Output:
# Fix memory leak in extension host
## Issue Metadata
- **Issue Number:** #12345
- **State:** CLOSED
- **Author:** @username
- **Created:** January 15, 2024 at 10:30 AM UTC
- **Labels:** `bug` `performance`
- **Assignees:** @maintainer
## Issue Description
The extension host process consumes excessive memory...
## Comments (3)
### Comment 1
**Author:** @contributor
**Posted:** January 16, 2024 at 2:15 PM UTC
I can reproduce this issue with the following steps...python run.py [OPTIONS]
Options:
--repo TEXT GitHub repository (owner/repo)
--token TEXT GitHub personal access token
--start-date TEXT Filter issues created after YYYY-MM-DD
--end-date TEXT Filter issues created before YYYY-MM-DD
--labels TEXT Comma-separated labels to filter
--resume Resume from last saved state
--clean Remove all state and start fresh
--help Show this message and exit
# Run all tests
pytest
# Run with verbose output
pytest -v
# Run specific test file
pytest tests/test_extractors.pyexport GITHUB_TOKEN="your_token_here" # Alternative to interactive input- Scope needed:
repo(for private repos) or no scopes (for public repos) - Rate limits: ~5000 requests/hour for authenticated users
- Get a token: GitHub Settings > Developer settings > Personal access tokens
- Documentation websites - Convert GitHub issues into searchable knowledge bases
- Support chatbots - Train AI models on historical issue resolutions
- Project analysis - Analyze issue trends, common problems, and solutions
- RAG applications - Build context-aware AI assistants for your projects
- Research - Study open-source project evolution and community interactions
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes and add tests
- Run tests (
pytest) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- GitHub API Documentation
- RAG (Retrieval-Augmented Generation)
- LangChain - Framework for LLM applications
- ChromaDB - Vector database for RAG
โญ Star this repo if it helped you build something awesome!