Skip to content

A universal command-line tool to extract issues and comments from any public GitHub repository and prepare them for Retrieval-Augmented Generation (RAG) pipelines.

License

Notifications You must be signed in to change notification settings

yanchuk/github-issues-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1 Commit
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

GitHub Issues to RAG Extractor

A universal command-line tool to extract issues and comments from any public GitHub repository and prepare them for Retrieval-Augmented Generation (RAG) pipelines. Perfect for building knowledge bases, documentation systems, and AI-powered support tools from any open-source project.

๐Ÿ› ๏ธ Development Notes

LLM-Assisted Development: This project was developed as a weekend experiment in coding with Large Language Models. The development process used a combination of human creativity and AI assistance from ChatGPT o3-mini-high, Claude Sonnet 4, Claude Opus 4, and MCP (Model Context Protocol) for filesystem operations. It demonstrates how modern AI tools can accelerate software development while maintaining code quality through comprehensive testing.

Tools used:

  • ๐Ÿง  A bit of human brain and hands
  • ๐Ÿค– ChatGPT o3-mini-high (OpenAI)
  • ๐Ÿ”ฎ Claude Sonnet 4 (Anthropic)
  • โšก Claude Opus 4 (Anthropic)
  • ๐Ÿ—‚๏ธ MCP for filesystem operations

โœจ Features

  • ๐Ÿ” Issue extraction with date-range and label filtering
  • ๐Ÿ’ฌ Comment extraction for each issue with full metadata
  • ๐Ÿ“ Individual Markdown files for each issue (perfect for RAG indexing)
  • ๐Ÿ”„ Exponential backoff and retry for GitHub API rate limits
  • โฏ๏ธ Resume capability after interruptions
  • ๐Ÿ“Š Structured JSON outputs split into files โ‰ค4 MiB (Cloudflare-compatible)
  • ๐Ÿงฉ Text chunking (~2000 characters) with metadata manifest
  • ๐Ÿ“ Organized output structure in /results directory
  • โœ… Comprehensive unit tests for all components

๐Ÿ—‚๏ธ Output Structure

results/
โ”œโ”€โ”€ json/                          # All JSON data
โ”‚   โ”œโ”€โ”€ all_issues.json           # Raw issues data
โ”‚   โ”œโ”€โ”€ all_comments.json         # Raw comments data  
โ”‚   โ”œโ”€โ”€ merged_*.json             # Combined issue+comments (chunked)
โ”‚   โ””โ”€โ”€ corpus_*.json             # Text chunks for RAG indexing
โ”œโ”€โ”€ markdown/                      # Individual issue files
โ”‚   โ”œโ”€โ”€ 0001-issue-title.md       # Issue #1 with metadata & comments
โ”‚   โ”œโ”€โ”€ 0002-another-issue.md     # Issue #2 with metadata & comments
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ manifest.csv                  # Index mapping chunks to files

๐Ÿ“‚ Repository Structure

โ”œโ”€โ”€ LICENSE                       # MIT License
โ”œโ”€โ”€ README.md                     # This file
โ”œโ”€โ”€ requirements.txt              # Python dependencies
โ”œโ”€โ”€ config.yaml                   # Configuration file
โ”œโ”€โ”€ run.py                        # Simple entry point
โ”œโ”€โ”€ src/                          # Source code
โ”‚   โ”œโ”€โ”€ cli.py                    # CLI interface
โ”‚   โ”œโ”€โ”€ extractors.py             # Issue extraction logic
โ”‚   โ”œโ”€โ”€ comments.py               # Comment extraction logic
โ”‚   โ”œโ”€โ”€ rag_prep.py               # RAG preparation & chunking
โ”‚   โ”œโ”€โ”€ markdown_generator.py     # Markdown file generation
โ”‚   โ”œโ”€โ”€ output_manager.py         # Output directory management
โ”‚   โ”œโ”€โ”€ utils.py                  # Utilities & retry logic
โ”‚   โ””โ”€โ”€ state/                    # Persistent state files
โ””โ”€โ”€ tests/                        # Unit tests
    โ”œโ”€โ”€ test_extractors.py
    โ”œโ”€โ”€ test_comments.py
    โ”œโ”€โ”€ test_rag_prep.py
    โ””โ”€โ”€ ...

๐Ÿš€ Quick Start

1. Clone & Install

git clone https://github.com/yanchuk/github-issues-rag.git
cd github-issues-rag
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\\Scripts\\activate
pip install -r requirements.txt

2. Configure (Optional)

Edit config.yaml or rely on interactive prompts:

# GitHub repository (leave blank for interactive prompt)
repo: "owner/repo"  # e.g., "microsoft/vscode", "facebook/react", "tensorflow/tensorflow"
token: ""  # Personal access token (or set via prompt)

# Filters for issue extraction  
filters:
  start_date: "2023-01-01"        # Optional: YYYY-MM-DD
  end_date: "2023-12-31"          # Optional: YYYY-MM-DD  
  labels: "bug,enhancement"       # Optional: comma-separated

# RAG preparation settings
chunk_size: 2000                  # Characters per text chunk
max_file_size: 4194304           # 4 MiB per output file
log_level: "INFO"

3. Run the Pipeline

# Clean start (removes previous state)
python run.py --clean

# Resume from previous run
python run.py --resume

# With CLI arguments (overrides config.yaml)
python run.py --repo "microsoft/vscode" --labels "bug" --start-date "2024-01-01"

# Extract from popular open-source projects
python run.py --repo "facebook/react" --labels "bug,enhancement"
python run.py --repo "tensorflow/tensorflow" --start-date "2024-01-01"
python run.py --repo "kubernetes/kubernetes" --labels "kind/bug,priority/high"

4. Outputs

After completion, check the results/ directory:

ls -la results/
โ”œโ”€โ”€ json/                    # Raw and processed JSON data
โ”œโ”€โ”€ markdown/                # Individual issue .md files  
โ””โ”€โ”€ manifest.csv            # RAG indexing manifest

Example Markdown Output:

# Fix memory leak in extension host

## Issue Metadata
- **Issue Number:** #12345
- **State:** CLOSED  
- **Author:** @username
- **Created:** January 15, 2024 at 10:30 AM UTC
- **Labels:** `bug` `performance` 
- **Assignees:** @maintainer

## Issue Description
The extension host process consumes excessive memory...

## Comments (3)
### Comment 1
**Author:** @contributor  
**Posted:** January 16, 2024 at 2:15 PM UTC

I can reproduce this issue with the following steps...

โš™๏ธ CLI Options

python run.py [OPTIONS]

Options:
  --repo TEXT        GitHub repository (owner/repo)
  --token TEXT       GitHub personal access token  
  --start-date TEXT  Filter issues created after YYYY-MM-DD
  --end-date TEXT    Filter issues created before YYYY-MM-DD
  --labels TEXT      Comma-separated labels to filter
  --resume           Resume from last saved state
  --clean            Remove all state and start fresh
  --help             Show this message and exit

๐Ÿงช Testing

# Run all tests
pytest

# Run with verbose output
pytest -v

# Run specific test file
pytest tests/test_extractors.py

๐Ÿ”ง Configuration

Environment Variables

export GITHUB_TOKEN="your_token_here"  # Alternative to interactive input

GitHub Token Requirements

๐Ÿ“‹ Use Cases

  • Documentation websites - Convert GitHub issues into searchable knowledge bases
  • Support chatbots - Train AI models on historical issue resolutions
  • Project analysis - Analyze issue trends, common problems, and solutions
  • RAG applications - Build context-aware AI assistants for your projects
  • Research - Study open-source project evolution and community interactions

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes and add tests
  4. Run tests (pytest)
  5. Commit changes (git commit -m 'Add amazing feature')
  6. Push to branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ”— Related Projects


โญ Star this repo if it helped you build something awesome!

About

A universal command-line tool to extract issues and comments from any public GitHub repository and prepare them for Retrieval-Augmented Generation (RAG) pipelines.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages