Skip to content

tarskisworld/chatgpt-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

chatgpt-rag

A collection of robust scripts and wrappers for extracting, sampling, and summarizing data from large HTML and JSON files. Designed for LLM data pipelines, research, and rapid data exploration.

Features

  • Extract readable text samples from large HTML files
  • Summarize HTML tag structure and visible content
  • Summarize and sample large JSON files
  • Chunk large JSON arrays into manageable JSONL files
  • Thin Bash wrappers for easy CLI usage
  • Output and data directories are excluded from version control via .gitignore

Requirements

  • Python 3.8+
  • Packages: pandas, ijson, bs4, lxml, chardet
  • Bash (for wrapper scripts)

Directory Structure

chatgpt-rag/
├── data/                # Input data (ignored by git)
│   └── raw_data/
├── outputs/             # All script outputs (ignored by git)
│   ├── html_summary/
│   ├── json_chunks/
│   ├── json_summary/
│   ├── parsed_conversations/
│   └── ...
├── scripts/
│   ├── python/          # Python scripts
│   └── bash/            # Bash wrappers
├── USAGE_GUIDE.md       # Detailed usage instructions
├── README.md            # Project overview (this file)
├── .gitignore           # Excludes data/ and outputs/
└── ...

Usage

See USAGE_GUIDE.md for detailed instructions and examples for each script and wrapper.

Quick Start

  1. Place your input files in data/raw_data/.
  2. Run the desired wrapper script from the project root, e.g.:
    bash scripts/bash/html_extract_sample.sh data/raw_data/yourfile.html outputs/sample.txt
  3. Outputs will appear in the outputs/ directory.

Contributing

Pull requests and issues are welcome! Please ensure your code is well-documented and tested.

License

See LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published