chatgpt-rag

A collection of robust scripts and wrappers for extracting, sampling, and summarizing data from large HTML and JSON files. Designed for LLM data pipelines, research, and rapid data exploration.

Features

Extract readable text samples from large HTML files
Summarize HTML tag structure and visible content
Summarize and sample large JSON files
Chunk large JSON arrays into manageable JSONL files
Thin Bash wrappers for easy CLI usage
Output and data directories are excluded from version control via .gitignore

Requirements

Python 3.8+
Packages: pandas, ijson, bs4, lxml, chardet
Bash (for wrapper scripts)

Directory Structure

chatgpt-rag/
├── data/                # Input data (ignored by git)
│   └── raw_data/
├── outputs/             # All script outputs (ignored by git)
│   ├── html_summary/
│   ├── json_chunks/
│   ├── json_summary/
│   ├── parsed_conversations/
│   └── ...
├── scripts/
│   ├── python/          # Python scripts
│   └── bash/            # Bash wrappers
├── USAGE_GUIDE.md       # Detailed usage instructions
├── README.md            # Project overview (this file)
├── .gitignore           # Excludes data/ and outputs/
└── ...

Usage

See USAGE_GUIDE.md for detailed instructions and examples for each script and wrapper.

Quick Start

Place your input files in data/raw_data/.

Run the desired wrapper script from the project root, e.g.:

bash scripts/bash/html_extract_sample.sh data/raw_data/yourfile.html outputs/sample.txt

Outputs will appear in the outputs/ directory.

Contributing

Pull requests and issues are welcome! Please ensure your code is well-documented and tested.

License

See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

chatgpt-rag

Features

Requirements

Directory Structure

Usage

Quick Start

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
USAGE_GUIDE.md		USAGE_GUIDE.md

License

tarskisworld/chatgpt-rag

Folders and files

Latest commit

History

Repository files navigation

chatgpt-rag

Features

Requirements

Directory Structure

Usage

Quick Start

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages