A collection of robust scripts and wrappers for extracting, sampling, and summarizing data from large HTML and JSON files. Designed for LLM data pipelines, research, and rapid data exploration.
- Extract readable text samples from large HTML files
- Summarize HTML tag structure and visible content
- Summarize and sample large JSON files
- Chunk large JSON arrays into manageable JSONL files
- Thin Bash wrappers for easy CLI usage
- Output and data directories are excluded from version control via
.gitignore
- Python 3.8+
- Packages:
pandas,ijson,bs4,lxml,chardet - Bash (for wrapper scripts)
chatgpt-rag/
├── data/ # Input data (ignored by git)
│ └── raw_data/
├── outputs/ # All script outputs (ignored by git)
│ ├── html_summary/
│ ├── json_chunks/
│ ├── json_summary/
│ ├── parsed_conversations/
│ └── ...
├── scripts/
│ ├── python/ # Python scripts
│ └── bash/ # Bash wrappers
├── USAGE_GUIDE.md # Detailed usage instructions
├── README.md # Project overview (this file)
├── .gitignore # Excludes data/ and outputs/
└── ...
See USAGE_GUIDE.md for detailed instructions and examples for each script and wrapper.
- Place your input files in
data/raw_data/. - Run the desired wrapper script from the project root, e.g.:
bash scripts/bash/html_extract_sample.sh data/raw_data/yourfile.html outputs/sample.txt
- Outputs will appear in the
outputs/directory.
Pull requests and issues are welcome! Please ensure your code is well-documented and tested.
See LICENSE for details.