Llama Web is an intelligent content transformation tool that converts dynamic website content into clean, formatted markdown using the power of Llama AI and BeautifulSoup4. This tool excels at processing modern, JavaScript-heavy websites and transforming their content into portable markdown format.
- Smart Web Scraping:
- Handles dynamic content and modern web layouts
- Intelligent element filtering and content prioritization
- Clean removal of ads, popups, and navigation elements
- AI-Powered Processing:
- Uses Llama 3.2 3B model for context-aware formatting
- Semantic structure preservation
- Intelligent heading hierarchy detection
- Advanced Content Transformation:
- Maintains document structure
- Preserves important formatting
- Handles complex nested lists and tables
- Error Handling & Resilience:
- Robust against malformed HTML
- Graceful fallback for JavaScript-heavy sites
- Connection error management
- Core Technologies:
- Python 3.x
- BeautifulSoup4 for HTML parsing
- Ollama for AI processing
- LangChain for text splitting
- Sentence Transformers for content analysis
- Ensure Python 3.x is installed
- Clone the Llama Web repository
- Install dependencies:
pip install -r requirements.txt- Navigate to the project directory
- Run the script:
python llama_web/main.py --url "your-url-here"# Convert news article with Llama Web
python llama_web/main.py --url "https://news.website.com/article"
# Process documentation with custom output
python llama_web/main.py --url "https://docs.python.org/3/" --output docs.md- News articles and blog posts
- Documentation pages
- Product descriptions
- Academic content
- Forum threads and discussions
- Advanced URL parsing and handling
- Batch processing of multiple URLs
- Custom markdown templates
- Configuration profiles for different content types
- PDF to Markdown conversion
- Image caption extraction and formatting
- Video transcript processing
- Social media thread conversion
- Export to AsciiDoc
- Support for LaTeX output
- Wiki markup conversion
- Custom templating system
- Multi-language support
- Content summarization
- Automatic citation formatting
- SEO-optimized output
- Headless browser integration for JavaScript-heavy sites
- API endpoint for remote conversion
- Batch processing capabilities
- Custom formatting rules engine
Contributions are welcome! Areas we're particularly interested in:
- Additional output formats
- Performance optimizations
- New content source handlers
- AI model improvements
Please be mindful of website rate limits when using this tool. Consider implementing delays between requests for bulk processing.
MIT License - See LICENSE file for details
bs4 llama/
├── llama_web/
│ ├── __init__.py
│ └── main.py
├── requirements.txt
└── README.md