Skip to content

MeanFishy00/bs4-llama

Repository files navigation

Llama Web: AI-Powered Web to Markdown Converter

Description

Llama Web is an intelligent content transformation tool that converts dynamic website content into clean, formatted markdown using the power of Llama AI and BeautifulSoup4. This tool excels at processing modern, JavaScript-heavy websites and transforming their content into portable markdown format.

Key Features

  • Smart Web Scraping:
    • Handles dynamic content and modern web layouts
    • Intelligent element filtering and content prioritization
    • Clean removal of ads, popups, and navigation elements
  • AI-Powered Processing:
    • Uses Llama 3.2 3B model for context-aware formatting
    • Semantic structure preservation
    • Intelligent heading hierarchy detection
  • Advanced Content Transformation:
    • Maintains document structure
    • Preserves important formatting
    • Handles complex nested lists and tables
  • Error Handling & Resilience:
    • Robust against malformed HTML
    • Graceful fallback for JavaScript-heavy sites
    • Connection error management

Technical Stack

  • Core Technologies:
    • Python 3.x
    • BeautifulSoup4 for HTML parsing
    • Ollama for AI processing
    • LangChain for text splitting
    • Sentence Transformers for content analysis

Installation

  1. Ensure Python 3.x is installed
  2. Clone the Llama Web repository
  3. Install dependencies:
pip install -r requirements.txt

Usage

  1. Navigate to the project directory
  2. Run the script:
python llama_web/main.py --url "your-url-here"

Advanced Usage Examples

# Convert news article with Llama Web
python llama_web/main.py --url "https://news.website.com/article"

# Process documentation with custom output
python llama_web/main.py --url "https://docs.python.org/3/" --output docs.md

Supported Content Types

  • News articles and blog posts
  • Documentation pages
  • Product descriptions
  • Academic content
  • Forum threads and discussions

Future Enhancements

Llama Web Core Features

  • Advanced URL parsing and handling
  • Batch processing of multiple URLs
  • Custom markdown templates
  • Configuration profiles for different content types

Additional Content Types

  • PDF to Markdown conversion
  • Image caption extraction and formatting
  • Video transcript processing
  • Social media thread conversion

Format Extensions

  • Export to AsciiDoc
  • Support for LaTeX output
  • Wiki markup conversion
  • Custom templating system

AI Improvements

  • Multi-language support
  • Content summarization
  • Automatic citation formatting
  • SEO-optimized output

Technical Roadmap

  • Headless browser integration for JavaScript-heavy sites
  • API endpoint for remote conversion
  • Batch processing capabilities
  • Custom formatting rules engine

Contributing

Contributions are welcome! Areas we're particularly interested in:

  • Additional output formats
  • Performance optimizations
  • New content source handlers
  • AI model improvements

Note on Rate Limiting

Please be mindful of website rate limits when using this tool. Consider implementing delays between requests for bulk processing.

License

MIT License - See LICENSE file for details

Project Structure

bs4 llama/
├── llama_web/
│   ├── __init__.py
│   └── main.py
├── requirements.txt
└── README.md

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages