PDF Extractor

Cross-platform desktop application for extracting content from PDF documents using IBM Docling's AI-powered document understanding.

Features

Drag & Drop Interface: Simply drag PDF files into the application
Multiple Export Formats:
- JSON (structured document data)
- Markdown (clean text output)
- CSV/Excel (extracted tables)
- HTML (web-viewable format)
AI-Powered Extraction: Uses IBM Docling for intelligent layout detection and table extraction
Cross-Platform: Native apps for Windows and macOS
GPU Acceleration: Automatic CUDA/MPS support for faster processing

Download

Download the latest release for your platform from the Releases page.

Platform	File	Description
Windows (Installer)	`PDF_Extractor_Setup_x.x.x.exe`	Standard Windows installer
Windows (Portable)	`PDF_Extractor_Portable.zip`	No installation required
macOS (Intel)	`PDF_Extractor_macOS.dmg`	For Intel-based Macs
macOS (Apple Silicon)	`PDF_Extractor_macOS_ARM.dmg`	For M1/M2/M3/M4 Macs

Note: On first run, the app downloads AI models (~300MB). This only happens once.

Installation

Windows - Installer (Recommended)

Download PDF_Extractor_Setup_x.x.x.exe from the latest release
Run the installer (if Windows SmartScreen appears, click "More info" → "Run anyway")
Follow the setup wizard
Launch PDF Extractor from the Start Menu or desktop shortcut

Windows - Portable Version

The portable version requires no installation and can run from any folder or USB drive.

Download PDF_Extractor_Portable.zip from the latest release
Extract the zip file to any folder (e.g., C:\Apps\PDF Extractor\)
Double-click PDF Extractor.exe to run the application
First run only: If Windows SmartScreen shows "Windows protected your PC":
- Click "More info"
- Click "Run anyway"

Important: Keep the _internal folder in the same location as PDF Extractor.exe - the application needs it to run.

macOS

Download the appropriate DMG for your Mac:
- Intel Macs: PDF_Extractor_macOS.dmg
- Apple Silicon (M1/M2/M3/M4): PDF_Extractor_macOS_ARM.dmg
Open the DMG file
Drag PDF Extractor to your Applications folder
Launch from Applications or Spotlight

Note: If you see "App is damaged" or "unidentified developer" warning, see Troubleshooting below.

Building from Source

Prerequisites

Python 3.10 or higher
Git

Quick Start (Development)

# Clone the repository
git clone https://github.com/danribes/pdf_xtractor.git
cd pdf_xtractor

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run the application
python src/main.py

Building Distributables

Windows

# Build executable only
scripts\build_windows.bat

# Build executable + installer
scripts\build_windows.bat installer

Output:

dist\PDF Extractor.exe - Standalone executable
dist\PDF_Extractor_Setup_1.0.0.exe - Installer (requires Inno Setup)

macOS

# Build .app bundle
./scripts/build_mac.sh

# Build .app + DMG installer
./scripts/build_mac.sh dmg

# Build Universal binary (Intel + Apple Silicon)
./scripts/build_mac.sh universal dmg

Output:

dist/PDF Extractor.app - Application bundle
dist/PDF_Extractor_1.0.0.dmg - DMG installer

Bundling AI Models (Offline Distribution)

For distribution without requiring internet on first run:

# Download models to local directory
python scripts/download_models.py

# Then build normally - models will be included
./scripts/build_mac.sh dmg  # or build_windows.bat

This increases the app size by ~300MB but allows fully offline usage.

Project Structure

pdf_xtractor/
├── src/
│   ├── main.py              # Application entry point
│   ├── gui.py               # PySide6 desktop interface
│   ├── converter.py         # Docling processing logic
│   └── config.py            # Configuration management
├── build/
│   ├── pdfextractor.spec    # PyInstaller configuration
│   ├── installer_windows.iss # Inno Setup script
│   └── version_info.txt     # Windows version metadata
├── scripts/
│   ├── build_windows.bat    # Windows build script
│   ├── build_mac.sh         # macOS build script
│   ├── download_models.py   # Pre-download AI models
│   └── create_icons.py      # Generate app icons
├── assets/
│   ├── icon.ico             # Windows icon
│   ├── icon.icns            # macOS icon
│   └── icon.png             # Reference icon
├── .github/
│   └── workflows/
│       └── build.yml        # CI/CD for automated builds
├── requirements.txt
└── README.md

Export Format Details

Format	Method	Use Case
JSON	`export_to_dict()`	Full document hierarchy for developers
Markdown	`export_to_markdown()`	Clean text for LLMs or documentation
CSV/Excel	`table.export_to_dataframe()`	Structured data for analysis
HTML	`export_to_html()`	Visualizing the document in a browser

Code Signing & Notarization

Windows

For distribution, sign your executable with a code signing certificate:

signtool sign /f certificate.pfx /p password /t http://timestamp.digicert.com "dist\PDF Extractor.exe"

macOS

For distribution outside the App Store:

# Sign the app
codesign --deep --force --sign "Developer ID Application: Your Name (TEAM_ID)" "dist/PDF Extractor.app"

# Create signed DMG
codesign --sign "Developer ID Application: Your Name (TEAM_ID)" "dist/PDF_Extractor_1.0.0.dmg"

# Notarize
xcrun notarytool submit dist/PDF_Extractor_1.0.0.dmg \
    --apple-id "your@email.com" \
    --team-id "TEAM_ID" \
    --password "app-specific-password" \
    --wait

# Staple the notarization
xcrun stapler staple "dist/PDF_Extractor_1.0.0.dmg"

CI/CD with GitHub Actions

The project includes automated builds via GitHub Actions. To create a release:

Tag a version: git tag v1.0.0
Push the tag: git push origin v1.0.0
GitHub Actions will build for all platforms
Download artifacts from the draft release

You can also manually trigger a build from the Actions tab.

Troubleshooting

"App is damaged" on macOS

This happens with unsigned apps. Remove the quarantine attribute:

xattr -cr "/Applications/PDF Extractor.app"

Models fail to download

If behind a firewall, pre-download models and set environment variables:

export HF_HOME=/path/to/models
python scripts/download_models.py

GPU not detected

Ensure you have the correct PyTorch version for your GPU:

# For NVIDIA CUDA
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# For Apple Silicon (MPS)
pip install torch torchvision  # MPS support is automatic

Requirements

Python 3.10+
docling >= 2.5.0
PySide6 >= 6.6.0
pandas >= 2.0.0
PyInstaller >= 6.0.0 (for building)

License

MIT License

Acknowledgments

IBM Docling - Document understanding AI
PySide6 - Qt for Python
PyInstaller - Python application bundling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Extractor

Features

Download

Installation

Windows - Installer (Recommended)

Windows - Portable Version

macOS

Building from Source

Prerequisites

Quick Start (Development)

Building Distributables

Windows

macOS

Bundling AI Models (Offline Distribution)

Project Structure

Export Format Details

Code Signing & Notarization

Windows

macOS

CI/CD with GitHub Actions

Troubleshooting

"App is damaged" on macOS

Models fail to download

GPU not detected

Requirements

License

Acknowledgments

About

Uh oh!

Releases 2

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
assets		assets
build		build
scripts		scripts
src		src
.gitignore		.gitignore
Extractor.png		Extractor.png
README.md		README.md
requirements.txt		requirements.txt

danribes/pdf_xtractor

Folders and files

Latest commit

History

Repository files navigation

PDF Extractor

Features

Download

Installation

Windows - Installer (Recommended)

Windows - Portable Version

macOS

Building from Source

Prerequisites

Quick Start (Development)

Building Distributables

Windows

macOS

Bundling AI Models (Offline Distribution)

Project Structure

Export Format Details

Code Signing & Notarization

Windows

macOS

CI/CD with GitHub Actions

Troubleshooting

"App is damaged" on macOS

Models fail to download

GPU not detected

Requirements

License

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Uh oh!

Languages

Packages