Skip to content

fairdataihub/poster2json

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

poster2json

Convert scientific posters (PDF/images) into structured JSON metadata using Large Language Models.

DOI License: MIT Python 3.10+

Quick Start

Installation

Option 1: pip install from GitHub

pip install git+https://github.com/fairdataihub/poster2json.git

Option 2: Clone and install

git clone https://github.com/fairdataihub/poster2json.git
cd poster2json
pip install -e .

Option 3: Requirements only

git clone https://github.com/fairdataihub/poster2json.git
cd poster2json
pip install -r requirements.txt

Basic Usage

# If installed via pip
poster2json --annotation-dir "./posters" --output-dir "./output"

# Or run directly
python poster_extraction.py --annotation-dir "./posters" --output-dir "./output"

Docker (Recommended for Windows)

docker compose up --build

See Docker Setup for detailed instructions including Windows/WSL2 support.

How It Works

PDF/Image → Raw Text Extraction → LLM JSON Structuring → Structured JSON
                ↓                         ↓
           [pdfalto]              [Llama 3.1 8B]
           [Qwen2-VL]             Fine-tuned for posters
  1. PDF files → Processed via pdfalto for layout-aware text extraction
  2. Image files → Processed via Qwen2-VL-7B vision-language model
  3. All files → Structured into JSON by fine-tuned Llama 3.1 8B

Output Format

Output conforms to the poster-json-schema:

{
  "$schema": "https://posters.science/schema/v0.1/poster_schema.json",
  "creators": [{"name": "LastName, FirstName", "givenName": "FirstName", "familyName": "LastName", "affiliation": ["Institution"]}],
  "titles": [{"title": "Poster Title"}],
  "posterContent": {
    "sections": [
      {"sectionTitle": "Abstract", "sectionContent": "..."},
      {"sectionTitle": "Methods", "sectionContent": "..."}
    ]
  },
  "imageCaptions": [{"captions": ["Figure 1.", "Description"]}],
  "tableCaptions": [{"captions": ["Table 1.", "Description"]}]
}

System Requirements

Requirement Specification
GPU CUDA-capable, ≥16GB VRAM
RAM ≥32GB recommended
Python 3.10+
OS Linux, macOS, Windows (via Docker/WSL2)

API Server

# Start the API
python api.py

# POST a poster file
curl -X POST http://localhost:8000/extract -F "file=@poster.pdf"

Documentation

Document Description
Installation Guide Detailed setup instructions
Docker Setup Docker deployment & Windows support
Architecture Technical details & methodology
Evaluation Validation metrics & results
API Reference REST API documentation

Project Structure

poster2json/
├── poster_extraction.py    # Main extraction pipeline
├── api.py                  # Flask REST API
├── requirements.txt        # Python dependencies
├── Dockerfile              # Container build
├── docker-compose.yml      # Docker orchestration
├── docs/                   # Documentation
├── example_posters/        # Sample poster files
└── test_results/           # Validation outputs

Performance

Validation Results: 10/10 (100%) passing on test set

Metric Score Threshold
Word Capture 0.96 ≥0.75
ROUGE-L 0.89 ≥0.75
Number Capture 0.93 ≥0.75
Field Proportion 0.99 0.50–2.00

License

MIT License - see LICENSE for details.

Citation

Part of the FAIR Data Innovations Hub posters.science project.

Contributing

Contributions welcome! Please open an issue to discuss proposed changes.