poster2json

Convert scientific posters (PDF/images) into structured JSON metadata using Large Language Models.

Quick Start

Installation

Option 1: pip install from GitHub

pip install git+https://github.com/fairdataihub/poster2json.git

Option 2: Clone and install

git clone https://github.com/fairdataihub/poster2json.git
cd poster2json
pip install -e .

Option 3: Requirements only

git clone https://github.com/fairdataihub/poster2json.git
cd poster2json
pip install -r requirements.txt

Basic Usage

# If installed via pip
poster2json --annotation-dir "./posters" --output-dir "./output"

# Or run directly
python poster_extraction.py --annotation-dir "./posters" --output-dir "./output"

Docker (Recommended for Windows)

docker compose up --build

See Docker Setup for detailed instructions including Windows/WSL2 support.

How It Works

PDF/Image → Raw Text Extraction → LLM JSON Structuring → Structured JSON
                ↓                         ↓
           [pdfalto]              [Llama 3.1 8B]
           [Qwen2-VL]             Fine-tuned for posters

PDF files → Processed via pdfalto for layout-aware text extraction
Image files → Processed via Qwen2-VL-7B vision-language model
All files → Structured into JSON by fine-tuned Llama 3.1 8B

Output Format

Output conforms to the poster-json-schema:

{
  "$schema": "https://posters.science/schema/v0.1/poster_schema.json",
  "creators": [{"name": "LastName, FirstName", "givenName": "FirstName", "familyName": "LastName", "affiliation": ["Institution"]}],
  "titles": [{"title": "Poster Title"}],
  "posterContent": {
    "sections": [
      {"sectionTitle": "Abstract", "sectionContent": "..."},
      {"sectionTitle": "Methods", "sectionContent": "..."}
    ]
  },
  "imageCaptions": [{"captions": ["Figure 1.", "Description"]}],
  "tableCaptions": [{"captions": ["Table 1.", "Description"]}]
}

System Requirements

Requirement	Specification
GPU	CUDA-capable, ≥16GB VRAM
RAM	≥32GB recommended
Python	3.10+
OS	Linux, macOS, Windows (via Docker/WSL2)

API Server

# Start the API
python api.py

# POST a poster file
curl -X POST http://localhost:8000/extract -F "file=@poster.pdf"

Documentation

Document	Description
Installation Guide	Detailed setup instructions
Docker Setup	Docker deployment & Windows support
Architecture	Technical details & methodology
Evaluation	Validation metrics & results
API Reference	REST API documentation

Project Structure

poster2json/
├── poster_extraction.py    # Main extraction pipeline
├── api.py                  # Flask REST API
├── requirements.txt        # Python dependencies
├── Dockerfile              # Container build
├── docker-compose.yml      # Docker orchestration
├── docs/                   # Documentation
├── example_posters/        # Sample poster files
└── test_results/           # Validation outputs

Performance

Validation Results: 10/10 (100%) passing on test set

Metric	Score	Threshold
Word Capture	0.96	≥0.75
ROUGE-L	0.89	≥0.75
Number Capture	0.93	≥0.75
Field Proportion	0.99	0.50–2.00

License

MIT License - see LICENSE for details.

Citation

Part of the FAIR Data Innovations Hub posters.science project.

Contributing

Contributions welcome! Please open an issue to discuss proposed changes.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
.github/workflows		.github/workflows
docs		docs
example_posters		example_posters
executables		executables
manual_poster_annotation		manual_poster_annotation
test_results		test_results
.dockerignore		.dockerignore
.flake8		.flake8
.gitignore		.gitignore
.pylint.ini		.pylint.ini
.pylintrc		.pylintrc
.zenodo.json		.zenodo.json
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
api.py		api.py
codemeta.json		codemeta.json
docker-compose-prod.yml		docker-compose-prod.yml
docker-compose.dev.yml		docker-compose.dev.yml
poster_extraction.py		poster_extraction.py
poster_extraction_schema.json		poster_extraction_schema.json
poster_schema.json		poster_schema.json
pyproject.toml		pyproject.toml
requirements-prod.txt		requirements-prod.txt
requirements.txt		requirements.txt
test.ipynb		test.ipynb
test_api.py		test_api.py
validation.py		validation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

poster2json

Quick Start

Installation

Basic Usage

Docker (Recommended for Windows)

How It Works

Output Format

System Requirements

API Server

Documentation

Project Structure

Performance

License

Citation

Contributing

About

Uh oh!

Releases 2

Packages

Contributors 4

Uh oh!

Languages

License

fairdataihub/poster2json

Folders and files

Latest commit

History

Repository files navigation

poster2json

Quick Start

Installation

Basic Usage

Docker (Recommended for Windows)

How It Works

Output Format

System Requirements

API Server

Documentation

Project Structure

Performance

License

Citation

Contributing

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 4

Uh oh!

Languages

Packages