Convert scientific posters (PDF/images) into structured JSON metadata using Large Language Models.
Option 1: pip install from GitHub
pip install git+https://github.com/fairdataihub/poster2json.gitOption 2: Clone and install
git clone https://github.com/fairdataihub/poster2json.git
cd poster2json
pip install -e .Option 3: Requirements only
git clone https://github.com/fairdataihub/poster2json.git
cd poster2json
pip install -r requirements.txt# If installed via pip
poster2json --annotation-dir "./posters" --output-dir "./output"
# Or run directly
python poster_extraction.py --annotation-dir "./posters" --output-dir "./output"docker compose up --buildSee Docker Setup for detailed instructions including Windows/WSL2 support.
PDF/Image → Raw Text Extraction → LLM JSON Structuring → Structured JSON
↓ ↓
[pdfalto] [Llama 3.1 8B]
[Qwen2-VL] Fine-tuned for posters
- PDF files → Processed via
pdfaltofor layout-aware text extraction - Image files → Processed via
Qwen2-VL-7Bvision-language model - All files → Structured into JSON by fine-tuned Llama 3.1 8B
Output conforms to the poster-json-schema:
{
"$schema": "https://posters.science/schema/v0.1/poster_schema.json",
"creators": [{"name": "LastName, FirstName", "givenName": "FirstName", "familyName": "LastName", "affiliation": ["Institution"]}],
"titles": [{"title": "Poster Title"}],
"posterContent": {
"sections": [
{"sectionTitle": "Abstract", "sectionContent": "..."},
{"sectionTitle": "Methods", "sectionContent": "..."}
]
},
"imageCaptions": [{"captions": ["Figure 1.", "Description"]}],
"tableCaptions": [{"captions": ["Table 1.", "Description"]}]
}| Requirement | Specification |
|---|---|
| GPU | CUDA-capable, ≥16GB VRAM |
| RAM | ≥32GB recommended |
| Python | 3.10+ |
| OS | Linux, macOS, Windows (via Docker/WSL2) |
# Start the API
python api.py
# POST a poster file
curl -X POST http://localhost:8000/extract -F "file=@poster.pdf"| Document | Description |
|---|---|
| Installation Guide | Detailed setup instructions |
| Docker Setup | Docker deployment & Windows support |
| Architecture | Technical details & methodology |
| Evaluation | Validation metrics & results |
| API Reference | REST API documentation |
poster2json/
├── poster_extraction.py # Main extraction pipeline
├── api.py # Flask REST API
├── requirements.txt # Python dependencies
├── Dockerfile # Container build
├── docker-compose.yml # Docker orchestration
├── docs/ # Documentation
├── example_posters/ # Sample poster files
└── test_results/ # Validation outputs
Validation Results: 10/10 (100%) passing on test set
| Metric | Score | Threshold |
|---|---|---|
| Word Capture | 0.96 | ≥0.75 |
| ROUGE-L | 0.89 | ≥0.75 |
| Number Capture | 0.93 | ≥0.75 |
| Field Proportion | 0.99 | 0.50–2.00 |
MIT License - see LICENSE for details.
Part of the FAIR Data Innovations Hub posters.science project.
Contributions welcome! Please open an issue to discuss proposed changes.