OpenContracts (Demo)
Open source document intelligence. Self-hosted, AI-powered, and built for teams who need to own their data.
| Backend CI/CD | |
| Meta |
OpenContracts is an AGPL-3.0 licensed platform for document analysis, annotation, and collaboration. It combines document management with AI-powered analysis tools, discussion threads, and structured data extraction.
- Document Processing — Upload PDFs and text files, automatically extract structure with ML-based parsers
- Annotation & Analysis — Highlight, label, and analyze documents with custom annotation schemas
- AI Agents — Chat with documents using configurable AI assistants that can search and analyze content
- Collaboration — Threaded discussions with @mentions, voting, and moderation at corpus and document levels
- Data Extraction — Extract structured data from hundreds of documents using agent-powered queries
- Version Control — Track document changes, restore previous versions, soft delete with recovery
- Organize documents into collections (Corpuses) with folder hierarchies
- Fine-grained permissions with public/private visibility controls
- Document versioning with full history and restore capability
- Bulk upload and batch operations
- Pluggable parser architecture supporting multiple backends:
- Docling — ML-based structure extraction
- NLM-Ingest — Layout-aware parsing
- Text/Markdown — Simple text extraction
- Automatic vector embeddings for semantic search (powered by pgvector)
- Structural annotation extraction (headers, paragraphs, tables)
- Multi-page annotation support
- Custom label schemas with validation
- Relationship mapping between annotations
- Import/export in standard formats
- Built on PydanticAI for structured LLM interactions
- Configurable AI agents with tool access (search, document loading, annotation queries)
- Real-time streaming responses via WebSocket
- Conversation history with context management
- Threaded discussions at global, corpus, and document levels
- @mentions for documents, corpuses, and AI agents
- Upvoting/downvoting with reputation tracking
- Thread pinning, locking, and moderation controls
- User profiles with activity feeds and statistics
- Badges and achievements for community engagement
- Leaderboards showing top contributors
- Define extraction schemas with multiple question types
- Run extractions across document collections
- Review and validate extracted data in grid view
- Export results in structured formats
Browse the full documentation at jsv4.github.io/OpenContracts or in the repo:
| Guide | Description |
|---|---|
| Quick Start | Get running with Docker in minutes |
| Key Concepts | Core workflows and terminology |
| PDF Data Format | How text maps to PDF coordinates |
| LLM Framework | PydanticAI integration and agents |
| Vector Stores | Semantic search architecture |
| Pipeline Overview | Parser and embedder system |
| Custom Extractors | Build your own data extraction tasks |
| v3.0.0.b3 Release Notes | Latest features and migration guide |
OpenContracts uses a standardized format for representing text and layout on PDF pages, enabling portable annotations across tools:
The modular pipeline supports custom parsers, embedders, and thumbnail generators:
Each component inherits from a base class with a defined interface:
- Parsers — Extract text and structure from documents
- Embedders — Generate vector embeddings for search
- Thumbnailers — Create document previews
See the pipeline documentation for details on creating custom components.
git clone https://github.com/JSv4/OpenContracts.git
cd OpenContracts
docker compose -f local.yml upRun migrations before starting services:
# Apply database migrations
docker compose -f production.yml --profile migrate up migrate
# Start services
docker compose -f production.yml up -dThe migration service runs once to avoid race conditions and ensures all tables are created before dependent services start.
OpenContracts collects anonymous usage data to guide development priorities. We collect:
- Installation events (unique installation ID)
- Feature usage statistics (analyzer runs, extracts created)
- Aggregate counts (documents, users, queries)
We do not collect document contents, extracted data, user identities, or query contents.
Disable with TELEMETRY_ENABLED=False in your settings.
Currently supported:
- PDF (full layout and annotation support)
- Text-based formats (plaintext, Markdown)
Coming soon: DOCX viewing and annotation powered by Docxodus, an open source in-browser Word document viewer. This will enable the same annotation and analysis workflows for Word documents that currently exist for PDFs.
This project builds on work from:
- AllenAI PAWLS — PDF annotation data format and concepts
- NLMatics nlm-ingestor — Document parsing pipeline
The data extraction grid UI draws inspiration from NLMatics' innovative approach to document querying:
AGPL-3.0 — See LICENSE for details.





