Skip to content

Enterprise-grade and API-first LLM workspace for unstructured documents, including data extraction, redaction, rights management, prompt playground, and more!

License

Notifications You must be signed in to change notification settings

Open-Source-Legal/OpenContracts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenContracts

OpenContracts (Demo)

Open source document intelligence. Self-hosted, AI-powered, and built for teams who need to own their data.

Sponsor


Backend CI/CD codecov
Meta code style - black types - Mypy imports - isort License - AGPL-3.0

What is OpenContracts?

OpenContracts is an AGPL-3.0 licensed platform for document analysis, annotation, and collaboration. It combines document management with AI-powered analysis tools, discussion threads, and structured data extraction.

Core Capabilities

  • Document Processing — Upload PDFs and text files, automatically extract structure with ML-based parsers
  • Annotation & Analysis — Highlight, label, and analyze documents with custom annotation schemas
  • AI Agents — Chat with documents using configurable AI assistants that can search and analyze content
  • Collaboration — Threaded discussions with @mentions, voting, and moderation at corpus and document levels
  • Data Extraction — Extract structured data from hundreds of documents using agent-powered queries
  • Version Control — Track document changes, restore previous versions, soft delete with recovery

Quick Look

Document Annotation

PDF Processing

Text Format Support

Txt Processing

Structured Data Extraction

Data Grid

Custom Analytics

Analyzer Annotations


Features

Document Management

  • Organize documents into collections (Corpuses) with folder hierarchies
  • Fine-grained permissions with public/private visibility controls
  • Document versioning with full history and restore capability
  • Bulk upload and batch operations

Parsing & Processing

  • Pluggable parser architecture supporting multiple backends:
    • Docling — ML-based structure extraction
    • NLM-Ingest — Layout-aware parsing
    • Text/Markdown — Simple text extraction
  • Automatic vector embeddings for semantic search (powered by pgvector)
  • Structural annotation extraction (headers, paragraphs, tables)

Annotation Tools

  • Multi-page annotation support
  • Custom label schemas with validation
  • Relationship mapping between annotations
  • Import/export in standard formats

AI & LLM Integration

  • Built on PydanticAI for structured LLM interactions
  • Configurable AI agents with tool access (search, document loading, annotation queries)
  • Real-time streaming responses via WebSocket
  • Conversation history with context management

Collaboration (New in v3.0.0.b3)

  • Threaded discussions at global, corpus, and document levels
  • @mentions for documents, corpuses, and AI agents
  • Upvoting/downvoting with reputation tracking
  • Thread pinning, locking, and moderation controls
  • User profiles with activity feeds and statistics
  • Badges and achievements for community engagement
  • Leaderboards showing top contributors

Data Extraction

  • Define extraction schemas with multiple question types
  • Run extractions across document collections
  • Review and validate extracted data in grid view
  • Export results in structured formats

Documentation

Browse the full documentation at jsv4.github.io/OpenContracts or in the repo:

Guide Description
Quick Start Get running with Docker in minutes
Key Concepts Core workflows and terminology
PDF Data Format How text maps to PDF coordinates
LLM Framework PydanticAI integration and agents
Vector Stores Semantic search architecture
Pipeline Overview Parser and embedder system
Custom Extractors Build your own data extraction tasks
v3.0.0.b3 Release Notes Latest features and migration guide

Architecture

Data Format

OpenContracts uses a standardized format for representing text and layout on PDF pages, enabling portable annotations across tools:

Data Format

Processing Pipeline

The modular pipeline supports custom parsers, embedders, and thumbnail generators:

Pipeline Diagram

Each component inherits from a base class with a defined interface:

  • Parsers — Extract text and structure from documents
  • Embedders — Generate vector embeddings for search
  • Thumbnailers — Create document previews

See the pipeline documentation for details on creating custom components.


Deployment

Quick Start (Development)

git clone https://github.com/JSv4/OpenContracts.git
cd OpenContracts
docker compose -f local.yml up

Production

Run migrations before starting services:

# Apply database migrations
docker compose -f production.yml --profile migrate up migrate

# Start services
docker compose -f production.yml up -d

The migration service runs once to avoid race conditions and ensures all tables are created before dependent services start.


Telemetry

OpenContracts collects anonymous usage data to guide development priorities. We collect:

  • Installation events (unique installation ID)
  • Feature usage statistics (analyzer runs, extracts created)
  • Aggregate counts (documents, users, queries)

We do not collect document contents, extracted data, user identities, or query contents.

Disable with TELEMETRY_ENABLED=False in your settings.


Supported Formats

Currently supported:

  • PDF (full layout and annotation support)
  • Text-based formats (plaintext, Markdown)

Coming soon: DOCX viewing and annotation powered by Docxodus, an open source in-browser Word document viewer. This will enable the same annotation and analysis workflows for Word documents that currently exist for PDFs.


Acknowledgements

This project builds on work from:

The data extraction grid UI draws inspiration from NLMatics' innovative approach to document querying:

NLMatics Grid


License

AGPL-3.0 — See LICENSE for details.