Foss Generator

Synthetic Causal Training Data Generator for LLM Fine-Tuning

Part of the Sovereign Causal Graph research project.

Overview

Foss Generator creates high-quality synthetic Chain-of-Thought training data from domain ontologies. Pure Python, no external dependencies, no API calls required.

Performance: ~100,000 samples in ~10 seconds on Apple Silicon M-series.

flowchart LR
    A[Domain Ontology] --> B[Causal Mechanisms]
    B --> C[Template Engine]
    C --> D[Style Variants]
    D --> E[Quality Scoring]
    E --> F[Training Data]

    style A fill:#3b82f6,color:#fff
    style F fill:#10b981,color:#fff

Features

16 Industry Domains: Pharma, Finance, Cybersecurity, Healthcare, Energy, and more
12 Output Styles: Academic, executive, regulatory, technical, etc.
Quality Scoring: Automatic quantification pattern detection
Multiple Exports: JSONL, HuggingFace-compatible, CSV
Custom Ontologies: Load your own domain definitions

Installation

git clone https://github.com/DT-Foss/foss-generator.git
cd foss-generator

No dependencies required - pure Python 3.8+.

Usage

Generate samples for a specific domain:

python foss_generator.py --domain pharma --count 10000

Generate samples for all domains:

python foss_generator.py --all-domains --count 5000

Use a custom ontology:

python foss_generator.py --custom my_ontology.json --count 10000

Output formats:

python foss_generator.py --domain finance --count 5000 --format huggingface

Architecture

flowchart TB
    subgraph Input
        O[Domain Ontologies<br/>16 Industries]
        C[Custom Ontology<br/>JSON]
    end

    subgraph Generator
        M[Mechanism Selector]
        T[Template Engine<br/>12 Styles]
        Q[Quality Scorer]
    end

    subgraph Output
        J[JSONL]
        H[HuggingFace]
        V[CSV]
    end

    O --> M
    C --> M
    M --> T
    T --> Q
    Q --> J
    Q --> H
    Q --> V

    style O fill:#3b82f6,color:#fff
    style C fill:#3b82f6,color:#fff
    style Q fill:#f59e0b,color:#fff
    style J fill:#10b981,color:#fff
    style H fill:#10b981,color:#fff
    style V fill:#10b981,color:#fff

Output Sample

{
  "prompt": "Trace the causal pathway from patent cliff approaching to revenue decline.",
  "completion": "**Causal Analysis: patent cliff approaching → revenue decline**\n\n**1. Trigger Identification**\nThe causal sequence initiates with patent cliff approaching.\n\n**2. Mechanistic Pathway**\nGeneric competition preparation triggers pricing pressure and market share erosion.\n\n**3. Quantitative Evidence**\nTypical revenue drop: 70-90% within 12 months.\n\n**4. Confidence Assessment**\nThis causal relationship demonstrates very_high confidence."
}

Available Domains

Domain	Mechanisms	Description
`pharma`	18	Drug development, clinical trials, regulatory
`finance`	18	Risk management, trading, compliance
`cybersecurity`	15	Threat detection, incident response
`healthcare`	16	Clinical workflows, patient safety
`energy`	15	Grid operations, renewable integration
`manufacturing`	14	Supply chain, quality control
`legal`	12	Contract analysis, litigation
`insurance`	14	Underwriting, claims processing
`aviation`	12	Safety systems, maintenance
`telecom`	12	Network operations, capacity
`biotech`	15	Research, development pipelines
`mining`	12	Extraction, environmental compliance
`agriculture`	12	Crop management, supply chain
`real_estate`	12	Valuation, market dynamics
`defense`	10	Procurement, systems integration
`maritime`	10	Shipping, port operations

Causal Chain Structure

flowchart LR
    T[Trigger Event] --> M[Mechanism]
    M --> O[Outcome]
    O --> Q[Quantification]

    style T fill:#ef4444,color:#fff
    style M fill:#f59e0b,color:#fff
    style O fill:#10b981,color:#fff
    style Q fill:#8b5cf6,color:#fff

Each generated sample follows this causal structure:

Trigger: Initiating condition (e.g., "Phase III trial completion")
Mechanism: Causal pathway (e.g., "FDA review process activation")
Outcome: Terminal effect (e.g., "drug approval or rejection")
Quantification: Measured evidence (e.g., "58% approval rate")

Research

This generator is part of the Sovereign Causal Graph research on LLM-free scientific knowledge extraction. The system achieves:

Foss Hallucination Gate: 14-step validation pipeline
Grade A/B extraction rate: 4.6% from raw PDF content
Validated by: Pieter Wuille (SIPA), Bitcoin Core maintainer

License

MIT License - see LICENSE

Author

David Tom Foss

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
domain_ontologies.py		domain_ontologies.py
foss_generator.py		foss_generator.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Foss Generator

Overview

Features

Installation

Usage

Generate samples for a specific domain:

Generate samples for all domains:

Use a custom ontology:

Output formats:

Architecture

Output Sample

Available Domains

Causal Chain Structure

Research

License

Author

About

Uh oh!

Releases

Packages

Languages

License

DT-Foss/foss-generator

Folders and files

Latest commit

History

Repository files navigation

Foss Generator

Overview

Features

Installation

Usage

Generate samples for a specific domain:

Generate samples for all domains:

Use a custom ontology:

Output formats:

Architecture

Output Sample

Available Domains

Causal Chain Structure

Research

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages