Skip to content

Synthetic causal training data generator for LLM fine-tuning. 16 industry domains, 200+ mechanisms, ~100K samples in 10 seconds. Pure Python.

License

Notifications You must be signed in to change notification settings

DT-Foss/foss-generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Foss Generator

Synthetic Causal Training Data Generator for LLM Fine-Tuning

Part of the Sovereign Causal Graph research project.

Overview

Foss Generator creates high-quality synthetic Chain-of-Thought training data from domain ontologies. Pure Python, no external dependencies, no API calls required.

Performance: ~100,000 samples in ~10 seconds on Apple Silicon M-series.

flowchart LR
    A[Domain Ontology] --> B[Causal Mechanisms]
    B --> C[Template Engine]
    C --> D[Style Variants]
    D --> E[Quality Scoring]
    E --> F[Training Data]

    style A fill:#3b82f6,color:#fff
    style F fill:#10b981,color:#fff
Loading

Features

  • 16 Industry Domains: Pharma, Finance, Cybersecurity, Healthcare, Energy, and more
  • 12 Output Styles: Academic, executive, regulatory, technical, etc.
  • Quality Scoring: Automatic quantification pattern detection
  • Multiple Exports: JSONL, HuggingFace-compatible, CSV
  • Custom Ontologies: Load your own domain definitions

Installation

git clone https://github.com/DT-Foss/foss-generator.git
cd foss-generator

No dependencies required - pure Python 3.8+.

Usage

Generate samples for a specific domain:

python foss_generator.py --domain pharma --count 10000

Generate samples for all domains:

python foss_generator.py --all-domains --count 5000

Use a custom ontology:

python foss_generator.py --custom my_ontology.json --count 10000

Output formats:

python foss_generator.py --domain finance --count 5000 --format huggingface

Architecture

flowchart TB
    subgraph Input
        O[Domain Ontologies<br/>16 Industries]
        C[Custom Ontology<br/>JSON]
    end

    subgraph Generator
        M[Mechanism Selector]
        T[Template Engine<br/>12 Styles]
        Q[Quality Scorer]
    end

    subgraph Output
        J[JSONL]
        H[HuggingFace]
        V[CSV]
    end

    O --> M
    C --> M
    M --> T
    T --> Q
    Q --> J
    Q --> H
    Q --> V

    style O fill:#3b82f6,color:#fff
    style C fill:#3b82f6,color:#fff
    style Q fill:#f59e0b,color:#fff
    style J fill:#10b981,color:#fff
    style H fill:#10b981,color:#fff
    style V fill:#10b981,color:#fff
Loading

Output Sample

{
  "prompt": "Trace the causal pathway from patent cliff approaching to revenue decline.",
  "completion": "**Causal Analysis: patent cliff approaching → revenue decline**\n\n**1. Trigger Identification**\nThe causal sequence initiates with patent cliff approaching.\n\n**2. Mechanistic Pathway**\nGeneric competition preparation triggers pricing pressure and market share erosion.\n\n**3. Quantitative Evidence**\nTypical revenue drop: 70-90% within 12 months.\n\n**4. Confidence Assessment**\nThis causal relationship demonstrates very_high confidence."
}

Available Domains

Domain Mechanisms Description
pharma 18 Drug development, clinical trials, regulatory
finance 18 Risk management, trading, compliance
cybersecurity 15 Threat detection, incident response
healthcare 16 Clinical workflows, patient safety
energy 15 Grid operations, renewable integration
manufacturing 14 Supply chain, quality control
legal 12 Contract analysis, litigation
insurance 14 Underwriting, claims processing
aviation 12 Safety systems, maintenance
telecom 12 Network operations, capacity
biotech 15 Research, development pipelines
mining 12 Extraction, environmental compliance
agriculture 12 Crop management, supply chain
real_estate 12 Valuation, market dynamics
defense 10 Procurement, systems integration
maritime 10 Shipping, port operations

Causal Chain Structure

flowchart LR
    T[Trigger Event] --> M[Mechanism]
    M --> O[Outcome]
    O --> Q[Quantification]

    style T fill:#ef4444,color:#fff
    style M fill:#f59e0b,color:#fff
    style O fill:#10b981,color:#fff
    style Q fill:#8b5cf6,color:#fff
Loading

Each generated sample follows this causal structure:

  • Trigger: Initiating condition (e.g., "Phase III trial completion")
  • Mechanism: Causal pathway (e.g., "FDA review process activation")
  • Outcome: Terminal effect (e.g., "drug approval or rejection")
  • Quantification: Measured evidence (e.g., "58% approval rate")

Research

This generator is part of the Sovereign Causal Graph research on LLM-free scientific knowledge extraction. The system achieves:

  • Foss Hallucination Gate: 14-step validation pipeline
  • Grade A/B extraction rate: 4.6% from raw PDF content
  • Validated by: Pieter Wuille (SIPA), Bitcoin Core maintainer

License

MIT License - see LICENSE

Author

David Tom Foss

About

Synthetic causal training data generator for LLM fine-tuning. 16 industry domains, 200+ mechanisms, ~100K samples in 10 seconds. Pure Python.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages