Synthetic Causal Training Data Generator for LLM Fine-Tuning
Part of the Sovereign Causal Graph research project.
Foss Generator creates high-quality synthetic Chain-of-Thought training data from domain ontologies. Pure Python, no external dependencies, no API calls required.
Performance: ~100,000 samples in ~10 seconds on Apple Silicon M-series.
flowchart LR
A[Domain Ontology] --> B[Causal Mechanisms]
B --> C[Template Engine]
C --> D[Style Variants]
D --> E[Quality Scoring]
E --> F[Training Data]
style A fill:#3b82f6,color:#fff
style F fill:#10b981,color:#fff
- 16 Industry Domains: Pharma, Finance, Cybersecurity, Healthcare, Energy, and more
- 12 Output Styles: Academic, executive, regulatory, technical, etc.
- Quality Scoring: Automatic quantification pattern detection
- Multiple Exports: JSONL, HuggingFace-compatible, CSV
- Custom Ontologies: Load your own domain definitions
git clone https://github.com/DT-Foss/foss-generator.git
cd foss-generatorNo dependencies required - pure Python 3.8+.
python foss_generator.py --domain pharma --count 10000python foss_generator.py --all-domains --count 5000python foss_generator.py --custom my_ontology.json --count 10000python foss_generator.py --domain finance --count 5000 --format huggingfaceflowchart TB
subgraph Input
O[Domain Ontologies<br/>16 Industries]
C[Custom Ontology<br/>JSON]
end
subgraph Generator
M[Mechanism Selector]
T[Template Engine<br/>12 Styles]
Q[Quality Scorer]
end
subgraph Output
J[JSONL]
H[HuggingFace]
V[CSV]
end
O --> M
C --> M
M --> T
T --> Q
Q --> J
Q --> H
Q --> V
style O fill:#3b82f6,color:#fff
style C fill:#3b82f6,color:#fff
style Q fill:#f59e0b,color:#fff
style J fill:#10b981,color:#fff
style H fill:#10b981,color:#fff
style V fill:#10b981,color:#fff
{
"prompt": "Trace the causal pathway from patent cliff approaching to revenue decline.",
"completion": "**Causal Analysis: patent cliff approaching → revenue decline**\n\n**1. Trigger Identification**\nThe causal sequence initiates with patent cliff approaching.\n\n**2. Mechanistic Pathway**\nGeneric competition preparation triggers pricing pressure and market share erosion.\n\n**3. Quantitative Evidence**\nTypical revenue drop: 70-90% within 12 months.\n\n**4. Confidence Assessment**\nThis causal relationship demonstrates very_high confidence."
}| Domain | Mechanisms | Description |
|---|---|---|
pharma |
18 | Drug development, clinical trials, regulatory |
finance |
18 | Risk management, trading, compliance |
cybersecurity |
15 | Threat detection, incident response |
healthcare |
16 | Clinical workflows, patient safety |
energy |
15 | Grid operations, renewable integration |
manufacturing |
14 | Supply chain, quality control |
legal |
12 | Contract analysis, litigation |
insurance |
14 | Underwriting, claims processing |
aviation |
12 | Safety systems, maintenance |
telecom |
12 | Network operations, capacity |
biotech |
15 | Research, development pipelines |
mining |
12 | Extraction, environmental compliance |
agriculture |
12 | Crop management, supply chain |
real_estate |
12 | Valuation, market dynamics |
defense |
10 | Procurement, systems integration |
maritime |
10 | Shipping, port operations |
flowchart LR
T[Trigger Event] --> M[Mechanism]
M --> O[Outcome]
O --> Q[Quantification]
style T fill:#ef4444,color:#fff
style M fill:#f59e0b,color:#fff
style O fill:#10b981,color:#fff
style Q fill:#8b5cf6,color:#fff
Each generated sample follows this causal structure:
- Trigger: Initiating condition (e.g., "Phase III trial completion")
- Mechanism: Causal pathway (e.g., "FDA review process activation")
- Outcome: Terminal effect (e.g., "drug approval or rejection")
- Quantification: Measured evidence (e.g., "58% approval rate")
This generator is part of the Sovereign Causal Graph research on LLM-free scientific knowledge extraction. The system achieves:
- Foss Hallucination Gate: 14-step validation pipeline
- Grade A/B extraction rate: 4.6% from raw PDF content
- Validated by: Pieter Wuille (SIPA), Bitcoin Core maintainer
MIT License - see LICENSE
David Tom Foss