Simple is a bioinformatics pipeline for mapping EMS-induced point mutations using bulk segregant analysis. This Dockerized version provides an easy-to-use, reproducible environment for running the analysis.
Important
Create a Working Directory
You must create a folder containing your FastQ files before running the pipeline. The pipeline will create all output in a timestamped subdirectory within this folder.
If you have cloned this repository, just use the data and runs directories.
Otherwise create them inside a clean directory.
The pipeline automatically detects files based on flexible patterns. Files must contain:
- "mut" for mutant bulk samples
- "wt" for wild-type bulk samples
- "R1" or "R2" to indicate read direction
- Any extension:
.fastq,.fq,.fastq.gz,.fq.gz, etc.
Click here to know more about the file structure and file naming
# 1. Create your analysis folder
mkdir my_mutant_analysis
cd my_mutant_analysis
# 2. Create data folder for your FastQ files
mkdir -p data runs
# 3. Copy your FastQ files to the data folder
# Your folder should now look like this:
ls -la data/
# mut.R1.fq.gz mut.R2.fq.gz wt.R1.fq.gz wt.R2.fq.gzmy_analysis/
├── data/
│ ├── mut.R1.fq.gz # Mutant bulk, read 1 (compressed)
│ ├── mut.R2.fq.gz # Mutant bulk, read 2 (compressed)
│ ├── wt.R1.fastq # Wild-type bulk, read 1 (uncompressed)
│ └── wt.R2.fastq # Wild-type bulk, read 2 (uncompressed)
└── runs/ # Created automatically by pipeline
my_analysis/
├── data/
│ ├── root_mutant.mut.R1.fq.gz
│ ├── root_mutant.mut.R2.fq.gz
│ ├── root_mutant.wt.R1.fq.gz
│ └── root_mutant.wt.R2.fq.gz
└── runs/ # Created automatically by pipeline
my_analysis/
├── data/
│ ├── sample_mut.R1.fastq.gz
│ ├── sample_mut.R2.fastq.gz
│ ├── sample_wt.R1.fastq.gz
│ └── sample_wt.R2.fastq.gz
└── runs/ # Created automatically by pipeline
simple-fork/
├── data/ # Example data folder (rename to your analysis name)
│ ├── mut.R1.fq.gz
│ ├── mut.R2.fq.gz
│ ├── wt.R1.fq.gz
│ └── wt.R2.fq.gz
├── scripts/ # Pipeline scripts
├── programs/ # Bioinformatics tools
└── Dockerfile
- ❌
mutant.fastq- Missing "R1" or "R2" identifier - ❌
wildtype.fastq- Missing "R1" or "R2" identifier - ❌
mut_1.fastq- Should bemut.R1.fastq - ❌
wt_1.fastq- Should bewt.R1.fastq - ❌ Files without "mut" or "wt" in the name
- ❌ Files without "R1" or "R2" in the name
mut.R1.fq.gz,mut.R2.fq.gz,wt.R1.fq.gz,wt.R2.fq.gzroot_mutant.mut.R1.fastq,root_mutant.mut.R2.fastq,root_mutant.wt.R1.fastq,root_mutant.wt.R2.fastqsample_mut.R1.fq,sample_mut.R2.fq,sample_wt.R1.fq,sample_wt.R2.fq
Before configuring, check what species are available:
# View available species using Docker
docker run --rm ghcr.io/andraghetti/simple cat /app/scripts/data_base.txt | grep -v "^#" | awk '{print $1}'Available species include:
Arabidopsis_thalianaOryza_sativa_JaponicaZea_maysSolanum_lycopersicumDrosophila_melanogastercaenorhabditis_elegansdanio_rerioSaccharomyces_cerevisiae
This will be required in the next step, so copy the value. It will be used in the --species attribute.
Example: --species Drosophila_melanogaster.
The pipeline accepts command line arguments that override the default values. You can customize the analysis by passing parameters to the Docker container:
Available Parameters:
--line LINE_NAME: Line name (default: EMS)--mutation TYPE: Mutation type: recessive or dominant (default: recessive)--species SPECIES: Species name (default: Arabidopsis_thaliana)--cpu-cores CORES: CPU cores: auto or number (default: auto)--memory MEMORY: Java memory allocation (default: auto)
# Pull the pre-built image from GitHub Container Registry
docker pull ghcr.io/andraghetti/simple:latest
# Run the pipeline with default settings
docker run --rm \
-v $(pwd)/data:/app/data \
-v $(pwd)/runs:/app/runs \
ghcr.io/andraghetti/simple:latestAlternative example with all the configurable options:
# Run the pipeline with custom parameters
docker run --rm \
-v $(pwd)/data:/app/data \
-v $(pwd)/runs:/app/runs \
ghcr.io/andraghetti/simple:latest \
--line root_mutant \
--mutation recessive \
--species Arabidopsis_thaliana \
--cpu-cores 8 \
--java-memory 16g# Build with explicit platform specification (works with regular docker build)
docker build -t simple-pipeline .
# Run the pipeline with default settings
docker run --rm \
-v $(pwd)/data:/app/data \
-v $(pwd)/runs:/app/runs \
simple-pipelineAfter completion, check the runs/ directory for your results:
runs/run-YYYYMMDD_HHMMSS/output/YOUR_LINE_NAME.candidates.txt- Candidate mutationsruns/run-YYYYMMDD_HHMMSS/output/YOUR_LINE_NAME.allSNPs.txt- All SNPs for plottingruns/run-YYYYMMDD_HHMMSS/output/YOUR_LINE_NAME.Rplot_*.pdf- Manhattan plotsruns/run-YYYYMMDD_HHMMSS/output/snpEff_summary.html- snpEff annotation summaryruns/run-YYYYMMDD_HHMMSS/output/snpEff_genes.txt- snpEff gene annotationsruns/run-YYYYMMDD_HHMMSS/output/log.txt- Complete execution log
- Reference Preparation: Downloads and indexes reference genome
- Read Alignment: Maps reads using BWA with optimized parameters
- BAM Processing: Sorts, fixes mates, and marks duplicates
- Variant Calling: Uses GATK HaplotypeCaller for variant discovery
- Variant Annotation: Annotates variants with snpEff
- Candidate Selection: Identifies candidate genes based on mutation type
-
Create analysis folder and add files:
# Create your analysis folder mkdir root_analysis cd root_analysis # Add your FastQ files (copy or download them here) # Files: root_mutant.mut.R1.fq.gz, root_mutant.mut.R2.fq.gz # root_mutant.wt.R1.fq.gz, root_mutant.wt.R2.fq.gz
-
Run the pipeline:
docker run --rm \ -v $(pwd)/data:/app/data \ -v $(pwd)/runs:/app/runs \ ghcr.io/andraghetti/simple:latest \ --line root_mutant \ --mutation recessive \ --species Arabidopsis_thaliana
-
Check results: Look in
runs/run-YYYYMMDD_HHMMSS/output/root_mutant.candidates.txtfor candidate genes affecting root development
- Pipeline fails immediately: Check file naming convention in your analysis directory
- Files must contain "mut" or "wt" AND "R1" or "R2"
- Examples:
mut.R1.fq.gz,wt.R1.fastq,line_mut.R1.fq.gz
- Out of memory errors: Reduce
--memoryparameter (e.g., use--memory 8g) - Slow performance: Ensure
--cpu-cores autofor maximum performance - Missing reference: Check
--speciesmatches entries indata_base.txt - Invalid arguments: Use
--helpto see available options - No files found: Verify files are in the current directory with correct naming pattern
- Permission errors: Ensure Docker has read/write access to your analysis directory
- Check the log files in
runs/run-TIMESTAMP/output/ - Verify file naming convention matches exactly
- Ensure Docker has sufficient resources allocated
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this pipeline in your research, please cite the original Simple paper and acknowledge this optimized version.