-
Notifications
You must be signed in to change notification settings - Fork 6
Adding UMI consensus flow to the pipeline #134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
v2.0.6 Release PR
nvnieuwk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some comments, the overall implementation is executed pretty well. Good job!
Can you also make sure that the versions for every process are exported and added to ch_versions in the main flow?
subworkflows/local/consensus/main.nf
Outdated
| .branch { _meta, _fq1, _fq2, umi_flag -> | ||
| umi_in_readname: umi_flag == true | ||
| umi_in_seq: umi_flag == false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you use a single boolean to determine how the flow should be, you can use if statements instead of .branch which is a bit more readable here. e.g.
if(umi_in_readname) {
// Do the flow if the UMI is in readname
} else {
// Do the flow if the UMI is not in the readname
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rather like the use of branch here.
As a whole, I'd consider moving all settings (e.g. umi_flag and umi_in_readname) to the metadata map.
In practice, the inputs to the workflow will most likely be a mix of UMI enabled and non UMI samples, and we need to be able to handle this usecase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthdsm It's feasible to add settings to the metadata map and change the workflow based on it. But my question is how to distinguish samples into UMI samples and non UMI samples, if samples are the mix of those two. Do you have any suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add an extra field to the samplesheet where the pipeline user can specify which files have UMIs and which files don't
nvnieuwk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Feel free to implement the next step 🥳
|
Hi @Hwanseok-Jeong, Would you be able to add a mermaid type diagram with a grand view of how you envision the workflow and integration of the consensus pipeline? https://github.blog/developer-skills/github/include-diagrams-markdown-files-mermaid/ Thanks! |
subworkflows/local/consensus/main.nf
Outdated
| def ch_bwa_index = Channel.value(file(params.genomes.GRCh38.bwamem)) | ||
| def ch_fasta = Channel.value(file(params.genomes.GRCh38.fasta)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GRCh38 is not the only supported organism in the pipeline. You should try and get the references dynamically based on the organism in the meta data. You can have a look at other parts of the pipeline to get some inspiration for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nor is BWA the only supported aligner
workflows/preprocessing.nf
Outdated
| 'LB': meta.library ?: "", | ||
| 'PL': meta.platform ?: rg.PL, | ||
| 'ID': meta.readgroup ?: rg.ID | ||
| 'ID': (meta.readgroup ?: rg.ID ?: meta.id ?: samplename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick context: when I ran the UMI test FASTQs from assets/fastq_umi.yml, bwa mem died with
"[E::bwa_set_rg] no ID within the read group line" — neither meta.readgroup nor rg.ID was set.
I tried adding readgroup in the YAML, but the schema ignores it (Nextflow warns and drops it), so the @rg still had no ID.
I added a safe fallback for @rg:ID:
meta.readgroup ?: rg.ID ?: meta.id ?: samplename
If an ID already exists we keep it; only when it’s missing do we fall back to meta.id, then the sample name. This fixes the UMI test case and I think it doesn’t change behavior for datasets that already provide an ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthdsm you know more about this, do you like this solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rg.ID should always be set. If not, I think there might be something wrong with the headers of the fastq records.
The metadata fields are set here
Also, the entire subworkflow should be moved to after line 227, since now you're only working with the explicit fastq inputs and ignoring the fastq's from the demultiplex subworkflow.
Workflow Grand View with UMI Consensusflowchart TD
A[Raw FASTQ] --> B{Input Type}
B -->|Flowcell| C[BCL Demultiplex]
B -->|FASTQ| D[FASTQ Input]
C --> E[ch_input_fastq]
D --> E[ch_input_fastq]
E --> F{UMI Consensus?}
F -->|No| H[Original Reads]
F -->|Yes| G[Enter UMI Consensus]
subgraph SWF [UMI Consensus Subworkflow]
G1[Group Reads by UMI]
G2[Call Consensus Reads]
G3[Filter Consensus Reads]
G1 --> G2 --> G3
end
G --> G1
SWF --> I[Consensus Reads]
H --> I
I --> J[QC & Trimming : fastp]
J --> K[Alignment and CRAM/BAM Conversion]
K --> L[Coverage Analysis]
K --> M[BAM QC]
L --> N[MultiQC]
M --> N
|
nvnieuwk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be good here to try and reuse as much of the code as possible. There's already a subworkflow in this pipeline that handles the alignment of with all supported tools. It's less of a burden to maintain the pipeline if we only need to update the alignment in one place
Mistakenly committed file
…ng/preprocessing into InternUZGent/shared
subworkflows/local/consensus/main.nf
Outdated
| def sample = meta.samplename ?: UUID.randomUUID().toString() | ||
| def new_bam = file("${sample}.mapped.bam") | ||
| bam.copyTo(new_bam) | ||
| tuple(meta, new_bam) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessary, the samplename is a required value in the samplesheet so this will always be set. I'm also not sure what the reason is to copy the BAM file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FGBIO_ZIPPERBAMS made an error:
-[nf-cmgg/preprocessing] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCMGG_PREPROCESSING:PREPROCESSING:CONSENSUS:FGBIO_ZIPPERBAMS (1)'
Caused by:
Process NFCMGG_PREPROCESSING:PREPROCESSING:CONSENSUS:FGBIO_ZIPPERBAMS input file name collision -- There are multiple input files for each of the following file names: umi_sample2.bam
Container:
community.wave.seqera.io/library/fgbio:2.5.21--368dab1b4f308243
Tip: you can replicate the issue by changing to the process work dir and entering the command bash .command.run
-- Check '.nextflow.log' file for details
and I tried to distinguish file name of mapped bam from uBAM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are better ways to handle this than to copy the whole BAM file. Can you try and find a way to make sure the name of the uBAM file is always different from the start? (so from when the file has been created)
subworkflows/local/consensus/main.nf
Outdated
| ch_fasta_by_meta, | ||
| ch_dict_by_meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will probably have to patch the FGBIO_ZIPPERBAMS module to make sure all files are on the same input tuple. Otherwise you might get sample mixing later on which should be avoided at all cost. You can just change the code of the module and run nf-core modules patch fgbio/zipperbams to patch this specific module
subworkflows/local/consensus/main.nf
Outdated
| // 1.3: Mapped BAM => Grouped BAM | ||
|
|
||
| // 1.3: Mapped BAM => Grouped BAM | ||
| def ch_strategy = Channel.value( (params.umi_group_strategy ?: 'Adjacency') as String ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parameter defaults should be set in the nextflow.config file. This will ensure readability of the pipeline by keeping all defaults in the same place
subworkflows/local/consensus/main.nf
Outdated
| def valid_strategies = ['identity', 'edit', 'adjacency', 'paired'] | ||
| def umi_strategy = (params.umi_group_strategy ?: 'adjacency').toLowerCase() | ||
| if ( !valid_strategies.contains(umi_strategy) ) { | ||
| exit 1, "Invalid value for --umi_group_strategy: '${params.umi_group_strategy}'. Allowed values: ${valid_strategies.join(', ')}" | ||
| } | ||
| def ch_strategy = Channel.value(umi_strategy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def valid_strategies = ['identity', 'edit', 'adjacency', 'paired'] | |
| def umi_strategy = (params.umi_group_strategy ?: 'adjacency').toLowerCase() | |
| if ( !valid_strategies.contains(umi_strategy) ) { | |
| exit 1, "Invalid value for --umi_group_strategy: '${params.umi_group_strategy}'. Allowed values: ${valid_strategies.join(', ')}" | |
| } | |
| def ch_strategy = Channel.value(umi_strategy) | |
| def ch_strategy = Channel.value(params.umi_strategy) |
You don't need to check this here. You can add allowed options to the parameter in the schema. (Use nf-core schema build, go to that parameter and set the allowed options for that parameter). This way the validation will happens before the pipeline actually starts
| workflow.out.ubam.collect { it[1].name }, | ||
| workflow.out.grouped_bam.collect { it[1].name }, | ||
| workflow.out.filtered_ubam.collect { it[1].name }, | ||
| workflow.out.consensus_bam.collect { it[1].name }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are probably testing the name because of md5sum mismatches? You can use the nft-bam nf-test plugin to improve BAM/CRAM tests. I think you can use the .getReadsMD5() method here to only get the md5sum of the reads, those should be stable
tests/config/nf-test.config
Outdated
| } | ||
|
|
||
| plugins = [ | ||
| "nft-bam" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will need to specify a version
| import nf.test.NFTest | ||
| import nf.test.bam.BamFile | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| import nf.test.NFTest | |
| import nf.test.bam.BamFile |
This import is done automatically
tests/config/nf-test.config
Outdated
| plugins = [ | ||
| "nft-bam" | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| plugins = [ | |
| "nft-bam" | |
| ] |
nvnieuwk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some documentation comments, @matthdsm do you also want to give a review on this PR?
README.md
Outdated
|
|
||
| The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible. | ||
|
|
||
| The pipeline also supports Unique Molecular Identifier (UMI) data. If your samplesheet includes a `umi_type` column (`seq` or `readname`), UMI-aware preprocessing is enabled automatically; rows with `umi_type=none` are processed as usual. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The pipeline also supports Unique Molecular Identifier (UMI) data. If your samplesheet includes a `umi_type` column (`seq` or `readname`), UMI-aware preprocessing is enabled automatically; rows with `umi_type=none` are processed as usual. | |
| The pipeline also supports Unique Molecular Identifier (UMI) data. If your samplesheet includes a `umi_type` column (`seq` or `readname`), UMI-aware preprocessing is enabled automatically. Rows with no `umi_type` specified will be processed as non-UMI sequencing data. |
I thought this was a bit confusing
| UMI processing (only for rows with `umi_type`): | ||
| - Extract UMI from read sequence (`seq`) or read name (`readname`) | ||
| - Group reads by UMI (fgbio GroupReadsByUmi) | ||
| - Call molecular consensus (fgbio CallMolecularConsensusReads) and filter (fgbio FilterConsensusReads) | ||
| - Re-align filtered consensus reads with BWA-MEM (`-Y`), then sort/index | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you guys also add this to the metro map if you have some time left? You can find tutorials for this here: https://nf-co.re/docs/guidelines/graphic_design/overview
| 6. Alignment QC using [`samtools flagstat`](http://www.htslib.org/doc/samtools-flagstat.html), [`samtools stats`](http://www.htslib.org/doc/samtools-stats.html), [`samtools idxstats`](http://www.htslib.org/doc/samtools-idxstats.html) and [`picard CollecHsMetrics`](https://broadinstitute.github.io/picard/command-line-overview.html#CollectHsMetrics), [`picard CollectWgsMetrics`](https://broadinstitute.github.io/picard/command-line-overview.html#CollectWgsMetrics), [`picard CollectMultipleMetrics`](https://broadinstitute.github.io/picard/command-line-overview.html#CollectMultipleMetrics) | ||
| 7. QC aggregation using [`multiqc`](https://multiqc.info/) | ||
|
|
||
|  |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just update the old one
Description
This PR integrates a UMI consensus subworkflow into the nf-cmgg/preprocessing pipeline.
The new flow supports multiple UMI vendors, performs UMI extraction, consensus read generation, and UMI-aware deduplication following fgbio best practices.