Skip to content

Conversation

@Hwanseok-Jeong
Copy link

Description
This PR integrates a UMI consensus subworkflow into the nf-cmgg/preprocessing pipeline.
The new flow supports multiple UMI vendors, performs UMI extraction, consensus read generation, and UMI-aware deduplication following fgbio best practices.

@nvnieuwk nvnieuwk changed the base branch from main to dev August 13, 2025 14:34
@nvnieuwk nvnieuwk self-requested a review August 13, 2025 14:40
Copy link
Member

@nvnieuwk nvnieuwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some comments, the overall implementation is executed pretty well. Good job!

Can you also make sure that the versions for every process are exported and added to ch_versions in the main flow?

Comment on lines 31 to 34
.branch { _meta, _fq1, _fq2, umi_flag ->
umi_in_readname: umi_flag == true
umi_in_seq: umi_flag == false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you use a single boolean to determine how the flow should be, you can use if statements instead of .branch which is a bit more readable here. e.g.

if(umi_in_readname) {
  // Do the flow if the UMI is in readname
} else {
  // Do the flow if the UMI is not in the readname
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rather like the use of branch here.

As a whole, I'd consider moving all settings (e.g. umi_flag and umi_in_readname) to the metadata map.
In practice, the inputs to the workflow will most likely be a mix of UMI enabled and non UMI samples, and we need to be able to handle this usecase

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthdsm It's feasible to add settings to the metadata map and change the workflow based on it. But my question is how to distinguish samples into UMI samples and non UMI samples, if samples are the mix of those two. Do you have any suggestion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add an extra field to the samplesheet where the pipeline user can specify which files have UMIs and which files don't

Copy link
Member

@nvnieuwk nvnieuwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Feel free to implement the next step 🥳

@matthdsm
Copy link
Member

matthdsm commented Aug 20, 2025

Hi @Hwanseok-Jeong,

Would you be able to add a mermaid type diagram with a grand view of how you envision the workflow and integration of the consensus pipeline?

https://mermaid.js.org/intro/

https://github.blog/developer-skills/github/include-diagrams-markdown-files-mermaid/

Thanks!

Comment on lines 70 to 71
def ch_bwa_index = Channel.value(file(params.genomes.GRCh38.bwamem))
def ch_fasta = Channel.value(file(params.genomes.GRCh38.fasta))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GRCh38 is not the only supported organism in the pipeline. You should try and get the references dynamically based on the organism in the meta data. You can have a look at other parts of the pipeline to get some inspiration for this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nor is BWA the only supported aligner

'LB': meta.library ?: "",
'PL': meta.platform ?: rg.PL,
'ID': meta.readgroup ?: rg.ID
'ID': (meta.readgroup ?: rg.ID ?: meta.id ?: samplename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this change?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick context: when I ran the UMI test FASTQs from assets/fastq_umi.yml, bwa mem died with
"[E::bwa_set_rg] no ID within the read group line" — neither meta.readgroup nor rg.ID was set.
I tried adding readgroup in the YAML, but the schema ignores it (Nextflow warns and drops it), so the @rg still had no ID.

I added a safe fallback for @rg:ID:
meta.readgroup ?: rg.ID ?: meta.id ?: samplename

If an ID already exists we keep it; only when it’s missing do we fall back to meta.id, then the sample name. This fixes the UMI test case and I think it doesn’t change behavior for datasets that already provide an ID.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthdsm you know more about this, do you like this solution?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rg.ID should always be set. If not, I think there might be something wrong with the headers of the fastq records.

The metadata fields are set here

Also, the entire subworkflow should be moved to after line 227, since now you're only working with the explicit fastq inputs and ignoring the fastq's from the demultiplex subworkflow.

@Hwanseok-Jeong
Copy link
Author

Hwanseok-Jeong commented Aug 20, 2025

Workflow Grand View with UMI Consensus

flowchart TD
    A[Raw FASTQ] --> B{Input Type}
    B -->|Flowcell| C[BCL Demultiplex]
    B -->|FASTQ| D[FASTQ Input]

    C --> E[ch_input_fastq]
    D --> E[ch_input_fastq]

    E --> F{UMI Consensus?}
    F -->|No| H[Original Reads]
    F -->|Yes| G[Enter UMI Consensus]

    subgraph SWF [UMI Consensus Subworkflow]
        G1[Group Reads by UMI]
        G2[Call Consensus Reads]
        G3[Filter Consensus Reads]
        G1 --> G2 --> G3
    end

    G --> G1
    SWF --> I[Consensus Reads]
    H --> I

    I --> J[QC & Trimming : fastp]
    J --> K[Alignment and CRAM/BAM Conversion]
    K --> L[Coverage Analysis]
    K --> M[BAM QC]
    L --> N[MultiQC]
    M --> N
Loading

Copy link
Member

@nvnieuwk nvnieuwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be good here to try and reuse as much of the code as possible. There's already a subworkflow in this pipeline that handles the alignment of with all supported tools. It's less of a burden to maintain the pipeline if we only need to update the alignment in one place

Comment on lines 80 to 83
def sample = meta.samplename ?: UUID.randomUUID().toString()
def new_bam = file("${sample}.mapped.bam")
bam.copyTo(new_bam)
tuple(meta, new_bam)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary, the samplename is a required value in the samplesheet so this will always be set. I'm also not sure what the reason is to copy the BAM file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FGBIO_ZIPPERBAMS made an error:


-[nf-cmgg/preprocessing] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCMGG_PREPROCESSING:PREPROCESSING:CONSENSUS:FGBIO_ZIPPERBAMS (1)'

Caused by:
Process NFCMGG_PREPROCESSING:PREPROCESSING:CONSENSUS:FGBIO_ZIPPERBAMS input file name collision -- There are multiple input files for each of the following file names: umi_sample2.bam

Container:
community.wave.seqera.io/library/fgbio:2.5.21--368dab1b4f308243

Tip: you can replicate the issue by changing to the process work dir and entering the command bash .command.run

-- Check '.nextflow.log' file for details


and I tried to distinguish file name of mapped bam from uBAM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are better ways to handle this than to copy the whole BAM file. Can you try and find a way to make sure the name of the uBAM file is always different from the start? (so from when the file has been created)

Comment on lines 96 to 97
ch_fasta_by_meta,
ch_dict_by_meta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will probably have to patch the FGBIO_ZIPPERBAMS module to make sure all files are on the same input tuple. Otherwise you might get sample mixing later on which should be avoided at all cost. You can just change the code of the module and run nf-core modules patch fgbio/zipperbams to patch this specific module

// 1.3: Mapped BAM => Grouped BAM

// 1.3: Mapped BAM => Grouped BAM
def ch_strategy = Channel.value( (params.umi_group_strategy ?: 'Adjacency') as String )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter defaults should be set in the nextflow.config file. This will ensure readability of the pipeline by keeping all defaults in the same place

Comment on lines 99 to 104
def valid_strategies = ['identity', 'edit', 'adjacency', 'paired']
def umi_strategy = (params.umi_group_strategy ?: 'adjacency').toLowerCase()
if ( !valid_strategies.contains(umi_strategy) ) {
exit 1, "Invalid value for --umi_group_strategy: '${params.umi_group_strategy}'. Allowed values: ${valid_strategies.join(', ')}"
}
def ch_strategy = Channel.value(umi_strategy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def valid_strategies = ['identity', 'edit', 'adjacency', 'paired']
def umi_strategy = (params.umi_group_strategy ?: 'adjacency').toLowerCase()
if ( !valid_strategies.contains(umi_strategy) ) {
exit 1, "Invalid value for --umi_group_strategy: '${params.umi_group_strategy}'. Allowed values: ${valid_strategies.join(', ')}"
}
def ch_strategy = Channel.value(umi_strategy)
def ch_strategy = Channel.value(params.umi_strategy)

You don't need to check this here. You can add allowed options to the parameter in the schema. (Use nf-core schema build, go to that parameter and set the allowed options for that parameter). This way the validation will happens before the pipeline actually starts

Comment on lines 65 to 68
workflow.out.ubam.collect { it[1].name },
workflow.out.grouped_bam.collect { it[1].name },
workflow.out.filtered_ubam.collect { it[1].name },
workflow.out.consensus_bam.collect { it[1].name },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are probably testing the name because of md5sum mismatches? You can use the nft-bam nf-test plugin to improve BAM/CRAM tests. I think you can use the .getReadsMD5() method here to only get the md5sum of the reads, those should be stable

}

plugins = [
"nft-bam"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to specify a version

Comment on lines 1 to 3
import nf.test.NFTest
import nf.test.bam.BamFile

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import nf.test.NFTest
import nf.test.bam.BamFile

This import is done automatically

Comment on lines 18 to 20
plugins = [
"nft-bam"
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
plugins = [
"nft-bam"
]

Copy link
Member

@nvnieuwk nvnieuwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some documentation comments, @matthdsm do you also want to give a review on this PR?

README.md Outdated

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

The pipeline also supports Unique Molecular Identifier (UMI) data. If your samplesheet includes a `umi_type` column (`seq` or `readname`), UMI-aware preprocessing is enabled automatically; rows with `umi_type=none` are processed as usual.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The pipeline also supports Unique Molecular Identifier (UMI) data. If your samplesheet includes a `umi_type` column (`seq` or `readname`), UMI-aware preprocessing is enabled automatically; rows with `umi_type=none` are processed as usual.
The pipeline also supports Unique Molecular Identifier (UMI) data. If your samplesheet includes a `umi_type` column (`seq` or `readname`), UMI-aware preprocessing is enabled automatically. Rows with no `umi_type` specified will be processed as non-UMI sequencing data.

I thought this was a bit confusing

Comment on lines 31 to 36
UMI processing (only for rows with `umi_type`):
- Extract UMI from read sequence (`seq`) or read name (`readname`)
- Group reads by UMI (fgbio GroupReadsByUmi)
- Call molecular consensus (fgbio CallMolecularConsensusReads) and filter (fgbio FilterConsensusReads)
- Re-align filtered consensus reads with BWA-MEM (`-Y`), then sort/index

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you guys also add this to the metro map if you have some time left? You can find tutorials for this here: https://nf-co.re/docs/guidelines/graphic_design/overview

6. Alignment QC using [`samtools flagstat`](http://www.htslib.org/doc/samtools-flagstat.html), [`samtools stats`](http://www.htslib.org/doc/samtools-stats.html), [`samtools idxstats`](http://www.htslib.org/doc/samtools-idxstats.html) and [`picard CollecHsMetrics`](https://broadinstitute.github.io/picard/command-line-overview.html#CollectHsMetrics), [`picard CollectWgsMetrics`](https://broadinstitute.github.io/picard/command-line-overview.html#CollectWgsMetrics), [`picard CollectMultipleMetrics`](https://broadinstitute.github.io/picard/command-line-overview.html#CollectMultipleMetrics)
7. QC aggregation using [`multiqc`](https://multiqc.info/)

![metro map](docs/images/metro_map.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just update the old one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants