Adding UMI consensus flow to the pipeline #134

Hwanseok-Jeong · 2025-08-13T14:30:43Z

Description
This PR integrates a UMI consensus subworkflow into the nf-cmgg/preprocessing pipeline.
The new flow supports multiple UMI vendors, performs UMI extraction, consensus read generation, and UMI-aware deduplication following fgbio best practices.

v2.0.6 Release PR

nvnieuwk

Here are some comments, the overall implementation is executed pretty well. Good job!

Can you also make sure that the versions for every process are exported and added to ch_versions in the main flow?

nextflow_schema.json

subworkflows/local/consensus/main.nf

nvnieuwk · 2025-08-19T13:04:51Z

subworkflows/local/consensus/main.nf

+            .branch { _meta, _fq1, _fq2, umi_flag ->
+                umi_in_readname: umi_flag == true
+                umi_in_seq: umi_flag == false
+            }


Since you use a single boolean to determine how the flow should be, you can use if statements instead of .branch which is a bit more readable here. e.g.

if(umi_in_readname) { // Do the flow if the UMI is in readname } else { // Do the flow if the UMI is not in the readname }

I rather like the use of branch here.

As a whole, I'd consider moving all settings (e.g. umi_flag and umi_in_readname) to the metadata map.
In practice, the inputs to the workflow will most likely be a mix of UMI enabled and non UMI samples, and we need to be able to handle this usecase

@matthdsm It's feasible to add settings to the metadata map and change the workflow based on it. But my question is how to distinguish samples into UMI samples and non UMI samples, if samples are the mix of those two. Do you have any suggestion?

You could add an extra field to the samplesheet where the pipeline user can specify which files have UMIs and which files don't

subworkflows/local/consensus/main.nf

workflows/preprocessing.nf

subworkflows/local/consensus/main.nf

workflows/preprocessing.nf

nvnieuwk

Looking good! Feel free to implement the next step 🥳

matthdsm · 2025-08-20T08:39:33Z

Hi @Hwanseok-Jeong,

Would you be able to add a mermaid type diagram with a grand view of how you envision the workflow and integration of the consensus pipeline?

https://mermaid.js.org/intro/

https://github.blog/developer-skills/github/include-diagrams-markdown-files-mermaid/

Thanks!

nvnieuwk · 2025-08-20T09:25:55Z

subworkflows/local/consensus/main.nf

+        def ch_bwa_index = Channel.value(file(params.genomes.GRCh38.bwamem))
+        def ch_fasta     = Channel.value(file(params.genomes.GRCh38.fasta))


GRCh38 is not the only supported organism in the pipeline. You should try and get the references dynamically based on the organism in the meta data. You can have a look at other parts of the pipeline to get some inspiration for this

nor is BWA the only supported aligner

nvnieuwk · 2025-08-20T09:26:28Z

workflows/preprocessing.nf

                    'LB': meta.library ?: "",
                    'PL': meta.platform ?: rg.PL,
-                    'ID': meta.readgroup ?: rg.ID
+                    'ID': (meta.readgroup ?: rg.ID ?: meta.id ?: samplename)


What's the reason for this change?

Quick context: when I ran the UMI test FASTQs from assets/fastq_umi.yml, bwa mem died with
"[E::bwa_set_rg] no ID within the read group line" — neither meta.readgroup nor rg.ID was set.
I tried adding readgroup in the YAML, but the schema ignores it (Nextflow warns and drops it), so the @rg still had no ID.

I added a safe fallback for @rg:ID:
meta.readgroup ?: rg.ID ?: meta.id ?: samplename

If an ID already exists we keep it; only when it’s missing do we fall back to meta.id, then the sample name. This fixes the UMI test case and I think it doesn’t change behavior for datasets that already provide an ID.

@matthdsm you know more about this, do you like this solution?

rg.ID should always be set. If not, I think there might be something wrong with the headers of the fastq records.

The metadata fields are set here

Also, the entire subworkflow should be moved to after line 227, since now you're only working with the explicit fastq inputs and ignoring the fastq's from the demultiplex subworkflow.

Hwanseok-Jeong · 2025-08-20T09:48:31Z

Workflow Grand View with UMI Consensus

flowchart TD
    A[Raw FASTQ] --> B{Input Type}
    B -->|Flowcell| C[BCL Demultiplex]
    B -->|FASTQ| D[FASTQ Input]

    C --> E[ch_input_fastq]
    D --> E[ch_input_fastq]

    E --> F{UMI Consensus?}
    F -->|No| H[Original Reads]
    F -->|Yes| G[Enter UMI Consensus]

    subgraph SWF [UMI Consensus Subworkflow]
        G1[Group Reads by UMI]
        G2[Call Consensus Reads]
        G3[Filter Consensus Reads]
        G1 --> G2 --> G3
    end

    G --> G1
    SWF --> I[Consensus Reads]
    H --> I

    I --> J[QC & Trimming : fastp]
    J --> K[Alignment and CRAM/BAM Conversion]
    K --> L[Coverage Analysis]
    K --> M[BAM QC]
    L --> N[MultiQC]
    M --> N

nvnieuwk

I think it might be good here to try and reuse as much of the code as possible. There's already a subworkflow in this pipeline that handles the alignment of with all supported tools. It's less of a burden to maintain the pipeline if we only need to update the alignment in one place

Mistakenly committed file

…ng/preprocessing into InternUZGent/shared

nvnieuwk · 2025-08-21T13:30:43Z

subworkflows/local/consensus/main.nf

+            def sample = meta.samplename ?: UUID.randomUUID().toString()
+            def new_bam = file("${sample}.mapped.bam")
+            bam.copyTo(new_bam)
+            tuple(meta, new_bam)


This seems unnecessary, the samplename is a required value in the samplesheet so this will always be set. I'm also not sure what the reason is to copy the BAM file?

FGBIO_ZIPPERBAMS made an error:

-[nf-cmgg/preprocessing] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCMGG_PREPROCESSING:PREPROCESSING:CONSENSUS:FGBIO_ZIPPERBAMS (1)'

Caused by:
Process NFCMGG_PREPROCESSING:PREPROCESSING:CONSENSUS:FGBIO_ZIPPERBAMS input file name collision -- There are multiple input files for each of the following file names: umi_sample2.bam

Container:
community.wave.seqera.io/library/fgbio:2.5.21--368dab1b4f308243

Tip: you can replicate the issue by changing to the process work dir and entering the command bash .command.run

-- Check '.nextflow.log' file for details

and I tried to distinguish file name of mapped bam from uBAM.

There are better ways to handle this than to copy the whole BAM file. Can you try and find a way to make sure the name of the uBAM file is always different from the start? (so from when the file has been created)

nvnieuwk · 2025-08-21T13:32:49Z

subworkflows/local/consensus/main.nf

+            ch_fasta_by_meta,
+            ch_dict_by_meta


You will probably have to patch the FGBIO_ZIPPERBAMS module to make sure all files are on the same input tuple. Otherwise you might get sample mixing later on which should be avoided at all cost. You can just change the code of the module and run nf-core modules patch fgbio/zipperbams to patch this specific module

nvnieuwk · 2025-08-22T08:06:52Z

subworkflows/local/consensus/main.nf

+    // 1.3: Mapped BAM => Grouped BAM
+
+        // 1.3: Mapped BAM => Grouped BAM
+        def ch_strategy = Channel.value( (params.umi_group_strategy ?: 'Adjacency') as String )


Parameter defaults should be set in the nextflow.config file. This will ensure readability of the pipeline by keeping all defaults in the same place

nvnieuwk · 2025-08-22T08:42:18Z

subworkflows/local/consensus/main.nf

+        def valid_strategies = ['identity', 'edit', 'adjacency', 'paired']
+        def umi_strategy = (params.umi_group_strategy ?: 'adjacency').toLowerCase()
+        if ( !valid_strategies.contains(umi_strategy) ) {
+            exit 1, "Invalid value for --umi_group_strategy: '${params.umi_group_strategy}'. Allowed values: ${valid_strategies.join(', ')}"
+        }
+        def ch_strategy = Channel.value(umi_strategy)


Suggested change

def valid_strategies = ['identity', 'edit', 'adjacency', 'paired']

def umi_strategy = (params.umi_group_strategy ?: 'adjacency').toLowerCase()

if ( !valid_strategies.contains(umi_strategy) ) {

exit 1, "Invalid value for --umi_group_strategy: '${params.umi_group_strategy}'. Allowed values: ${valid_strategies.join(', ')}"

}

def ch_strategy = Channel.value(umi_strategy)

def ch_strategy = Channel.value(params.umi_strategy)

You don't need to check this here. You can add allowed options to the parameter in the schema. (Use nf-core schema build, go to that parameter and set the allowed options for that parameter). This way the validation will happens before the pipeline actually starts

nvnieuwk · 2025-08-25T07:17:38Z

tests/subworkflows/local/consensus/main.nf.test

+                        workflow.out.ubam.collect { it[1].name },
+                        workflow.out.grouped_bam.collect { it[1].name },
+                        workflow.out.filtered_ubam.collect { it[1].name },
+                        workflow.out.consensus_bam.collect { it[1].name },


You are probably testing the name because of md5sum mismatches? You can use the nft-bam nf-test plugin to improve BAM/CRAM tests. I think you can use the .getReadsMD5() method here to only get the md5sum of the reads, those should be stable

nvnieuwk · 2025-08-25T08:03:34Z

tests/config/nf-test.config

 }
+
+plugins = [
+  "nft-bam"


You will need to specify a version

nvnieuwk · 2025-08-25T08:03:54Z

tests/subworkflows/local/consensus/main.nf.test

+import nf.test.NFTest
+import nf.test.bam.BamFile
+


Suggested change

import nf.test.NFTest

import nf.test.bam.BamFile

This import is done automatically

nvnieuwk · 2025-08-25T13:48:08Z

tests/config/nf-test.config

+plugins = [
+  "nft-bam"
+]


Suggested change

plugins = [

"nft-bam"

]

nvnieuwk

Some documentation comments, @matthdsm do you also want to give a review on this PR?

nvnieuwk · 2025-08-27T07:24:01Z

README.md


 The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

+The pipeline also supports Unique Molecular Identifier (UMI) data. If your samplesheet includes a `umi_type` column (`seq` or `readname`), UMI-aware preprocessing is enabled automatically; rows with `umi_type=none` are processed as usual.


Suggested change

The pipeline also supports Unique Molecular Identifier (UMI) data. If your samplesheet includes a `umi_type` column (`seq` or `readname`), UMI-aware preprocessing is enabled automatically; rows with `umi_type=none` are processed as usual.

The pipeline also supports Unique Molecular Identifier (UMI) data. If your samplesheet includes a `umi_type` column (`seq` or `readname`), UMI-aware preprocessing is enabled automatically. Rows with no `umi_type` specified will be processed as non-UMI sequencing data.

I thought this was a bit confusing

nvnieuwk · 2025-08-27T07:25:59Z

README.md

+UMI processing (only for rows with `umi_type`):
+- Extract UMI from read sequence (`seq`) or read name (`readname`)
+- Group reads by UMI (fgbio GroupReadsByUmi)
+- Call molecular consensus (fgbio CallMolecularConsensusReads) and filter (fgbio FilterConsensusReads)
+- Re-align filtered consensus reads with BWA-MEM (`-Y`), then sort/index
+


Can you guys also add this to the metro map if you have some time left? You can find tutorials for this here: https://nf-co.re/docs/guidelines/graphic_design/overview

nvnieuwk · 2025-08-27T14:24:20Z

README.md

 6. Alignment QC using [`samtools flagstat`](http://www.htslib.org/doc/samtools-flagstat.html), [`samtools stats`](http://www.htslib.org/doc/samtools-stats.html), [`samtools idxstats`](http://www.htslib.org/doc/samtools-idxstats.html) and [`picard CollecHsMetrics`](https://broadinstitute.github.io/picard/command-line-overview.html#CollectHsMetrics), [`picard CollectWgsMetrics`](https://broadinstitute.github.io/picard/command-line-overview.html#CollectWgsMetrics), [`picard CollectMultipleMetrics`](https://broadinstitute.github.io/picard/command-line-overview.html#CollectMultipleMetrics)
 7. QC aggregation using [`multiqc`](https://multiqc.info/)

-![metro map](docs/images/metro_map.png)


You can just update the old one

matthdsm and others added 2 commits June 23, 2025 12:48

Merge pull request nf-cmgg#132 from nf-cmgg/dev

e3021ca

v2.0.6 Release PR

fgbio module added

2cfcc0e

nvnieuwk changed the base branch from main to dev August 13, 2025 14:34

nvnieuwk self-requested a review August 13, 2025 14:40

Hwanseok-Jeong and others added 4 commits August 14, 2025 09:57

Update consensus workflow

51ca1e7

main.nf, trying to run tests but still working on it

eff5743

Update on the current situation (schema, config, main, preprocessing)

eeb1d31

module update

a958e10

nvnieuwk reviewed Aug 19, 2025

View reviewed changes

correction based on feedback

6cb4bb1

nvnieuwk reviewed Aug 19, 2025

View reviewed changes

subworkflows/local/consensus/main.nf Outdated Show resolved Hide resolved

nvnieuwk reviewed Aug 19, 2025

View reviewed changes

subworkflows/local/consensus/main.nf Outdated Show resolved Hide resolved

nvnieuwk reviewed Aug 19, 2025

View reviewed changes

workflows/preprocessing.nf Outdated Show resolved Hide resolved

2nd correction

34809ea

nvnieuwk reviewed Aug 19, 2025

View reviewed changes

Step 1.2 ubam => Mapped BAM

8176911

nvnieuwk reviewed Aug 20, 2025

View reviewed changes

Change readgroup ID handling, and alignment reference and method

2a815de

nvnieuwk reviewed Aug 20, 2025

View reviewed changes

Hwanseok-Jeong and others added 7 commits August 20, 2025 16:46

samplesheet extra field: umi_type (readname/seq/none)

55b35d6

Correction: alignment with subworkflow

2b4c13f

Delete assets/fastq.yml

a3b4475

Mistakenly committed file

Merge branch 'InternUZGent/shared' of https://github.com/Hwanseok-Jeo…

be8cd6e

…ng/preprocessing into InternUZGent/shared

umi_type into meta

8815f0f

deleted umi params

405653d

successful run until step 1.2

b150d31

nvnieuwk reviewed Aug 21, 2025

View reviewed changes

Hwanseok-Jeong and others added 4 commits August 21, 2025 16:39

zipperbam input fixed

4f2c54b

changing ubam output file name for successful run of zipperbam

1c8ff25

Step 1.3 Mapped BAM => Grouped BAM

ea903eb

modules.json update (manual)

7b79dc6

nvnieuwk reviewed Aug 22, 2025

View reviewed changes

params.umi_group_strategy added

53e778e

nvnieuwk reviewed Aug 22, 2025

View reviewed changes

Hwanseok-Jeong and others added 4 commits August 22, 2025 10:50

param.umi_group_strategy update

2fe0d46

2(b).1 finished

0266496

Step 2(b).2: Consensus Filtered uBam -> Consensus Mapped & Filtered BAM

be65aa7

nf-test script - 1st Attempt

8246e39

nvnieuwk reviewed Aug 25, 2025

View reviewed changes

nf-test script 2nd attempt

2f4ad6f

nvnieuwk reviewed Aug 25, 2025

View reviewed changes

tests/config/nf-test.config Outdated

}

plugins = [

"nft-bam"

Copy link

Member

nvnieuwk Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to specify a version

nvnieuwk reviewed Aug 25, 2025

View reviewed changes

nf-test with snapshot

06f2da8

nvnieuwk reviewed Aug 25, 2025

View reviewed changes

tests/config/nf-test.config Outdated

Comment on lines 18 to 20

plugins = [

"nft-bam"

]

Copy link

Member

nvnieuwk Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

plugins = [

"nft-bam"

]

Hwanseok-Jeong and others added 5 commits August 25, 2025 20:35

nf-test with correct snapshot

3e34992

nf-test.config: added params

b5f32ad

consensus_bam -> cram (for further integration into pipeline)

ba11f49

UMI samples integrated with non-UMI samples

e4394e0

Update on documentation

b8bdafa

nvnieuwk reviewed Aug 27, 2025

View reviewed changes

Hwanseok-Jeong and others added 2 commits August 27, 2025 10:27

Potential parameters added as comments (feedback by Toon)

d78154f

Documentation corection and metro map update

12d2e23

nvnieuwk reviewed Aug 27, 2025

View reviewed changes

		def ch_bwa_index = Channel.value(file(params.genomes.GRCh38.bwamem))
		def ch_fasta = Channel.value(file(params.genomes.GRCh38.fasta))


		The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

		The pipeline also supports Unique Molecular Identifier (UMI) data. If your samplesheet includes a `umi_type` column (`seq` or `readname`), UMI-aware preprocessing is enabled automatically; rows with `umi_type=none` are processed as usual.

Adding UMI consensus flow to the pipeline #134

Are you sure you want to change the base?

Adding UMI consensus flow to the pipeline #134

Uh oh!

Conversation

Hwanseok-Jeong commented Aug 13, 2025

Uh oh!

nvnieuwk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvnieuwk left a comment

Choose a reason for hiding this comment

Uh oh!

matthdsm commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Hwanseok-Jeong commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Workflow Grand View with UMI Consensus

Uh oh!

nvnieuwk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvnieuwk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthdsm commented Aug 20, 2025 •

edited

Loading

Hwanseok-Jeong commented Aug 20, 2025 •

edited

Loading