Add run_dbcan screening for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation #483

HaidYi · 2025-07-02T00:16:24Z

PR checklist

Close #481.

The main changes include:

Like other screening tools, added a dedicated subworkflow (subworkflows/dbcan.nf) for the support of run_dbcan screening.
Added the annotation step for generating the .gff files and added the alias of the current modules (e.g., PYRODIGAL_GFF). So, the input gbk column may also use gff file as input. Feel free to change this part as it may need some tweaks considering the both the pipeline and the document.
Other utilities:
- ci/cd, testing profiles for dbcan, module.config, etc.
- documents: readme and output

Things that are needed the changes from the maintainer:

Add the changelog for this change in the next release version.
Add the dbcan screening step in the schematic workflow.

nf-core-bot · 2025-07-02T00:17:00Z

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.3.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

jasmezz

What a great addition! @HaidYi I really appreciate your effort, your PR is really clear and on point. Thank you very much for this contribution. During review I directly pushed some minor changes to your fork.

Some other comments we could consider:

Thinking about renaming the new dbcan subworkflow to cazyme. This would be more in line with previous naming, i.e. subworkflow names tell the purpose, not the tool.
- This would include changing the output dir in modules.config to ${params.outdir}/cazyme/cazyme_annotation, ${params.outdir}/cazyme/cgc, ${params.outdir}/cazyme/substrate
- file tree in output docs
- test names
- nextflow_schema.json ...
The database download takes very long because of low download rate (>2 GB at at rate of ~ 1 MB/s). That is too long for the test profiles; we need to create a smaller database somehow...
Adding manual dbCAN database download (via bioconda) to the respective section in usage docs.

jasmezz · 2025-07-10T12:37:06Z

conf/test_preannotated_dbcan.config

+    dbcan_skip_cgc             = true   // skip cgc as .gbk is used
+    dbcan_skip_substrate       = true   // skip substrate as .gbk is used


If we want to be able to run the complete CAZyme subworkflow with pre-annotated .gff files while also providing pre-annotated .gbk files for other subworkflows, we need an additional (optional) column in the samplesheet.

jasmezz · 2025-07-10T13:22:09Z

docs/output.md

+    - `*_dbCAN_hmm_results.tsv`: TSV file containing the detailed dbCAN HMM results for CAZyme annotation.
+    - `*_dbCANsub_hmm_results.tsv`: TSV file containing the detailed dbCAN subfamily results for CAZyme annotation.
+    - `*_diamond.out`: TSV file containing the detailed dbCAN diamond results for CAZyme annotation.
+  - `cgc`


Many of the files of the cgc and substrate section seem duplicated. Maybe we don't need to store those which are created in the cazyme step already? Can control this in modules.config (e.g. see RGI_MAIN entry).

@jasmezz Thank you for reviewing the codes. I will revise it based on your comments.

jfy133

Really good first PR @HaidYi ! Clean and pretty much all of my comments are sort of minor/just polishing

Some additional things to my direct comments:

Missing citations.md update
Missing the how to cite/methods text in this file: https://github.com/HaidYi/funcscan/blob/0cad8f95c553b3cdd3a59c34a0db107bd6df14f4/subworkflows/local/utils_nfcore_funcscan_pipeline/main.nf#L174
Missing metromap update (but we can probably do this before release)
Missing nf-test test and snapshots for the new tests

jfy133 · 2025-07-15T06:43:40Z

conf/test_cazyme_pyrodigal.config

@@ -0,0 +1,34 @@
+/*


The test should be for all cazyme screening tools, so I would rename accordingly for 'future proofing'

This is a good idea for leaving a placeholder for other cazyme screening developers.

I'm not sure I follow...

Basicaally what I mean this should be: test_cazyme_pyrodigal not test_dbcan_pyrodigal!

jfy133 · 2025-07-15T06:45:11Z

conf/test_preannotated_dbcan.config

+    run_bgc_screening          = false
+    run_cazyme_screening       = true
+
+    dbcan_skip_cgc             = true   // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet


We should probably add gff files!

You can generate them from a normal funcscan fun, and make a PR against teh funscan branch of nf-core/testdatasets, which has the files and an updated samplesheet for the next funcscan version

Yes, currently the cazyme screening can only use the .gff files in the pipeline. To use the pre-annotated one, I generated the .gff files from pyrodigal. The PR can be found at nf-core/test-datasets#1683.

Can this be updated now you have the file?

jfy133 · 2025-07-15T06:46:43Z

docs/output.md

 |   ├── deepbgc/
 |   ├── gecco/
 |   └── hmmsearch/
+├── dbcan/


The top level should be the molecule/gene type (i.e., cazyme), then a subdirectory with each tool (in this case dbcan), and within that each of the different output directories

jfy133 · 2025-07-15T06:48:37Z

docs/output.md

+
+- `dbcan/`
+  - `cazyme`
+    - `*_overview.tsv`: TSV file containing the results of dbCAN CAZyme annotation


You're missing the <sample.id> sample subdirectory underneath the tool name (accoeding to your modules.confg)

docs/usage.md

jfy133 · 2025-07-15T06:55:00Z

subworkflows/local/cazyme.nf

+        .join(ch_gffs_for_rundbcan)
+        .multiMap { meta, faa, gff ->
+            faa: [meta, faa]
+            gff: [meta, gff, 'prodigal']


Is the gff always from prodigal? Or is this a dummy value?

Refer to the module description: https://nf-co.re/modules/rundbcan_easycgc/. If it's the generated in the pipeline, it is always the prodigal. But if it's provided using the pre-annotated one, then it could be either NCBI_prok, JGI, NCBI_euk or prodigal. This makes things complicated. An easier way is to define a parameter in the cli for this option but it's kind of hard to deal with the mixed case in a batch without doing the modifications in the samplesheet.

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

HaidYi · 2025-07-17T03:58:29Z

@jfy133 Thank you for the comments and suggestions. I will fix all the problems one-by-one. As I don't want this PR corrupt other screening steps, I will do a more comprehensive testing, which may take more time. I will let you know when I fix all the issues.

jfy133 · 2025-08-29T06:50:48Z

@jasmezz and I would like to go through once more to give you the ✔️ given this is a big (And exciting) extension to the pipeline!

jfy133 · 2025-08-29T06:51:02Z

(I can't today, but will try to find time next week(

jfy133

Getting close @HaidYi ! The main thing is to already move the GFF type into the samplesheet, and add the actual test snapshots of your new configs (and add the cazyme subworkflow to the 'default' test.config)

Feel free to ping me on slack if you need anything :)

docs/output.md

subworkflows/local/cazyme.nf

subworkflows/local/utils_nfcore_funcscan_pipeline/main.nf

jfy133 · 2025-09-03T09:43:07Z

conf/test_cazyme_pyrodigal.config

@@ -0,0 +1,34 @@
+/*


I'm not sure I follow...

Basicaally what I mean this should be: test_cazyme_pyrodigal not test_dbcan_pyrodigal!

conf/test_cazyme_pyrodigal.config

tests/default.nf.test.snap

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

…plesheet

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

HaidYi · 2025-09-28T18:12:55Z

@jfy133 Please review it again. All the issues you raised have already been resolved in the current codes.

jfy133 · 2025-10-04T19:21:10Z

@jfy133 Please review it again. All the issues you raised have already been resolved in the current codes.

Thank @HaidYi ! I just made a release so you have a clean slate and thus a 'clean release' once this PR is in.

I will resolve the conflicts next week and do the review :)

jfy133 · 2025-10-08T09:13:41Z

@nf-core-bot fix linting

jfy133

We are almost there I think @HaidYi !

A few minor code things, and I noticed a few comments that are still outstanding from previous reviews.

A major thing though: you need to add yourself a as contributor! Please add a changelog entry, and add yourself to the main README list of people and the manifest section of the nextflow.config!

I think we are at maturity of the PR that on my next round of review, I may just make sense for me to directly make changes rather than making you have to do it yourself as I think they they will all be very minor (each fixing the annjoying RO crate linting error) - if you're OK with it, could you add me as a collaborator to your fork, and then I can have push changes rights for my next review? How does that sound?

Also note I've finished travelling now for a while so should be much faster to respond and review - I will check each Wednesday morning :)

jfy133 · 2025-10-08T09:09:11Z

subworkflows/local/cazyme.nf

+    gffs    // tuple val(meta), path(ANNOTATION_ANNOTATION_TOOL.out.gff)
+
+    main:
+


Suggested change

jfy133 · 2025-10-08T09:16:54Z

workflows/funcscan.nf

+    // Add gff_type to meta for cazyme screening
+    if ((params.run_cazyme_screening && !params.cazyme_skip_dbcan && (!params.dbcan_skip_cgc || !params.dbcan_skip_substrate)) && params.annotation_tool in ['pyrodigal', 'prodigal', 'prokka', 'bakta']) {
+      ch_new_annotation_short.map { meta, fasta, faa, gff, gbk ->
+          def new_meta = meta + [gff_type: 'prodigal']  // Only Use 'prodigal' as dbcan does not distinguish 'pyrodigal' and 'prodigal' 


But what if the annotation is from prokka or bakta 🤔 , this would be the wrong type of GFF, right?

jfy133 · 2025-10-08T09:24:37Z

workflows/funcscan.nf

+      ch_new_annotation_short.map { meta, fasta, faa, gff, gbk ->
+          def new_meta = meta + [gff_type: 'prodigal']  // Only Use 'prodigal' as dbcan does not distinguish 'pyrodigal' and 'prodigal' 
+          [new_meta, fasta, faa, gff, gbk]
+      }.set { ch_new_annotation_short }


We do not use set, please switch to = channel assignment.

However, what is the purpose of this, it looks like you're 'over-writing' ch_new_annotation_short the itself self, I would rather use a separate channel for clarity, as in

if (xyz) { ch_new_annotation_for_mixing = ch_new_annotation_short.map{} ... } else ch_new_annotation_for_mixing = ch_new_annotation_short } ch_prepped_input -= ch_new_annotaiton_for_mixing

Or something like this

jfy133 · 2025-10-08T09:27:51Z

conf/test_preannotated_dbcan.config

+    run_bgc_screening          = false
+    run_cazyme_screening       = true
+
+    dbcan_skip_cgc             = true   // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet


Can this be updated now you have the file?

jfy133 · 2025-10-08T09:28:20Z

conf/test_preannotated_dbcan.config

This file should be test_preannotated_cazyme

jfy133 · 2025-10-08T09:29:41Z

workflows/funcscan.nf

+                !file.isEmpty()
+            },
+            ch_prepped_input.gffs
+        )


Suggested change

)

)

ch_versions = ch_versions.mix(CAZYME.out.versions)

jfy133 · 2025-10-08T09:30:05Z

nextflow.config

+    test_cazyme_pyrodigal {
+        includeConfig 'conf/test_cazyme_pyrodigal.config'
+    }
+    test_preannotated_dbcan {


Suggested change

test_preannotated_dbcan {

test_preannotated_cazyme {

jfy133 · 2025-10-08T09:30:14Z

nextflow.config

+        includeConfig 'conf/test_cazyme_pyrodigal.config'
+    }
+    test_preannotated_dbcan {
+        includeConfig 'conf/test_preannotated_dbcan.config'


Suggested change

includeConfig 'conf/test_preannotated_dbcan.config'

includeConfig 'conf/test_preannotated_cazyme.config'

jfy133 · 2025-10-08T09:30:26Z

nextflow_schema.json

+                },
+                "dbcan_skip_cgc": {
+                    "type": "boolean",
+                    "description": "Skip CGC during the dbCAN screening.",


Still missing

HaidYi and others added 7 commits June 30, 2025 19:22

Add run_dbcan screening

6353679

fix missing gffs

15f2ef5

split dbcan results by meta.id

d5df4a1

rm constraints of annotation tool

f049e2f

add test config for rundbcan

8289bdb

add test profile for rundbcan in ci

d8af5e9

add dbcan in the refs

0a5e505

HaidYi self-assigned this Jul 2, 2025

HaidYi requested review from Darcy220606, jasmezz and jfy133 as code owners July 2, 2025 00:16

HaidYi added the enhancement Improvement for existing functionality label Jul 2, 2025

HaidYi mentioned this pull request Jul 2, 2025

Add rundbcan for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation #481

Open

Suggestions from code review

01a573a

jasmezz reviewed Jul 10, 2025

View reviewed changes

HaidYi added 5 commits July 14, 2025 23:18

rm duplicate outputs

5c5ec66

add manual dbCAN database download

9fd005c

rename DBCAN to CAZYME

ea4b852

add gff column in samplesheet

62623a5

change run_dbcan_screening to run_cazyme_screening

0cad8f9

jfy133 reviewed Jul 15, 2025

View reviewed changes

HaidYi and others added 4 commits July 16, 2025 19:24

add missing identifier

b76e3a2

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

add missing identifier

0f5863a

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

add missing conda

f2d79d5

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

fix typo

625ced4

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

HaidYi added 3 commits July 16, 2025 23:01

re-organize the outdir structure of cazyme screening

58273f1

add citation

a638f32

add cazyme_skip_dbcan param

a5d692b

jfy133 reviewed Sep 3, 2025

View reviewed changes

HaidYi and others added 19 commits September 18, 2025 09:57

only list top view

cce04b2

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

Update docs/output.md

ddd51c1

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

Update docs/output.md

b31feb6

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

Update docs/output.md

36c22d3

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

Update docs/output.md

59385f9

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

Update docs/output.md

2a6544e

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

Update docs/output.md

2dbe952

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

add a column: gff_type in samplesheet

01fb374

rm dbcan_gff_type parameter

13b82ab

add option for using local dbcan db

3af937f

filter samples for dbcan cgc/substrate if no gff_type provided in sam…

c28f049

…plesheet

add cazyme to toolCitationText

796b96d

Update docs/output.md

6de2005

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

update the profile name

5f9b432

add cazyme_screening to default test

e114734

add test_cazyme_pyrodigal test

6ee4dd7

add cazyme_dbcan_db to params

18ba885

fix bug

161d37d

add gff_type in meta for cazyme screening

d505ea6

HaidYi requested a review from jfy133 September 28, 2025 21:46

Merge branch 'dev' into rundbcan

69d5133

[automated] Fix code linting

f5ed73e

jfy133 reviewed Oct 8, 2025

View reviewed changes

HaidYi removed the enhancement Improvement for existing functionality label Oct 10, 2025

		dbcan_skip_cgc = true // skip cgc as .gbk is used
		dbcan_skip_substrate = true // skip substrate as .gbk is used

		gffs // tuple val(meta), path(ANNOTATION_ANNOTATION_TOOL.out.gff)

		main:

	includeConfig 'conf/test_preannotated_dbcan.config'
	includeConfig 'conf/test_preannotated_cazyme.config'

Add run_dbcan screening for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation #483

Are you sure you want to change the base?

Add run_dbcan screening for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation #483

Uh oh!

Conversation

HaidYi commented Jul 2, 2025

PR checklist

Uh oh!

nf-core-bot commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jasmezz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jfy133 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HaidYi commented Jul 17, 2025

Uh oh!

jfy133 commented Aug 29, 2025

Uh oh!

jfy133 commented Aug 29, 2025

Uh oh!

jfy133 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HaidYi commented Sep 28, 2025

Uh oh!

jfy133 commented Oct 4, 2025

Uh oh!

jfy133 commented Oct 8, 2025

Uh oh!

jfy133 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

nf-core-bot commented Jul 2, 2025 •

edited

Loading

jfy133 left a comment •

edited

Loading