+
Skip to content

Conversation

HaidYi
Copy link

@HaidYi HaidYi commented Jul 2, 2025

PR checklist

Close #481.

The main changes include:

  • Like other screening tools, added a dedicated subworkflow (subworkflows/dbcan.nf) for the support of run_dbcan screening.
  • Added the annotation step for generating the .gff files and added the alias of the current modules (e.g., PYRODIGAL_GFF). So, the input gbk column may also use gff file as input. Feel free to change this part as it may need some tweaks considering the both the pipeline and the document.
  • Other utilities:
    • ci/cd, testing profiles for dbcan, module.config, etc.
    • documents: readme and output

Things that are needed the changes from the maintainer:

  • Add the changelog for this change in the next release version.
  • Add the dbcan screening step in the schematic workflow.

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/funcscan branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@HaidYi HaidYi self-assigned this Jul 2, 2025
@HaidYi HaidYi added the enhancement Improvement for existing functionality label Jul 2, 2025
@nf-core-bot
Copy link
Member

nf-core-bot commented Jul 2, 2025

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.3.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

Copy link
Collaborator

@jasmezz jasmezz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a great addition! @HaidYi I really appreciate your effort, your PR is really clear and on point. Thank you very much for this contribution. During review I directly pushed some minor changes to your fork.

Some other comments we could consider:

  • Thinking about renaming the new dbcan subworkflow to cazyme. This would be more in line with previous naming, i.e. subworkflow names tell the purpose, not the tool.
    • This would include changing the output dir in modules.config to ${params.outdir}/cazyme/cazyme_annotation, ${params.outdir}/cazyme/cgc, ${params.outdir}/cazyme/substrate
    • file tree in output docs
    • test names
    • nextflow_schema.json ...
  • The database download takes very long because of low download rate (>2 GB at at rate of ~ 1 MB/s). That is too long for the test profiles; we need to create a smaller database somehow...
  • Adding manual dbCAN database download (via bioconda) to the respective section in usage docs.

Comment on lines 35 to 36
dbcan_skip_cgc = true // skip cgc as .gbk is used
dbcan_skip_substrate = true // skip substrate as .gbk is used
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to be able to run the complete CAZyme subworkflow with pre-annotated .gff files while also providing pre-annotated .gbk files for other subworkflows, we need an additional (optional) column in the samplesheet.

docs/output.md Outdated
- `*_dbCAN_hmm_results.tsv`: TSV file containing the detailed dbCAN HMM results for CAZyme annotation.
- `*_dbCANsub_hmm_results.tsv`: TSV file containing the detailed dbCAN subfamily results for CAZyme annotation.
- `*_diamond.out`: TSV file containing the detailed dbCAN diamond results for CAZyme annotation.
- `cgc`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the files of the cgc and substrate section seem duplicated. Maybe we don't need to store those which are created in the cazyme step already? Can control this in modules.config (e.g. see RGI_MAIN entry).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasmezz Thank you for reviewing the codes. I will revise it based on your comments.

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good first PR @HaidYi ! Clean and pretty much all of my comments are sort of minor/just polishing

Some additional things to my direct comments:

@@ -0,0 +1,34 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should be for all cazyme screening tools, so I would rename accordingly for 'future proofing'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea for leaving a placeholder for other cazyme screening developers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow...

Basicaally what I mean this should be: test_cazyme_pyrodigal not test_dbcan_pyrodigal!

run_bgc_screening = false
run_cazyme_screening = true

dbcan_skip_cgc = true // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add gff files!

You can generate them from a normal funcscan fun, and make a PR against teh funscan branch of nf-core/testdatasets, which has the files and an updated samplesheet for the next funcscan version

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently the cazyme screening can only use the .gff files in the pipeline. To use the pre-annotated one, I generated the .gff files from pyrodigal. The PR can be found at nf-core/test-datasets#1683.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be updated now you have the file?

docs/output.md Outdated
| ├── deepbgc/
| ├── gecco/
| └── hmmsearch/
├── dbcan/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top level should be the molecule/gene type (i.e., cazyme), then a subdirectory with each tool (in this case dbcan), and within that each of the different output directories

docs/output.md Outdated

- `dbcan/`
- `cazyme`
- `*_overview.tsv`: TSV file containing the results of dbCAN CAZyme annotation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing the <sample.id> sample subdirectory underneath the tool name (accoeding to your modules.confg)

.join(ch_gffs_for_rundbcan)
.multiMap { meta, faa, gff ->
faa: [meta, faa]
gff: [meta, gff, 'prodigal']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the gff always from prodigal? Or is this a dummy value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to the module description: https://nf-co.re/modules/rundbcan_easycgc/. If it's the generated in the pipeline, it is always the prodigal. But if it's provided using the pre-annotated one, then it could be either NCBI_prok, JGI, NCBI_euk or prodigal. This makes things complicated. An easier way is to define a parameter in the cli for this option but it's kind of hard to deal with the mixed case in a batch without doing the modifications in the samplesheet.

HaidYi and others added 4 commits July 16, 2025 19:24
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
@HaidYi
Copy link
Author

HaidYi commented Jul 17, 2025

@jfy133 Thank you for the comments and suggestions. I will fix all the problems one-by-one. As I don't want this PR corrupt other screening steps, I will do a more comprehensive testing, which may take more time. I will let you know when I fix all the issues.

@jfy133
Copy link
Member

jfy133 commented Aug 29, 2025

@jasmezz and I would like to go through once more to give you the ✔️ given this is a big (And exciting) extension to the pipeline!

@jfy133
Copy link
Member

jfy133 commented Aug 29, 2025

(I can't today, but will try to find time next week(

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting close @HaidYi ! The main thing is to already move the GFF type into the samplesheet, and add the actual test snapshots of your new configs (and add the cazyme subworkflow to the 'default' test.config)

Feel free to ping me on slack if you need anything :)

@@ -0,0 +1,34 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow...

Basicaally what I mean this should be: test_cazyme_pyrodigal not test_dbcan_pyrodigal!

HaidYi and others added 19 commits September 18, 2025 09:57
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
@HaidYi
Copy link
Author

HaidYi commented Sep 28, 2025

@jfy133 Please review it again. All the issues you raised have already been resolved in the current codes.

@HaidYi HaidYi requested a review from jfy133 September 28, 2025 21:46
@jfy133
Copy link
Member

jfy133 commented Oct 4, 2025

@jfy133 Please review it again. All the issues you raised have already been resolved in the current codes.

Thank @HaidYi ! I just made a release so you have a clean slate and thus a 'clean release' once this PR is in.

I will resolve the conflicts next week and do the review :)

@jfy133
Copy link
Member

jfy133 commented Oct 8, 2025

@nf-core-bot fix linting

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are almost there I think @HaidYi !

A few minor code things, and I noticed a few comments that are still outstanding from previous reviews.

A major thing though: you need to add yourself a as contributor! Please add a changelog entry, and add yourself to the main README list of people and the manifest section of the nextflow.config!

I think we are at maturity of the PR that on my next round of review, I may just make sense for me to directly make changes rather than making you have to do it yourself as I think they they will all be very minor (each fixing the annjoying RO crate linting error) - if you're OK with it, could you add me as a collaborator to your fork, and then I can have push changes rights for my next review? How does that sound?

Also note I've finished travelling now for a while so should be much faster to respond and review - I will check each Wednesday morning :)

gffs // tuple val(meta), path(ANNOTATION_ANNOTATION_TOOL.out.gff)

main:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

// Add gff_type to meta for cazyme screening
if ((params.run_cazyme_screening && !params.cazyme_skip_dbcan && (!params.dbcan_skip_cgc || !params.dbcan_skip_substrate)) && params.annotation_tool in ['pyrodigal', 'prodigal', 'prokka', 'bakta']) {
ch_new_annotation_short.map { meta, fasta, faa, gff, gbk ->
def new_meta = meta + [gff_type: 'prodigal'] // Only Use 'prodigal' as dbcan does not distinguish 'pyrodigal' and 'prodigal'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what if the annotation is from prokka or bakta 🤔 , this would be the wrong type of GFF, right?

ch_new_annotation_short.map { meta, fasta, faa, gff, gbk ->
def new_meta = meta + [gff_type: 'prodigal'] // Only Use 'prodigal' as dbcan does not distinguish 'pyrodigal' and 'prodigal'
[new_meta, fasta, faa, gff, gbk]
}.set { ch_new_annotation_short }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not use set, please switch to = channel assignment.

However, what is the purpose of this, it looks like you're 'over-writing' ch_new_annotation_short the itself self, I would rather use a separate channel for clarity, as in

if (xyz) { 
    ch_new_annotation_for_mixing = ch_new_annotation_short.map{} ...
} else 
    ch_new_annotation_for_mixing = ch_new_annotation_short
}

ch_prepped_input -= ch_new_annotaiton_for_mixing

Or something like this

run_bgc_screening = false
run_cazyme_screening = true

dbcan_skip_cgc = true // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be updated now you have the file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be test_preannotated_cazyme

!file.isEmpty()
},
ch_prepped_input.gffs
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
)
)
ch_versions = ch_versions.mix(CAZYME.out.versions)

test_cazyme_pyrodigal {
includeConfig 'conf/test_cazyme_pyrodigal.config'
}
test_preannotated_dbcan {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
test_preannotated_dbcan {
test_preannotated_cazyme {

includeConfig 'conf/test_cazyme_pyrodigal.config'
}
test_preannotated_dbcan {
includeConfig 'conf/test_preannotated_dbcan.config'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
includeConfig 'conf/test_preannotated_dbcan.config'
includeConfig 'conf/test_preannotated_cazyme.config'

},
"dbcan_skip_cgc": {
"type": "boolean",
"description": "Skip CGC during the dbCAN screening.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing

@HaidYi HaidYi removed the enhancement Improvement for existing functionality label Oct 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载