-
Notifications
You must be signed in to change notification settings - Fork 33
Add run_dbcan screening for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation #483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.3.1. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a great addition! @HaidYi I really appreciate your effort, your PR is really clear and on point. Thank you very much for this contribution. During review I directly pushed some minor changes to your fork.
Some other comments we could consider:
- Thinking about renaming the new
dbcan
subworkflow tocazyme
. This would be more in line with previous naming, i.e. subworkflow names tell the purpose, not the tool.- This would include changing the output dir in
modules.config
to${params.outdir}/cazyme/cazyme_annotation
,${params.outdir}/cazyme/cgc
,${params.outdir}/cazyme/substrate
- file tree in output docs
- test names
- nextflow_schema.json ...
- This would include changing the output dir in
- The database download takes very long because of low download rate (>2 GB at at rate of ~ 1 MB/s). That is too long for the test profiles; we need to create a smaller database somehow...
- Adding manual dbCAN database download (via bioconda) to the respective section in usage docs.
conf/test_preannotated_dbcan.config
Outdated
dbcan_skip_cgc = true // skip cgc as .gbk is used | ||
dbcan_skip_substrate = true // skip substrate as .gbk is used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to be able to run the complete CAZyme subworkflow with pre-annotated .gff files while also providing pre-annotated .gbk files for other subworkflows, we need an additional (optional) column in the samplesheet.
docs/output.md
Outdated
- `*_dbCAN_hmm_results.tsv`: TSV file containing the detailed dbCAN HMM results for CAZyme annotation. | ||
- `*_dbCANsub_hmm_results.tsv`: TSV file containing the detailed dbCAN subfamily results for CAZyme annotation. | ||
- `*_diamond.out`: TSV file containing the detailed dbCAN diamond results for CAZyme annotation. | ||
- `cgc` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of the files of the cgc
and substrate
section seem duplicated. Maybe we don't need to store those which are created in the cazyme
step already? Can control this in modules.config
(e.g. see RGI_MAIN entry).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasmezz Thank you for reviewing the codes. I will revise it based on your comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good first PR @HaidYi ! Clean and pretty much all of my comments are sort of minor/just polishing
Some additional things to my direct comments:
- Missing
citations.md
update - Missing the how to cite/methods text in this file: https://github.com/HaidYi/funcscan/blob/0cad8f95c553b3cdd3a59c34a0db107bd6df14f4/subworkflows/local/utils_nfcore_funcscan_pipeline/main.nf#L174
- Missing metromap update (but we can probably do this before release)
- Missing nf-test test and snapshots for the new tests
@@ -0,0 +1,34 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test should be for all cazyme screening tools, so I would rename accordingly for 'future proofing'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea for leaving a placeholder for other cazyme screening developers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow...
Basicaally what I mean this should be: test_cazyme_pyrodigal
not test_dbcan_pyrodigal
!
run_bgc_screening = false | ||
run_cazyme_screening = true | ||
|
||
dbcan_skip_cgc = true // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add gff files!
You can generate them from a normal funcscan fun, and make a PR against teh funscan
branch of nf-core/testdatasets, which has the files and an updated samplesheet for the next funcscan version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently the cazyme screening can only use the .gff
files in the pipeline. To use the pre-annotated one, I generated the .gff
files from pyrodigal
. The PR can be found at nf-core/test-datasets#1683.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be updated now you have the file?
docs/output.md
Outdated
| ├── deepbgc/ | ||
| ├── gecco/ | ||
| └── hmmsearch/ | ||
├── dbcan/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The top level should be the molecule/gene type (i.e., cazyme
), then a subdirectory with each tool (in this case dbcan
), and within that each of the different output directories
docs/output.md
Outdated
|
||
- `dbcan/` | ||
- `cazyme` | ||
- `*_overview.tsv`: TSV file containing the results of dbCAN CAZyme annotation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're missing the <sample.id>
sample subdirectory underneath the tool name (accoeding to your modules.confg
)
subworkflows/local/cazyme.nf
Outdated
.join(ch_gffs_for_rundbcan) | ||
.multiMap { meta, faa, gff -> | ||
faa: [meta, faa] | ||
gff: [meta, gff, 'prodigal'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the gff always from prodigal? Or is this a dummy value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer to the module description: https://nf-co.re/modules/rundbcan_easycgc/. If it's the generated in the pipeline, it is always the prodigal
. But if it's provided using the pre-annotated one, then it could be either NCBI_prok
, JGI
, NCBI_euk
or prodigal
. This makes things complicated. An easier way is to define a parameter in the cli for this option but it's kind of hard to deal with the mixed case in a batch without doing the modifications in the samplesheet.
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
@jfy133 Thank you for the comments and suggestions. I will fix all the problems one-by-one. As I don't want this PR corrupt other screening steps, I will do a more comprehensive testing, which may take more time. I will let you know when I fix all the issues. |
@jasmezz and I would like to go through once more to give you the ✔️ given this is a big (And exciting) extension to the pipeline! |
(I can't today, but will try to find time next week( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting close @HaidYi ! The main thing is to already move the GFF type into the samplesheet, and add the actual test snapshots of your new configs (and add the cazyme subworkflow to the 'default' test.config
)
Feel free to ping me on slack if you need anything :)
@@ -0,0 +1,34 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow...
Basicaally what I mean this should be: test_cazyme_pyrodigal
not test_dbcan_pyrodigal
!
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
@jfy133 Please review it again. All the issues you raised have already been resolved in the current codes. |
@nf-core-bot fix linting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are almost there I think @HaidYi !
A few minor code things, and I noticed a few comments that are still outstanding from previous reviews.
A major thing though: you need to add yourself a as contributor! Please add a changelog entry, and add yourself to the main README list of people and the manifest section of the nextflow.config
!
I think we are at maturity of the PR that on my next round of review, I may just make sense for me to directly make changes rather than making you have to do it yourself as I think they they will all be very minor (each fixing the annjoying RO crate linting error) - if you're OK with it, could you add me as a collaborator to your fork, and then I can have push changes rights for my next review? How does that sound?
Also note I've finished travelling now for a while so should be much faster to respond and review - I will check each Wednesday morning :)
gffs // tuple val(meta), path(ANNOTATION_ANNOTATION_TOOL.out.gff) | ||
|
||
main: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
workflows/funcscan.nf
Outdated
// Add gff_type to meta for cazyme screening | ||
if ((params.run_cazyme_screening && !params.cazyme_skip_dbcan && (!params.dbcan_skip_cgc || !params.dbcan_skip_substrate)) && params.annotation_tool in ['pyrodigal', 'prodigal', 'prokka', 'bakta']) { | ||
ch_new_annotation_short.map { meta, fasta, faa, gff, gbk -> | ||
def new_meta = meta + [gff_type: 'prodigal'] // Only Use 'prodigal' as dbcan does not distinguish 'pyrodigal' and 'prodigal' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what if the annotation is from prokka
or bakta
🤔 , this would be the wrong type of GFF, right?
ch_new_annotation_short.map { meta, fasta, faa, gff, gbk -> | ||
def new_meta = meta + [gff_type: 'prodigal'] // Only Use 'prodigal' as dbcan does not distinguish 'pyrodigal' and 'prodigal' | ||
[new_meta, fasta, faa, gff, gbk] | ||
}.set { ch_new_annotation_short } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not use set
, please switch to =
channel assignment.
However, what is the purpose of this, it looks like you're 'over-writing' ch_new_annotation_short
the itself self, I would rather use a separate channel for clarity, as in
if (xyz) {
ch_new_annotation_for_mixing = ch_new_annotation_short.map{} ...
} else
ch_new_annotation_for_mixing = ch_new_annotation_short
}
ch_prepped_input -= ch_new_annotaiton_for_mixing
Or something like this
run_bgc_screening = false | ||
run_cazyme_screening = true | ||
|
||
dbcan_skip_cgc = true // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be updated now you have the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be test_preannotated_cazyme
!file.isEmpty() | ||
}, | ||
ch_prepped_input.gffs | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | |
) | |
ch_versions = ch_versions.mix(CAZYME.out.versions) |
test_cazyme_pyrodigal { | ||
includeConfig 'conf/test_cazyme_pyrodigal.config' | ||
} | ||
test_preannotated_dbcan { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_preannotated_dbcan { | |
test_preannotated_cazyme { |
includeConfig 'conf/test_cazyme_pyrodigal.config' | ||
} | ||
test_preannotated_dbcan { | ||
includeConfig 'conf/test_preannotated_dbcan.config' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
includeConfig 'conf/test_preannotated_dbcan.config' | |
includeConfig 'conf/test_preannotated_cazyme.config' |
}, | ||
"dbcan_skip_cgc": { | ||
"type": "boolean", | ||
"description": "Skip CGC during the dbCAN screening.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still missing
PR checklist
Close #481.
The main changes include:
subworkflows/dbcan.nf
) for the support of run_dbcan screening..gff
files and added the alias of the current modules (e.g.,PYRODIGAL_GFF
). So, the inputgbk
column may also usegff
file as input. Feel free to change this part as it may need some tweaks considering the both the pipeline and the document.module.config
, etc.readme
andoutput
Things that are needed the changes from the maintainer:
nf-core pipelines lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).