-
Notifications
You must be signed in to change notification settings - Fork 29
Add a samtools split task #328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Ah, directory isn't in 1.0, I was looking at https://docs.openwdl.org/language-guide/variables.html |
rhpvorderman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good! Just a couple of nitpicks.
samtools.wdl
Outdated
| String? unaccountedPath | ||
| String filenameFormat = "%!.%." | ||
| String outputFormat = "bam" | ||
| Boolean writeIndex = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend setting this to true unless you are very sure that these split bams are not going to be used by tool that requires indexing upstream or are not the end product. (IGV and GATK will only accept indexed files for instance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment on output: alternatively just always index to keep things simpler. For further context: indexing tasks are a pain to write because of file localization. You have to make sure that file and index are in the same folder. MiniWDL and cromwell will run jobs in isolated folders so you have to juggle files around using links or similar in indexing tasks (or worse, copy the data). The easiest way to index is to always index in producer tasks rather than have separate indexing tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, very reasonable.
| String filenameFormat = "%!.%." | ||
| String outputFormat = "bam" | ||
| Boolean writeIndex = false | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add a Int compressionLevel = 1 parameter and use --output-fmt level=~{compressionLevel} in the command section.
Some thing to be aware of is that most bioinformatics formats use DEFLATE. Either as some form of gzip (bgzip = blocked gzip, concatenated gzip blocks) or in some custom format. Level 5 is chosen by default in samtools. IMNSHO this was a very bad design choice. Level 5 is multiple times slower than level 1 for only 20% smaller filesize. In fact, this makes most bioinformatics tools behave like compression tools with a bioinformatic side effect. I am not interested in heavy compression for intermediate files. For BAM files, if you need compression, CRAM is always the better option. So there is no reason to compress BAM with anything more than level 1.
Additionaly Intel's ISA-L project has a DEFLATE implementation that decompresses 2x faster and compresses 5x (!!!!!) faster at level 1 while being completely compatible. It does not support levels higher than 3. So even if the project supports vastly faster runtimes, these are not enabled if the compression level is too high and zlib is used as a fallback.
Sorry something of a pet peeve of mine. The impact is quite huge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q.E.D
$ /usr/bin/time samtools split --output-fmt-option level=1 ~/test/HG002_20230424_1302_3H_PAO89685_2264ba8c_hac_simplex_downsampled.bam
171.40user 44.85system 3:38.60elapsed 98%CPU (0avgtext+0avgdata 4360maxresident)k
18047512inputs+21207544outputs (1major+495minor)pagefaults 0swaps
$ /usr/bin/time samtools split ~/test/HG002_20230424_1302_3H_PAO89685_2264ba8c_hac_simplex_downsampled.bam
280.31user 44.25system 5:27.19elapsed 99%CPU (0avgtext+0avgdata 4596maxresident)k
15075128inputs+19728672outputs (11major+503minor)pagefaults 0swaps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jeez that's a big different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and 21207544 / 19728672 = 1.07. So the level 1 file is just 7% bigger. Presumably because this is ONT data. Due to the 32KB window size on gzip Illumina compresses better than ONT data.
Checklist
parameter_metawas added/updated (if required).