Add a samtools split task #328

hexylena · 2025-03-07T10:23:10Z

Checklist

Pull request details were added to CHANGELOG.md.
Documentation was updated (if required).
parameter_meta was added/updated (if required).
Submodule branches are on develop or a tagged commit.

hexylena · 2025-03-07T10:26:59Z

Ah, directory isn't in 1.0, I was looking at https://docs.openwdl.org/language-guide/variables.html

rhpvorderman

Looks very good! Just a couple of nitpicks.

rhpvorderman · 2025-03-07T11:10:57Z

samtools.wdl

+        String? unaccountedPath
+        String filenameFormat = "%!.%."
+        String outputFormat = "bam"
+        Boolean writeIndex = false


I recommend setting this to true unless you are very sure that these split bams are not going to be used by tool that requires indexing upstream or are not the end product. (IGV and GATK will only accept indexed files for instance).

See comment on output: alternatively just always index to keep things simpler. For further context: indexing tasks are a pain to write because of file localization. You have to make sure that file and index are in the same folder. MiniWDL and cromwell will run jobs in isolated folders so you have to juggle files around using links or similar in indexing tasks (or worse, copy the data). The easiest way to index is to always index in producer tasks rather than have separate indexing tasks.

That makes sense, very reasonable.

samtools.wdl

rhpvorderman · 2025-03-07T11:27:33Z

samtools.wdl

+        String filenameFormat = "%!.%."
+        String outputFormat = "bam"
+        Boolean writeIndex = false
+


You could add a Int compressionLevel = 1 parameter and use --output-fmt level=~{compressionLevel} in the command section.

Some thing to be aware of is that most bioinformatics formats use DEFLATE. Either as some form of gzip (bgzip = blocked gzip, concatenated gzip blocks) or in some custom format. Level 5 is chosen by default in samtools. IMNSHO this was a very bad design choice. Level 5 is multiple times slower than level 1 for only 20% smaller filesize. In fact, this makes most bioinformatics tools behave like compression tools with a bioinformatic side effect. I am not interested in heavy compression for intermediate files. For BAM files, if you need compression, CRAM is always the better option. So there is no reason to compress BAM with anything more than level 1.

Additionaly Intel's ISA-L project has a DEFLATE implementation that decompresses 2x faster and compresses 5x (!!!!!) faster at level 1 while being completely compatible. It does not support levels higher than 3. So even if the project supports vastly faster runtimes, these are not enabled if the compression level is too high and zlib is used as a fallback.

Sorry something of a pet peeve of mine. The impact is quite huge.

Q.E.D

$ /usr/bin/time samtools split --output-fmt-option level=1 ~/test/HG002_20230424_1302_3H_PAO89685_2264ba8c_hac_simplex_downsampled.bam 171.40user 44.85system 3:38.60elapsed 98%CPU (0avgtext+0avgdata 4360maxresident)k 18047512inputs+21207544outputs (1major+495minor)pagefaults 0swaps

$ /usr/bin/time samtools split ~/test/HG002_20230424_1302_3H_PAO89685_2264ba8c_hac_simplex_downsampled.bam 280.31user 44.25system 5:27.19elapsed 99%CPU (0avgtext+0avgdata 4596maxresident)k 15075128inputs+19728672outputs (11major+503minor)pagefaults 0swaps

Jeez that's a big different.

Yes and 21207544 / 19728672 = 1.07. So the level 1 file is just 7% bigger. Presumably because this is ONT data. Due to the 32KB window size on gzip Illumina compresses better than ONT data.

hexylena added 2 commits March 7, 2025 11:21

Add a samtools split task

319501e

Register in changelog

60dcef7

hexylena added 2 commits March 7, 2025 11:27

Directory not yet available

4030091

Must be defined

8a0de27

hexylena force-pushed the samtools-split branch from 33f04f7 to 8a0de27 Compare March 7, 2025 10:29

noticed in wdl-aid that only these are permitted

b70891c

rhpvorderman requested changes Mar 7, 2025

View reviewed changes

hexylena added 8 commits March 7, 2025 13:01

Add compression level parameter, defaulting to 1

1ec8855

default to indexing

153db04

Remove control of output format

1522785

include indexes

2bba90e

write index is non-optional

bd4a856

make subdirectory as well

be0aabe

emits csi extension instead

10e83c1

missing threads

6ebf7cd

rhpvorderman approved these changes Mar 11, 2025

View reviewed changes

rhpvorderman added 2 commits March 11, 2025 17:06

Merge branch 'develop' into samtools-split

4267c2b

Merge branch 'develop' into samtools-split

d2df344

rhpvorderman merged commit cd579bf into biowdl:develop Mar 28, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a samtools split task #328

Add a samtools split task #328

Uh oh!

hexylena commented Mar 7, 2025

Uh oh!

hexylena commented Mar 7, 2025

Uh oh!

rhpvorderman left a comment

Uh oh!

rhpvorderman Mar 7, 2025

Uh oh!

rhpvorderman Mar 7, 2025

Uh oh!

hexylena Mar 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

rhpvorderman Mar 7, 2025

Uh oh!

rhpvorderman Mar 7, 2025

Uh oh!

hexylena Mar 7, 2025

Uh oh!

rhpvorderman Mar 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add a samtools split task #328

Add a samtools split task #328

Uh oh!

Conversation

hexylena commented Mar 7, 2025

Checklist

Uh oh!

hexylena commented Mar 7, 2025

Uh oh!

rhpvorderman left a comment

Choose a reason for hiding this comment

Uh oh!

rhpvorderman Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

rhpvorderman Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

hexylena Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rhpvorderman Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

rhpvorderman Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

hexylena Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

rhpvorderman Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hexylena Mar 7, 2025 •

edited

Loading