这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@hexylena
Copy link
Contributor

@hexylena hexylena commented Mar 7, 2025

Checklist

  • Pull request details were added to CHANGELOG.md.
  • Documentation was updated (if required).
  • parameter_meta was added/updated (if required).
  • Submodule branches are on develop or a tagged commit.

@hexylena
Copy link
Contributor Author

hexylena commented Mar 7, 2025

Ah, directory isn't in 1.0, I was looking at https://docs.openwdl.org/language-guide/variables.html

Copy link
Contributor

@rhpvorderman rhpvorderman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good! Just a couple of nitpicks.

samtools.wdl Outdated
String? unaccountedPath
String filenameFormat = "%!.%."
String outputFormat = "bam"
Boolean writeIndex = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend setting this to true unless you are very sure that these split bams are not going to be used by tool that requires indexing upstream or are not the end product. (IGV and GATK will only accept indexed files for instance).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment on output: alternatively just always index to keep things simpler. For further context: indexing tasks are a pain to write because of file localization. You have to make sure that file and index are in the same folder. MiniWDL and cromwell will run jobs in isolated folders so you have to juggle files around using links or similar in indexing tasks (or worse, copy the data). The easiest way to index is to always index in producer tasks rather than have separate indexing tasks.

Copy link
Contributor Author

@hexylena hexylena Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, very reasonable.

String filenameFormat = "%!.%."
String outputFormat = "bam"
Boolean writeIndex = false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add a Int compressionLevel = 1 parameter and use --output-fmt level=~{compressionLevel} in the command section.

Some thing to be aware of is that most bioinformatics formats use DEFLATE. Either as some form of gzip (bgzip = blocked gzip, concatenated gzip blocks) or in some custom format. Level 5 is chosen by default in samtools. IMNSHO this was a very bad design choice. Level 5 is multiple times slower than level 1 for only 20% smaller filesize. In fact, this makes most bioinformatics tools behave like compression tools with a bioinformatic side effect. I am not interested in heavy compression for intermediate files. For BAM files, if you need compression, CRAM is always the better option. So there is no reason to compress BAM with anything more than level 1.

Additionaly Intel's ISA-L project has a DEFLATE implementation that decompresses 2x faster and compresses 5x (!!!!!) faster at level 1 while being completely compatible. It does not support levels higher than 3. So even if the project supports vastly faster runtimes, these are not enabled if the compression level is too high and zlib is used as a fallback.

Sorry something of a pet peeve of mine. The impact is quite huge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q.E.D

$ /usr/bin/time samtools split --output-fmt-option level=1 ~/test/HG002_20230424_1302_3H_PAO89685_2264ba8c_hac_simplex_downsampled.bam 
171.40user 44.85system 3:38.60elapsed 98%CPU (0avgtext+0avgdata 4360maxresident)k
18047512inputs+21207544outputs (1major+495minor)pagefaults 0swaps
$ /usr/bin/time samtools split ~/test/HG002_20230424_1302_3H_PAO89685_2264ba8c_hac_simplex_downsampled.bam 
280.31user 44.25system 5:27.19elapsed 99%CPU (0avgtext+0avgdata 4596maxresident)k
15075128inputs+19728672outputs (11major+503minor)pagefaults 0swaps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jeez that's a big different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and 21207544 / 19728672 = 1.07. So the level 1 file is just 7% bigger. Presumably because this is ONT data. Due to the 32KB window size on gzip Illumina compresses better than ONT data.

@rhpvorderman rhpvorderman merged commit cd579bf into biowdl:develop Mar 28, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants