这是indexloc提供的服务,不要输入任何密码
Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ version 6.0.0-dev
+ Use softlinks to localise the database for centrifuge.
+ Added the FastqFilter task.
+ Added a new input `revcomp` to cutadapt to set the `--revcomp` flag, defaults to `false`.
+ New samtools task: split.
+ Update `bedtools.Intersect` to support `-wa`, `-wb`, and `-s` flags.

version 5.2.0
Expand Down
58 changes: 58 additions & 0 deletions samtools.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -514,6 +514,64 @@ task Sort {
}
}

task Split {
input {
File inputBam
String outputPath
String? unaccountedPath
String filenameFormat = "%!.%."

Int compressionLevel = 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add a Int compressionLevel = 1 parameter and use --output-fmt level=~{compressionLevel} in the command section.

Some thing to be aware of is that most bioinformatics formats use DEFLATE. Either as some form of gzip (bgzip = blocked gzip, concatenated gzip blocks) or in some custom format. Level 5 is chosen by default in samtools. IMNSHO this was a very bad design choice. Level 5 is multiple times slower than level 1 for only 20% smaller filesize. In fact, this makes most bioinformatics tools behave like compression tools with a bioinformatic side effect. I am not interested in heavy compression for intermediate files. For BAM files, if you need compression, CRAM is always the better option. So there is no reason to compress BAM with anything more than level 1.

Additionaly Intel's ISA-L project has a DEFLATE implementation that decompresses 2x faster and compresses 5x (!!!!!) faster at level 1 while being completely compatible. It does not support levels higher than 3. So even if the project supports vastly faster runtimes, these are not enabled if the compression level is too high and zlib is used as a fallback.

Sorry something of a pet peeve of mine. The impact is quite huge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q.E.D

$ /usr/bin/time samtools split --output-fmt-option level=1 ~/test/HG002_20230424_1302_3H_PAO89685_2264ba8c_hac_simplex_downsampled.bam 
171.40user 44.85system 3:38.60elapsed 98%CPU (0avgtext+0avgdata 4360maxresident)k
18047512inputs+21207544outputs (1major+495minor)pagefaults 0swaps
$ /usr/bin/time samtools split ~/test/HG002_20230424_1302_3H_PAO89685_2264ba8c_hac_simplex_downsampled.bam 
280.31user 44.25system 5:27.19elapsed 99%CPU (0avgtext+0avgdata 4596maxresident)k
15075128inputs+19728672outputs (11major+503minor)pagefaults 0swaps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jeez that's a big different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and 21207544 / 19728672 = 1.07. So the level 1 file is just 7% bigger. Presumably because this is ONT data. Due to the 32KB window size on gzip Illumina compresses better than ONT data.

Int threads = 1
String memory = "1GiB"
Int timeMinutes = 1 + ceil(size(inputBam, "GiB") * 2)
String dockerImage = "quay.io/biocontainers/samtools:1.16.1--h6899075_1"
}

command {
set -e
mkdir -p "~{outputPath}/rg/"
samtools split \
--output-fmt bam \
--output-fmt-option level=~{compressionLevel} \
-f "~{outputPath}/rg/~{filenameFormat}" \
~{"-u " + unaccountedPath} \
--threads ~{threads} \
--write-index \
~{inputBam}
}

output {
Array[File] splitBam = glob(outputPath + "/rg/*.bam")
Array[File] splitBamIndex = glob(outputPath + "/rg/*.bam.csi")
File? unaccounted = unaccountedPath
}

runtime {
cpu: threads
memory: memory
docker: dockerImage
time_minutes: timeMinutes
}

parameter_meta {
# inputs
inputBam: {description: "The bam file to split.", category: "required"}
outputPath: {description: "Directory to store output bams", category: "required"}

# Optional parameters
unaccountedPath: {description: "The location to write reads to which are not detected as being part of an existing read group.", category: "common"}
filenameFormat: {description: "Format of the filename, the following tokens can be used: %% a literal % sign, %* basename, %# @RG index, %! @RG ID, %. filename extension for output format", category: "common"}
compressionLevel: {description: "Set compression level when writing gz or bgzf fastq files.", category: "advanced"}

# outputs
splitBam: {description: "BAM file split by read groups"}
splitBamIndex: {description: "BAM indexes"}
unaccounted: {description: "Reads with no RG tag or an unrecognised RG tag."}
}
}

task Tabix {
input {
File inputFile
Expand Down
Loading