这是indexloc提供的服务,不要输入任何密码
Skip to content

Use more efficient drop-in replacements for compression #265

@MaxJa4

Description

@MaxJa4

Is your feature request related to a problem?

Currently, compress/gzip is the default compression option for compatibility reasons, since the way faster zstd is not compatible with that file format.
From the same author from whom we use the zstd library (klauspost) exist more compression libraries, which are optimized versions of their standard implementation.

I tested these two drop-in replacements for compress/gzip:

  • github.com/klauspost/compress/gzip -> optimized gzip implementation, still as specified in gzip standard spec
  • github.com/klauspost/pgzip -> parallelized version of the optimized gzip, therefore also compatible
Test setup

HW: i7 13700KF, 32GB RAM, PCIe 4.0 NVMe SSD with ~7 GB/s read (sequential)
SW: Win10, Go v1.21.0, klauspost pkgs v1.16.7
Parameters: Compression level = 4 for all (default for zstd)
Test data used: Full Silesia Corpus contents (link)
Methodology: Read file (tar'ed test data), de-/compress in memory (no file writing)

Results

Package File size after Compression Decompression
Gzip orignal 33,54% 2152ms 863.1ms
Gzip klauspost 34.29% 1249ms 711.4ms
PGzip klauspost 34.37% 246.5ms 674.0ms
Zstd for reference 31.89% 672.0ms 198.6ms

Describe the solution you'd like

To make existing .tar.gz / gzip setups benefit from vastly superior compression speeds (~1.7x with Gzip, ~8.7x with PGzip) and slightly improved decompression speeds while preserving compatibility, I'm suggesting using one of the optimized gzip implementations as drop-in replacement.
Note, that it might not be desirable in an automated/unsupervised backup scenario to use all available CPU power for compression like with PGzip's parallelism (level can be adjusted though), as live production workloads could be influenced by a backup run. The optimized Gzip is still single threaded afaik.
While we're at it, maybe it makes sense to introduce three compression presets: fast, default (level 4 like here and default for all packages afaik), best. The user can then prioritize downtime (fast preset) or storage size/cost (best preset). Not set would be default like now.

If you see the need for more testing, feel free to suggest more.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions