这是indexloc提供的服务,不要输入任何密码
Skip to content

Why doesn’t ClickHouse compression exceed ~3x on IPv6 , ports and DateTime columns with ZSTD and Delta codecs? #84424

@Lidhoria

Description

@Lidhoria

Hello,

I’m working with a ClickHouse table containing six columns — IPv6 addresses (srcip, dstip), DateTime fields (flowstart, flowend), and port numbers (srcport, dstport). I’ve applied what I believe are efficient compression codecs, including:

ZSTD(22) on IPv6 and port columns

Delta and DoubleDelta codecs on the DateTime fields

I have a MergeTree table with the following columns and codecs applied:

srcip and dstip as IPv6 with ZSTD(22) codec

flowstart as DateTime with Delta, ZSTD(22) codecs

flowend as DateTime with DoubleDelta, ZSTD(22) codecs

srcport and dstport as UInt16 with ZSTD(22) codec

The table is partitioned by hour on flowstart and ordered by flowstart.

Despite these optimizations, when I check compression statistics from system.parts_columns, the overall compression ratio is around 3x:

sql
Copy
Edit
SELECT
round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS compression_factor_x
FROM system.parts_columns
WHERE database = 'table' AND table = 'flows';
I expected a higher compression factor (10x-30x), especially given the use of high compression levels and delta codecs on DateTime fields.

My dataset consists of about 1 million rows with a mix of IPs, timestamps, and port numbers.

I consistently get only about 3x compression. Given the codecs applied, I expected a significantly higher compression ratio (e.g., 10x or more).

Could someone please help me understand why the compression factor is limited here? Are there other codecs, table structures, or best practices I should consider to improve compression on these types of columns?

Question

I consistently get only about 3x compression. Given the codecs applied, I expected a significantly higher compression ratio (e.g., 10x or more).

Could someone please help me understand why the compression factor is limited here? Are there other codecs, table structures, or best practices I should consider to improve compression on these types of columns?

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionQuestion?st-need-infoWe need extra data to continue (waiting for response)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions