-
Notifications
You must be signed in to change notification settings - Fork 7.5k
Description
Hello,
I’m working with a ClickHouse table containing six columns — IPv6 addresses (srcip, dstip), DateTime fields (flowstart, flowend), and port numbers (srcport, dstport). I’ve applied what I believe are efficient compression codecs, including:
ZSTD(22) on IPv6 and port columns
Delta and DoubleDelta codecs on the DateTime fields
I have a MergeTree table with the following columns and codecs applied:
srcip and dstip as IPv6 with ZSTD(22) codec
flowstart as DateTime with Delta, ZSTD(22) codecs
flowend as DateTime with DoubleDelta, ZSTD(22) codecs
srcport and dstport as UInt16 with ZSTD(22) codec
The table is partitioned by hour on flowstart and ordered by flowstart.
Despite these optimizations, when I check compression statistics from system.parts_columns, the overall compression ratio is around 3x:
sql
Copy
Edit
SELECT
round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS compression_factor_x
FROM system.parts_columns
WHERE database = 'table' AND table = 'flows';
I expected a higher compression factor (10x-30x), especially given the use of high compression levels and delta codecs on DateTime fields.
My dataset consists of about 1 million rows with a mix of IPs, timestamps, and port numbers.
I consistently get only about 3x compression. Given the codecs applied, I expected a significantly higher compression ratio (e.g., 10x or more).
Could someone please help me understand why the compression factor is limited here? Are there other codecs, table structures, or best practices I should consider to improve compression on these types of columns?
Question
I consistently get only about 3x compression. Given the codecs applied, I expected a significantly higher compression ratio (e.g., 10x or more).
Could someone please help me understand why the compression factor is limited here? Are there other codecs, table structures, or best practices I should consider to improve compression on these types of columns?