Feature request: fast inflate that discards data, to find length #860
Replies: 18 comments
-
|
It would still need to partially reconstruct the decompressed data to make sure no-one tries to pass in corrupted or invalid data, but copying of decompressed data to user-supplied output buffer could be skipped to improve the speed. |
Beta Was this translation helpful? Give feedback.
-
|
It seems like that could already be accomplished with |
Beta Was this translation helpful? Give feedback.
-
|
Calling a callback function would probably be even slower than using normal optimized memcpy(). This is because callback functions need own stack frame. |
Beta Was this translation helpful? Give feedback.
-
|
We can check to see if the callback is |
Beta Was this translation helpful? Give feedback.
-
|
@nmoinvaz That would actually make it slightly faster when callback is NULL even though it introduces conditional jump. I just don't know how big speed penalty it would incur if the callback is non-NULL. |
Beta Was this translation helpful? Give feedback.
-
|
Yeah it is definitely a trade off. It ends up making things a bit faster for one scenario and a bit slower for another. Either way I think @joshtriplett should be able to use |
Beta Was this translation helpful? Give feedback.
-
|
@mtl1979 wrote:
That's not quite what I'm looking for. If I need to validate the data, I can do a full inflate; the optimization of avoiding the final copy might be nice, but generally if I'm validating the data I'll also want to do something else with it, such as hash it, so I need the data anyway. In this case, I really do just want to know the length of the compressed data stream, so that I can skip past it and start processing the next record. I'd like to avoid not just the overhead of the final copy, but also the overhead of the internal copies to process backreferences. If it'd be trivial to do, reconstructing the match-length and summing up the uncompressed length would be nice. Beyond that, I'd rather skip any checks for the validity of the deflate stream, if they'd add any overhead at all. |
Beta Was this translation helpful? Give feedback.
-
|
@nmoinvaz I had a quick glimpse at inflateBack() and no-op'ing the output callback would require more than just checking that output callback is non-NULL as the code blindly writes to output buffer until it's full... |
Beta Was this translation helpful? Give feedback.
-
|
@joshtriplett I think easiest way would be to duplicate inflateBack() and strip anything that writes data to output window. Some lines would need to be modified to make sure input pointer is advanced even if copying and advancing output pointer is removed. I'm not sure what is the best way to test the modified code actually works as expected except linking the code against git... |
Beta Was this translation helpful? Give feedback.
-
|
@mtl1979 I'd be happy to give that a try and benchmark it. (I'm sure there's a more maintainable way to avoid the code duplication, but it'll help to have numbers for how much this could help first.) |
Beta Was this translation helpful? Give feedback.
-
|
@joshtriplett We have used various tricks to avoid code duplication elsewhere in the code, but for initial testing, it is easier to just duplicate the code... When there is just two alternatives, it doesn't make sense to use macros to hide the differences especially when the code already uses a lot of macros.... |
Beta Was this translation helpful? Give feedback.
-
|
Some preliminary results using a large (914MB compressed, 2592MB uncompressed) DEFLATE stream:
Not as much as I'd expected, but still fast enough to be worthwhile, and it seems quite likely that it could be optimized further. Here's the WIP patch/hack: zlib-ng-skipout.patch.txt Thoughts? Also, in the course of working on this, I found an unrelated optimization, which I'll send a PR for. |
Beta Was this translation helpful? Give feedback.
-
|
I am noticing that in Also I am wondering what is the size of the input buffer given to Nice work on the PR. I will try and do some performance testing on it soon. |
Beta Was this translation helpful? Give feedback.
-
|
@nmoinvaz I gave inflateBack all of the input data at once, and made the input callback a no-op. |
Beta Was this translation helpful? Give feedback.
-
|
@joshtriplett I have done some benchmarking on the PR. Do you plan on trying to take out the copying of data to |
Beta Was this translation helpful? Give feedback.
-
|
@nmoinvaz I didn't touch the |
Beta Was this translation helpful? Give feedback.
-
|
I ran across this in the FAQ.zlib which may be related/useful:
... Alternatively, you can scan a deflate stream once to generate an index, and then use that index for |
Beta Was this translation helpful? Give feedback.
-
|
I repeated performance tests related to deflate decompression performance on Apple Silicon (MacBook Air) using the system provided See this comment for more details. To my mind the efficient handling of decompressing small streams is vital to further improving on It is surprising to me how git can be this fast given that it appears to only use zlib without any special handling. Edit: This morning I woke up realising that |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Some file formats, such as the git packfile format, have a series of DEFLATE-compressed streams within them, and don't include lengths that would allow skipping over them to the next piece of data. Git's index file format provides those offsets, but when creating that index file, you have to walk over the entire pack file, deflating each object sequentially, in order to find where the next object starts. That's a major bottleneck in some common git operations.
Would it be possible to have a function in zlib-ng that would do the fastest possible inflate of a DEFLATE-compressed stream without actually reconstructing the decompressed data? This could just decode the block headers, decode any dynamic Huffman tables, skip over literal symbols and match/distance pairs, and look for the end-of-block symbol. No decompressed data, no window of data for backreferences, no copy operations, just looking for the end of each block as fast as possible.
If the resulting function ended up being substantially faster, would that be a reasonable thing to add?
Beta Was this translation helpful? Give feedback.
All reactions