-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Software pipeline for ZSTD_compressBlock_fast_dictMatchState (+5-6% compression speed) #3086
Conversation
0.1% ratio loss for ~5% speed win |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks really solid! Well done.
I left some comments and questions, which we can discuss sometime next week.
As a minor coding style comment : This may seem a "minor event" (just one keyword difference), but it's not : in most cases, bugs involve mutable variables. When a variable is This is a balance though. I'm not saying that mutable variables are forbidden. They are necessary. But because they add a "burden cost" to maintenance, adding one should be pondered with caution, and additional efforts to keep them In this PR, I only found 2 cases where variables could be made |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Updated PR summary with results from full benchmarking suite and discussion of good and bad future directions for this work. |
Summary
Using similar techniques as #2749, improves compression speed by 5-6% on a dataset of HTML headers between 0 and 8KB (the dataset has ratio 2.32 w/o dictionary, 3.41 with dictionary). This PR doesn't specifically address the cold dictionary scenario, which is something I'd like to explore in the future, but it does improve cold dictionary performance (as well as hot dictionary performance, see plots below).
Plan to merge
@Cyan4973 , @felixhandte , and I are in agreement regarding which measurements are required to merge this change. Because the full suite of measurements will take some work to produce, I am only including two experiments for now. Once I get all nits and feedback resolved, and have a final version of the code ready to merge, I will do the full suite and post results in this thread.
Ratio regression
This change regresses compression ratio by 0.1% on the html8K dataset (4350234 bytes -> 4354095 bytes). This is because we no longer search the very last position. It may be possible to fix this without killing the perf win -- however I think @felixhandte struggled to do this for noDict, so I didn't attempt it for now. Let me know if this regression is a blocker.EDIT: 6cba817 fixes the ratio regression (at least on html8K) with no performance cost! Thanks @felixhandte for pointing out this fix.
It is still possible that this PR will cause zstd not to search the last position -- this would occur in a situation where the matchfinder is in accelerated search mode when it hits the end of the input. However, across the entire html8K dataset, this change in the parsing strategy did not regress compressed size by even a single byte. So there is essentially no ratio tradeoff with this PR.
Preliminary results
Cold dictionary
Hot dictionary
Final results
I benchmarked on a Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz CentOS machine with core isolation and turbo disabled. I measured all combinations of the following variables:
I modified the largeNbDicts benchmark to append the max speed over 10 iterations to a CSV file. I ran each scenario 10 times interleaved (bash script) to eliminate bias from background processes competing for main memory / L3 bandwidth. I then used the bootstrap method to compute confidence intervals for the percent increase in speed,
(optimized / dev) - 1
(python script).Summary of results
Regressions
All scenarios are speed-positive except github-users with 4K dict. The latter scenarios have small regressions, between 0.2% and 1.6% depending on dict temperature. This is due to the extremely high compression ratio on github-users (which reduced the win for all github-users dict sizes). Every time a match is found, the pipeline has to reset, killing the wins from pipelining when there are no matches found. Most real-world datasets have much lower compression ratio, and most real-world dictionaries are much larger than 4K. So the regression only affects an extremely unrealistic use-case.
Again: html8K was speed-positive across all scenarios. github-users with 32K and 110K dicts were speed-positive across all scenarios. This regression is only for github-users with 4K dict size.
Wins
All html8K scenarios are 5%+ wins on gcc. For html8K with 32K dictionary size, the gcc wins are 7%+ for all temperatures. The largest win I measured was
9.8% (± 0.2%)
on html_8K/cold/110K with gcc.clang is speed positive across all scenarios, but wins are more like 1-3% (this makes sense, since I only measured gcc in my exploration).
github-users wins for 32K dict and 110K dict are also around 1-3%. As discussed above, this is due to the extremely high compression ratio of github-users, which is not representative of most real-world data.
Full data
Rejected optimizations
Future work
I did not try deeper pipelines, or combining pipelining with loop unrolling (as we have in noDict). It is completely possible that either of those directions could lead to bigger wins. Since I hit the noDict wins with a less complex pipeline (and even exceeded them in some important scenarios), I did not try these more complex optimizations.
Conclusion
There are big wins for html8K, and every realistic scenario I tested is speed-positive. Even for an unrealistic worst-case scenario (github-hot-4K-gcc), the speed regression was only 1.6%. The ratio regression has been addressed. This PR is ready to merge!