-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dec: adjust seqSymbol load on aarch64 #3141
Conversation
@@ -1170,9 +1170,19 @@ FORCE_INLINE_TEMPLATE seq_t | |||
ZSTD_decodeSequence(seqState_t* seqState, const ZSTD_longOffset_e longOffsets) | |||
{ | |||
seq_t seq; | |||
#if defined(__aarch64__) && (defined(__GNUC__) && !defined(__clang__)) | |||
ZSTD_seqSymbol llDInfoS, mlDInfoS, ofDInfoS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maintenance :
Please explain in code comment why this code block exists.
Something equivalent to your explanation text in the PR will be good enough.
The intention is for future maintainers to understand why this code block exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. Will update. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
It's a pity that the code needs to be made more obscure to get better performance, Not sure if @terrelln wants to chime in, since it's an ARM optimization PR ? |
Why are you gating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang on x86-64 also benefits from this patch by 2%. So please also take the copy approach if defined(__clang__) || defined(__aarch64__)
, assuming you haven't measured a perf regression on aarch64 with clang.
ZSTD_seqSymbol llDInfoS, mlDInfoS, ofDInfoS; | ||
ZSTD_seqSymbol* const llDInfo = &llDInfoS; | ||
ZSTD_seqSymbol* const mlDInfo = &mlDInfoS; | ||
ZSTD_seqSymbol* const ofDInfo = &ofDInfoS; | ||
ZSTD_memcpy(llDInfo, seqState->stateLL.table + seqState->stateLL.state, sizeof(ZSTD_seqSymbol)); | ||
ZSTD_memcpy(mlDInfo, seqState->stateML.table + seqState->stateML.state, sizeof(ZSTD_seqSymbol)); | ||
ZSTD_memcpy(ofDInfo, seqState->stateOffb.table + seqState->stateOffb.state, sizeof(ZSTD_seqSymbol)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this give you the same codegen & performance? If so, I'd prefer this code over the memcpy()
version, it is cleaner.
ZSTD_seqSymbol llDInfoS, mlDInfoS, ofDInfoS; | |
ZSTD_seqSymbol* const llDInfo = &llDInfoS; | |
ZSTD_seqSymbol* const mlDInfo = &mlDInfoS; | |
ZSTD_seqSymbol* const ofDInfo = &ofDInfoS; | |
ZSTD_memcpy(llDInfo, seqState->stateLL.table + seqState->stateLL.state, sizeof(ZSTD_seqSymbol)); | |
ZSTD_memcpy(mlDInfo, seqState->stateML.table + seqState->stateML.state, sizeof(ZSTD_seqSymbol)); | |
ZSTD_memcpy(ofDInfo, seqState->stateOffb.table + seqState->stateOffb.state, sizeof(ZSTD_seqSymbol)); | |
ZSTD_seqSymbol const llDInfoS = seqState->stateLL.table[seqState->stateLL.state]; | |
ZSTD_seqSymbol const mlDInfoS = seqState->stateML.table[seqState->stateML.state]; | |
ZSTD_seqSymbol const ofDInfoS = seqState->stateOffb.table[seqState->stateOffb.state]; | |
ZSTD_seqSymbol* const llDInfo = &llDInfoS; | |
ZSTD_seqSymbol* const mlDInfo = &mlDInfoS; | |
ZSTD_seqSymbol* const ofDInfo = &ofDInfoS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang on x86-64 also benefits from this patch by 2%. So please also take the copy approach
if defined(__clang__) || defined(__aarch64__)
, assuming you haven't measured a perf regression on aarch64 with clang.
Thanks for help validation on x86, @terrelln. I tested the PR with both GCC-11 and Clang-13, and didn't observe the boost with Clang on both x86 and Arm. That's why I gated clang.
I ran tests with more different Clang versions when you mentioned the change on x86. Here is the result I got on Xeon-5218 and Arm N1:
- Xeon-5218 (change with this PR)
Clang-11 | Clang-12 | Clang-13 | Clang-14 | |
---|---|---|---|---|
dickens | -9.220% | 4.343% | -4.091% | -3.180% |
mozilla | -5.432% | 1.582% | 0.829% | -1.509% |
mr | -7.304% | 4.282% | -2.242% | -2.188% |
nci | -6.520% | -0.238% | 1.944% | 0.482% |
ooffice | -6.873% | 2.524% | 0.442% | 0.259% |
osdb | -5.778% | 2.268% | -2.300% | -1.800% |
reymont | -8.662% | 2.638% | -4.276% | -3.209% |
samba | -7.967% | 1.613% | -2.012% | -0.796% |
sao | -5.087% | 3.238% | -1.249% | -0.842% |
webster | -5.776% | 1.850% | -1.919% | -10.305% |
x-ray | -5.929% | 4.491% | -1.834% | -2.442% |
xml | -8.442% | 0.708% | -1.194% | 0.823% |
- Arm N1 (change with this PR)
Clang-11 | Clang-12 | Clang-13 | Clang-14 | |
---|---|---|---|---|
dickens | -1.868% | 0.544% | 0.172% | 0.172% |
mozilla | -0.838% | 0.773% | 0.156% | 0.048% |
mr | -0.406% | 0.472% | 0.333% | 0.450% |
nci | -1.080% | 0.866% | 0.394% | -0.273% |
ooffice | -1.612% | 0.556% | 0.377% | 0.174% |
osdb | -1.299% | 0.628% | 0.218% | 0.100% |
reymont | -0.338% | 0.661% | 0.138% | 0.159% |
samba | 0.407% | 0.458% | 0.091% | 0.183% |
sao | -0.222% | 0.630% | 0.400% | 0.130% |
webster | -0.899% | 0.529% | 0.009% | 0.064% |
x-ray | -0.487% | 0.722% | 0.148% | 0.409% |
xml | -0.313% | 1.065% | 0.250% | 0.071% |
So from the data only Clang-12 (noticeable uplift on x86) could be added. Pls let me know if you'd prefer this way. Thanks.
Does this give you the same codegen & performance? If so, I'd prefer this code over the
memcpy()
version, it is cleaner.
This doesn't give the same performance. It actually same number of load operations as the original one. You may refer the codegen here: https://www.godbolt.org/z/5nc1rK5j3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to mention that all these tests were with level-8.
ZSTD_seqSymbol is a structure with total of 64 bits wide. So it can be loaded in one operation and extract its fields by simply shifting or extracting on aarch64. GCC doesn't recognize this and generates more unnecessary ldr/ldrb/ldrh operations that cause performance drop. With this change it is observed 2~4% uplift of silesia and 2.5~6% of cantrbry @L8 on Arm N1. Signed-off-by: Jun He <[email protected]> Change-Id: I7748909204cf78a17eb9d4f2333692d53239daa8
Thanks for the PR @JunHe77! |
ZSTD_seqSymbol is a structure with a total of 64 bits wide. So it can be loaded in one operation and
extract its fields by simply shifting or bit-extracting on aarch64.
GCC doesn't recognize this and generates more unnecessary ldr/ldrb/ldrh operations that cause
performance drop.
With this change, it is observed 2%~4% uplift of silesia and 2.5%~6% of cantrbry at LVL8 on Arm N1.
Signed-off-by: Jun He [email protected]
Change-Id: I7748909204cf78a17eb9d4f2333692d53239daa8