Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dec: adjust seqSymbol load on aarch64 #3141

Merged
merged 1 commit into from
Jun 9, 2022
Merged

Conversation

JunHe77
Copy link
Contributor

@JunHe77 JunHe77 commented May 23, 2022

ZSTD_seqSymbol is a structure with a total of 64 bits wide. So it can be loaded in one operation and
extract its fields by simply shifting or bit-extracting on aarch64.
GCC doesn't recognize this and generates more unnecessary ldr/ldrb/ldrh operations that cause
performance drop.
With this change, it is observed 2%~4% uplift of silesia and 2.5%~6% of cantrbry at LVL8 on Arm N1.

Signed-off-by: Jun He [email protected]
Change-Id: I7748909204cf78a17eb9d4f2333692d53239daa8

@@ -1170,9 +1170,19 @@ FORCE_INLINE_TEMPLATE seq_t
ZSTD_decodeSequence(seqState_t* seqState, const ZSTD_longOffset_e longOffsets)
{
seq_t seq;
#if defined(__aarch64__) && (defined(__GNUC__) && !defined(__clang__))
ZSTD_seqSymbol llDInfoS, mlDInfoS, ofDInfoS;
Copy link
Contributor

@Cyan4973 Cyan4973 May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintenance :
Please explain in code comment why this code block exists.
Something equivalent to your explanation text in the PR will be good enough.

The intention is for future maintainers to understand why this code block exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Will update. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@Cyan4973
Copy link
Contributor

It's a pity that the code needs to be made more obscure to get better performance,
but well, we know such cases do exist unfortunately.

Not sure if @terrelln wants to chime in, since it's an ARM optimization PR ?

@terrelln
Copy link
Contributor

Why are you gating !defined(clang)? Did you find a performance drop on clang, or did you just not measure it?

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang on x86-64 also benefits from this patch by 2%. So please also take the copy approach if defined(__clang__) || defined(__aarch64__), assuming you haven't measured a perf regression on aarch64 with clang.

Comment on lines +1174 to +1188
ZSTD_seqSymbol llDInfoS, mlDInfoS, ofDInfoS;
ZSTD_seqSymbol* const llDInfo = &llDInfoS;
ZSTD_seqSymbol* const mlDInfo = &mlDInfoS;
ZSTD_seqSymbol* const ofDInfo = &ofDInfoS;
ZSTD_memcpy(llDInfo, seqState->stateLL.table + seqState->stateLL.state, sizeof(ZSTD_seqSymbol));
ZSTD_memcpy(mlDInfo, seqState->stateML.table + seqState->stateML.state, sizeof(ZSTD_seqSymbol));
ZSTD_memcpy(ofDInfo, seqState->stateOffb.table + seqState->stateOffb.state, sizeof(ZSTD_seqSymbol));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this give you the same codegen & performance? If so, I'd prefer this code over the memcpy() version, it is cleaner.

Suggested change
ZSTD_seqSymbol llDInfoS, mlDInfoS, ofDInfoS;
ZSTD_seqSymbol* const llDInfo = &llDInfoS;
ZSTD_seqSymbol* const mlDInfo = &mlDInfoS;
ZSTD_seqSymbol* const ofDInfo = &ofDInfoS;
ZSTD_memcpy(llDInfo, seqState->stateLL.table + seqState->stateLL.state, sizeof(ZSTD_seqSymbol));
ZSTD_memcpy(mlDInfo, seqState->stateML.table + seqState->stateML.state, sizeof(ZSTD_seqSymbol));
ZSTD_memcpy(ofDInfo, seqState->stateOffb.table + seqState->stateOffb.state, sizeof(ZSTD_seqSymbol));
ZSTD_seqSymbol const llDInfoS = seqState->stateLL.table[seqState->stateLL.state];
ZSTD_seqSymbol const mlDInfoS = seqState->stateML.table[seqState->stateML.state];
ZSTD_seqSymbol const ofDInfoS = seqState->stateOffb.table[seqState->stateOffb.state];
ZSTD_seqSymbol* const llDInfo = &llDInfoS;
ZSTD_seqSymbol* const mlDInfo = &mlDInfoS;
ZSTD_seqSymbol* const ofDInfo = &ofDInfoS;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang on x86-64 also benefits from this patch by 2%. So please also take the copy approach if defined(__clang__) || defined(__aarch64__), assuming you haven't measured a perf regression on aarch64 with clang.

Thanks for help validation on x86, @terrelln. I tested the PR with both GCC-11 and Clang-13, and didn't observe the boost with Clang on both x86 and Arm. That's why I gated clang.
I ran tests with more different Clang versions when you mentioned the change on x86. Here is the result I got on Xeon-5218 and Arm N1:

  • Xeon-5218 (change with this PR)
Clang-11 Clang-12 Clang-13 Clang-14
dickens -9.220% 4.343% -4.091% -3.180%
mozilla -5.432% 1.582% 0.829% -1.509%
mr -7.304% 4.282% -2.242% -2.188%
nci -6.520% -0.238% 1.944% 0.482%
ooffice -6.873% 2.524% 0.442% 0.259%
osdb -5.778% 2.268% -2.300% -1.800%
reymont -8.662% 2.638% -4.276% -3.209%
samba -7.967% 1.613% -2.012% -0.796%
sao -5.087% 3.238% -1.249% -0.842%
webster -5.776% 1.850% -1.919% -10.305%
x-ray -5.929% 4.491% -1.834% -2.442%
xml -8.442% 0.708% -1.194% 0.823%
  • Arm N1 (change with this PR)
Clang-11 Clang-12 Clang-13 Clang-14
dickens -1.868% 0.544% 0.172% 0.172%
mozilla -0.838% 0.773% 0.156% 0.048%
mr -0.406% 0.472% 0.333% 0.450%
nci -1.080% 0.866% 0.394% -0.273%
ooffice -1.612% 0.556% 0.377% 0.174%
osdb -1.299% 0.628% 0.218% 0.100%
reymont -0.338% 0.661% 0.138% 0.159%
samba 0.407% 0.458% 0.091% 0.183%
sao -0.222% 0.630% 0.400% 0.130%
webster -0.899% 0.529% 0.009% 0.064%
x-ray -0.487% 0.722% 0.148% 0.409%
xml -0.313% 1.065% 0.250% 0.071%

So from the data only Clang-12 (noticeable uplift on x86) could be added. Pls let me know if you'd prefer this way. Thanks.

Does this give you the same codegen & performance? If so, I'd prefer this code over the memcpy() version, it is cleaner.

This doesn't give the same performance. It actually same number of load operations as the original one. You may refer the codegen here: https://www.godbolt.org/z/5nc1rK5j3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to mention that all these tests were with level-8.

ZSTD_seqSymbol is a structure with total of 64 bits
wide. So it can be loaded in one operation and
extract its fields by simply shifting or extracting
on aarch64.
GCC doesn't recognize this and generates more
unnecessary ldr/ldrb/ldrh operations that cause
performance drop.
With this change it is observed 2~4% uplift of
silesia and 2.5~6% of cantrbry @L8 on Arm N1.

Signed-off-by: Jun He <[email protected]>
Change-Id: I7748909204cf78a17eb9d4f2333692d53239daa8
@terrelln
Copy link
Contributor

terrelln commented Jun 9, 2022

Thanks for the PR @JunHe77!

@terrelln terrelln merged commit 3b1bd91 into facebook:dev Jun 9, 2022
@Cyan4973 Cyan4973 mentioned this pull request Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants