Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dec: adjust seqSymbol load on aarch64 #3141

Merged
merged 1 commit into from
Jun 9, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions lib/decompress/zstd_decompress_block.c
Original file line number Diff line number Diff line change
Expand Up @@ -1170,9 +1170,27 @@ FORCE_INLINE_TEMPLATE seq_t
ZSTD_decodeSequence(seqState_t* seqState, const ZSTD_longOffset_e longOffsets)
{
seq_t seq;
/*
* ZSTD_seqSymbol is a structure with a total of 64 bits wide. So it can be
* loaded in one operation and extracted its fields by simply shifting or
* bit-extracting on aarch64.
* GCC doesn't recognize this and generates more unnecessary ldr/ldrb/ldrh
* operations that cause performance drop. This can be avoided by using this
* ZSTD_memcpy hack.
*/
#if defined(__aarch64__) && (defined(__GNUC__) && !defined(__clang__))
ZSTD_seqSymbol llDInfoS, mlDInfoS, ofDInfoS;
Copy link
Contributor

@Cyan4973 Cyan4973 May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintenance :
Please explain in code comment why this code block exists.
Something equivalent to your explanation text in the PR will be good enough.

The intention is for future maintainers to understand why this code block exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Will update. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

ZSTD_seqSymbol* const llDInfo = &llDInfoS;
ZSTD_seqSymbol* const mlDInfo = &mlDInfoS;
ZSTD_seqSymbol* const ofDInfo = &ofDInfoS;
ZSTD_memcpy(llDInfo, seqState->stateLL.table + seqState->stateLL.state, sizeof(ZSTD_seqSymbol));
ZSTD_memcpy(mlDInfo, seqState->stateML.table + seqState->stateML.state, sizeof(ZSTD_seqSymbol));
ZSTD_memcpy(ofDInfo, seqState->stateOffb.table + seqState->stateOffb.state, sizeof(ZSTD_seqSymbol));
Comment on lines +1182 to +1188
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this give you the same codegen & performance? If so, I'd prefer this code over the memcpy() version, it is cleaner.

Suggested change
ZSTD_seqSymbol llDInfoS, mlDInfoS, ofDInfoS;
ZSTD_seqSymbol* const llDInfo = &llDInfoS;
ZSTD_seqSymbol* const mlDInfo = &mlDInfoS;
ZSTD_seqSymbol* const ofDInfo = &ofDInfoS;
ZSTD_memcpy(llDInfo, seqState->stateLL.table + seqState->stateLL.state, sizeof(ZSTD_seqSymbol));
ZSTD_memcpy(mlDInfo, seqState->stateML.table + seqState->stateML.state, sizeof(ZSTD_seqSymbol));
ZSTD_memcpy(ofDInfo, seqState->stateOffb.table + seqState->stateOffb.state, sizeof(ZSTD_seqSymbol));
ZSTD_seqSymbol const llDInfoS = seqState->stateLL.table[seqState->stateLL.state];
ZSTD_seqSymbol const mlDInfoS = seqState->stateML.table[seqState->stateML.state];
ZSTD_seqSymbol const ofDInfoS = seqState->stateOffb.table[seqState->stateOffb.state];
ZSTD_seqSymbol* const llDInfo = &llDInfoS;
ZSTD_seqSymbol* const mlDInfo = &mlDInfoS;
ZSTD_seqSymbol* const ofDInfo = &ofDInfoS;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang on x86-64 also benefits from this patch by 2%. So please also take the copy approach if defined(__clang__) || defined(__aarch64__), assuming you haven't measured a perf regression on aarch64 with clang.

Thanks for help validation on x86, @terrelln. I tested the PR with both GCC-11 and Clang-13, and didn't observe the boost with Clang on both x86 and Arm. That's why I gated clang.
I ran tests with more different Clang versions when you mentioned the change on x86. Here is the result I got on Xeon-5218 and Arm N1:

  • Xeon-5218 (change with this PR)
Clang-11 Clang-12 Clang-13 Clang-14
dickens -9.220% 4.343% -4.091% -3.180%
mozilla -5.432% 1.582% 0.829% -1.509%
mr -7.304% 4.282% -2.242% -2.188%
nci -6.520% -0.238% 1.944% 0.482%
ooffice -6.873% 2.524% 0.442% 0.259%
osdb -5.778% 2.268% -2.300% -1.800%
reymont -8.662% 2.638% -4.276% -3.209%
samba -7.967% 1.613% -2.012% -0.796%
sao -5.087% 3.238% -1.249% -0.842%
webster -5.776% 1.850% -1.919% -10.305%
x-ray -5.929% 4.491% -1.834% -2.442%
xml -8.442% 0.708% -1.194% 0.823%
  • Arm N1 (change with this PR)
Clang-11 Clang-12 Clang-13 Clang-14
dickens -1.868% 0.544% 0.172% 0.172%
mozilla -0.838% 0.773% 0.156% 0.048%
mr -0.406% 0.472% 0.333% 0.450%
nci -1.080% 0.866% 0.394% -0.273%
ooffice -1.612% 0.556% 0.377% 0.174%
osdb -1.299% 0.628% 0.218% 0.100%
reymont -0.338% 0.661% 0.138% 0.159%
samba 0.407% 0.458% 0.091% 0.183%
sao -0.222% 0.630% 0.400% 0.130%
webster -0.899% 0.529% 0.009% 0.064%
x-ray -0.487% 0.722% 0.148% 0.409%
xml -0.313% 1.065% 0.250% 0.071%

So from the data only Clang-12 (noticeable uplift on x86) could be added. Pls let me know if you'd prefer this way. Thanks.

Does this give you the same codegen & performance? If so, I'd prefer this code over the memcpy() version, it is cleaner.

This doesn't give the same performance. It actually same number of load operations as the original one. You may refer the codegen here: https://www.godbolt.org/z/5nc1rK5j3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to mention that all these tests were with level-8.

#else
const ZSTD_seqSymbol* const llDInfo = seqState->stateLL.table + seqState->stateLL.state;
const ZSTD_seqSymbol* const mlDInfo = seqState->stateML.table + seqState->stateML.state;
const ZSTD_seqSymbol* const ofDInfo = seqState->stateOffb.table + seqState->stateOffb.state;
#endif
seq.matchLength = mlDInfo->baseValue;
seq.litLength = llDInfo->baseValue;
{ U32 const ofBase = ofDInfo->baseValue;
Expand Down