dec: adjust seqSymbol load on aarch64 #3141

JunHe77 · 2022-05-23T08:23:06Z

ZSTD_seqSymbol is a structure with a total of 64 bits wide. So it can be loaded in one operation and
extract its fields by simply shifting or bit-extracting on aarch64.
GCC doesn't recognize this and generates more unnecessary ldr/ldrb/ldrh operations that cause
performance drop.
With this change, it is observed 2%~4% uplift of silesia and 2.5%~6% of cantrbry at LVL8 on Arm N1.

Signed-off-by: Jun He [email protected]
Change-Id: I7748909204cf78a17eb9d4f2333692d53239daa8

lib/decompress/zstd_decompress_block.c

Cyan4973 · 2022-05-24T13:53:40Z

lib/decompress/zstd_decompress_block.c

@@ -1170,9 +1170,19 @@ FORCE_INLINE_TEMPLATE seq_t
 ZSTD_decodeSequence(seqState_t* seqState, const ZSTD_longOffset_e longOffsets)
 {
    seq_t seq;
+    #if defined(__aarch64__) && (defined(__GNUC__) && !defined(__clang__))
+    ZSTD_seqSymbol llDInfoS, mlDInfoS, ofDInfoS;


Maintenance :
Please explain in code comment why this code block exists.
Something equivalent to your explanation text in the PR will be good enough.

The intention is for future maintainers to understand why this code block exist.

Good suggestion. Will update. Thanks.

Cyan4973 · 2022-05-24T13:56:19Z

It's a pity that the code needs to be made more obscure to get better performance,
but well, we know such cases do exist unfortunately.

Not sure if @terrelln wants to chime in, since it's an ARM optimization PR ?

terrelln · 2022-05-24T15:27:27Z

Why are you gating !defined(clang)? Did you find a performance drop on clang, or did you just not measure it?

terrelln

clang on x86-64 also benefits from this patch by 2%. So please also take the copy approach if defined(__clang__) || defined(__aarch64__), assuming you haven't measured a perf regression on aarch64 with clang.

terrelln · 2022-05-24T15:29:22Z

lib/decompress/zstd_decompress_block.c

+    ZSTD_seqSymbol llDInfoS, mlDInfoS, ofDInfoS;
+    ZSTD_seqSymbol* const llDInfo = &llDInfoS;
+    ZSTD_seqSymbol* const mlDInfo = &mlDInfoS;
+    ZSTD_seqSymbol* const ofDInfo = &ofDInfoS;
+    ZSTD_memcpy(llDInfo, seqState->stateLL.table + seqState->stateLL.state, sizeof(ZSTD_seqSymbol));
+    ZSTD_memcpy(mlDInfo, seqState->stateML.table + seqState->stateML.state, sizeof(ZSTD_seqSymbol));
+    ZSTD_memcpy(ofDInfo, seqState->stateOffb.table + seqState->stateOffb.state, sizeof(ZSTD_seqSymbol));


Does this give you the same codegen & performance? If so, I'd prefer this code over the memcpy() version, it is cleaner.

Suggested change

ZSTD_seqSymbol llDInfoS, mlDInfoS, ofDInfoS;

ZSTD_seqSymbol* const llDInfo = &llDInfoS;

ZSTD_seqSymbol* const mlDInfo = &mlDInfoS;

ZSTD_seqSymbol* const ofDInfo = &ofDInfoS;

ZSTD_memcpy(llDInfo, seqState->stateLL.table + seqState->stateLL.state, sizeof(ZSTD_seqSymbol));

ZSTD_memcpy(mlDInfo, seqState->stateML.table + seqState->stateML.state, sizeof(ZSTD_seqSymbol));

ZSTD_memcpy(ofDInfo, seqState->stateOffb.table + seqState->stateOffb.state, sizeof(ZSTD_seqSymbol));

ZSTD_seqSymbol const llDInfoS = seqState->stateLL.table[seqState->stateLL.state];

ZSTD_seqSymbol const mlDInfoS = seqState->stateML.table[seqState->stateML.state];

ZSTD_seqSymbol const ofDInfoS = seqState->stateOffb.table[seqState->stateOffb.state];

ZSTD_seqSymbol* const llDInfo = &llDInfoS;

ZSTD_seqSymbol* const mlDInfo = &mlDInfoS;

ZSTD_seqSymbol* const ofDInfo = &ofDInfoS;

clang on x86-64 also benefits from this patch by 2%. So please also take the copy approach if defined(__clang__) || defined(__aarch64__), assuming you haven't measured a perf regression on aarch64 with clang.

Thanks for help validation on x86, @terrelln. I tested the PR with both GCC-11 and Clang-13, and didn't observe the boost with Clang on both x86 and Arm. That's why I gated clang.
I ran tests with more different Clang versions when you mentioned the change on x86. Here is the result I got on Xeon-5218 and Arm N1:

Xeon-5218 (change with this PR)

Clang-11 Clang-12 Clang-13 Clang-14

dickens -9.220% 4.343% -4.091% -3.180%

mozilla -5.432% 1.582% 0.829% -1.509%

mr -7.304% 4.282% -2.242% -2.188%

nci -6.520% -0.238% 1.944% 0.482%

ooffice -6.873% 2.524% 0.442% 0.259%

osdb -5.778% 2.268% -2.300% -1.800%

reymont -8.662% 2.638% -4.276% -3.209%

samba -7.967% 1.613% -2.012% -0.796%

sao -5.087% 3.238% -1.249% -0.842%

webster -5.776% 1.850% -1.919% -10.305%

x-ray -5.929% 4.491% -1.834% -2.442%

xml -8.442% 0.708% -1.194% 0.823%

Arm N1 (change with this PR)

Clang-11 Clang-12 Clang-13 Clang-14

dickens -1.868% 0.544% 0.172% 0.172%

mozilla -0.838% 0.773% 0.156% 0.048%

mr -0.406% 0.472% 0.333% 0.450%

nci -1.080% 0.866% 0.394% -0.273%

ooffice -1.612% 0.556% 0.377% 0.174%

osdb -1.299% 0.628% 0.218% 0.100%

reymont -0.338% 0.661% 0.138% 0.159%

samba 0.407% 0.458% 0.091% 0.183%

sao -0.222% 0.630% 0.400% 0.130%

webster -0.899% 0.529% 0.009% 0.064%

x-ray -0.487% 0.722% 0.148% 0.409%

xml -0.313% 1.065% 0.250% 0.071%

So from the data only Clang-12 (noticeable uplift on x86) could be added. Pls let me know if you'd prefer this way. Thanks.

Does this give you the same codegen & performance? If so, I'd prefer this code over the memcpy() version, it is cleaner.

This doesn't give the same performance. It actually same number of load operations as the original one. You may refer the codegen here: https://www.godbolt.org/z/5nc1rK5j3

Forgot to mention that all these tests were with level-8.

@L8

ZSTD_seqSymbol is a structure with total of 64 bits wide. So it can be loaded in one operation and extract its fields by simply shifting or extracting on aarch64. GCC doesn't recognize this and generates more unnecessary ldr/ldrb/ldrh operations that cause performance drop. With this change it is observed 2~4% uplift of silesia and 2.5~6% of cantrbry @L8 on Arm N1. Signed-off-by: Jun He <[email protected]> Change-Id: I7748909204cf78a17eb9d4f2333692d53239daa8

terrelln · 2022-06-09T20:40:47Z

Thanks for the PR @JunHe77!

facebook-github-bot added the CLA Signed label May 23, 2022

JunHe77 force-pushed the seqDec branch from c65eaea to 596ed61 Compare May 23, 2022 08:26

Cyan4973 reviewed May 24, 2022

View reviewed changes

lib/decompress/zstd_decompress_block.c Outdated Show resolved Hide resolved

Cyan4973 reviewed May 24, 2022

View reviewed changes

terrelln reviewed May 24, 2022

View reviewed changes

embg assigned terrelln May 24, 2022

JunHe77 force-pushed the seqDec branch from 596ed61 to 2491c65 Compare May 30, 2022 14:02

terrelln approved these changes Jun 9, 2022

View reviewed changes

terrelln merged commit 3b1bd91 into facebook:dev Jun 9, 2022

Cyan4973 mentioned this pull request Feb 9, 2023

release v1.5.4 #3487

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dec: adjust seqSymbol load on aarch64 #3141

dec: adjust seqSymbol load on aarch64 #3141

JunHe77 commented May 23, 2022 •

edited

Loading

Cyan4973 May 24, 2022 •

edited

Loading

JunHe77 May 25, 2022

JunHe77 Jun 1, 2022

Cyan4973 commented May 24, 2022

terrelln commented May 24, 2022

terrelln left a comment

terrelln May 24, 2022

JunHe77 May 25, 2022

JunHe77 May 25, 2022

terrelln commented Jun 9, 2022

	Clang-11	Clang-12	Clang-13	Clang-14
dickens	-9.220%	4.343%	-4.091%	-3.180%
mozilla	-5.432%	1.582%	0.829%	-1.509%
mr	-7.304%	4.282%	-2.242%	-2.188%
nci	-6.520%	-0.238%	1.944%	0.482%
ooffice	-6.873%	2.524%	0.442%	0.259%
osdb	-5.778%	2.268%	-2.300%	-1.800%
reymont	-8.662%	2.638%	-4.276%	-3.209%
samba	-7.967%	1.613%	-2.012%	-0.796%
sao	-5.087%	3.238%	-1.249%	-0.842%
webster	-5.776%	1.850%	-1.919%	-10.305%
x-ray	-5.929%	4.491%	-1.834%	-2.442%
xml	-8.442%	0.708%	-1.194%	0.823%

	Clang-11	Clang-12	Clang-13	Clang-14
dickens	-1.868%	0.544%	0.172%	0.172%
mozilla	-0.838%	0.773%	0.156%	0.048%
mr	-0.406%	0.472%	0.333%	0.450%
nci	-1.080%	0.866%	0.394%	-0.273%
ooffice	-1.612%	0.556%	0.377%	0.174%
osdb	-1.299%	0.628%	0.218%	0.100%
reymont	-0.338%	0.661%	0.138%	0.159%
samba	0.407%	0.458%	0.091%	0.183%
sao	-0.222%	0.630%	0.400%	0.130%
webster	-0.899%	0.529%	0.009%	0.064%
x-ray	-0.487%	0.722%	0.148%	0.409%
xml	-0.313%	1.065%	0.250%	0.071%

dec: adjust seqSymbol load on aarch64 #3141

dec: adjust seqSymbol load on aarch64 #3141

Conversation

JunHe77 commented May 23, 2022 • edited Loading

Cyan4973 May 24, 2022 • edited Loading

Choose a reason for hiding this comment

JunHe77 May 25, 2022

Choose a reason for hiding this comment

JunHe77 Jun 1, 2022

Choose a reason for hiding this comment

Cyan4973 commented May 24, 2022

terrelln commented May 24, 2022

terrelln left a comment

Choose a reason for hiding this comment

terrelln May 24, 2022

Choose a reason for hiding this comment

JunHe77 May 25, 2022

Choose a reason for hiding this comment

JunHe77 May 25, 2022

Choose a reason for hiding this comment

terrelln commented Jun 9, 2022

JunHe77 commented May 23, 2022 •

edited

Loading

Cyan4973 May 24, 2022 •

edited

Loading