common: apply two stage copy to aarch64 #3145

JunHe77 · 2022-05-26T06:41:54Z

On aarch64 ZSTD_wildcopy uses a simple loop to do
16B based memory copy. There is existing optimized
two stage copy that can achieve better performance.
By applying this to aarch64 it is also observed ~1%
uplift in silesia corpus.

Signed-off-by: Jun He [email protected]
Change-Id: Ic1253308e7a8a7df2d08963ba544e086c81ce8be

On aarch64 ZSTD_wildcopy uses a simple loop to do 16B based memory copy. There is existing optimized two stage copy that can achieve better performance. By applying this to aarch64 it is also observed ~1% uplift in silesia corpus. Signed-off-by: Jun He <[email protected]> Change-Id: Ic1253308e7a8a7df2d08963ba544e086c81ce8be

Cyan4973 · 2022-06-02T17:07:54Z

By aarch64, could you provide some details about the exact type of cpu / soc this patch has been benchmarked on ?

JunHe77 · 2022-06-03T04:37:24Z

Hi, @Cyan4973 , the result have been benchmarked on the Arm N1/A72/A57 platforms and observed similar uplift.

Cyan4973 · 2022-06-06T22:51:49Z

I can't remember why this code was added here.

It could be that, with aarch64 being merely an instruction set under which so many different SoCs are built, maybe some of them (non N1) would prefer the first loop.
However, it's pretty hard to confirm / test (couldn't find a test platform where this holds true).

From what I can see, the second formulation just separates the first branch from later ones, so that it can have its own statistics (as opposed to being merged with other loop iterations). Such a construction is expected to be rather good in the context of wildcopy, essentially distinguishing small copies from larger ones. This should translate into almost always better performance, except maybe for specific systems with rather poor branch predictors.

So I'm gonna make an educated guess here and state that this PR seems tends to improve the situation, on top of simplifying it by removing a weird and poorly documented corner case.

JunHe77 · 2022-06-07T08:19:40Z

Thank you @Cyan4973 for the review. I used to check the log info of that change (969ba4f), but could not find the context of it. It looks like this is designed for compression. With my benchmarks here, I didn't find regression in compression with removing the aarch64 specific part on N1/A72/A57.

terrelln · 2022-06-09T20:38:20Z

Thanks for the PR @JunHe77!

facebook-github-bot added the CLA Signed label May 26, 2022

embg self-assigned this Jun 1, 2022

This comment was marked as duplicate.

Sign in to view

terrelln self-assigned this Jun 2, 2022

This comment was marked as duplicate.

Sign in to view

Cyan4973 approved these changes Jun 6, 2022

View reviewed changes

embg removed their assignment Jun 9, 2022

terrelln merged commit 3b915cd into facebook:dev Jun 9, 2022

Cyan4973 mentioned this pull request Feb 9, 2023

release v1.5.4 #3487

Merged

JunHe77 deleted the wildcopy branch March 12, 2023 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common: apply two stage copy to aarch64 #3145

common: apply two stage copy to aarch64 #3145

JunHe77 commented May 26, 2022

Cyan4973 commented Jun 2, 2022

This comment was marked as duplicate.

JunHe77 commented Jun 3, 2022 •

edited

Loading

Cyan4973 commented Jun 6, 2022

This comment was marked as duplicate.

JunHe77 commented Jun 7, 2022

terrelln commented Jun 9, 2022

common: apply two stage copy to aarch64 #3145

common: apply two stage copy to aarch64 #3145

Conversation

JunHe77 commented May 26, 2022

Cyan4973 commented Jun 2, 2022

This comment was marked as duplicate.

JunHe77 commented Jun 3, 2022 • edited Loading

Cyan4973 commented Jun 6, 2022

This comment was marked as duplicate.

JunHe77 commented Jun 7, 2022

terrelln commented Jun 9, 2022

JunHe77 commented Jun 3, 2022 •

edited

Loading