Using templates to force unrolling and removing branches from hot loops #736

DiamonDinoia · 2025-10-16T03:35:47Z

This PR use templates to avoid branching and unrolls hot loops.

Performance comparison on my laptop:

In 2D and 3D there's no difference when I ran the tests many times it fluctuates one or another. In 1D the unrolling is a clear winner.

With AVX512 the number of instructions halves so we might just need to tune the unrolling in the future.

This PR is in preparation for the new version of xsimd. My plan is to tweak the unrolling factor once we update to that version.

lu1and10 · 2025-10-16T04:43:08Z

This PR use templates to avoid branching and unrolls hot loops.

Great, thanks! I guess this is with avx2 machine. Could you share the benchmark setup? I would like to run the same on rome and genoa nodes for once.

DiamonDinoia · 2025-10-16T14:54:47Z

I did taskset -c 1 ./spreadtestndall 1 1e8 1e2 1 && taskset -c 1 ./spreadtestndall 2 1e8 1e3 1 && taskset -c 1 ./spreadtestndall 3 1e7 1e5 1 5 times, averaged the results pts/s non spread pts/s and plotted them

lu1and10 · 2025-10-18T01:16:51Z

Since 1D is the interesting part of comparing compiler's loop unrolling vs this PR's manually force loop unrolling, I tested 1D spreading on FI cluster's nodes: Rome, Genoa and Icelake with gcc/13.3.0.

Each pts/s data is the average of 10 runs of export OMP_NUM_THREADS=1; taskset -c 1 ./spreadtestndall 1 1e8 xxx 1 where xxx is $N = 1e2,1e4,1e7,1e8$ and fixed $M = 1e8$.

Pasting the result plots here for future reference, the unrolling factor may need to be tweaked for different CPUs as the current PR's force unrolling factor runs better on Marco's laptop.

Rome:

Genoa:

Icelake:

ahbarnett · 2025-10-20T17:49:26Z

Hi guys, Thanks for trying that out Marco, and the detailed benchmarks, both of you. It's looking like we should not pull this in yet, since it's no significant change on workstations, apart from w=3 in 1D it slows down. Also it adds a large chunk of custom templating in utils.h that I can't understand, and just adds complexity. We have to keep simplicity as a goal here, unless there are significant speed gains across architectures. Best, Alex

…

On Fri, Oct 17, 2025 at 9:17 PM Libin Lu ***@***.***> wrote: *lu1and10* left a comment (flatironinstitute/finufft#736) <#736 (comment)> Since 1D is the interesting part of comparing compiler's loop unrolling vs this PR's manually force loop unrolling, I tested 1D spreading on FI cluster's nodes: Rome, Genoa and Icelake with gcc/13.3.0. Each pts/s data is the average of 10 runs of export OMP_NUM_THREADS=1; taskset -c 1 ./spreadtestndall 1 1e8 xxx 1 where xxx is $N = 1e2,1e4,1e7,1e8$ and fixed $M = 1e8$. Pasting the result plots here for future reference, the unrolling factor may need to be tweaked for different CPUs as the current PR's force unrolling factor runs better on Marco's laptop. Rome: compare_master_pr_rome.png (view on web) <https://github.com/user-attachments/assets/adc973f9-399b-40b0-ad23-5c9b50843f77> Genoa: compare_master_pr_genoa.png (view on web) <https://github.com/user-attachments/assets/17ef4635-971a-4ab4-bf91-af2bd53d51c1> Icelake: compare_master_pr_icelake.png (view on web) <https://github.com/user-attachments/assets/12f436ee-ada1-41c6-929d-4f132f621a86> — Reply to this email directly, view it on GitHub <#736 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSXGQOU35GO7AERZYOL3YGIJVAVCNFSM6AAAAACJKKUR4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIMJXGY3DENZRG4> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

DiamonDinoia · 2025-10-20T18:59:15Z

Hi Alex,

There are performance improvements that requires this PR. I have not integrated all the changes here to keep this PR lean and self contained. We happy to discuss this in person in more detail.

Listing here for context to the GH users:

the new xsimd version will reduce register utilization which benefit unrolling
reduced register spills in general
experiment with compiler flags that handle register usage -fira-*
it seems that forcing unrolling benefits GPU greatly with reduced spills
it is possible to tune UNROLL to be consistent across compiler and reduce performance variability

Thanks,
Marco

DiamonDinoia requested a review from lu1and10 October 16, 2025 03:36

Using templates to force unrolling and removing branches from hot loops

491e690

DiamonDinoia force-pushed the feat/template-unrolling branch from ad86ffa to 491e690 Compare October 16, 2025 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using templates to force unrolling and removing branches from hot loops #736

Using templates to force unrolling and removing branches from hot loops #736

Uh oh!

DiamonDinoia commented Oct 16, 2025

Uh oh!

lu1and10 commented Oct 16, 2025

Uh oh!

DiamonDinoia commented Oct 16, 2025

Uh oh!

lu1and10 commented Oct 18, 2025

Uh oh!

ahbarnett commented Oct 20, 2025 via email

Uh oh!

DiamonDinoia commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Using templates to force unrolling and removing branches from hot loops #736

Are you sure you want to change the base?

Using templates to force unrolling and removing branches from hot loops #736

Uh oh!

Conversation

DiamonDinoia commented Oct 16, 2025

Uh oh!

lu1and10 commented Oct 16, 2025

Uh oh!

DiamonDinoia commented Oct 16, 2025

Uh oh!

lu1and10 commented Oct 18, 2025

Uh oh!

ahbarnett commented Oct 20, 2025 via email

Uh oh!

DiamonDinoia commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants