-
Notifications
You must be signed in to change notification settings - Fork 94
Using templates to force unrolling and removing branches from hot loops #736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Using templates to force unrolling and removing branches from hot loops #736
Conversation
Great, thanks! I guess this is with avx2 machine. Could you share the benchmark setup? I would like to run the same on rome and genoa nodes for once. |
|
I did |
ad86ffa to
491e690
Compare
|
Since 1D is the interesting part of comparing compiler's loop unrolling vs this PR's manually force loop unrolling, I tested 1D spreading on FI cluster's nodes: Rome, Genoa and Icelake with gcc/13.3.0. Each pts/s data is the average of 10 runs of Pasting the result plots here for future reference, the unrolling factor may need to be tweaked for different CPUs as the current PR's force unrolling factor runs better on Marco's laptop. |
|
Hi guys,
Thanks for trying that out Marco, and the detailed benchmarks, both of you.
It's looking like we should not pull this in yet, since it's no
significant change on workstations, apart from w=3 in 1D it slows down.
Also it adds a large chunk of custom templating in utils.h that I can't
understand, and just adds complexity.
We have to keep simplicity as a goal here, unless there are significant
speed gains across architectures.
Best, Alex
…On Fri, Oct 17, 2025 at 9:17 PM Libin Lu ***@***.***> wrote:
*lu1and10* left a comment (flatironinstitute/finufft#736)
<#736 (comment)>
Since 1D is the interesting part of comparing compiler's loop unrolling vs
this PR's manually force loop unrolling, I tested 1D spreading on FI
cluster's nodes: Rome, Genoa and Icelake with gcc/13.3.0.
Each pts/s data is the average of 10 runs of export OMP_NUM_THREADS=1;
taskset -c 1 ./spreadtestndall 1 1e8 xxx 1 where xxx is $N =
1e2,1e4,1e7,1e8$ and fixed $M = 1e8$.
Pasting the result plots here for future reference, the unrolling factor
may need to be tweaked for different CPUs as the current PR's force
unrolling factor runs better on Marco's laptop.
Rome:
compare_master_pr_rome.png (view on web)
<https://github.com/user-attachments/assets/adc973f9-399b-40b0-ad23-5c9b50843f77>
Genoa:
compare_master_pr_genoa.png (view on web)
<https://github.com/user-attachments/assets/17ef4635-971a-4ab4-bf91-af2bd53d51c1>
Icelake:
compare_master_pr_icelake.png (view on web)
<https://github.com/user-attachments/assets/12f436ee-ada1-41c6-929d-4f132f621a86>
—
Reply to this email directly, view it on GitHub
<#736 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSXGQOU35GO7AERZYOL3YGIJVAVCNFSM6AAAAACJKKUR4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIMJXGY3DENZRG4>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
*-------------------------------------------------------------------~^`^~._.~'
|\ Alex Barnett Center for Computational Mathematics, Flatiron Institute
| \ http://users.flatironinstitute.org/~ahb 646-876-5942
|
|
Hi Alex, There are performance improvements that requires this PR. I have not integrated all the changes here to keep this PR lean and self contained. We happy to discuss this in person in more detail. Listing here for context to the GH users:
Thanks, |



This PR use templates to avoid branching and unrolls hot loops.
Performance comparison on my laptop:



In 2D and 3D there's no difference when I ran the tests many times it fluctuates one or another. In 1D the unrolling is a clear winner.
With AVX512 the number of instructions halves so we might just need to tune the unrolling in the future.
This PR is in preparation for the new version of xsimd. My plan is to tweak the unrolling factor once we update to that version.