Skip to content

Conversation

@DiamonDinoia
Copy link
Collaborator

This PR use templates to avoid branching and unrolls hot loops.

Performance comparison on my laptop:
image
image
image

In 2D and 3D there's no difference when I ran the tests many times it fluctuates one or another. In 1D the unrolling is a clear winner.

With AVX512 the number of instructions halves so we might just need to tune the unrolling in the future.

This PR is in preparation for the new version of xsimd. My plan is to tweak the unrolling factor once we update to that version.

@DiamonDinoia DiamonDinoia requested a review from lu1and10 October 16, 2025 03:36
@lu1and10
Copy link
Member

This PR use templates to avoid branching and unrolls hot loops.

Great, thanks! I guess this is with avx2 machine. Could you share the benchmark setup? I would like to run the same on rome and genoa nodes for once.

@DiamonDinoia
Copy link
Collaborator Author

I did taskset -c 1 ./spreadtestndall 1 1e8 1e2 1 && taskset -c 1 ./spreadtestndall 2 1e8 1e3 1 && taskset -c 1 ./spreadtestndall 3 1e7 1e5 1 5 times, averaged the results pts/s non spread pts/s and plotted them

@DiamonDinoia DiamonDinoia force-pushed the feat/template-unrolling branch from ad86ffa to 491e690 Compare October 16, 2025 15:00
@lu1and10
Copy link
Member

Since 1D is the interesting part of comparing compiler's loop unrolling vs this PR's manually force loop unrolling, I tested 1D spreading on FI cluster's nodes: Rome, Genoa and Icelake with gcc/13.3.0.

Each pts/s data is the average of 10 runs of export OMP_NUM_THREADS=1; taskset -c 1 ./spreadtestndall 1 1e8 xxx 1 where xxx is $N = 1e2,1e4,1e7,1e8$ and fixed $M = 1e8$.

Pasting the result plots here for future reference, the unrolling factor may need to be tweaked for different CPUs as the current PR's force unrolling factor runs better on Marco's laptop.

Rome:
compare_master_pr_rome
Genoa:
compare_master_pr_genoa
Icelake:
compare_master_pr_icelake

@ahbarnett
Copy link
Collaborator

ahbarnett commented Oct 20, 2025 via email

@DiamonDinoia
Copy link
Collaborator Author

Hi Alex,

There are performance improvements that requires this PR. I have not integrated all the changes here to keep this PR lean and self contained. We happy to discuss this in person in more detail.

Listing here for context to the GH users:

  1. the new xsimd version will reduce register utilization which benefit unrolling
  2. reduced register spills in general
  3. experiment with compiler flags that handle register usage -fira-*
  4. it seems that forcing unrolling benefits GPU greatly with reduced spills
  5. it is possible to tune UNROLL to be consistent across compiler and reduce performance variability

Thanks,
Marco

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants