Skip to content

feat: cache efficient CPU kernel#23

Merged
ryan-berger merged 27 commits intomainfrom
rberger/cpu-clean
Jan 17, 2026
Merged

feat: cache efficient CPU kernel#23
ryan-berger merged 27 commits intomainfrom
rberger/cpu-clean

Conversation

@ryan-berger
Copy link
Contributor

@ryan-berger ryan-berger commented Dec 21, 2025

The benchmark viewshed difference is almost imperceptible. This PRs viewshed is in green:
image

This was referenced Dec 21, 2025
@ryan-berger ryan-berger requested a review from tombh December 21, 2025 02:53
@ryan-berger
Copy link
Contributor Author

@tombh instead of a command line option, I am just conditionally compiling the vector length into the binary.

I don't think it actually makes sense for me to add the option in depending on the architecture, I think we just want to pick one.

I always want the fastest version supported by my architecture. Adding in an option just confuses things and doesn't actually help me test, I can conditionally compile these very easily anyways by messing with Rust flags.

Let me know if you have any issues, I've also fixed a lot of the clippy lints, although, lots of them happen to just be disabling them and providing a reason.

@tombh
Copy link
Collaborator

tombh commented Dec 21, 2025

There was just one tiny change I needed to get it to compile. But the heatmap seems to be messed up:

total_surfaces

That's for the Cardiff benchmark, but other .bt files suffer similarly.

I made PR #24 to add ARM building/testing, but it seems to suddenly not be happy about the lints. I don't understand why your PR is fine but that one isn't. The lint messages seem correct, but it's like suddenly only now is Clippy seeing all the avx512f gated code.

@ryan-berger ryan-berger force-pushed the rberger/cpu-clean branch 2 times, most recently from f5b0c4b to 91a580e Compare January 9, 2026 06:43
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about ignoring this file in .gitignore? Are world runs harder without it? Or can we just add a line in Atlas that does the same thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer it only because we are "JIT"ing so to speak (not in the traditional sense of the word) the program on the native machines before running it. If we were releasing a binary tool for anyone to make use of (i.e. build artifacts) then we would want to disable it since it would end up building on whatever architecture the build machine is on which would blow up in our faces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to answer your question, world runs are a bit harder without it, but it could just be a line we add to Atlas with the correct RUSTFLAGS when we provision.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or another approach is to both gitignore it and put in the repo elsewhere. Then all Atlas as to do is move it to the right place.

@ryan-berger
Copy link
Contributor Author

ryan-berger commented Jan 10, 2026 via email

@ryan-berger
Copy link
Contributor Author

ryan-berger commented Jan 10, 2026 via email

@tombh
Copy link
Collaborator

tombh commented Jan 10, 2026

I couldn't get main to recognize the module, and the examples I've read/some docs show it as a sibling:
https://doc.rust-lang.org/rust-by-example/mod/split.html

Oh I never knew that. And what was the issue with putting the definitions in main.rs?

Yes... Unfortunately it's quite a few features. Generic const exprs and portable simd

It's not a problem at all, just good to comment in rust-toolchain.toml what features it is enabling.

@ryan-berger
Copy link
Contributor Author

Oh I never knew that. And what was the issue with putting the definitions in main.rs?

I bet that probably is the issue yes. I'm not terribly concerned about that for this PR, but we could clean it up for sure.

@ryan-berger ryan-berger requested a review from tombh January 12, 2026 01:27
@ryan-berger ryan-berger force-pushed the rberger/cpu-clean branch 2 times, most recently from aed1763 to 81cdf51 Compare January 17, 2026 22:54
ryan-berger and others added 9 commits January 17, 2026 15:24
* add inclusive prefix sum code

adds an inclusive prefix sum kernel which is generic to unroll factor and vector length

* only calculate data for items within the TVS' radius

* add filling in of elevations into kernel
This reduces the distance searched for in every line of sight by one
elevation less. This is more accurate and consistent with other approaches.

No changes to viewsheds in tests or benchmarks. But total surfaces are
reduced.
* tests: Integrate viewshed tests into CPU kernel

Search for TODO@ryan for remaining tasks.

* inline both calls

* fix more rotation issues

* enable ring data feature for benchmark

* fmt

* fix some lints, reformat

* pass all unit tests, add in refraction

* put test above all on default vector length

* fix cfg blocks

* fix prefix max carry through

* fix formatting

* reignore tests

* fix non-sse build

* remove unnecessary carry through

---------

Co-authored-by: Ryan Berger <ryanbberger@gmail.com>
Because the TVS is only valid within a certain distance from the center, this
adds a better distance calculation which also helps with rasterization
ryan-berger and others added 17 commits January 17, 2026 15:29
Remove the old _CMP_GE which was used for the exclusive prefix sum code as
this was causing quite a few bugs _only_ in the AVX 512 kernel
The Vulkan kernel tallies total surface area against itself creating a quadratic
(sum i=0; (sum j=0; j; j < i); i < n) surface area rather than a linear one
Everything is just accumulated in `self.total_surfaces`.
They are so similar now that we should expect them to always produce
the benchmark viewshed within 1% difference.
Just an excuse to bump the CI cache for the Rust tests.
This is just about the edge case of handling equally long lines from
different angles. We always want the first angle to find the line to
win. The CPU kernel does this already. But the Vulkan kernel had
problems because of how forward and backward lines take it in turns per
sector.
Currently we use the cargo toml option to guarantee good performance
on x86 machines, which is especially needed for world runs. We will
recommend it elsewhere, but make sure it is enabled on all Turin workers
After some thought about how the L1 cache is functioning in our line of sight algorithm,
this commit calculates all the angles and puts them into a buffer, and then an unrolled prefix
max is then calculated on top of that.

It ends up being much quicker on my i9900k, offering about a 20% speedup, and it is expected machines
with larger L1 caches will be better.
@ryan-berger ryan-berger merged commit d326e9a into main Jan 17, 2026
7 checks passed
@ryan-berger ryan-berger deleted the rberger/cpu-clean branch January 17, 2026 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments