Skip to content

Use AcceleratedKernels.mapreduce in max_scaled_speed and integrate_via_indices#2882

Merged
ranocha merged 15 commits intomainfrom
vc/ak
Apr 3, 2026
Merged

Use AcceleratedKernels.mapreduce in max_scaled_speed and integrate_via_indices#2882
ranocha merged 15 commits intomainfrom
vc/ak

Conversation

@vchuravy
Copy link
Copy Markdown
Member

Demonstrate how to use AcceleratedKernels.

#2590 (comment)

There are several more places where we would need to do this,
but this is the crux of it.

fixes: #2823

@github-actions
Copy link
Copy Markdown
Contributor

Review checklist

This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.

Purpose and scope

  • The PR has a single goal that is clear from the PR title and/or description.
  • All code changes represent a single set of modifications that logically belong together.
  • No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.

Code quality

  • The code can be understood easily.
  • Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
  • There are no redundancies that can be removed by simple modularization/refactoring.
  • There are no leftover debug statements or commented code sections.
  • The code adheres to our conventions and style guide, and to the Julia guidelines.

Documentation

  • New functions and types are documented with a docstring or top-level comment.
  • Relevant publications are referenced in docstrings (see example for formatting).
  • Inline comments are used to document longer or unusual code sections.
  • Comments describe intent ("why?") and not just functionality ("what?").
  • If the PR introduces a significant change or new feature, it is documented in NEWS.md with its PR number.

Testing

  • The PR passes all tests.
  • New or modified lines of code are covered by tests.
  • New or modified tests run in less then 10 seconds.

Performance

  • There are no type instabilities or memory allocations in performance-critical parts.
  • If the PR intent is to improve performance, before/after time measurements are posted in the PR.

Verification

  • The correctness of the code was verified using appropriate tests.
  • If new equations/methods are added, a convergence test has been run and the results
    are posted in the PR.

Created with ❤️ by the Trixi.jl community.

@vchuravy vchuravy force-pushed the vc/ak branch 2 times, most recently from f39512c to 203e311 Compare March 25, 2026 10:08
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.08%. Comparing base (ce719f3) to head (5a13214).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2882      +/-   ##
==========================================
+ Coverage   96.76%   97.08%   +0.32%     
==========================================
  Files         610      610              
  Lines       47500    47515      +15     
==========================================
+ Hits        45960    46128     +168     
+ Misses       1540     1387     -153     
Flag Coverage Δ
unittests 97.08% <100.00%> (+0.32%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vchuravy vchuravy marked this pull request as ready for review March 25, 2026 12:49
@vchuravy vchuravy changed the title Use AcceleratedKernels.mapreduce max_scaled_speed Use AcceleratedKernels.mapreduce in max_scaled_speed Mar 25, 2026
@sloede
Copy link
Copy Markdown
Member

sloede commented Mar 25, 2026

Thanks for this PR with a proof of concept. I'd hold off with merging this to #2590 and do this at a later stage, so as not to further delay the merge of #2590 to main.

@vchuravy vchuravy changed the title Use AcceleratedKernels.mapreduce in max_scaled_speed Use AcceleratedKernels.mapreduce in max_scaled_speed and integrate_via_indices Mar 25, 2026
Base automatically changed from feature-gpu-offloading to main March 26, 2026 16:48
@vchuravy vchuravy requested a review from benegee March 26, 2026 16:55
vchuravy and others added 6 commits March 27, 2026 08:29
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Do you have some benchmark results showing the impact of this?

Co-authored-by: Valentin Churavy <v.churavy@gmail.com>
@vchuravy
Copy link
Copy Markdown
Member Author

One of the "annoying" things here is that mapreduce implies a contraction to a scalar; thus, at the end of the computation, we have to move data from the device to the host. There has been some discussion before to use an AsyncNumber type, but that wouldn't help here since we immediately use the values.

@benegee benegee mentioned this pull request Mar 29, 2026
18 tasks
@ranocha
Copy link
Copy Markdown
Member

ranocha commented Mar 30, 2026

What is the status of this PR? Is it basically ready except for formatting issues and failing tests due to the OrdinaryDiffEqSDIRK problem?

@vchuravy vchuravy requested a review from ranocha April 2, 2026 09:22
@vchuravy
Copy link
Copy Markdown
Member Author

vchuravy commented Apr 2, 2026

Is it basically ready

Yeah, ready from myside.

@ranocha
Copy link
Copy Markdown
Member

ranocha commented Apr 2, 2026

Can you please show some benchmarks comparing this to the current version on main?

@vchuravy
Copy link
Copy Markdown
Member Author

vchuravy commented Apr 2, 2026

Using the profiler to look at how time is being spent, the previous implementation used a KernelAbstraction kernel (2.244ms) + GPUArrays mapreduce (5us). The AcceleratedKernel implementation is 2.274ms (only one kernel launch)

Using the NVTX ranges from #2908
Before:

│    1.16% │   10.71 ms │     5 │   2.14 ms ± 0.01   (  2.13 ‥ 2.16)    │ Trixi.calculate dt       │

After:

│    1.09% │   10.74 ms │     5 │   2.15 ms ± 0.04   (  2.09 ‥ 2.2)     │ Trixi.calculate dt       │

I think I mistakenly said during our weekly meeting that the previous calulcate_dt was host based, but that is only true for the other code I rewrote as part of this PR.

@ranocha
Copy link
Copy Markdown
Member

ranocha commented Apr 2, 2026

So it's within the error bars of having the same performance, right? Is it worth the additional dependency and expected to be better in the log term?

@vchuravy
Copy link
Copy Markdown
Member Author

vchuravy commented Apr 2, 2026

Is it worth the additional dependency and expected to be better in the log term?

It allows us to use one consistent code-pattern for reductions, so I would say the additional dependency is worth it.

Copy link
Copy Markdown
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ranocha ranocha enabled auto-merge (squash) April 2, 2026 15:05
@ranocha ranocha disabled auto-merge April 3, 2026 07:24
@ranocha ranocha enabled auto-merge (squash) April 3, 2026 16:27
@ranocha ranocha merged commit 9fc914f into main Apr 3, 2026
41 checks passed
@ranocha ranocha deleted the vc/ak branch April 3, 2026 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU-compatible reduction of speeds in step size computation

5 participants