-
Notifications
You must be signed in to change notification settings - Fork 54
fix: Fix race condition in CUDA Simulator #690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The prior fix scanned each modules entire globals dict under lock on every run, and all modules shared a lock. This update only scans the globals dict on first entry for a module. Additionally, each module has it's own lock, so a thread holding the lock in one module doesn't affect the launch of a thread for a function in another module.
Greptile SummaryImplemented per-module locking with reference counting to fix race condition in
Confidence Score: 4/5
Important Files Changed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a race condition in the CUDA simulator's swapped_cuda_module context manager that occurs when multiple simulated threads simultaneously call device functions from the same Python module. The fix implements per-module locks and reference counting to ensure thread-safe swapping and restoration of module globals.
Key Changes:
- Added per-module locking mechanism using
defaultdict(threading.Lock)to synchronize access to module globals - Implemented reference counting to track the number of active threads using the fake CUDA module in each Python module
- Modified the swap logic so only the first entering thread performs the swap, and only the last exiting thread performs the restoration
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
numba_cuda/numba/cuda/simulator/kernelapi.py, line 493-504 (link)logic: race condition:
defaultdictaccess is not thread-safe when key doesn't existwhen multiple threads simultaneously call device functions from the same module for the first time, they race at line 504. threads can create different lock objects for the same
gid, defeating the per-module locking
1 file reviewed, 1 comment
The previous commit introduced a global lock on creating the per-module lock. This prevented concurrent creation of locks causing one thread to have a different lock to others, and modifying `fn_globs` or `_swap_refcount` in a race with other threads. Implementing this exposed yet another race: a thread could delete the lock from `_globals_locks` while another thread was already waiting at the entrance to the first `with lock:` statement. There is no need to delete a module's lock during runtime, so this commit simply removes the `del _global_locks[gid]`` statement.
|
/ok to test 1d3a392 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
numba_cuda/numba/cuda/simulator/kernelapi.py, line 531 (link)logic: missing cleanup for
_globals_locks[gid]- the lock itself is never deleted, causing memory leak for each unique module that uses the simulator
1 file reviewed, 1 comment
|
/ok to test 8fb2b52 |
gmarkall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I can't seem to find a way to hit the race with your reproducer, but the theory you outlined and the fix seem to make sense, so I'm comfortable with it.
I pushed a small fix so that the format check passes, and I'll set this to merge once the CI run is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
| # Last thread to leave the context restores real cuda | ||
| if _swap_refcount[gid] == 0: | ||
| fn_globs.update(_swap_orig.pop(gid)) | ||
| del _swap_refcount[gid] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: memory leak: _globals_locks[gid] never cleaned up
| del _swap_refcount[gid] | |
| del _swap_refcount[gid] | |
| del _globals_locks[gid] |
- Add Python 3.14 to the wheel publishing matrix (NVIDIA#750) - feat: swap out internal device array usage with `StridedMemoryView` (NVIDIA#703) - Fix max block size computation in `forall` (NVIDIA#744) - Fix prologue debug line info pointing to decorator instead of def line (NVIDIA#746) - Fix kernel return type in DISubroutineType debug metadata (NVIDIA#745) - Fix missing line info in Jupyter notebooks (NVIDIA#742) - Fix: Pass correct flags to linker when debugging in the presence of LTOIR code (NVIDIA#698) - chore(deps): add cuda-pathfinder to pixi deps (NVIDIA#741) - fix: enable flake8-bugbear lints and fix found problems (NVIDIA#708) - fix: Fix race condition in CUDA Simulator (NVIDIA#690) - ci: run tests in parallel (NVIDIA#740) - feat: users can pass `shared_memory_carveout` to @cuda.jit (NVIDIA#642) - Fix compatibility with NumPy 2.4: np.trapz and np.in1d removed (NVIDIA#739) - Pass the -numba-debug flag to libnvvm (NVIDIA#681) - ci: remove rapids containers from conda ci (NVIDIA#737) - Use `pathfinder` for dynamic libraries (NVIDIA#308) - CI: Add CUDA 13.1 testing support (NVIDIA#705) - Adding `pixi run test` and `pixi run test-par` support (NVIDIA#724) - Disable per-PR nvmath tests + follow same test practice (NVIDIA#723) - chore(deps): regenerate pixi lockfile (NVIDIA#722) - Fix DISubprogram line number to point to function definition line (NVIDIA#695) - revert: chore(dev): build pixi using rattler (NVIDIA#713) (NVIDIA#719) - [feat] Initial version of the Numba CUDA GDB pretty-printer (NVIDIA#692) - chore(dev): build pixi using rattler (NVIDIA#713) - build(deps): bump the actions-monthly group across 1 directory with 8 updates (NVIDIA#704)
- Add Python 3.14 to the wheel publishing matrix (#750) - feat: swap out internal device array usage with `StridedMemoryView` (#703) - Fix max block size computation in `forall` (#744) - Fix prologue debug line info pointing to decorator instead of def line (#746) - Fix kernel return type in DISubroutineType debug metadata (#745) - Fix missing line info in Jupyter notebooks (#742) - Fix: Pass correct flags to linker when debugging in the presence of LTOIR code (#698) - chore(deps): add cuda-pathfinder to pixi deps (#741) - fix: enable flake8-bugbear lints and fix found problems (#708) - fix: Fix race condition in CUDA Simulator (#690) - ci: run tests in parallel (#740) - feat: users can pass `shared_memory_carveout` to @cuda.jit (#642) - Fix compatibility with NumPy 2.4: np.trapz and np.in1d removed (#739) - Pass the -numba-debug flag to libnvvm (#681) - ci: remove rapids containers from conda ci (#737) - Use `pathfinder` for dynamic libraries (#308) - CI: Add CUDA 13.1 testing support (#705) - Adding `pixi run test` and `pixi run test-par` support (#724) - Disable per-PR nvmath tests + follow same test practice (#723) - chore(deps): regenerate pixi lockfile (#722) - Fix DISubprogram line number to point to function definition line (#695) - revert: chore(dev): build pixi using rattler (#713) (#719) - [feat] Initial version of the Numba CUDA GDB pretty-printer (#692) - chore(dev): build pixi using rattler (#713) - build(deps): bump the actions-monthly group across 1 directory with 8 updates (#704) <!-- Thank you for contributing to numba-cuda :) Here are some guidelines to help the review process go smoothly. 1. Please write a description in this text box of the changes that are being made. 2. Please ensure that you have written units tests for the changes made/features added. 3. If you are closing an issue please use one of the automatic closing words as noted here: https://help.github.com/articles/closing-issues-using-keywords/ 4. If your pull request is not ready for review but you want to make use of the continuous integration testing facilities please label it with `[WIP]`. 5. If your pull request is ready to be reviewed without requiring additional work on top of it, then remove the `[WIP]` label (if present) and replace it with `[REVIEW]`. If assistance is required to complete the functionality, for example when the C/C++ code of a feature is complete but Python bindings are still required, then add the label `[HELP-REQ]` so that others can triage and assist. The additional changes then can be implemented on top of the same PR. If the assistance is done by members of the rapidsAI team, then no additional actions are required by the creator of the original PR for this, otherwise the original author of the PR needs to give permission to the person(s) assisting to commit to their personal fork of the project. If that doesn't happen then a new PR based on the code of the original PR can be opened by the person assisting, which then will be the PR that will be merged. 6. Once all work has been done and review has taken place please do not add features or make changes out of the scope of those requested by the reviewer (doing this just add delays as already reviewed code ends up having to be re-reviewed/it is hard to tell what is new etc!). Further, please do not rebase your branch on main/force push/rewrite history, doing any of these causes the context of any comments made by reviewers to be lost. If conflicts occur against main they should be resolved by merging main into the branch used for making the pull request. Many thanks in advance for your cooperation! -->
Description
There is a race condition in the CUDA simulator, specifically in the
swapped_cuda_modulecontext manager.I use the simulator for quick-running CI to avoid using up precious free GPU minutes. Occasionally, I get this error:
It is raised from a different thread each time. The error arose more commonly after I began allocating arrays in a small helper function in its own module. The error is similar to the one raised in numba/numba#1844.
Each thread in the simulator is a
threading.Threadobject, so they share memory. Every time a device function is called, it is wrapped in this context manager:numba-cuda/numba_cuda/numba/cuda/simulator/kernelapi.py
Lines 494 to 509 in aff41e9
Race:
Thread A and Thread B are executing device functions in the same python module. They don't need to be the same function. They must be in a separate file from the kernel definition, as the kernel replaces references on entry, run all threads, and restores only after all threads have exited.
orig = {}andrepl = {}, as no references to cuda exist in it's__globals__dict. Thread B yields.cuda.local.array, and sees replaced reference to numba.cuda.localis not imported as part of numba.cuda whenNUMBA_ENABLE_CUDASIM==1, so the error is thrown.MWE
The Gist below contains a script that reliably causes the error on my machine. It takes ~200s to hit the race on my machine, typically, so I have not added it to the test suite. It does seem to fail faster on xdist, but it has a very long runtime when it doesn't fail.
Reproducer
Place all three files in the same directory, and run
cudasim_race_mwe.Fix
This PR implements a per-module lock and reference count, so that the first entrance to the context for a module replaces cuda -> fake_cuda, and the last thread to exit restores fake_cuda -> cuda. There may be a performance hit associated for simulated kernels with many device function calls from many modules, but this should be small, as all threads except for the first entrant and last exit perform a single integer comparison and increment/decrement an integer counter under the lock. The short "benchmark" run in the MWE did not change duration between the patched and unpatched versions on my machine.