fix: Fix race condition in CUDA Simulator #690

ccam80 · 2026-01-04T23:38:30Z

Description

There is a race condition in the CUDA simulator, specifically in the swapped_cuda_module context manager.

I use the simulator for quick-running CI to avoid using up precious free GPU minutes. Occasionally, I get this error:

AttributeError: tid=[0, 13, 0] ctaid=[0, 0, 0]: module 'numba.cuda' has no attribute 'local'

It is raised from a different thread each time. The error arose more commonly after I began allocating arrays in a small helper function in its own module. The error is similar to the one raised in numba/numba#1844.

Each thread in the simulator is a threading.Thread object, so they share memory. Every time a device function is called, it is wrapped in this context manager:

numba-cuda/numba_cuda/numba/cuda/simulator/kernelapi.py

Lines 494 to 509 in aff41e9

    
           @contextmanager 
        
           def swapped_cuda_module(fn, fake_cuda_module): 
        
               from numba import cuda 
        
               fn_globs = fn.__globals__ 
        
               # get all globals that is the "cuda" module 
        
               orig = dict((k, v) for k, v in fn_globs.items() if v is cuda) 
        
               # build replacement dict 
        
               repl = dict((k, fake_cuda_module) for k, v in orig.items()) 
        
               # replace 
        
               fn_globs.update(repl) 
        
               try: 
        
                   yield 
        
               finally: 
        
                   # revert 
        
                   fn_globs.update(orig)

Race:

Thread A and Thread B are executing device functions in the same python module. They don't need to be the same function. They must be in a separate file from the kernel definition, as the kernel replaces references on entry, run all threads, and restores only after all threads have exited.

Thread A launches and swaps numba.cuda for fake_cuda, yields.
Thread B launches and gets orig = {} and repl = {}, as no references to cuda exist in it's __globals__ dict. Thread B yields.
Thread A exits, replacing fake_cuda with numba.cuda
Thread B calls e.g. cuda.local.array, and sees replaced reference to numba.cuda. local is not imported as part of numba.cuda when NUMBA_ENABLE_CUDASIM==1, so the error is thrown.

MWE

The Gist below contains a script that reliably causes the error on my machine. It takes ~200s to hit the race on my machine, typically, so I have not added it to the test suite. It does seem to fail faster on xdist, but it has a very long runtime when it doesn't fail.

Reproducer
Place all three files in the same directory, and run cudasim_race_mwe.

Fix

This PR implements a per-module lock and reference count, so that the first entrance to the context for a module replaces cuda -> fake_cuda, and the last thread to exit restores fake_cuda -> cuda. There may be a performance hit associated for simulated kernels with many device function calls from many modules, but this should be small, as all threads except for the first entrant and last exit perform a single integer comparison and increment/decrement an integer counter under the lock. The short "benchmark" run in the MWE did not change duration between the patched and unpatched versions on my machine.

@contextmanager
def swapped_cuda_module(fn, fake_cuda_module):
    from numba import cuda
    fn_globs = fn.__globals__
    gid = id(fn_globs)

    # Use a lock per-modules to avoid cross-locking other modules
    lock = _globals_locks[gid]

    with lock:
        # Scan and replace globals with fake module on first entrance only
        if _swap_refcount[gid] == 0:
            orig = {k: v for k, v in fn_globs.items() if v is cuda}
            _swap_orig[gid] = orig
            for k in orig:
                fn_globs[k] = fake_cuda_module

        # Increment the reference counter on every entrance
        _swap_refcount[gid] += 1
    try:
        yield
    finally:
        with lock:
            # Decrement "number of modules using fake CUDA" counter on exit
            _swap_refcount[gid] -= 1

            # Last thread to leave the context restores real cuda
            if _swap_refcount[gid] == 0:
                fn_globs.update(_swap_orig.pop(gid))
                del _swap_refcount[gid]
                del _globals_locks[gid]

The prior fix scanned each modules entire globals dict under lock on every run, and all modules shared a lock. This update only scans the globals dict on first entry for a module. Additionally, each module has it's own lock, so a thread holding the lock in one module doesn't affect the launch of a thread for a function in another module.

copy-pr-bot · 2026-01-04T23:38:34Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-01-04T23:41:18Z

Greptile Summary

Implemented per-module locking with reference counting to fix race condition in swapped_cuda_module context manager. The fix prevents threads from interfering with each other's module swaps by tracking how many threads are using each module's globals and only performing the swap on first entry and restoration on last exit.

Added _locks_register_lock to safely create per-module locks
Added _globals_locks defaultdict to store per-module locks
Added _swap_refcount to track number of active threads per module
Added _swap_orig to store original module references
Modified swapped_cuda_module to use reference counting pattern

Confidence Score: 4/5

Safe to merge after fixing the lock cleanup - the core race condition fix is sound
The threading logic correctly addresses the race condition with proper locking and reference counting, but there's a memory leak where _globals_locks[gid] is never deleted when refcount reaches zero
numba_cuda/numba/cuda/simulator/kernelapi.py needs the lock cleanup fix at line 534

Important Files Changed

Filename	Overview
numba_cuda/numba/cuda/simulator/kernelapi.py	Added per-module locking and reference counting to fix race condition in context manager, but missing cleanup of lock registry causing memory leak

Copilot

Pull request overview

This PR fixes a race condition in the CUDA simulator's swapped_cuda_module context manager that occurs when multiple simulated threads simultaneously call device functions from the same Python module. The fix implements per-module locks and reference counting to ensure thread-safe swapping and restoration of module globals.

Key Changes:

Added per-module locking mechanism using defaultdict(threading.Lock) to synchronize access to module globals
Implemented reference counting to track the number of active threads using the fake CUDA module in each Python module
Modified the swap logic so only the first entering thread performs the swap, and only the last exiting thread performs the restoration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

numba_cuda/numba/cuda/simulator/kernelapi.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

greptile-apps

Additional Comments (1)

numba_cuda/numba/cuda/simulator/kernelapi.py, line 493-504 (link)

logic: race condition: defaultdict access is not thread-safe when key doesn't exist

when multiple threads simultaneously call device functions from the same module for the first time, they race at line 504. threads can create different lock objects for the same gid, defeating the per-module locking

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

…nary

The previous commit introduced a global lock on creating the per-module lock. This prevented concurrent creation of locks causing one thread to have a different lock to others, and modifying `fn_globs` or `_swap_refcount` in a race with other threads. Implementing this exposed yet another race: a thread could delete the lock from `_globals_locks` while another thread was already waiting at the entrance to the first `with lock:` statement. There is no need to delete a module's lock during runtime, so this commit simply removes the `del _global_locks[gid]`` statement.

gmarkall · 2026-01-22T13:05:27Z

/ok to test 1d3a392

greptile-apps

Additional Comments (1)

numba_cuda/numba/cuda/simulator/kernelapi.py, line 531 (link)

logic: missing cleanup for _globals_locks[gid] - the lock itself is never deleted, causing memory leak for each unique module that uses the simulator

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

gmarkall · 2026-01-22T13:40:22Z

/ok to test 8fb2b52

gmarkall

Thanks for the PR! I can't seem to find a way to hit the race with your reproducer, but the theory you outlined and the fix seem to make sense, so I'm comfortable with it.

I pushed a small fix so that the format check passes, and I'll set this to merge once the CI run is done.

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-22T13:42:43Z

numba_cuda/numba/cuda/simulator/kernelapi.py

+            # Last thread to leave the context restores real cuda
+            if _swap_refcount[gid] == 0:
+                fn_globs.update(_swap_orig.pop(gid))
+                del _swap_refcount[gid]


logic: memory leak: _globals_locks[gid] never cleaned up

Suggested change

del _swap_refcount[gid]

del _swap_refcount[gid]

del _globals_locks[gid]

- Add Python 3.14 to the wheel publishing matrix (NVIDIA#750) - feat: swap out internal device array usage with `StridedMemoryView` (NVIDIA#703) - Fix max block size computation in `forall` (NVIDIA#744) - Fix prologue debug line info pointing to decorator instead of def line (NVIDIA#746) - Fix kernel return type in DISubroutineType debug metadata (NVIDIA#745) - Fix missing line info in Jupyter notebooks (NVIDIA#742) - Fix: Pass correct flags to linker when debugging in the presence of LTOIR code (NVIDIA#698) - chore(deps): add cuda-pathfinder to pixi deps (NVIDIA#741) - fix: enable flake8-bugbear lints and fix found problems (NVIDIA#708) - fix: Fix race condition in CUDA Simulator (NVIDIA#690) - ci: run tests in parallel (NVIDIA#740) - feat: users can pass `shared_memory_carveout` to @cuda.jit (NVIDIA#642) - Fix compatibility with NumPy 2.4: np.trapz and np.in1d removed (NVIDIA#739) - Pass the -numba-debug flag to libnvvm (NVIDIA#681) - ci: remove rapids containers from conda ci (NVIDIA#737) - Use `pathfinder` for dynamic libraries (NVIDIA#308) - CI: Add CUDA 13.1 testing support (NVIDIA#705) - Adding `pixi run test` and `pixi run test-par` support (NVIDIA#724) - Disable per-PR nvmath tests + follow same test practice (NVIDIA#723) - chore(deps): regenerate pixi lockfile (NVIDIA#722) - Fix DISubprogram line number to point to function definition line (NVIDIA#695) - revert: chore(dev): build pixi using rattler (NVIDIA#713) (NVIDIA#719) - [feat] Initial version of the Numba CUDA GDB pretty-printer (NVIDIA#692) - chore(dev): build pixi using rattler (NVIDIA#713) - build(deps): bump the actions-monthly group across 1 directory with 8 updates (NVIDIA#704)

- Add Python 3.14 to the wheel publishing matrix (#750) - feat: swap out internal device array usage with `StridedMemoryView` (#703) - Fix max block size computation in `forall` (#744) - Fix prologue debug line info pointing to decorator instead of def line (#746) - Fix kernel return type in DISubroutineType debug metadata (#745) - Fix missing line info in Jupyter notebooks (#742) - Fix: Pass correct flags to linker when debugging in the presence of LTOIR code (#698) - chore(deps): add cuda-pathfinder to pixi deps (#741) - fix: enable flake8-bugbear lints and fix found problems (#708) - fix: Fix race condition in CUDA Simulator (#690) - ci: run tests in parallel (#740) - feat: users can pass `shared_memory_carveout` to @cuda.jit (#642) - Fix compatibility with NumPy 2.4: np.trapz and np.in1d removed (#739) - Pass the -numba-debug flag to libnvvm (#681) - ci: remove rapids containers from conda ci (#737) - Use `pathfinder` for dynamic libraries (#308) - CI: Add CUDA 13.1 testing support (#705) - Adding `pixi run test` and `pixi run test-par` support (#724) - Disable per-PR nvmath tests + follow same test practice (#723) - chore(deps): regenerate pixi lockfile (#722) - Fix DISubprogram line number to point to function definition line (#695) - revert: chore(dev): build pixi using rattler (#713) (#719) - [feat] Initial version of the Numba CUDA GDB pretty-printer (#692) - chore(dev): build pixi using rattler (#713) - build(deps): bump the actions-monthly group across 1 directory with 8 updates (#704)

ccam80 added 2 commits January 4, 2026 20:38

fix: make swapped_cuda_module thread-safe in CUDA simulator

8356c6f

Copilot AI review requested due to automatic review settings January 4, 2026 23:38

Copilot started reviewing on behalf of ccam80 January 4, 2026 23:38 View session

Copilot AI reviewed Jan 4, 2026

View reviewed changes

numba_cuda/numba/cuda/simulator/kernelapi.py Outdated Show resolved Hide resolved

chore: typo

2dd15c9

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

greptile-apps bot reviewed Jan 5, 2026

View reviewed changes

ccam80 added 3 commits January 6, 2026 12:04

fix: remove remaining race vulnerability while accessing locks dictio…

67acbfb

…nary

chore: merge typo-edit commit

71b2e0b

gmarkall added the 3 - Ready for Review Ready for review by team label Jan 21, 2026

Merge branch 'main' into fix-simulator-race

1d3a392

greptile-apps bot reviewed Jan 22, 2026

View reviewed changes

Fix formatting issue

8fb2b52

gmarkall approved these changes Jan 22, 2026

View reviewed changes

gmarkall enabled auto-merge (squash) January 22, 2026 13:42

greptile-apps bot reviewed Jan 22, 2026

View reviewed changes

gmarkall added 4 - Waiting on CI Waiting for a CI run to finish successfully and removed 3 - Ready for Review Ready for review by team labels Jan 22, 2026

gmarkall merged commit d41b90d into NVIDIA:main Jan 22, 2026
103 of 105 checks passed

gmarkall mentioned this pull request Jan 27, 2026

Bump version to 0.25.0 #752

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fix race condition in CUDA Simulator #690

fix: Fix race condition in CUDA Simulator #690

Uh oh!

ccam80 commented Jan 4, 2026

Uh oh!

copy-pr-bot bot commented Jan 4, 2026

Uh oh!

greptile-apps bot commented Jan 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

gmarkall commented Jan 22, 2026

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

gmarkall commented Jan 22, 2026

Uh oh!

gmarkall left a comment

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	@contextmanager
	def swapped_cuda_module(fn, fake_cuda_module):
	from numba import cuda

	fn_globs = fn.__globals__
	# get all globals that is the "cuda" module
	orig = dict((k, v) for k, v in fn_globs.items() if v is cuda)
	# build replacement dict
	repl = dict((k, fake_cuda_module) for k, v in orig.items())
	# replace
	fn_globs.update(repl)
	try:
	yield
	finally:
	# revert
	fn_globs.update(orig)

	del _swap_refcount[gid]
	del _swap_refcount[gid]
	del _globals_locks[gid]

fix: Fix race condition in CUDA Simulator #690

fix: Fix race condition in CUDA Simulator #690

Uh oh!

Conversation

ccam80 commented Jan 4, 2026

Description

Race:

MWE

Fix

Uh oh!

copy-pr-bot bot commented Jan 4, 2026

Uh oh!

greptile-apps bot commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

gmarkall commented Jan 22, 2026

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

gmarkall commented Jan 22, 2026

Uh oh!

gmarkall left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Jan 4, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading