[WIP] Second pass at LaunchConfig #727

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft

tpn wants to merge 7 commits into NVIDIA:main from tpn:280-launch-config-v2

LAUNCH-CONFIG-CODEX-PROMPT.md

-Original file line number
+Diff line change
@@ -0,0 +1,68 @@
+    # Codex Prompt: Launch Config Sensitive (LC-S) work for cuda.coop
+    You are resuming launch-config work in `numba-cuda` to support cuda.coop
+    single-phase. The cache invalidation change for `numba_cuda.__version__` is
+    already in a separate PR; do not redo that here.
+    ## Repos / worktrees
+    - `numba-cuda` (current worktree): `/home/trentn/src/280-launch-config-v2`
+      - branch: `280-launch-config-v2`
+    - `numba-cuda` main baseline: `/home/trentn/src/numba-cuda-main`
+    - cuda.coop repo: `/home/trentn/src/cccl/python/cuda_cccl`
+      - see `SINGLE-PHASE-*.md` for context
+    ## Current local state (numba-cuda)
+    Run:
+    - `git status -sb`
+    - `git diff`
+    Expected (uncommitted) changes in this worktree:
+    - `numba_cuda/numba/cuda/compiler.py`
+      - CUDABackend sets `state.metadata["launch_config_sensitive"] = True`
+        when the active launch config is explicitly marked.
+    - `numba_cuda/numba/cuda/dispatcher.py`
+      - `_LaunchConfiguration` adds explicit API:
+        `mark_kernel_as_launch_config_sensitive()`, `get_kernel_launch_config_sensitive()`,
+        `is_kernel_launch_config_sensitive()`.
+    - `scripts/bench-launch-overhead.py`
+      - import compatibility for `numba.cuda.core.config` vs `numba.core.config`.
+    - Untracked: `PR.md`, `tags` (clean up before commit).
+    ## What is already implemented
+    - TLS-based launch-config capture in C extension, exposed via
+      `numba_cuda/numba/cuda/launchconfig.py`.
+    - Dispatcher plumbing for LC-S (per-config specialization + cache keys + `.lcs` marker).
+    - Tests for LC-S recompile + cache coverage.
+    - Docs updated for launch-config introspection.
+    - In cccl: `cuda/coop/_rewrite.py` now marks LC-S when accessing launch config.
+      It calls `mark_kernel_as_launch_config_sensitive()` when available, with
+      fallback to `state.metadata["launch_config_sensitive"] = True`.
+    ## Open decisions / tasks
+. **Explicit LC-S API decision: keep**
+       - `_LaunchConfiguration` explicit LC-S API is retained.
+       - Compiler hook in `CUDABackend` uses this API to set metadata.
+       - cccl rewrite is updated to use the API when available.
+. **Run CUDA tests on a GPU**
+       - `pixi run -e cu-12-9-py312 pytest testing --pyargs numba.cuda.tests.cudapy.test_launch_config_sensitive -k launch_config_sensitive`
+       - `pixi run -e cu-12-9-py312 pytest testing --pyargs numba.cuda.tests.cudapy.test_caching -k launch_config_sensitive`
+       - Status: both passing on GPU in this worktree.
+. **Validate disk-cache behavior across processes**
+       - Ensure `.lcs` marker + launch-config cache keying behave correctly.
+       - Status: covered by
+         `LaunchConfigSensitiveCachingTest.test_launch_config_sensitive_cache_keys`
+         in `test_caching.py` (passes, includes separate-process verification).
+. **Audit launch paths**
+       - Confirm all kernel launch paths go through `CUDADispatcher.call()`.
+       - Status: Python launch paths in `dispatcher.py` verified.
+. **Commit / cleanup**
+       - Remove untracked `PR.md` and `tags`.
+       - Prepare commit(s) for the launch-config work.
+    ## Notes
+    - If you need to re-run the overhead micro-benchmark, see `LAUNCH-CONFIG.md`.
+    - Update `LAUNCH-CONFIG-TODO.md` with any new decisions or test results.

LAUNCH-CONFIG-TODO.md

-Original file line number
+Diff line change
@@ -0,0 +1,58 @@
+    # Launch Config Sensitive (LC-S) plumbing
+    Last updated: 2026-02-19
+    ## Current status (summary)
+    - Launch-config TLS capture exists in C extension and is exposed via
+      `numba_cuda/numba/cuda/launchconfig.py` (current/ensure/capture helpers).
+    - Dispatcher plumbing for LC-S is implemented:
+      - `_Kernel` captures `launch_config_sensitive` from compile metadata.
+      - `CUDADispatcher` tracks LC-S and routes to per-launch-config sub-dispatchers.
+      - Disk cache includes a launch-config key and LC-S marker file (`.lcs`).
+    - Tests added:
+      - `numba_cuda/numba/cuda/tests/cudapy/test_launch_config_sensitive.py`
+      - `numba_cuda/numba/cuda/tests/cudapy/cache_launch_config_sensitive_usecases.py`
+      - `numba_cuda/numba/cuda/tests/cudapy/test_caching.py` LC-S coverage
+    - Docs updated: `docs/source/reference/kernel.rst`.
+    - In cccl, `cuda/coop/_rewrite.py` now marks LC-S when accessing launch config.
+      It uses the explicit LaunchConfiguration API when available, with fallback to
+      `state.metadata["launch_config_sensitive"] = True` for compatibility.
+    - CUDA tests run and passing on GPU in this worktree:
+      - `pixi run -e cu-12-9-py312 pytest testing --pyargs numba.cuda.tests.cudapy.test_launch_config_sensitive -k launch_config_sensitive`
+      - `pixi run -e cu-12-9-py312 pytest testing --pyargs numba.cuda.tests.cudapy.test_caching -k launch_config_sensitive`
+    ## Local working tree state (numba-cuda)
+    - Branch: `280-launch-config-v2`
+    - Modified files (uncommitted):
+      - `numba_cuda/numba/cuda/compiler.py`
+      - `numba_cuda/numba/cuda/dispatcher.py`
+      - `scripts/bench-launch-overhead.py`
+    - Untracked: `PR.md`, `tags`
+    ## New (uncommitted) LC-S API work
+    - `_LaunchConfiguration` gains explicit helpers:
+      - `mark_kernel_as_launch_config_sensitive()`
+      - `get_kernel_launch_config_sensitive()`
+      - `is_kernel_launch_config_sensitive()`
+    - `CUDABackend` sets metadata when the launch config is explicitly marked.
+      This provides an official path to mark LC-S without poking at `state.metadata`
+      directly from rewrites.
+    ## Remaining TODO
+. **Cleanup**
+       - Remove or handle untracked `PR.md` and `tags` before committing.
+    ## Completed checks (2026-02-19)
+    - **Cross-process disk-cache behavior**
+      - Verified by:
+        `pixi run -e cu-12-9-py312 pytest testing --pyargs numba.cuda.tests.cudapy.test_caching -k launch_config_sensitive`
+      - `LaunchConfigSensitiveCachingTest.test_launch_config_sensitive_cache_keys`
+        exercises cache reuse in a separate process and passed.
+    - **Launch path audit**
+      - Python launch paths in `dispatcher.py` all route through
+        `CUDADispatcher.call()`: `__getitem__()` -> `configure()` ->
+        `_LaunchConfiguration.__call__()` -> `call()`, plus `ForAll.__call__()`.
+    ## Notes
+    - Separate PR for cache invalidation on `numba_cuda.__version__` is already
+      pushed; do not re-implement here.

LAUNCH-CONFIG.md

-Original file line number
+Diff line change
@@ -0,0 +1,104 @@
+    # Launch Config Benchmarking
+    This repo includes lightweight benchmarking scaffolding to quantify CUDA kernel
+    launch overhead across three launch-config implementations (baseline, old
+    contextvar branch, and the new v2 branch).
+    ## Status / Next Steps (Launch Config Work)
+    - LC-S plumbing is implemented in `dispatcher.py` and supporting files.
+    - CUDA LC-S tests have been run on GPU in this branch and are passing.
+    - There are uncommitted changes in `compiler.py`, `dispatcher.py`, and
+      `scripts/bench-launch-overhead.py` that add an explicit LC-S API on
+      `_LaunchConfiguration` and a compiler hook to honor it.
+    - cccl rewrite integration now uses the explicit LC-S API with a fallback to
+      metadata for compatibility.
+    - Cross-process disk-cache behavior is covered by LC-S caching tests and passes.
+    - See `LAUNCH-CONFIG-TODO.md` for a detailed handoff checklist.
+    ## What’s Included
+    ### 1) `scripts/bench-launch-overhead.py`
+    A focused micro-benchmark that measures launch overhead (us/launch) for kernels
+    with 0..4 arguments, using a 1x1 launch. It:
+    - warms up each kernel
+    - runs `loops` iterations per kernel (default: 100k for 0–3 args, 10k for 4 args)
+    - repeats the measurement (default: 7 repeats)
+    - reports mean/stdev and deltas vs the first repo
+    - optionally writes JSON output
+    The benchmark is designed to compare multiple repos (or worktrees) in the same
+    Python environment.
+    ### 2) `scripts/bench-against.py`
+    A helper that compares benchmarks between two git refs using a temporary
+    worktree, running the pixi benchmark tasks before and after.
+    ### 3) Pixi tasks
+    Defined in `pixi.toml`:
+    - `bench-launch-overhead`: runs `scripts/bench-launch-overhead.py`
+    - `bench`: pytest benchmark suite (`numba.cuda.tests.benchmarks`)
+    - `benchcmp`: compare benchmark results from `bench`
+    - `bench-against`: runs `scripts/bench-against.py`
+    ## Recommended Usage (Three-Way Compare)
+    Assuming you have three working trees for:
+    - **baseline** (main or a baseline ref)
+    - **contextvar** (old implementation)
+    - **v2** (new implementation)
+    Run the launch-overhead micro-benchmark:
+    ```bash
+    pixi run -e cu-12-9-py312 bench-launch-overhead \
+      --repo baseline=/path/to/numba-cuda-main \
+      --repo contextvar=/path/to/numba-cuda-contextvar \
+      --repo v2=/home/trentn/src/280-launch-config-v2
+    ```
+    Notes:
+    - The script will `pip install -e` each repo by default. Use `--no-install`
+      if you have already installed them and want to skip reinstalling.
+    - Use `--python` to point at a specific interpreter if needed.
+    - Use `--loops` to override the default loop counts, e.g. `--loops 200000,200000,200000,200000,20000`.
+    - Use `--output results.json` to persist the results.
+    ## Example Output
+    ```
+    Launch overhead (us/launch):
+    args  baseline            contextvar          v2
+4.10 +/- 0.05       6.20 +/- 0.06       4.50 +/- 0.04
+4.40 +/- 0.05       6.60 +/- 0.06       4.80 +/- 0.05
+    ...
+    Deltas vs baseline:
+    args  contextvar          v2
+2.10 (+51.2%)       0.40 (+9.8%)
+2.20 (+50.0%)       0.40 (+9.1%)
+    ...
+    ```
+    ## Benchmark Suite (Broader Coverage)
+    For more extensive benchmark coverage (not just launch overhead), use:
+    ```bash
+    pixi run -e cu-12-9-py312 bench
+    ```
+    To compare two git refs using a temporary worktree:
+    ```bash
+    pixi run -e cu-12-9-py312 bench-against HEAD~ HEAD
+    ```
+    This runs `bench` on the baseline ref and `benchcmp` on the proposed ref.
+    ## Notes / Constraints
+    - Benchmarks require a real GPU (CUDA simulator is rejected).
+    - The micro-benchmark intentionally keeps kernels trivial to isolate launch
+      overhead.
+    - The three-way comparison is the most direct way to capture the relative
+      overhead introduced by launch-config state management.

docs/source/reference/kernel.rst

-Original file line number
+Diff line change
@@ Expand Up / @@ -57,6 +57,45 @@ This is similar to launch configuration in CUDA C/C++: @@
     .. note:: The order of ``stream`` and ``sharedmem`` are reversed in Numba
        compared to in CUDA C/C++.
+    Launch configuration introspection (advanced)
+    ---------------------------------------------
+    The current launch configuration is available during compilation triggered by
+    kernel launches. This can be useful for debugging or for extensions that need
+    to observe how kernels are configured.
+    .. note:: The capture is compile-time only. If the kernel is already compiled
+       for the given argument types, the captured config may remain ``None``.
+    .. code-block:: python
+       from numba import cuda
+       from numba.cuda import launchconfig
+       @cuda.jit
+       def f(x):
+           x[0] = 1
+       arr = cuda.device_array(1, dtype="i4")
+       with launchconfig.capture_compile_config(f) as capture:
+           f[1, 1](arr)  # first launch triggers compilation
+       cfg = capture["config"]
+       print(cfg.griddim, cfg.blockdim, cfg.sharedmem)
+    Configured kernels also expose pre-launch callbacks for lightweight
+    instrumentation:
+    .. code-block:: python
+       cfg = f[1, 1]
+       def log_launch(kernel, cfg):
+           print(cfg.griddim, cfg.blockdim)
+       cfg.pre_launch_callbacks.append(log_launch)
+       cfg(arr)
     Dispatcher objects also provide several utility methods for inspection and
     creating a specialized instance:
@@ Expand Down @@

docs/source/user/index.rst

-Original file line number
+Diff line change
@@ Expand Up / @@ -11,6 +11,7 @@ User guide @@
     .. toctree::
        installation.rst
+       release_notes.rst
        kernels.rst
        memory.rst
        device-functions.rst
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Second pass at LaunchConfig #727

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

[WIP] Second pass at LaunchConfig #727

Are you sure you want to change the base?

Uh oh!

[WIP] Second pass at LaunchConfig #727

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!