Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions LAUNCH-CONFIG-CODEX-PROMPT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Codex Prompt: Launch Config Sensitive (LC-S) work for cuda.coop

You are resuming launch-config work in `numba-cuda` to support cuda.coop
single-phase. The cache invalidation change for `numba_cuda.__version__` is
already in a separate PR; do not redo that here.

## Repos / worktrees
- `numba-cuda` (current worktree): `/home/trentn/src/280-launch-config-v2`
- branch: `280-launch-config-v2`
- `numba-cuda` main baseline: `/home/trentn/src/numba-cuda-main`
- cuda.coop repo: `/home/trentn/src/cccl/python/cuda_cccl`
- see `SINGLE-PHASE-*.md` for context

## Current local state (numba-cuda)
Run:
- `git status -sb`
- `git diff`

Expected (uncommitted) changes in this worktree:
- `numba_cuda/numba/cuda/compiler.py`
- CUDABackend sets `state.metadata["launch_config_sensitive"] = True`
when the active launch config is explicitly marked.
- `numba_cuda/numba/cuda/dispatcher.py`
- `_LaunchConfiguration` adds explicit API:
`mark_kernel_as_launch_config_sensitive()`, `get_kernel_launch_config_sensitive()`,
`is_kernel_launch_config_sensitive()`.
- `scripts/bench-launch-overhead.py`
- import compatibility for `numba.cuda.core.config` vs `numba.core.config`.
- Untracked: `PR.md`, `tags` (clean up before commit).

## What is already implemented
- TLS-based launch-config capture in C extension, exposed via
`numba_cuda/numba/cuda/launchconfig.py`.
- Dispatcher plumbing for LC-S (per-config specialization + cache keys + `.lcs` marker).
- Tests for LC-S recompile + cache coverage.
- Docs updated for launch-config introspection.
- In cccl: `cuda/coop/_rewrite.py` now marks LC-S when accessing launch config.
It calls `mark_kernel_as_launch_config_sensitive()` when available, with
fallback to `state.metadata["launch_config_sensitive"] = True`.

## Open decisions / tasks
1. **Explicit LC-S API decision: keep**
- `_LaunchConfiguration` explicit LC-S API is retained.
- Compiler hook in `CUDABackend` uses this API to set metadata.
- cccl rewrite is updated to use the API when available.

2. **Run CUDA tests on a GPU**
- `pixi run -e cu-12-9-py312 pytest testing --pyargs numba.cuda.tests.cudapy.test_launch_config_sensitive -k launch_config_sensitive`
- `pixi run -e cu-12-9-py312 pytest testing --pyargs numba.cuda.tests.cudapy.test_caching -k launch_config_sensitive`
- Status: both passing on GPU in this worktree.

3. **Validate disk-cache behavior across processes**
- Ensure `.lcs` marker + launch-config cache keying behave correctly.
- Status: covered by
`LaunchConfigSensitiveCachingTest.test_launch_config_sensitive_cache_keys`
in `test_caching.py` (passes, includes separate-process verification).

4. **Audit launch paths**
- Confirm all kernel launch paths go through `CUDADispatcher.call()`.
- Status: Python launch paths in `dispatcher.py` verified.

5. **Commit / cleanup**
- Remove untracked `PR.md` and `tags`.
- Prepare commit(s) for the launch-config work.

## Notes
- If you need to re-run the overhead micro-benchmark, see `LAUNCH-CONFIG.md`.
- Update `LAUNCH-CONFIG-TODO.md` with any new decisions or test results.
58 changes: 58 additions & 0 deletions LAUNCH-CONFIG-TODO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Launch Config Sensitive (LC-S) plumbing

Last updated: 2026-02-19

## Current status (summary)
- Launch-config TLS capture exists in C extension and is exposed via
`numba_cuda/numba/cuda/launchconfig.py` (current/ensure/capture helpers).
- Dispatcher plumbing for LC-S is implemented:
- `_Kernel` captures `launch_config_sensitive` from compile metadata.
- `CUDADispatcher` tracks LC-S and routes to per-launch-config sub-dispatchers.
- Disk cache includes a launch-config key and LC-S marker file (`.lcs`).
- Tests added:
- `numba_cuda/numba/cuda/tests/cudapy/test_launch_config_sensitive.py`
- `numba_cuda/numba/cuda/tests/cudapy/cache_launch_config_sensitive_usecases.py`
- `numba_cuda/numba/cuda/tests/cudapy/test_caching.py` LC-S coverage
- Docs updated: `docs/source/reference/kernel.rst`.
- In cccl, `cuda/coop/_rewrite.py` now marks LC-S when accessing launch config.
It uses the explicit LaunchConfiguration API when available, with fallback to
`state.metadata["launch_config_sensitive"] = True` for compatibility.
- CUDA tests run and passing on GPU in this worktree:
- `pixi run -e cu-12-9-py312 pytest testing --pyargs numba.cuda.tests.cudapy.test_launch_config_sensitive -k launch_config_sensitive`
- `pixi run -e cu-12-9-py312 pytest testing --pyargs numba.cuda.tests.cudapy.test_caching -k launch_config_sensitive`

## Local working tree state (numba-cuda)
- Branch: `280-launch-config-v2`
- Modified files (uncommitted):
- `numba_cuda/numba/cuda/compiler.py`
- `numba_cuda/numba/cuda/dispatcher.py`
- `scripts/bench-launch-overhead.py`
- Untracked: `PR.md`, `tags`

## New (uncommitted) LC-S API work
- `_LaunchConfiguration` gains explicit helpers:
- `mark_kernel_as_launch_config_sensitive()`
- `get_kernel_launch_config_sensitive()`
- `is_kernel_launch_config_sensitive()`
- `CUDABackend` sets metadata when the launch config is explicitly marked.
This provides an official path to mark LC-S without poking at `state.metadata`
directly from rewrites.

## Remaining TODO
1. **Cleanup**
- Remove or handle untracked `PR.md` and `tags` before committing.

## Completed checks (2026-02-19)
- **Cross-process disk-cache behavior**
- Verified by:
`pixi run -e cu-12-9-py312 pytest testing --pyargs numba.cuda.tests.cudapy.test_caching -k launch_config_sensitive`
- `LaunchConfigSensitiveCachingTest.test_launch_config_sensitive_cache_keys`
exercises cache reuse in a separate process and passed.
- **Launch path audit**
- Python launch paths in `dispatcher.py` all route through
`CUDADispatcher.call()`: `__getitem__()` -> `configure()` ->
`_LaunchConfiguration.__call__()` -> `call()`, plus `ForAll.__call__()`.

## Notes
- Separate PR for cache invalidation on `numba_cuda.__version__` is already
pushed; do not re-implement here.
104 changes: 104 additions & 0 deletions LAUNCH-CONFIG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Launch Config Benchmarking

This repo includes lightweight benchmarking scaffolding to quantify CUDA kernel
launch overhead across three launch-config implementations (baseline, old
contextvar branch, and the new v2 branch).

## Status / Next Steps (Launch Config Work)
- LC-S plumbing is implemented in `dispatcher.py` and supporting files.
- CUDA LC-S tests have been run on GPU in this branch and are passing.
- There are uncommitted changes in `compiler.py`, `dispatcher.py`, and
`scripts/bench-launch-overhead.py` that add an explicit LC-S API on
`_LaunchConfiguration` and a compiler hook to honor it.
- cccl rewrite integration now uses the explicit LC-S API with a fallback to
metadata for compatibility.
- Cross-process disk-cache behavior is covered by LC-S caching tests and passes.
- See `LAUNCH-CONFIG-TODO.md` for a detailed handoff checklist.

## What’s Included

### 1) `scripts/bench-launch-overhead.py`
A focused micro-benchmark that measures launch overhead (us/launch) for kernels
with 0..4 arguments, using a 1x1 launch. It:
- warms up each kernel
- runs `loops` iterations per kernel (default: 100k for 0–3 args, 10k for 4 args)
- repeats the measurement (default: 7 repeats)
- reports mean/stdev and deltas vs the first repo
- optionally writes JSON output

The benchmark is designed to compare multiple repos (or worktrees) in the same
Python environment.

### 2) `scripts/bench-against.py`
A helper that compares benchmarks between two git refs using a temporary
worktree, running the pixi benchmark tasks before and after.

### 3) Pixi tasks
Defined in `pixi.toml`:
- `bench-launch-overhead`: runs `scripts/bench-launch-overhead.py`
- `bench`: pytest benchmark suite (`numba.cuda.tests.benchmarks`)
- `benchcmp`: compare benchmark results from `bench`
- `bench-against`: runs `scripts/bench-against.py`

## Recommended Usage (Three-Way Compare)

Assuming you have three working trees for:
- **baseline** (main or a baseline ref)
- **contextvar** (old implementation)
- **v2** (new implementation)

Run the launch-overhead micro-benchmark:

```bash
pixi run -e cu-12-9-py312 bench-launch-overhead \
--repo baseline=/path/to/numba-cuda-main \
--repo contextvar=/path/to/numba-cuda-contextvar \
--repo v2=/home/trentn/src/280-launch-config-v2
```

Notes:
- The script will `pip install -e` each repo by default. Use `--no-install`
if you have already installed them and want to skip reinstalling.
- Use `--python` to point at a specific interpreter if needed.
- Use `--loops` to override the default loop counts, e.g. `--loops 200000,200000,200000,200000,20000`.
- Use `--output results.json` to persist the results.

## Example Output

```
Launch overhead (us/launch):
args baseline contextvar v2
0 4.10 +/- 0.05 6.20 +/- 0.06 4.50 +/- 0.04
1 4.40 +/- 0.05 6.60 +/- 0.06 4.80 +/- 0.05
...

Deltas vs baseline:
args contextvar v2
0 2.10 (+51.2%) 0.40 (+9.8%)
1 2.20 (+50.0%) 0.40 (+9.1%)
...
```

## Benchmark Suite (Broader Coverage)

For more extensive benchmark coverage (not just launch overhead), use:

```bash
pixi run -e cu-12-9-py312 bench
```

To compare two git refs using a temporary worktree:

```bash
pixi run -e cu-12-9-py312 bench-against HEAD~ HEAD
```

This runs `bench` on the baseline ref and `benchcmp` on the proposed ref.

## Notes / Constraints

- Benchmarks require a real GPU (CUDA simulator is rejected).
- The micro-benchmark intentionally keeps kernels trivial to isolate launch
overhead.
- The three-way comparison is the most direct way to capture the relative
overhead introduced by launch-config state management.
39 changes: 39 additions & 0 deletions docs/source/reference/kernel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,45 @@ This is similar to launch configuration in CUDA C/C++:
.. note:: The order of ``stream`` and ``sharedmem`` are reversed in Numba
compared to in CUDA C/C++.

Launch configuration introspection (advanced)
---------------------------------------------

The current launch configuration is available during compilation triggered by
kernel launches. This can be useful for debugging or for extensions that need
to observe how kernels are configured.

.. note:: The capture is compile-time only. If the kernel is already compiled
for the given argument types, the captured config may remain ``None``.

.. code-block:: python

from numba import cuda
from numba.cuda import launchconfig

@cuda.jit
def f(x):
x[0] = 1

arr = cuda.device_array(1, dtype="i4")
with launchconfig.capture_compile_config(f) as capture:
f[1, 1](arr) # first launch triggers compilation

cfg = capture["config"]
print(cfg.griddim, cfg.blockdim, cfg.sharedmem)

Configured kernels also expose pre-launch callbacks for lightweight
instrumentation:

.. code-block:: python

cfg = f[1, 1]

def log_launch(kernel, cfg):
print(cfg.griddim, cfg.blockdim)

cfg.pre_launch_callbacks.append(log_launch)
cfg(arr)

Dispatcher objects also provide several utility methods for inspection and
creating a specialized instance:

Expand Down
1 change: 1 addition & 0 deletions docs/source/user/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ User guide
.. toctree::

installation.rst
release_notes.rst
kernels.rst
memory.rst
device-functions.rst
Expand Down
Loading