Skip to content

Conversation

@kaeun97
Copy link
Contributor

@kaeun97 kaeun97 commented Dec 3, 2025

This adds set_shared_memory_carveout using cuFuncSetAttribute requested here.

Do want to confirm with @gmarkall about this design (and would be nice to chat about it). An alternative is to combine this withcache_config in some way. However, I do think separating the two makes more sense mainly because it makes the intent clearer for the user (cache_config sets preference hints while set_shared_memory_carveout sets explicit percentages).

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 3, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@kaeun97 kaeun97 changed the title feat: add set_shared-memory_carveout feat: add set_shared_memory_carveout Dec 3, 2025
@kaeun97 kaeun97 force-pushed the kaeun97/add-set-shared-memory-carveout branch from d4d96c4 to 81d16af Compare December 3, 2025 01:45
@gmarkall
Copy link
Contributor

gmarkall commented Dec 3, 2025

/ok to test

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 3, 2025

/ok to test

@gmarkall, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@gmarkall
Copy link
Contributor

gmarkall commented Dec 3, 2025

/ok to test 1697fc7

@gmarkall
Copy link
Contributor

gmarkall commented Dec 3, 2025

Thanks for the PR! I'll circle back to this when I get a chance (hopefully soon) and come back to you. Happy to chat about the design as well.

@gmarkall gmarkall added the 3 - Ready for Review Ready for review by team label Dec 3, 2025
@gmarkall
Copy link
Contributor

gmarkall commented Dec 4, 2025

/ok to test 93a149e

Copy link
Contributor

@gmarkall gmarkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do want to confirm with @gmarkall about this design (and would be nice to chat about it).

I think the design for the part of the problem that this PR solves looks great - I can think of no improvement.

An alternative is to combine this withcache_config in some way. However, I do think separating the two makes more sense mainly because it makes the intent clearer for the user (cache_config sets preference hints while set_shared_memory_carveout sets explicit percentages).

I agree, keeping it separate makes sense to me too.

I'm happy to chat if you have unresolved questions, but I'm also happy for this to be merged - let me know how you'd like to proceed.

@gmarkall gmarkall added 4 - Waiting on author Waiting for author to respond to review and removed 3 - Ready for Review Ready for review by team labels Dec 4, 2025
@kaeun97 kaeun97 force-pushed the kaeun97/add-set-shared-memory-carveout branch from 93a149e to d91f8f4 Compare December 8, 2025 21:36
@kaeun97
Copy link
Contributor Author

kaeun97 commented Dec 8, 2025

@gmarkall Thank you for the review! Good to hear that the design makes sense. Just wanted to make sure that I am interpreting the description from the issue properly :) Would be nice to re-run CI and merge this.

@kaeun97 kaeun97 marked this pull request as ready for review December 8, 2025 21:40
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 8, 2025

Greptile Overview

Greptile Summary

This PR adds set_shared_memory_carveout functionality to configure the shared memory carveout percentage for CUDA kernels using cuFuncSetAttribute. The implementation:

  • Adds abstract method to Function base class and concrete implementations in both CtypesFunction and CudaPythonFunction
  • Validates carveout values are in the range -1 to 100 (where -1 means default)
  • Includes comprehensive test coverage for valid/invalid values and kernel execution

Key Issue Found:

  • The CtypesFunction implementation calls driver.cuFuncSetAttribute(), but this function is not defined in API_PROTOTYPES in drvapi.py. While the default code path uses CudaPythonFunction (which accesses the API via cuda.bindings), the CtypesFunction path will fail if ever used. This should be addressed by either removing the dormant ctypes code path or adding the missing API prototype.

Confidence Score: 3/5

  • This PR is generally safe to merge with one critical issue that needs attention
  • Score reflects that while the main implementation path (CudaPythonFunction) appears correct and well-tested, there's a logical error in the CtypesFunction implementation that calls an undefined API function. This would cause runtime failures if the ctypes code path is ever used. The default code path uses CudaPythonFunction, so this may not impact most users immediately, but it's a correctness issue that should be addressed.
  • Pay special attention to numba_cuda/numba/cuda/cudadrv/driver.py - the CtypesFunction implementation references an undefined API function

Important Files Changed

File Analysis

Filename Score Overview
numba_cuda/numba/cuda/cudadrv/driver.py 3/5 Added set_shared_memory_carveout method to Function classes with validation, but CtypesFunction implementation references undefined API function
numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py 5/5 Comprehensive test added covering valid/invalid carveout values and kernel execution after configuration

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. numba_cuda/numba/cuda/cudadrv/driver.py, line 2400 (link)

    logic: cuFuncSetAttribute is not defined in API_PROTOTYPES in drvapi.py. If CtypesFunction is ever used, this will fail with CudaDriverError: Driver missing function: cuFuncSetAttribute. While the default code path uses CudaPythonFunction, the CtypesFunction path should either be removed or have the API prototype added to drvapi.py:

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@gmarkall gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Dec 9, 2025
@gmarkall
Copy link
Contributor

gmarkall commented Dec 9, 2025

Additional Comments (1)

1. `numba_cuda/numba/cuda/cudadrv/driver.py`, line 2400 ([link](/nvidia/numba-cuda/blob/683e774faebe7254a870818c648c556f24554cf5/numba_cuda/numba/cuda/cudadrv/driver.py#L2400))
   **logic:** `cuFuncSetAttribute` is not defined in `API_PROTOTYPES` in `drvapi.py`. If `CtypesFunction` is ever used, this will fail with `CudaDriverError: Driver missing function: cuFuncSetAttribute`. While the default code path uses `CudaPythonFunction`, the `CtypesFunction` path should either be removed or have the API prototype added to `drvapi.py`:

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

This is dead code, so it won't be hit. We need to remove it at some point.

@gmarkall
Copy link
Contributor

gmarkall commented Dec 9, 2025

/ok to test 683e774

@gmarkall gmarkall merged commit 2d151db into NVIDIA:main Dec 9, 2025
141 of 142 checks passed
@gmarkall
Copy link
Contributor

gmarkall commented Dec 9, 2025

Many thanks @kaeun97!

gmarkall added a commit to gmarkall/numba-cuda that referenced this pull request Dec 17, 2025
- Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643)
- Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591)
- feat: allow printing nested tuples (NVIDIA#667)
- build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655)
- build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652)
- Test RAPIDS 25.12 (NVIDIA#661)
- Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662)
- feat: add print support for int64 tuples (NVIDIA#663)
- Only run dependabot monthly and open fewer PRs (NVIDIA#658)
- test: fix bogus `self` argument to `Context` (NVIDIA#656)
- Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650)
- Add support for dependabot (NVIDIA#647)
- refactor: cull dead linker objects (NVIDIA#649)
- Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609)
- feat: add set_shared_memory_carveout (NVIDIA#629)
- chore: bump version in pixi.toml (NVIDIA#641)
- refactor: remove devicearray code to reduce complexity (NVIDIA#600)
@gmarkall gmarkall mentioned this pull request Dec 17, 2025
gmarkall added a commit that referenced this pull request Dec 17, 2025
- Capture global device arrays in kernels and device functions (#666)
- Fix #624: Accept Numba IR
nodes in all places Numba-CUDA IR nodes are expected
(#643)
- Fix Issue #588: separate
compilation of NVVM IR modules when generating debuginfo
(#591)
- feat: allow printing nested tuples
(#667)
- build(deps): bump actions/setup-python from 5.6.0 to 6.1.0
(#655)
- build(deps): bump actions/upload-artifact from 4 to 5
(#652)
- Test RAPIDS 25.12 (#661)
- Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests
(#662)
- feat: add print support for int64 tuples
(#663)
- Only run dependabot monthly and open fewer PRs
(#658)
- test: fix bogus `self` argument to `Context`
(#656)
- Fix false negative NRT link decision when NRT was previously toggled
on (#650)
- Add support for dependabot
(#647)
- refactor: cull dead linker objects
(#649)
- Migrate numba-cuda driver to use cuda.core.launch API
(#609)
- feat: add set_shared_memory_carveout
(#629)
- chore: bump version in pixi.toml
(#641)
- refactor: remove devicearray code to reduce complexity
(#600)
ZzEeKkAa added a commit to ZzEeKkAa/numba-cuda that referenced this pull request Jan 8, 2026
v0.23.0

- Capture global device arrays in kernels and device functions (NVIDIA#666)
- Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643)
- Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591)
- feat: allow printing nested tuples (NVIDIA#667)
- build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655)
- build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652)
- Test RAPIDS 25.12 (NVIDIA#661)
- Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662)
- feat: add print support for int64 tuples (NVIDIA#663)
- Only run dependabot monthly and open fewer PRs (NVIDIA#658)
- test: fix bogus `self` argument to `Context` (NVIDIA#656)
- Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650)
- Add support for dependabot (NVIDIA#647)
- refactor: cull dead linker objects (NVIDIA#649)
- Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609)
- feat: add set_shared_memory_carveout (NVIDIA#629)
- chore: bump version in pixi.toml (NVIDIA#641)
- refactor: remove devicearray code to reduce complexity (NVIDIA#600)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4 - Waiting on reviewer Waiting for reviewer to respond to author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants