Skip to content

Conversation

@rparolin
Copy link
Collaborator

@rparolin rparolin commented Sep 29, 2025

Skip IPC mempool tests on WSL

  • Skip IPC mempool tests on WSL using pytest.skipif(IS_WSL, ...).

On WSL2, cuMemPoolCreate with POSIX handle returns CUDA_ERROR_INVALID_VALUE despite capability flags indicating support. This is a platform/driver limitation, not a bug in our code.

I confirmed the failures are driver-related rather than a bug in our code. The host is WSL2 (kernel “microsoft-standard-WSL2”) with NVIDIA driver 581.15 (CUDA 13.0). Device attributes report mempools and POSIX FD handle support, yet a minimal, direct driver repro calling cuMemPoolCreate with handleTypes=CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR consistently returns CUDA_ERROR_INVALID_VALUE, while the same call with handleTypes=CU_MEM_HANDLE_TYPE_NONE succeeds.

# Minimal direct-driver repro of IPC mempool creation behavior

try:
    from cuda.bindings import driver
except ImportError:
    from cuda import cuda as driver

driver.cuInit(0)

loc = driver.CUmemLocation()
loc.type = driver.CUmemLocationType.CU_MEM_LOCATION_TYPE_DEVICE
loc.id = 0

def create_pool(handle_type):
    props = driver.CUmemPoolProps()
    props.allocType = driver.CUmemAllocationType.CU_MEM_ALLOCATION_TYPE_PINNED
    props.handleTypes = handle_type
    props.location = loc
    props.maxSize = 2_097_152
    props.win32SecurityAttributes = 0
    props.usage = 0
    return driver.cuMemPoolCreate(props)

print("POSIX_FD:", create_pool(driver.CUmemAllocationHandleType.CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR)[0].name)
print("NONE    :", create_pool(driver.CUmemAllocationHandleType.CU_MEM_HANDLE_TYPE_NONE)[0].name)

Output:

POSIX_FD: CUDA_ERROR_INVALID_VALUE
NONE    : CUDA_SUCCESS

Because this reproduces outside our code with the same CUmemPoolProps we set, the issue lies in the driver/runtime path under WSL2 (consistent with known IPC limitations there), not our implementation. Therefore, we skip IPC mempool tests on WSL to keep the suite portable and green, while leaving the tests enabled on native Linux and other environments where the driver accepts IPC pools.

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Sep 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rparolin rparolin marked this pull request as draft September 29, 2025 15:19
@rparolin rparolin requested a review from leofang September 29, 2025 15:42
@rparolin
Copy link
Collaborator Author

rparolin commented Sep 29, 2025

Wow, you both jumped on this PR fast. LOL. I had to double check to make sure I didn't mark this as ready by mistake.

@rparolin rparolin requested a review from cpcloud September 29, 2025 15:56
Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we avoid skipping tests based on only the platform or hard-coded conditions in cuda.core. What we should do is to query the driver and see if a functionality is supported or not. The reason is that the driver can be updated to gain new capabilities, and checks like this would not keep up.

@rparolin could you check what's the returned values on WSL?

>>> dev = Device()
>>> dev.properties.mempool_supported_handle_types
9       
>>> dev.properties.handle_type_posix_file_descriptor_supported
True
>>> dev.properties.handle_type_win32_handle_supported
False
>>> dev.properties.handle_type_win32_kmt_handle_supported
False

It could be that our existing skip condition is not sufficient:

@pytest.fixture(scope="function")
def ipc_device():
"""Obtains a device suitable for IPC-enabled mempool tests, or skips."""
# Check if IPC is supported on this platform/device
device = Device()
device.set_current()
if not device.properties.memory_pools_supported:
pytest.skip("Device does not support mempool operations")
# Note: Linux specific. Once Windows support for IPC is implemented, this
# test should be updated.
if not device.properties.handle_type_posix_file_descriptor_supported:
pytest.skip("Device does not support IPC")
return device

cpcloud
cpcloud previously approved these changes Sep 29, 2025
@rwgk
Copy link
Collaborator

rwgk commented Sep 29, 2025

could you check what's the returned values on WSL?

This is what I'm seeing on my Windows 11 24H2 WSL2 Ubuntu 24.04 workstation:

>>> dev = Device()
>>> dev.properties.mempool_supported_handle_types
0
>>> dev.properties.handle_type_posix_file_descriptor_supported
True
>>> dev.properties.handle_type_win32_handle_supported
False
>>> dev.properties.handle_type_win32_kmt_handle_supported
False

Using current cuda-python main (at commit dbde2b4).

Ubuntu side:

(WslLocalCudaVenv) rwgk-win11.localdomain:~/forked/cuda-python $ nvidia-smi
Mon Sep 29 11:21:40 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.02              Driver Version: 581.15         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A6000               On  |   00000000:C1:00.0 Off |                  Off |
| 30%   29C    P8             20W /  300W |    1652MiB /  49140MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Windows side:

PS C:\Users\rgrossekunst> nvidia-smi
Mon Sep 29 11:22:45 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 581.15                 Driver Version: 581.15         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A6000             WDDM  |   00000000:C1:00.0 Off |                  Off |
| 30%   31C    P8             42W /  300W |    1729MiB /  49140MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

@leofang
Copy link
Member

leofang commented Sep 29, 2025

>>> dev.properties.mempool_supported_handle_types
0
>>> dev.properties.handle_type_posix_file_descriptor_supported
True

mempool_supported_handle_types seems to be correct, which is a bit mask for extracting the supported handle types, and handle_type_posix_file_descriptor_supported is where it went south. @Andy-Jost let's check if we messed up in the implementation of handle_type_posix_file_descriptor_supported, or if WSL driver is being inconsistent here.

@rparolin
Copy link
Collaborator Author

/ok to test 57ead02

@rparolin
Copy link
Collaborator Author

/ok to test 0a9be40

@rparolin
Copy link
Collaborator Author

/ok to test f7deec9

@rparolin
Copy link
Collaborator Author

rparolin commented Oct 8, 2025

/ok to test 90590ec

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkraus14 Sorry but I think making this an installable package is a very poor UX. It means every single new contributor would stumble during first test run, and have to learn that there is a separate package that we must install locally first.

Also, by doing so we need to touch the CI workflows so as to make it installed first before testing. We seem to be shooting ourselves in the foot without justifiable gain?

@rparolin
Copy link
Collaborator Author

rparolin commented Oct 8, 2025

@kkraus14 Sorry but I think making this an installable package is a very poor UX. It means every single new contributor would stumble during first test run, and have to learn that there is a separate package that we must install locally first.

Also, by doing so we need to touch the CI workflows so as to make it installed first before testing. We seem to be shooting ourselves in the foot without justifiable gain?

I had similar thoughts but figured that it could be mitigated by our script/run_tests.sh script which can auto install for you. My tendency is have my build/dev environment tools script all the complexity so others don't step on landmines but I haven't developed an intuition of what is [un]familiar for python programmers yet.

@rparolin
Copy link
Collaborator Author

rparolin commented Oct 8, 2025

/ok to test cd6b714

@leofang
Copy link
Member

leofang commented Oct 8, 2025

The CI shows what I noted earlier:

Also, by doing so we need to touch the CI workflows so as to make it installed first before testing.

@rparolin
Copy link
Collaborator Author

rparolin commented Oct 8, 2025

/ok to test baac405

@leofang
Copy link
Member

leofang commented Oct 8, 2025

/ok to test bbe82c8

@leofang
Copy link
Member

leofang commented Oct 8, 2025

/ok to test 7726e05

@rparolin rparolin enabled auto-merge (squash) October 8, 2025 22:15
@leofang leofang added this to the cuda.core beta 7 milestone Oct 8, 2025
@leofang
Copy link
Member

leofang commented Oct 8, 2025

I pushed commit bbe82c8 so that we don't get blocked by this package discussion. If the helper module is installed as a package, it's used. Otherwise, we find it via relative path. We can revisit this discussion later and find a way to avoid friction with local development and CI testing.

@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 8, 2025

I agree that packaging this as a separate package is a poor developer UX. Maybe we should introduce a cuda.core.testing module or something similar?

Regardless, we can do that in a follow up PR.

@leofang leofang disabled auto-merge October 9, 2025 00:00
@leofang leofang merged commit 7028804 into main Oct 9, 2025
71 checks passed
@leofang leofang deleted the rparolin/skip_ipc_on_wsl branch October 9, 2025 00:01
@github-actions
Copy link

github-actions bot commented Oct 9, 2025

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P1 Medium priority - Should do test Improvements or additions to tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants