[NVSHMEM] Extend CUDA backend to compile and link TIR modules with NVSHMEM #18093

Kathryn-cat · 2025-06-25T18:37:53Z

This PR enables the native NVSHMEM compilation support.

tqchen · 2025-06-25T18:47:27Z

CMakeLists.txt

    message(FATAL_ERROR "Cannot find NVSHMEM, USE_NVSHMEM=" ${USE_NVSHMEM})
  endif()
+  set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -I${NVSHMEM_INCLUDE_DIR} -L${NVSHMEM_LIB_DIR}")
+  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -I${NVSHMEM_INCLUDE_DIR} -L${NVSHMEM_LIB_DIR}")


use cmake primitives target_include_directories instead of setting CXX and CUDA_FLAGS

tqchen · 2025-06-25T18:48:33Z

python/tvm/runtime/disco/session.py

            The loaded VM module.
        """
-        if device is None:
-            device = Device(device_type=0, device_id=0)


is this change intentional?

Yes, in the latest dlpack, DLDeviceType has enum values 1 to 17. The valueDevice(device_type=0, device_id=0) would raise an error of unrecognized device type. Since it is meant to indicate a Null value, I replace the subsequent usage with Optional<Device> type, see the changes of UseDefaultDeviceIfNone.

tqchen · 2025-06-25T18:50:17Z

please include one testcase that tests the basic functionality such as just calling the get worker id

jinhongyii · 2025-06-25T19:20:02Z

src/runtime/contrib/nvshmem/init.cc

+  CUmodule mod = static_cast<CUmodule>(cuModule);
+  auto status = nvshmemx_init_status();
+  // The NVSHMEM library must have completed device initialization prior to
+  // nvshmemx_cumodule_init. If not, we skip the cumodule initialization.


if not device initialized, we should return with error

The design here is to enable NVSHMEM compilation and linking broadly for every kernel, including those whose NVSHMEM context is not initialized and do not use NVSHMEM in their kernels.

In such case, nvshmemx_init_status() is used to check whether we need to call nvshmemx_cumodule_init or not. If not device initialized, we just skip nvshmemx_cumodule_init.

spectrometerHBH · 2025-06-25T19:36:40Z

src/target/source/codegen_cuda.cc

  decl_stream << "#define TVM_ENABLE_L2_PREFETCH 0\n";
  decl_stream << "#endif\n";

-  decl_stream << "\n#ifdef _WIN32\n";


Why change this?

It's because NVSHMEM contains #include <cstdint>, which is in conflict with the original #define int64_t long long and could lead to CUDA compilation error. The #define semantics is quite error prone, so I remove it and just do using.

Kathryn-cat · 2025-06-25T20:44:14Z

please include one testcase that tests the basic functionality such as just calling the get worker id

Added a test case under test_nvshmem.py

Kathryn-cat · 2025-06-25T20:50:57Z

@tvm-bot rerun

github-actions · 2025-06-25T20:55:56Z

Failed to re-run CI in https://github.com/apache/tvm/actions/runs/15886908209

Traceback (most recent call last):
  File "/home/runner/work/tvm/tvm/ci/scripts/github/github_tvmbot.py", line 591, in comment_failure
    raise item
  File "/home/runner/work/tvm/tvm/ci/scripts/github/github_tvmbot.py", line 697, in run
    pr.rerun_jenkins_ci()
  File "/home/runner/work/tvm/tvm/ci/scripts/github/github_tvmbot.py", line 550, in rerun_jenkins_ci
    post(url, auth=("tvm-bot", TVM_BOT_JENKINS_TOKEN))
  File "/home/runner/work/tvm/tvm/ci/scripts/jenkins/git_utils.py", line 53, in post
    with request.urlopen(req, data) as response:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 515, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 532, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 492, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 1392, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 1348, in do_open
    r = h.getresponse()
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/usr/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/http/client.py", line 300, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

Kathryn-cat · 2025-06-25T21:44:29Z

@tvm-bot rerun

github-actions · 2025-06-25T21:49:28Z

Failed to re-run CI in https://github.com/apache/tvm/actions/runs/15887848436

Traceback (most recent call last):
  File "/home/runner/work/tvm/tvm/ci/scripts/github/github_tvmbot.py", line 591, in comment_failure
    raise item
  File "/home/runner/work/tvm/tvm/ci/scripts/github/github_tvmbot.py", line 697, in run
    pr.rerun_jenkins_ci()
  File "/home/runner/work/tvm/tvm/ci/scripts/github/github_tvmbot.py", line 550, in rerun_jenkins_ci
    post(url, auth=("tvm-bot", TVM_BOT_JENKINS_TOKEN))
  File "/home/runner/work/tvm/tvm/ci/scripts/jenkins/git_utils.py", line 53, in post
    with request.urlopen(req, data) as response:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 515, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 532, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 492, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 1392, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 1348, in do_open
    r = h.getresponse()
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/usr/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/http/client.py", line 300, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

tqchen · 2025-06-26T12:30:25Z

@tvm-bot rerun

…SHMEM (apache#18093)

init

c3d74bb

tqchen requested changes Jun 25, 2025

View reviewed changes

tqchen reviewed Jun 25, 2025

View reviewed changes

jinhongyii reviewed Jun 25, 2025

View reviewed changes

spectrometerHBH reviewed Jun 25, 2025

View reviewed changes

Kathryn-cat added 2 commits June 25, 2025 16:01

address comments

8ed4cb0

test case

35a5767

Kathryn-cat requested a review from tqchen June 25, 2025 23:53

Kathryn-cat added 2 commits June 26, 2025 15:13

fix

43baac4

fix

5a6a36a

tqchen approved these changes Jun 26, 2025

View reviewed changes

tqchen merged commit a40f73f into apache:main Jun 26, 2025
10 checks passed

ysh329 mentioned this pull request Jul 16, 2025

[Release] v0.21.0 Release Candidate Notes #18150

Closed

ShiboXing pushed a commit to ShiboXing/tvm that referenced this pull request Aug 10, 2025

[NVSHMEM] Extend CUDA backend to compile and link TIR modules with NV…

c2af0d8

…SHMEM (apache#18093)

[NVSHMEM] Extend CUDA backend to compile and link TIR modules with NVSHMEM #18093

[NVSHMEM] Extend CUDA backend to compile and link TIR modules with NVSHMEM #18093

Uh oh!

Conversation

Kathryn-cat commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tqchen Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kathryn-cat Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

tqchen Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

Kathryn-cat Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tqchen commented Jun 25, 2025

Uh oh!

jinhongyii Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

Kathryn-cat Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

spectrometerHBH Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

Kathryn-cat Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

Kathryn-cat commented Jun 25, 2025

Uh oh!

Kathryn-cat commented Jun 25, 2025

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

Kathryn-cat commented Jun 25, 2025

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

tqchen commented Jun 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Kathryn-cat commented Jun 25, 2025 •

edited

Loading

tqchen Jun 25, 2025 •

edited

Loading

Kathryn-cat Jun 25, 2025 •

edited

Loading