Skip to content

Conversation

@Kathryn-cat
Copy link
Contributor

@Kathryn-cat Kathryn-cat commented Jun 25, 2025

This PR enables the native NVSHMEM compilation support.

CMakeLists.txt Outdated
message(FATAL_ERROR "Cannot find NVSHMEM, USE_NVSHMEM=" ${USE_NVSHMEM})
endif()
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -I${NVSHMEM_INCLUDE_DIR} -L${NVSHMEM_LIB_DIR}")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -I${NVSHMEM_INCLUDE_DIR} -L${NVSHMEM_LIB_DIR}")
Copy link
Member

@tqchen tqchen Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use cmake primitives target_include_directories instead of setting CXX and CUDA_FLAGS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

The loaded VM module.
"""
if device is None:
device = Device(device_type=0, device_id=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change intentional?

Copy link
Contributor Author

@Kathryn-cat Kathryn-cat Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the latest dlpack, DLDeviceType has enum values 1 to 17. The valueDevice(device_type=0, device_id=0) would raise an error of unrecognized device type. Since it is meant to indicate a Null value, I replace the subsequent usage with Optional<Device> type, see the changes of UseDefaultDeviceIfNone.

@tqchen
Copy link
Member

tqchen commented Jun 25, 2025

please include one testcase that tests the basic functionality such as just calling the get worker id

CUmodule mod = static_cast<CUmodule>(cuModule);
auto status = nvshmemx_init_status();
// The NVSHMEM library must have completed device initialization prior to
// nvshmemx_cumodule_init. If not, we skip the cumodule initialization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not device initialized, we should return with error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design here is to enable NVSHMEM compilation and linking broadly for every kernel, including those whose NVSHMEM context is not initialized and do not use NVSHMEM in their kernels.

In such case, nvshmemx_init_status() is used to check whether we need to call nvshmemx_cumodule_init or not. If not device initialized, we just skip nvshmemx_cumodule_init.

decl_stream << "#define TVM_ENABLE_L2_PREFETCH 0\n";
decl_stream << "#endif\n";

decl_stream << "\n#ifdef _WIN32\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because NVSHMEM contains #include <cstdint>, which is in conflict with the original #define int64_t long long and could lead to CUDA compilation error. The #define semantics is quite error prone, so I remove it and just do using.

@Kathryn-cat
Copy link
Contributor Author

please include one testcase that tests the basic functionality such as just calling the get worker id

Added a test case under test_nvshmem.py

@Kathryn-cat
Copy link
Contributor Author

@tvm-bot rerun

@github-actions
Copy link
Contributor

Failed to re-run CI in https://github.com/apache/tvm/actions/runs/15886908209

Traceback (most recent call last):
  File "/home/runner/work/tvm/tvm/ci/scripts/github/github_tvmbot.py", line 591, in comment_failure
    raise item
  File "/home/runner/work/tvm/tvm/ci/scripts/github/github_tvmbot.py", line 697, in run
    pr.rerun_jenkins_ci()
  File "/home/runner/work/tvm/tvm/ci/scripts/github/github_tvmbot.py", line 550, in rerun_jenkins_ci
    post(url, auth=("tvm-bot", TVM_BOT_JENKINS_TOKEN))
  File "/home/runner/work/tvm/tvm/ci/scripts/jenkins/git_utils.py", line 53, in post
    with request.urlopen(req, data) as response:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 515, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 532, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 492, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 1392, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 1348, in do_open
    r = h.getresponse()
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/usr/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/http/client.py", line 300, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

@Kathryn-cat
Copy link
Contributor Author

@tvm-bot rerun

@github-actions
Copy link
Contributor

Failed to re-run CI in https://github.com/apache/tvm/actions/runs/15887848436

Traceback (most recent call last):
  File "/home/runner/work/tvm/tvm/ci/scripts/github/github_tvmbot.py", line 591, in comment_failure
    raise item
  File "/home/runner/work/tvm/tvm/ci/scripts/github/github_tvmbot.py", line 697, in run
    pr.rerun_jenkins_ci()
  File "/home/runner/work/tvm/tvm/ci/scripts/github/github_tvmbot.py", line 550, in rerun_jenkins_ci
    post(url, auth=("tvm-bot", TVM_BOT_JENKINS_TOKEN))
  File "/home/runner/work/tvm/tvm/ci/scripts/jenkins/git_utils.py", line 53, in post
    with request.urlopen(req, data) as response:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 515, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 532, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 492, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 1392, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/urllib/request.py", line 1348, in do_open
    r = h.getresponse()
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/usr/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/http/client.py", line 300, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

@Kathryn-cat Kathryn-cat requested a review from tqchen June 25, 2025 23:53
@tqchen
Copy link
Member

tqchen commented Jun 26, 2025

@tvm-bot rerun

@tqchen tqchen merged commit a40f73f into apache:main Jun 26, 2025
10 checks passed
ShiboXing pushed a commit to ShiboXing/tvm that referenced this pull request Aug 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants