Integration of GUSLI plugin #494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

danielhe-nvidia wants to merge 2 commits into ai-dynamo:main from danielhe-nvidia:main

Contributor

danielhe-nvidia commented Jun 23, 2025

Change-Id: I05e7b35eb3076142b4dcc1f58a68865d5a5962c0

What?

Review only: Don't merge, as GUSLI standalone repository still do not exist
Support transfer of 1 element in a list
Support transfer of N element in a list, with/Without scattaer gather extra element
Support the above to multiple block devices in one transfer

Why?

Allow using gusli access (to server or bdevs) via nixl

pull-request-size bot added the size/XXL label

copy-pr-bot bot commented Jun 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions bot commented Jun 23, 2025

👋 Hi danielhe-nvidia! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

github-actions bot added the external-contribution label

danielhe-nvidia force-pushed the main branch from f03072d to baa76b9 Compare

June 24, 2025 11:07

w1ldptr requested changes

View reviewed changes

src/plugins/meson.build Outdated Show resolved Hide resolved

src/plugins/gusli/meson.build Outdated Show resolved Hide resolved

src/plugins/gusli/gusli_plugin.cpp Show resolved Hide resolved

src/plugins/gusli/gusli_plugin.cpp Outdated Show resolved Hide resolved

src/plugins/gusli/gusli_backend.h Outdated Show resolved Hide resolved

src/plugins/gusli/gusli_backend.cpp Outdated Show resolved Hide resolved

src/plugins/gusli/gusli_backend.cpp Outdated Show resolved Hide resolved

src/plugins/gusli/gusli_backend.cpp Outdated

    
              			__LOG_RETERR(NIXL_ERR_INVALID_PARAM, "missing SGL, or SGL too small 0x%lx[b]", local[0].len);

              		}

              	} else {

              		unsigned i = (has_sgl_mem ? 1 : 0);		// If supplied sgl, can't use it for now, just ignore it

Contributor

w1ldptr Jun 25, 2025

If I'm reading this correctly the code here assumes that the first descriptor is "special" memory to be used internally by Gusli for optimization when '-sgl' custom param is set. @vvenkates27 what do you think about exposing such API toggle?

src/plugins/gusli/gusli_backend.cpp Outdated Show resolved Hide resolved

src/plugins/gusli/gusli_backend.cpp Outdated

    
              	std::vector<class nixlGusliBackendReqH> child;		// Array of sub completions, possibly a tree, though tested only 2 level io.

              	static void completion_cb(nixlGusliBackendReqH *c) {

              		__LOG_IO(c, "_done, rv=%d", c->io.get_error());

              		c->pollable_async_rv = c->io.get_error();

Contributor

w1ldptr Jun 25, 2025

Just to double check: This callback is always called from the same thread that initialized the request handle?

Contributor Author

danielhe-nvidia Jul 6, 2025

No, by definition no. It will be called from internal gusli thread. However there is a cancellation point that guarantees, when IO is destroyed (it calls cancel) so callback will not arrive

w1ldptr mentioned this pull request

DOCS: Add Backend Developer Guide #515

Merged

danielhe-nvidia force-pushed the main branch from baa76b9 to 0f97db8 Compare

July 3, 2025 10:31

w1ldptr requested changes

View reviewed changes

src/plugins/gusli/meson.build Outdated Show resolved Hide resolved

src/plugins/gusli/gusli_backend.h Outdated Show resolved Hide resolved

src/plugins/gusli/gusli_backend.cpp Outdated Show resolved Hide resolved

src/plugins/gusli/gusli_backend.cpp Show resolved Hide resolved

src/plugins/gusli/gusli_backend.cpp Outdated Show resolved Hide resolved

src/plugins/gusli/gusli_backend.cpp Outdated Show resolved Hide resolved

src/plugins/gusli/gusli_backend.cpp Show resolved Hide resolved

src/plugins/gusli/gusli_backend.cpp Outdated Show resolved Hide resolved

src/plugins/gusli/gusli_backend.cpp Outdated Show resolved Hide resolved

src/plugins/gusli/gusli_backend.h Outdated Show resolved Hide resolved

danielhe-nvidia force-pushed the main branch 2 times, most recently from 81896b2 to 54cab9a Compare

July 8, 2025 13:14

w1ldptr requested changes

View reviewed changes

src/plugins/gusli/gusli_backend.cpp Outdated Show resolved Hide resolved

src/plugins/gusli/gusli_backend.cpp

    
                              std::stoi(backParams->at("max_num_simultaneous_requests"));

                      if (backParams->count("config_file") > 0)

                          gusli_params.config_file = backParams->at("config_file").c_str();

                  }

Contributor

w1ldptr Jul 14, 2025

Extract gusli::global_clnt_context::init_params into a helper function...

src/plugins/gusli/gusli_backend.cpp Outdated

    
                      if (backParams->count("config_file") > 0)

                          gusli_params.config_file = backParams->at("config_file").c_str();

                  }

                  lib_ = std::make_unique<gusli::global_clnt_raii>(gusli_params);

Contributor

w1ldptr Jul 14, 2025

... and move this initialization into initializer list by passing return value of the new helper to it directly instead of using a temporary. With this, you won't need to even use smart pointer anymore since you could just make lib_ by-value class field.

src/plugins/gusli/gusli_backend.cpp

    
                  if (rv == gusli::connect_rv::C_OK) return NIXL_SUCCESS;

                  if (rv == gusli::connect_rv::C_NO_DEVICE) return NIXL_ERR_NOT_FOUND;

                  if (rv == gusli::connect_rv::C_WRONG_ARGUMENTS) return NIXL_ERR_INVALID_PARAM;

                  return NIXL_ERR_BACKEND;

Contributor

w1ldptr Jul 14, 2025

Use switch here.

src/plugins/gusli/gusli_backend.cpp

    
                  assert(lib_->BREAKING_VERSION == 1);

              }

              nixlGusliEngine::~nixlGusliEngine() {}

Contributor

w1ldptr Jul 14, 2025

Remove this and use "=default" in the header instead.

src/plugins/gusli/gusli_backend.cpp Outdated Show resolved Hide resolved

src/plugins/gusli/gusli_backend.cpp

    
              nixl_status_t

              nixlGusliEngine::deregisterMem(nixlBackendMD *_md) {

                  nixlGusliMemReq *md = (nixlGusliMemReq *)_md;

                  if (!md) __LOG_RETERR(NIXL_ERR_INVALID_PARAM, "md==null");

Contributor

w1ldptr Jul 14, 2025

This is redundant. Validate function arg instead, rename auto_deleter to md and use it to access the metadata in a safe manner.

src/plugins/gusli/gusli_backend.cpp Outdated Show resolved Hide resolved

src/plugins/gusli/gusli_backend.cpp Show resolved Hide resolved

src/plugins/gusli/gusli_backend.cpp

    
              }

              }; // namespace

              nixlGusliEngine::nixlGusliEngine(const nixlBackendInitParams *nixlInit)

Contributor

w1ldptr Jul 14, 2025

Please consult the code style guide regarding variable naming conventions.

w1ldptr mentioned this pull request

clang-tidy: Add config to enforce naming style #575

Merged

aranadive self-requested a review

August 4, 2025 20:43

danielhe-nvidia requested a review from a team as a code owner

August 5, 2025 13:00

w1ldptr reviewed

View reviewed changes

test/unit/plugins/gusli/nixl_gusli_test.cpp

    
                      }

                      return true;

                  }

              };

Contributor

w1ldptr Aug 7, 2025

A lot of code in this file is a re-implementation of infrastructure that already exist in test/gtest/plugins or of the primitives provided by Gtest library. Please integrate the test-cases from here into one of the existing Gtest suites so it could be validated by CI from day1.

Contributor

barneuman commented Aug 19, 2025

Can you please add your test to the CI (need to be added to .gitlab/test_cpp.sh)?
and please add an explanation to the README on how to run with the server

Contributor Author

danielhe-nvidia commented Aug 25, 2025 •

edited

Loading

@barneuman
Will do,
Regarding your request: Please add an explanation to the README on how to run with the server - You should NOT do it via NIXL unit-test executable. NIXL is a client. GUSLI has an example code of how to write servers with classes and unit-test.

The correct way to test with server is to download spdk install it. Compile GUSLI example server app which loads GUSLI server. And NIXL unit-test will connect to this server (via gusli client). I dont recommend NIXL people to do that as it will require spdk installation in the NIXL container which is long and really unrelated to NIXL.
An in the middle approach is to not install spdk and use GUSLI sample app for server without spdk (server writes to a file or to ram block device). Much like in the previous approach NIXL unit-test would need to launch this server executable connect to it and kill-it. In other words, NIXL gusli client executable will not be able to run by itself, but needs some kind of external script to compile and launch the server, before NIXL does first register mem of first io in its unit test
The easiest approach but weakest is to actually compile both GUSLI server and client into NIXL unit-test and the executable calls fork() continuing as server and as client. This is the easiest way to test, no need to externally coordinate between 2 executables but it does require NIXL to link with GUSLI server which is incorrect. Moreover, NIXL is a heavily c++ oriented code and I am not sure that std:: is guaranteed to works well with fork(). AFAIK it was developed for thread safety but forks() may cause memory corruptions (like constructor of a static class called once but after fork() destructor will be called twice).

Naturally - I tested GUSLI <--> NIXL with each and every 1 of the above 3 approaches, but I don't know how eventually NIXL team will decide to integrate NIXL unit-tests.
With the plugin I wrote unit tests only for serverless GUSLI (gusli client .so has a special loopback mode. Client emulates the server). It is done for testing purposes and accessing legacy block devices, but for NIXL it tests 100% of the plugin functionality so NIXL developers don't really care about GUSLI real or emulated server as NIXL-gusli-plugin code gets full coverage in the unit-tests without gusli server.

Note: The application which is the backend server (loads GUSLI server library) be it SPDK based or NVMeshUM, or user space NVME driver, etc - Should not be part of NIXL and should be managed by external team. For example, SPDK team should be in charge of the SPDK server running, restarting upon crash, configuring to give access to specific block devices and not to others. It is definitely not the responsibility of NIXL / nor GUSLI teams. This is equivalent to real NVME disks. NIXL team should not be responsible for inserting new disks when existing run out of space or managing which local disks should be used for io and which (like os disk) should not. This external configuration of hardware is managed outside of NIXL


          Integration of GUSLI plugin

1d184fa

Change-Id: I05e7b35eb3076142b4dcc1f58a68865d5a5962c0
Signed-off-by: danielhsh <[email protected]>

danielhe-nvidia force-pushed the main branch from 452a7a2 to 1d184fa Compare

September 11, 2025 07:50


          Merge branch 'main' into main

28157e7

Signed-off-by: Adit Ranadive <[email protected]>

aranadive mentioned this pull request

GUSLI Backend Integration for NIXL #887

Merged

Contributor

aranadive commented Oct 13, 2025

Closing this PR since #887 was merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution size/XXL