Skip to content

Conversation

@danielhe-nvidia
Copy link
Contributor

Change-Id: I05e7b35eb3076142b4dcc1f58a68865d5a5962c0

What?

Review only: Don't merge, as GUSLI standalone repository still do not exist
Support transfer of 1 element in a list
Support transfer of N element in a list, with/Without scattaer gather extra element
Support the above to multiple block devices in one transfer

Why?

Allow using gusli access (to server or bdevs) via nixl

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jun 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi danielhe-nvidia! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

__LOG_RETERR(NIXL_ERR_INVALID_PARAM, "missing SGL, or SGL too small 0x%lx[b]", local[0].len);
}
} else {
unsigned i = (has_sgl_mem ? 1 : 0); // If supplied sgl, can't use it for now, just ignore it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this correctly the code here assumes that the first descriptor is "special" memory to be used internally by Gusli for optimization when '-sgl' custom param is set. @vvenkates27 what do you think about exposing such API toggle?

std::vector<class nixlGusliBackendReqH> child; // Array of sub completions, possibly a tree, though tested only 2 level io.
static void completion_cb(nixlGusliBackendReqH *c) {
__LOG_IO(c, "_done, rv=%d", c->io.get_error());
c->pollable_async_rv = c->io.get_error();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double check: This callback is always called from the same thread that initialized the request handle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, by definition no. It will be called from internal gusli thread. However there is a cancellation point that guarantees, when IO is destroyed (it calls cancel) so callback will not arrive

@danielhe-nvidia danielhe-nvidia force-pushed the main branch 2 times, most recently from 81896b2 to 54cab9a Compare July 8, 2025 13:14
std::stoi(backParams->at("max_num_simultaneous_requests"));
if (backParams->count("config_file") > 0)
gusli_params.config_file = backParams->at("config_file").c_str();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract gusli::global_clnt_context::init_params into a helper function...

if (backParams->count("config_file") > 0)
gusli_params.config_file = backParams->at("config_file").c_str();
}
lib_ = std::make_unique<gusli::global_clnt_raii>(gusli_params);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and move this initialization into initializer list by passing return value of the new helper to it directly instead of using a temporary. With this, you won't need to even use smart pointer anymore since you could just make lib_ by-value class field.

if (rv == gusli::connect_rv::C_OK) return NIXL_SUCCESS;
if (rv == gusli::connect_rv::C_NO_DEVICE) return NIXL_ERR_NOT_FOUND;
if (rv == gusli::connect_rv::C_WRONG_ARGUMENTS) return NIXL_ERR_INVALID_PARAM;
return NIXL_ERR_BACKEND;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use switch here.

assert(lib_->BREAKING_VERSION == 1);
}

nixlGusliEngine::~nixlGusliEngine() {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this and use "=default" in the header instead.

nixl_status_t
nixlGusliEngine::deregisterMem(nixlBackendMD *_md) {
nixlGusliMemReq *md = (nixlGusliMemReq *)_md;
if (!md) __LOG_RETERR(NIXL_ERR_INVALID_PARAM, "md==null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant. Validate function arg instead, rename auto_deleter to md and use it to access the metadata in a safe manner.

}
}; // namespace

nixlGusliEngine::nixlGusliEngine(const nixlBackendInitParams *nixlInit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consult the code style guide regarding variable naming conventions.

@aranadive aranadive self-requested a review August 4, 2025 20:43
@danielhe-nvidia danielhe-nvidia requested a review from a team as a code owner August 5, 2025 13:00
}
return true;
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of code in this file is a re-implementation of infrastructure that already exist in test/gtest/plugins or of the primitives provided by Gtest library. Please integrate the test-cases from here into one of the existing Gtest suites so it could be validated by CI from day1.

@barneuman
Copy link
Contributor

Can you please add your test to the CI (need to be added to .gitlab/test_cpp.sh)?
and please add an explanation to the README on how to run with the server

@danielhe-nvidia
Copy link
Contributor Author

danielhe-nvidia commented Aug 25, 2025

@barneuman
Will do,
Regarding your request: Please add an explanation to the README on how to run with the server - You should NOT do it via NIXL unit-test executable. NIXL is a client. GUSLI has an example code of how to write servers with classes and unit-test.

  1. The correct way to test with server is to download spdk install it. Compile GUSLI example server app which loads GUSLI server. And NIXL unit-test will connect to this server (via gusli client). I dont recommend NIXL people to do that as it will require spdk installation in the NIXL container which is long and really unrelated to NIXL.

  2. An in the middle approach is to not install spdk and use GUSLI sample app for server without spdk (server writes to a file or to ram block device). Much like in the previous approach NIXL unit-test would need to launch this server executable connect to it and kill-it. In other words, NIXL gusli client executable will not be able to run by itself, but needs some kind of external script to compile and launch the server, before NIXL does first register mem of first io in its unit test

  3. The easiest approach but weakest is to actually compile both GUSLI server and client into NIXL unit-test and the executable calls fork() continuing as server and as client. This is the easiest way to test, no need to externally coordinate between 2 executables but it does require NIXL to link with GUSLI server which is incorrect. Moreover, NIXL is a heavily c++ oriented code and I am not sure that std:: is guaranteed to works well with fork(). AFAIK it was developed for thread safety but forks() may cause memory corruptions (like constructor of a static class called once but after fork() destructor will be called twice).

Naturally - I tested GUSLI <--> NIXL with each and every 1 of the above 3 approaches, but I don't know how eventually NIXL team will decide to integrate NIXL unit-tests.
With the plugin I wrote unit tests only for serverless GUSLI (gusli client .so has a special loopback mode. Client emulates the server). It is done for testing purposes and accessing legacy block devices, but for NIXL it tests 100% of the plugin functionality so NIXL developers don't really care about GUSLI real or emulated server as NIXL-gusli-plugin code gets full coverage in the unit-tests without gusli server.

Note: The application which is the backend server (loads GUSLI server library) be it SPDK based or NVMeshUM, or user space NVME driver, etc - Should not be part of NIXL and should be managed by external team. For example, SPDK team should be in charge of the SPDK server running, restarting upon crash, configuring to give access to specific block devices and not to others. It is definitely not the responsibility of NIXL / nor GUSLI teams. This is equivalent to real NVME disks. NIXL team should not be responsible for inserting new disks when existing run out of space or managing which local disks should be used for io and which (like os disk) should not. This external configuration of hardware is managed outside of NIXL

Change-Id: I05e7b35eb3076142b4dcc1f58a68865d5a5962c0
Signed-off-by: danielhsh <[email protected]>
Signed-off-by: Adit Ranadive <[email protected]>
@aranadive
Copy link
Contributor

Closing this PR since #887 was merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants