Skip to content

Conversation

anton-nayshtut
Copy link
Contributor

@anton-nayshtut anton-nayshtut commented Oct 9, 2025

This PR introduces a Linux AIO plugin for the POSIX backend.

Linux AIO, although only available on Linux, is known for much better performance than POSIX AIO.

This patch implements a Linux AIO backend and integrates it into the NIXL build system so it is only built when the platform supports it.

The PR also introduces the LINUXAIO API parameter to nixlbench.

Here are the nixlbench results (with --storage_enable_direct on top of NVME drive - SAMSUNG MZPLJ1T6HBJR-00007):

AIO

----------------------------------------------------------------------------------------------------------------------------------------------------------------

Block Size (B)      Batch Size     B/W (GB/Sec)   Avg Lat. (us)  Avg Prep (us)  P99 Prep (us)  Avg Post (us)  P99 Post (us)  Avg Tx (us)    P99 Tx (us)
----------------------------------------------------------------------------------------------------------------------------------------------------------------
4096                1              0.165290       24.8           35.0           35.0           1.3            2.0            23.5           47.0
8192                1              0.289647       28.3           51.0           51.0           1.4            2.0            26.9           48.0
16384               1              0.449769       36.4           65.0           65.0           1.7            4.0            34.6           166.0
32768               1              0.847057       38.7           60.0           60.0           1.5            3.0            37.1           49.0
65536               1              1.393177       47.0           59.0           59.0           1.5            2.0            45.4           55.0
131072              1              1.976728       66.3           59.0           59.0           1.5            3.0            64.7           189.0
262144              1              1.984314       132.1          69.0           69.0           1.7            6.0            130.3          255.0
524288              1              2.681113       195.5          61.0           61.0           1.9            7.0            193.6          528.0
1048576             1              2.706190       387.5          61.0           61.0           1.8            5.0            385.6          835.0
2097152             1              2.683741       781.4          61.0           61.0           2.3            7.0            778.0          1609.0
4194304             1              2.621102       1600.2         59.0           59.0           2.1            6.0            1597.0         2184.0
8388608             1              2.647123       3169.0         21.0           21.0           2.0            6.0            3166.6         5939.0
16777216            1              2.680175       6259.7         60.0           60.0           2.1            8.0            6256.5         10677.0
33554432            1              2.617827       12817.7        61.0           61.0           2.2            13.0           12814.3        17233.0
67108864            1              2.586386       25947.0        60.0           60.0           3.8            25.0           25942.0        31822.0

URING

----------------------------------------------------------------------------------------------------------------------------------------------------------------

Block Size (B)      Batch Size     B/W (GB/Sec)   Avg Lat. (us)  Avg Prep (us)  P99 Prep (us)  Avg Post (us)  P99 Post (us)  Avg Tx (us)    P99 Tx (us)
----------------------------------------------------------------------------------------------------------------------------------------------------------------
4096                1              0.208734       19.6           117.0          117.0          3.3            19.0           16.2           160.0
8192                1              0.418782       19.6           83.0           83.0           1.9            5.0            17.6           168.0
16384               1              0.774555       21.2           72.0           72.0           2.1            6.0            19.0           169.0
32768               1              1.312282       25.0           58.0           58.0           2.5            6.0            22.4           184.0
65536               1              1.736099       37.7           63.0           63.0           3.1            8.0            34.5           314.0
131072              1              2.148162       61.0           57.0           57.0           4.4            9.0            56.5           334.0
262144              1              2.804096       93.5           64.0           64.0           7.3            15.0           86.1           333.0
524288              1              2.636124       198.9          68.0           68.0           12.8           28.0           186.0          526.0
1048576             1              2.656291       394.8          69.0           69.0           21.0           56.0           373.6          1053.0
2097152             1              2.867947       731.2          59.0           59.0           23.9           81.0           706.1          1458.0
4194304             1              2.642940       1587.0         166.0          166.0          2.7            6.0            1581.3         2471.0
8388608             1              2.629579       3190.1         48.0           48.0           2.5            5.0            3186.5         7507.0
16777216            1              2.634350       6368.6         50.0           50.0           2.8            17.0           6364.7         7024.0
33554432            1              2.595498       12927.9        81.0           81.0           2.6            5.0            12923.7        23889.0
67108864            1              2.589345       25917.3        82.0           82.0           3.4            12.0           25911.9        34322.0

LINUXAIO

----------------------------------------------------------------------------------------------------------------------------------------------------------------

Block Size (B)      Batch Size     B/W (GB/Sec)   Avg Lat. (us)  Avg Prep (us)  P99 Prep (us)  Avg Post (us)  P99 Post (us)  Avg Tx (us)    P99 Tx (us)
----------------------------------------------------------------------------------------------------------------------------------------------------------------
4096                1              0.103491       39.6           123.0          123.0          0.2            1.0            0.2            1.0
8192                1              0.202123       40.5           131.0          131.0          0.2            1.0            3.7            1.0
16384               1              0.407086       40.2           174.0          174.0          0.2            1.0            2.4            1.0
32768               1              0.839628       39.0           436.0          436.0          0.7            2.0            0.6            2.0
65536               1              2.037954       32.2           110.0          110.0          0.2            1.0            0.2            1.0
131072              1              4.011799       32.7           103.0          103.0          0.2            1.0            0.2            1.0
262144              1              5.590394       46.9           122.0          122.0          0.2            1.0            0.2            1.0
524288              1              13.309550      39.4           101.0          101.0          0.2            1.0            0.2            1.0
1048576             1              26.005428      40.3           103.0          103.0          0.2            1.0            0.3            1.0
2097152             1              4.383999       478.4          121.0          121.0          2.0            109.0          4.8            291.0
4194304             1              6.484923       646.8          112.0          112.0          2.9            166.0          8.7            539.0
8388608             1              13.066368      642.0          116.0          116.0          4.6            275.0          16.8           1048.0
16777216            1              26.129505      642.1          102.0          102.0          9.2            565.0          34.5           2161.0
33554432            1              52.217702      642.6          102.0          102.0          17.7           1095.0         83.2           5233.0
67108864            1              84.701161      792.3          275.0          275.0          41.5           2603.0         252.2          15875.0

Copy link

copy-pr-bot bot commented Oct 9, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

github-actions bot commented Oct 9, 2025

👋 Hi anton-nayshtut! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

ovidiusm
ovidiusm previously approved these changes Oct 9, 2025
@anton-nayshtut anton-nayshtut force-pushed the antonn/linux_aio branch 2 times, most recently from 4842d05 to b018b40 Compare October 9, 2025 13:19
@ovidiusm
Copy link
Contributor

ovidiusm commented Oct 9, 2025

/build

@anton-nayshtut
Copy link
Contributor Author

/build

@ovidiusm , I see that it needs a rebase. Could you please advise on the correct procedure? Should I rebase and push manually?

rt_dep = cpp.find_library('rt', required: true)
thread_dep = dependency('threads')

# Check for libaio (for POSIX plugin and test)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a different PR we should fix this mistake where it assumes libaio is required for POSIX AIO. It's not. We don't need any symbol test or extra library for POSIX AIO, it's baked into glibc. But your Linux AIO additions here are correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know. I just wanted to make this change a separate PR as you mentioned.

if (ret != num_ios_to_submit) {
if (ret < 0) {
NIXL_ERROR << absl::StrFormat("linux_aio submit failed: %s", nixl_strerror(-ret));
} else {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial submission isn't a failure. We should handle it by trying again later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree. However, it seems that the current API is built on the premise that it’s all or nothing, so for now I decided to follow it rather than break it by implementing a retry mechanism in functions that aren’t intended for this but are repeatedly called until success - like checkCompleted.

Generally speaking, I think we need to make the API aware of retries and partial submissions.

Please advise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large would it have to be in practice for full submission to be rejected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking because we have code freeze on Monday, and I am wondering if you think you can merge this new API by then as is and fix the partial submission issue in a separate PR since a fix would be touching code that is out of scope of this PR, or you think it is best to postpone everything and merge in a future version.

Right now we are switching to major / bugfix release cadence, so 0.7.0 would be a good candidate for this feature and 0.7.1 would be a good candidate for the fix. Unless you think that the code will be broken and prefer not to ship it now. IIUC currently requests are rejected if the client app is trying to write/read too many descriptors at once, which may be acceptable if the limit is high and users have not reported such issues until now in the current implementation.

Copy link
Contributor Author

@anton-nayshtut anton-nayshtut Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It behaves the same way as, for example, UringQueue which submits IOs using io_uring_submit API. This API also returns the number of submitted submission queue entries.

That’s why I implemented the Linux AIO plugin in the same way. It adds functionality and doesn’t make things worse.

So, I think we can add it.

That said, I believe we must adjust the API to be aware of possible partial submissions ASAP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need change of API for partial submission, we can entirely handle this inside the plugin implementation. Having said that, we can address this together for all POSIX IOs in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need change of API for partial submission, we can entirely handle this inside the plugin implementation. Having said that, we can address this together for all POSIX IOs in a separate PR.

Could you please elaborate on your suggestion? As far as I understand, the submit() API is only called once, and then checkCompleted() is called repeatedly in a loop until the transfer is finished. Is that correct? If so, how would you recommend implementing partial submission handling? Should the plugin keep sending the rest of the data in checkCompleted() until it’s all done, or do you have something else in mind?


ios_to_submit[idx] = nullptr; // Mark as completed

if (events[i].res < 0) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no way to report that a single operation failed? The whole backend simply fails? This isn't a comment so much on your patch but on this entire POSIX backend I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems so, unfortunately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In NIXL if entire transfer is not complete its considered a failure. If user wants to get more visibility, they can split their transfers to more requests. Everything is async

@ovidiusm
Copy link
Contributor

ovidiusm commented Oct 9, 2025

/build

@ovidiusm , I see that it needs a rebase. Could you please advise on the correct procedure? Should I rebase and push manually?

There are no conflicts, either a normal merge or a rebase should just work. You can also click on the update branch button and pull, it does a merge.

If URING is not supported, the getQueueType() API now returns the
correct return value (queue_t::UNSUPPORTED).

The patch also cleans up the code by removing an unnecessary ifdef.

Signed-off-by: Anton Nayshtut <[email protected]>
Linux AIO, although only available on Linux, is known for much better
performance than POSIX AIO.

This patch implements a Linux AIO backend and integrates it into the
NIXL build system so it is only built when the platform supports it.

Signed-off-by: Anton Nayshtut <[email protected]>
This patch adds support for a POSIX LINUXAIO API using the Linux AIO
plugin.

Signed-off-by: Anton Nayshtut <[email protected]>
@anton-nayshtut
Copy link
Contributor Author

/build

@ovidiusm , I see that it needs a rebase. Could you please advise on the correct procedure? Should I rebase and push manually?

There are no conflicts, either a normal merge or a rebase should just work. You can also click on the update branch button and pull, it does a merge.

Done! Thanks!

"--posix_api_type",
type=str,
help="API type for POSIX operations [AIO, URING] (only used with POSIX backend",
help="API type for POSIX operations [AIO, URING, LINUXAIO] (only used with POSIX backend",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can Linux AIO be made default, is it available with manylinux pip builds? @aranadive ?

Copy link
Contributor Author

@anton-nayshtut anton-nayshtut Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can Linux AIO be made default, is it available with manylinux pip builds? @aranadive ?

It can. As @benlwalker rightly mentioned in one of his comments, the current POSIX AIO dependency is incorrect. NIXL currently assumes that libaio is required for POSIX AIO, while in reality it is part of GLIBC.

We’re going to fix this dependency in a separate PR, but the fact that POSIX AIO is available in the manylinux build means that libaio is available there. This means we can make Linux AIO the default option.

if (ret != num_ios_to_submit) {
if (ret < 0) {
NIXL_ERROR << absl::StrFormat("linux_aio submit failed: %s", nixl_strerror(-ret));
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need change of API for partial submission, we can entirely handle this inside the plugin implementation. Having said that, we can address this together for all POSIX IOs in a separate PR.


ios_to_submit[idx] = nullptr; // Mark as completed

if (events[i].res < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In NIXL if entire transfer is not complete its considered a failure. If user wants to get more visibility, they can split their transfers to more requests. Everything is async

return NIXL_SUCCESS;
}

struct io_event events[32];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why specifically 32 events?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to. No specific reason. We can pick any other number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants