Skip to content

Conversation

2dm
Copy link

@2dm 2dm commented Sep 26, 2025

What?

Verify that all enqueued work has completed.

Why?

There is a race between the agent destructor and jobs on commQueue since the background thread might finish without checking for existing work in the queue. In particular, the missed job is probably invalidateLocalMD that is called right before, which will keep the MD intact and prevent the re-addition of that rank later.

How?

  1. Stop enqueueing
  2. Check queue is empty (the thread is processing or completed the last work that was enqueued)
  3. Only then set commThreadStop

@2dm 2dm requested a review from a team as a code owner September 26, 2025 18:56
Copy link

copy-pr-bot bot commented Sep 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

👋 Hi 2dm! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@mkhazraee
Copy link
Contributor

/ok to test 458e800

@mkhazraee
Copy link
Contributor

/build

@mkhazraee
Copy link
Contributor

/ok to test 7a7cdb7

@mkhazraee
Copy link
Contributor

/build

@ovidiusm
Copy link
Contributor

ovidiusm commented Oct 7, 2025

/build

@ovidiusm
Copy link
Contributor

ovidiusm commented Oct 7, 2025

/ok to test 99297c6

@tstamler
Copy link
Contributor

tstamler commented Oct 7, 2025

/build

@2dm 2dm force-pushed the destructor-race branch from a9cd35e to 1082028 Compare October 8, 2025 02:27
@aranadive
Copy link
Contributor

/ok to test 4387963

@aranadive
Copy link
Contributor

/build

@ovidiusm
Copy link
Contributor

ovidiusm commented Oct 8, 2025

/build

1 similar comment
@ovidiusm
Copy link
Contributor

ovidiusm commented Oct 9, 2025

/build


nixlAgent::~nixlAgent() {
if (data && (data->useEtcd || data->config.useListenThread)) {
data->agentShutdown = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC agentShutdown is accessed from different threads, it should be atomic

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ovidiusm Same for commThreadStop then?

@ovidiusm
Copy link
Contributor

ovidiusm commented Oct 9, 2025

I still see 2 tests failing with SIGPIPE during cleanup, so I would try to stress test this locally to make sure there is no issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants