Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add op status in RAS when comm is not ok #1642

Open
gangxie112 opened this issue Mar 18, 2025 · 1 comment
Open

Add op status in RAS when comm is not ok #1642

gangxie112 opened this issue Mar 18, 2025 · 1 comment

Comments

@gangxie112
Copy link

RAS is a great feature to resolve the silent hung issue. With it, we could get further more to the root cause. But it's still lake of detailed status is the blocked op, like op type, post,send, done fields of the sub, etc.
Do we consider to add them in next release?

Thanks,
Gang

@kiskra-nvidia
Copy link
Member

In the recently released NCCL 2.26.2, RAS was extended with separate counters for each collective operation type. Further extensions to RAS to provide more detailed information on the communicator status are on the roadmap, but it may be some time before they are implemented (unlikely to be ready by the next release).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants