Skip to content

Conversation

@Lunderberg
Copy link
Contributor

Prior to this commit, allocations performed by ncclCommInitRank had no corresponding call to ncclCommDestroy. While ncclCommDestroy does occur in the CCLThreadLocalContext::Clear method, there are no calls into this method. On worker processes, the failure to call ncclCommDestroy typically had little effect. Any destruction would occur shortly before the process closes, and so resources would be reclaimed by the OS when the process terminates.

However, worker0 of a Disco session is a separate thread, rather than a separate process. While this allows it to easily receive data from the controller thread, resources allocated by worker0 are not reclaimed by the OS until the entire process terminates. As a result, the CCLThreadLocalContext leaked GPU memory, as the ncclCommInitRank call at the start of each
tvm.runtime.disco.ProcessSession was never de-allocated. The increase in GPU memory usage was about 1 gigabyte for each ProcessSession.

This commit updates CCLThreadLocalContext to have a destructor that calls the Clear method. For worker0, this is called when the thread is joined to the main thread.

Prior to this commit, allocations performed by `ncclCommInitRank` had
no corresponding call to `ncclCommDestroy`.  While `ncclCommDestroy`
does occur in the `CCLThreadLocalContext::Clear` method, there are no
calls into this method.  On worker processes, the failure to call
`ncclCommDestroy` typically had little effect.  Any destruction would
occur shortly before the process closes, and so resources would be
reclaimed by the OS when the process terminates.

However, worker0 of a Disco session is a separate thread, rather than
a separate process.  While this allows it to easily receive data from
the controller thread, resources allocated by worker0 are not
reclaimed by the OS until the entire process terminates.  As a result,
the `CCLThreadLocalContext` leaked GPU memory, as the
`ncclCommInitRank` call at the start of each
`tvm.runtime.disco.ProcessSession` was never de-allocated.  The
increase in GPU memory usage was about 1 gigabyte for each
`ProcessSession`.

This commit updates `CCLThreadLocalContext` to have a destructor that
calls the `Clear` method.  For worker0, this is called when the thread
is joined to the main thread.
@vinx13 vinx13 merged commit 0984e97 into apache:main Jun 12, 2024
@Lunderberg Lunderberg deleted the nccl_cleanup_ccl_resources branch June 13, 2024 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants