-
Notifications
You must be signed in to change notification settings - Fork 228
fix: memory optimizations for Nemotron12B 12k seqlen DPO training #926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yubo Gao <[email protected]>
terrykong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the improving performance @ybgao-nvidia !
Signed-off-by: Yubo Gao <[email protected]>
Signed-off-by: Yubo Gao <[email protected]>
Signed-off-by: Yubo Gao <[email protected]>
Signed-off-by: Yubo Gao <[email protected]>
Signed-off-by: Yubo Gao <[email protected]>
Signed-off-by: Yubo Gao <[email protected]>
Wait ... where is it disabled by default? |
Signed-off-by: Yubo Gao <[email protected]>
Signed-off-by: Yubo Gao <[email protected]>
wangshangsam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small nits, but otherwise LGTM
Co-authored-by: Shang Wang <[email protected]> Signed-off-by: Yubo Gao <[email protected]>
Signed-off-by: Yubo Gao <[email protected]>
|
|
|
Signed-off-by: Yubo Gao <[email protected]>
|
|
wangshangsam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ybgao-nvidia ! @pjin-nvidia @bxyu-nvidia FYI
Corresponding fix in Automodel: NVIDIA-NeMo/Automodel#391 |
…IDIA-NeMo#926) Signed-off-by: Yubo Gao <[email protected]> Co-authored-by: Shang Wang <[email protected]> Co-authored-by: Terry Kong <[email protected]> Signed-off-by: Qidong Su <[email protected]>
…IDIA-NeMo#926) Signed-off-by: Yubo Gao <[email protected]> Co-authored-by: Shang Wang <[email protected]> Co-authored-by: Terry Kong <[email protected]> Signed-off-by: Stanislav Kirdey <[email protected]>
…IDIA-NeMo#926) Signed-off-by: Yubo Gao <[email protected]> Co-authored-by: Shang Wang <[email protected]> Co-authored-by: Terry Kong <[email protected]> Signed-off-by: Qidong Su <[email protected]>
…IDIA-NeMo#926) Signed-off-by: Yubo Gao <[email protected]> Co-authored-by: Shang Wang <[email protected]> Co-authored-by: Terry Kong <[email protected]>
What does this PR do ?
Memory optimizations
This PR applies memory optimizations that allows for single-node (8xH100) training of the Nemotron 12B model with sequence length 12288.
We need the following optimizations to make 12K context work:
The additional checkpointed layers provides a significant decrease in peak memory usage with minimal performance impact. However, enabling a smaller
max_split_sizein the allocator does increase the step time slightly. The collated performance results are below:(66.56)
(73.24)
Removal of
configure_expandable_segmentsFurthermore, the current implementation of
configure_expandable_segmentsdoes not actually perform its intended function.torch.cuda.get_device_properties(0).majorwhich initializestorch, including the memory allocator. The subsequent assignment to the environment variable will therefore not affect the allocator. Instead, thetorch.cuda.memory._set_allocator_settingsfunction should be used.However, setting expandable segments results in minimal affect to peak memory usage while causing a large performance overhead (from 20s to 80s per training iteration).
We have deleted the function and the related invocations and tests to keep the runtime behaviour consistent. Should the need arise to set expandable segments, the user shall do so instead in the
env_varsin the recipe configuration.Minor fixes for config schema
Some tweaks are done to make config validation pass.
tensorboardfield of logger optionalIssues
This PR resolves #848.
Usage
It is recommended to run DPO training with
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:64to reduce allocator fragmentation.Before your PR is "Ready for review"
Pre checks:
Additional Information