-
Notifications
You must be signed in to change notification settings - Fork 295
feat: Context Parallelism + Sequence Packing in Megatron + Dtensor #651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
159 commits
Select commit
Hold shift + click to select a range
097de0d
Fixed ~100 type errors
SahilJain314 21afaf7
fixed 50 more type issues
SahilJain314 f38ab81
lint
SahilJain314 e5296bc
Added mypy config, fixed 50 more type errors, and updated old typing.…
SahilJain314 a9cee8f
Updated pyproject
SahilJain314 89cf9cf
Down to 100 errors
SahilJain314 e8160c3
Update pyproject
SahilJain314 6a494b9
Down to 50 errors
SahilJain314 8a95964
Added testing doc
SahilJain314 3b0bfbc
Fixed missing import
SahilJain314 588cfda
Fixed tokenizer type
SahilJain314 bc07e76
Down to 50 errors
SahilJain314 72e4819
fixed 150 strict mypy errors
SahilJain314 9bb8d83
lint
SahilJain314 7376384
Fixed another 100 strict typing mypy errors
SahilJain314 b3f5098
Fixed another 100 strict typing mypy errors (down to 130)
SahilJain314 e8b3552
Brought non-strict errors down to 18
SahilJain314 7aa1f44
Brought non-strict errors down further
SahilJain314 958f538
Fixed pynvml test type
SahilJain314 6a14aec
feat: support mcore extra (megatron + tron)
terrykong 17f5136
typo
terrykong b17e163
all good
terrykong 5b3b96d
pin pre-commit
terrykong 48f8ffc
undo
terrykong 8aa6ff8
move submodule stuff into comment until
terrykong c5f4305
ok
terrykong fdc6024
rmove this round
terrykong 5c22f20
fix
terrykong 23c0943
dockerignore too
terrykong 7d2d8dc
Moved all actors to using NamedSharding-based distribution instead of…
SahilJain314 9cd192b
oops forgot named_sharding files
SahilJain314 9952110
lint
SahilJain314 e599e76
Merge remote-tracking branch 'origin/main' into sahilj/type_fix
SahilJain314 c917363
Updated with tot merge
SahilJain314 a843293
Updated configure_generation_config
SahilJain314 11b2f42
updated uv lock
SahilJain314 d72dfcb
original uv lock
SahilJain314 18aee25
updated uv lock
SahilJain314 a9c285b
original uv
SahilJain314 b678022
Merge remote-tracking branch 'origin/main' into sahilj/type_fix
SahilJain314 3b241df
updated uv lock
SahilJain314 40c6275
Pushed coordinate finding into the NamedSharding
SahilJain314 aacb783
Unit test failure
SahilJain314 c477048
Added tests and fixed types
SahilJain314 c0043c1
Added mypy to ci (just a warning rn)
SahilJain314 e3f98ea
Merge remote-tracking branch 'origin/sahilj/type_fix' into sahilj/nam…
SahilJain314 76d510a
try setting max_jobs really low
terrykong 3058683
Merge remote-tracking branch 'origin' into sahilj/named_sharding
SahilJain314 c38fcb4
Merge branch 'tk/megatron-extra' into sahilj/megatron_tot
SahilJain314 3f9667f
Added Megatron
SahilJain314 4b82ffd
Megatron fixes
SahilJain314 e36935d
tot bugfixes
SahilJain314 36d41ba
Updated git module
SahilJain314 9678444
Added kwargs to save checkpoint
SahilJain314 e5bcd1e
Fixed checkpointing Megatron
SahilJain314 f9be255
Don't bother with rng restore
SahilJain314 42800e6
Fixed metric logging
SahilJain314 dc7920b
lint
SahilJain314 f88bbc3
Updated nemo patch
SahilJain314 8910bd4
Updated patch
SahilJain314 84bd407
Fixed memory offloading for parameter and grad buffers
SahilJain314 2271229
fix: Don't call state_dict in loop + dtype fix (#445)
yfw b08ab1c
Enable dyanmic batching
SahilJain314 bbf8de7
Fixed pp bug
SahilJain314 41a542c
lint
SahilJain314 6b6355e
Merge remote-tracking branch 'origin' into sahilj/megatron_tot
SahilJain314 8ad9565
Fixed merge artifact
SahilJain314 f8f55a8
Fixes for tests
SahilJain314 1688ff2
Fixed dynamic batching and improved memory usage
SahilJain314 af9f767
default expandable segments on
SahilJain314 678caf7
Added basic sequence packing
SahilJain314 ec9b174
Added basic sequence packing
SahilJain314 72ff438
Fixed PP with sequence packing
SahilJain314 7142efa
Updated Megatron patch
SahilJain314 ce21205
Remove custom_fsdp mentions
SahilJain314 7b3cc97
Bump ray
SahilJain314 6608e62
Added a 70b config with megatron
SahilJain314 e0002c4
Merge branch 'sahilj/megatron_packed' of github.com:NVIDIA/NeMo-RL in…
SahilJain314 6a24f56
Sequence packing
ahmadki 04859b7
checkpoint
ahmadki 7592a35
revert some packing changes
ahmadki 6c57e22
initial fix for different micro batch lengths
ahmadki 051b329
benchmark configs
ahmadki fbbd77a
minor fixes
ahmadki 88919cf
grpo configs
ahmadki 4fb888d
grpo fixes so code would run
ahmadki 1da878e
loss function fixes, added packing strategy to policy
ahmadki db26d63
SFT config/API cleanup
ahmadki 5598f50
debug mode on
ahmadki c1b072f
new packing
ahmadki 06de588
cleanup
ahmadki 29a0cae
Merge branch 'main' into ahmadki/dev/sequence_packing_2
ahmadki 34bca2e
Merge branch 'sahilj/megatron_packed' into ahmadki/dev/sequence_packi…
ahmadki 41972d7
implemented MFFD as a "SequencePacker", moved it to bin packing algor…
ahmadki ab47ed1
logging cleanup
ahmadki 35e0421
made packing algorithms naming more clear
ahmadki e2a3375
more code cleanup
ahmadki 099ccce
Merge branch 'main' into ahmadki/dev/sequence_packing_2
ahmadki 1c8cc46
reduce amount of diff with main
ahmadki 76094e0
reduce amount of diff with main 2
ahmadki fd75847
added back flash-attn dependency
ahmadki 773c6db
cleanup and config alignments
ahmadki 8f64913
Merge branch 'main' into ahmadki/dev/sequence_packing_2
ahmadki 61804e4
config alignments, configs for new implementation
ahmadki 67c2958
generic get_packer
ahmadki ad31fce
config syntax cleanup
ahmadki e5811c2
moved dtensor sequence packing functions into hf common
ahmadki 9f35db8
typed flash attention kwargs
ahmadki 8674b96
dropped database based seq packing
ahmadki 948096d
typo
ahmadki c3c8b66
unified loss_fn for seq packing
ahmadki ff27c79
config organization
ahmadki 106ef8c
removed debug configs
ahmadki 3b1b444
more config cleanup
ahmadki dc3e2d6
removed PackedDataset
ahmadki 84be3b8
Merge branch 'main' into ahmadki/sequence_packing
ahmadki fe5e0e1
aligned NeMo git submodule with main
ahmadki 6f9e203
Merge branch 'main' into ahmadki/sequence_packing
SahilJain314 d81a24d
Merge branch 'main' into ahmadki/sequence_packing
SahilJain314 b4c56b6
Lint fix
SahilJain314 222dc8a
Load AutoModelForCausalLM weight in FP32
ahmadki a9ec7cd
Merge branch 'main' into ahmadki/sequence_packing
ahmadki 049e193
Merge branch 'main' into ahmadki/sequence_packing
ahmadki 03208bc
Updating seq packing algo to modified ffd
SahilJain314 3764f77
Enabling sequence packing by default for megatron
SahilJain314 bd2a393
Critical sequence packing fixes for Megatron
SahilJain314 f2db981
Init CP (no pp
SahilJain314 c0d2898
Fixed CP + PP
SahilJain314 0543108
Cleanup
SahilJain314 2b1b4b5
Fixed unit tests
SahilJain314 24cf74f
lint
SahilJain314 da84d6f
copyright
SahilJain314 e14e41b
copyright
SahilJain314 790d803
Merge branch 'main' into sahilj/cp-rebase
SahilJain314 3e22ec2
bugfix
SahilJain314 75e2465
PR fixes
SahilJain314 3dadbf2
Merge remote-tracking branch 'origin' into sahilj/cp-rebase
SahilJain314 508fdf4
PR Fixes
SahilJain314 9c2b013
Update nemo_rl/models/policy/__init__.py
SahilJain314 8b9ed7e
Update tests/unit/data/packing/test_algorithms.py
SahilJain314 b386ab3
Lint, also adding Ahmad as Coauthor
SahilJain314 11e3b7d
Fixed dtensor sequence packing
SahilJain314 a4416f2
Merge remote-tracking branch 'origin/main' into sahilj/cp-rebase
SahilJain314 9ee7bf5
Fixed NeMo commit merge
SahilJain314 1e95143
feat: Enable CP during get_logprobs for dtensor worker. (#678)
joyang-nv 5e497ce
Try unit fix
SahilJain314 62a6f01
fix: remove unnecessary ray initialization since it's handled at the …
terrykong 4500cde
Unit fix
SahilJain314 36f67ac
docs: update converter path in README. (#672)
xxman-google 75e2f69
fix: make mcore lr scheduler configuration consistent with dtensor (#…
ashors1 d431685
fix: fix mcore LR increment (#685)
ashors1 f9ef28f
fix: upgrade datasets to fix squad download (#692)
ashors1 cc31642
fix: Megatron config updates to avoid OOM (#687)
ashors1 df53dbc
fix: fix lr scheduler for config that was missed in #681 (#693)
ashors1 e93284d
fix: Fix gemma models broken by HF update (#676)
yfw 172dd0a
chore: add CP+SP (sequence parallel) assertion in DTensor worker (#689)
yuki-97 9083a2e
Lint
SahilJain314 db08e14
Fixed generation test
SahilJain314 75bb1b2
revert conftest
SahilJain314 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.