-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Issues: microsoft/DeepSpeed
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[BUG] [Fix-Suggested] ZeRO Stage 3 Overwrites Module ID Attribute Causing Incorrect Expert Placement on GPUs
bug
Something isn't working
training
#6772
opened Nov 20, 2024 by
traincheck-team
[BUG] [Fix-Suggested] Checkpoint Inconsistency When Freezing Model Parameters Before
deepspeed.initialize
#6771
opened Nov 20, 2024 by
traincheck-team
[BUG] [Fix-Suggested] KeyError in stage_1_and_2.py Due to Optimizer-Model Parameter Mismatch
#6770
opened Nov 20, 2024 by
traincheck-team
[BUG] clip_grad_norm for zero_optimization mode is not working
bug
Something isn't working
training
#6767
opened Nov 20, 2024 by
chengmengli06
[BUG]NCCL operation timeout when training with deepspeed_zero3_offload or deepspeed_zero3 on RTX4090
bug
Something isn't working
training
#6756
opened Nov 18, 2024 by
MLS2021
Some Demos on How to config to offload tensors to nvme device
#6752
opened Nov 15, 2024 by
niebowen666
Model Checkpoint docs are incorrectly rendered on deepspeed.readthedocs.io
bug
Something isn't working
documentation
Improvements or additions to documentation
#6747
opened Nov 12, 2024 by
akeshet
Whether Deepspeed-Domino is compatible with other parallel strategy?
#6744
opened Nov 12, 2024 by
Andy666G
[BUG] max_grad_norm not effect
bug
Something isn't working
compression
#6743
opened Nov 12, 2024 by
yiyepiaoling0715
GPU mem doesn't release after delete tensors in optimizer.bit16groups
#6729
opened Nov 8, 2024 by
wheresmyhair
[BUG] any clue for MFU drop?
bug
Something isn't working
training
#6727
opened Nov 8, 2024 by
SeunghyunSEO
[BUG] [ROCm] Fine-tuning DeepSeek-Coder-V2-Lite-Instruct with 8 MI300X GPUs results in c10::DistBackendError
bug
Something isn't working
rocm
AMD/ROCm/HIP issues
training
#6725
opened Nov 8, 2024 by
nikhil-tensorwave
[BUG] Zero3 for torch.compile with compiled_autograd when running LayerNorm
bug
Something isn't working
training
#6719
opened Nov 6, 2024 by
yitingw1
[BUG] DeepSpeed accuracy issue for torch.compile if activation checkpoint function not compiler disabled
bug
Something isn't working
training
#6718
opened Nov 6, 2024 by
jerrychenhf
[BUG]Issue with Zero Optimization for Llama-2-7b Fine-Tuning on Intel GPUs
bug
Something isn't working
training
#6713
opened Nov 5, 2024 by
molang66
"__nv_bfloat162" has already been defined
install
Installation and package dependencies
windows
#6709
opened Nov 4, 2024 by
wolfljj
[REQUEST] Some questions about deepspeed sequence parallel
enhancement
New feature or request
#6708
opened Nov 4, 2024 by
yingtongxiong
[REQUEST] Non-element-wise Optimizer Compatibility
enhancement
New feature or request
#6701
opened Nov 2, 2024 by
Triang-jyed-driung
How could I convert ZeRO-0 deepspeed weights into fp32 model checkpoint?
enhancement
New feature or request
#6699
opened Nov 1, 2024 by
liming-ai
[BUG] Universal Checkpoint Conversion: Resumed Training Behaves as If Model Initialized from Scratch
bug
Something isn't working
training
#6691
opened Oct 30, 2024 by
purefall
Previous Next
ProTip!
Follow long discussions with comments:>50.