-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Issues: microsoft/DeepSpeed
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[BUG] the input variables may be changed to scalars when use activation checkpoint
bug
Something isn't working
training
#6969
opened Jan 23, 2025 by
zhangvia
[BUG] z3+compile+gradient checkpoint uses more memory
bug
Something isn't working
training
#6966
opened Jan 22, 2025 by
oraluben
[BUG] deepspeed fails with torch 2.5 due to module._parameters is a dict, no longer a OrderedDict
bug
Something isn't working
training
#6961
opened Jan 20, 2025 by
skydoorkai
Is "Hierarchical All-to-all" feat available in current version?
#6957
opened Jan 16, 2025 by
GalanPei
[REQUEST] FPDT backward test
enhancement
New feature or request
#6955
opened Jan 16, 2025 by
YizhouZ
[REQUEST] Pipeline Parallelism support multi optimizer to train
enhancement
New feature or request
#6951
opened Jan 15, 2025 by
whcjb
[BUG] model(**input) cannot use under zero stage 3.
bug
Something isn't working
training
#6949
opened Jan 14, 2025 by
MarkDeng1
[Roadmap] DeepSpeed Roadmap Q1 2025
roadmap
Roadmap direction for DeepSpeed
#6946
opened Jan 13, 2025 by
loadams
5 tasks
DeepSpeed Installation Fails During Docker Build (NVML Initialization Issue)
#6945
opened Jan 13, 2025 by
asdfry
[BUG] Something isn't working
training
deepspeed.initialize
changes the output of Llama model
bug
#6929
opened Jan 7, 2025 by
Ktakuya332C
Multi node multi gpu distributed load
enhancement
New feature or request
#6927
opened Jan 6, 2025 by
rastinrastinii
[BUG]Zero++ training failed
bug
Something isn't working
training
#6926
opened Jan 6, 2025 by
HelloWorld506
[REQUEST] Deepspeed Inference Supports VL (vision language) model
enhancement
New feature or request
#6917
opened Dec 26, 2024 by
ethen8181
[BUG] Cannot access local variable 'locations' where it is not associated with a value
bug
Something isn't working
training
#6913
opened Dec 25, 2024 by
Guodanding
[BUG]Convergence Issue: Training BERT for Embedding with Zero2 and 3 as compared to Torchrun
bug
Something isn't working
training
#6911
opened Dec 24, 2024 by
dawnik17
[BUG] RuntimeError: The size of tensor a (2048) must match the size of tensor b (1024) at non-singleton dimension 2
bug
Something isn't working
deepspeed-chat
Related to DeepSpeed-Chat
#6910
opened Dec 24, 2024 by
Lowlowlowlowlowlow
[REQUEST] Support for XLA/TPU
enhancement
New feature or request
#6901
opened Dec 21, 2024 by
radna0
prterun noticed that process rank 7 with PID 0 on node gpu0304 exited on signal 6 (Aborted).
#6896
opened Dec 19, 2024 by
fabiogeraci
Using zero3 on multiple nodes is slow
bug
Something isn't working
training
#6889
opened Dec 18, 2024 by
HelloWorld506
How can DeepSpeed be configured to prevent the merging of parameter groups
#6878
opened Dec 16, 2024 by
CLL112
Previous Next
ProTip!
Exclude everything labeled
bug
with -label:bug.