-
Notifications
You must be signed in to change notification settings - Fork 268
Issues: pytorch/torchtitan
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
Mitigation to HuggingFace Trainer
enhancement
New feature or request
#824
opened Feb 6, 2025 by
huyiwen
WARNING - When using FSDP, it's recommended to enable config.force_recompute_fp8_weight_in_bwd.
module: fsdp
question
Further information is requested
#821
opened Feb 5, 2025 by
c0g
HSDP causes loss instability
module: fsdp
question
Further information is requested
#813
opened Jan 31, 2025 by
apkumar
debug model training hangs on NVIDIA B200 with >1 GPU
bug
Something isn't working
module: c10d
#810
opened Jan 28, 2025 by
vkuzo
Loss metrics dramatically change after resuming from checkpoint
bug
Something isn't working
enhancement
New feature or request
module: checkpoint
release_blocking
Issues that are blocking the milestone / release completion
Gradient Scaling With Pipeline Parallelism
module: pipelining
question
Further information is requested
#803
opened Jan 24, 2025 by
windsornguyen
should we have an extension point for model transforms out of tree?
enhancement
New feature or request
#790
opened Jan 15, 2025 by
vkuzo
[Bug] Unexpected performance drop with float8 training + compiling only nn.Linear layers + using selective per op AC
bug
Something isn't working
#786
opened Jan 10, 2025 by
danielvegamyhre
Why use RowwiseParallel for nn.Embedding instead of ColwiseParallel?
question
Further information is requested
#785
opened Jan 10, 2025 by
corey-lambda
BUG: early_step_in_backward with pipeline parallelism and len(model_parts) > 1
bug
Something isn't working
#777
opened Jan 7, 2025 by
cassanof
PP hangs when pipeline_parallel_microbatches < pipeline_parallel_degree
bug
Something isn't working
#775
opened Jan 6, 2025 by
cassanof
ZBVZeroBubble error
bug
Something isn't working
module: pipelining
#774
opened Jan 3, 2025 by
hhaAndroid
PP InterleavedZeroBubble schedule shows low TPS and high memory usage
bug
Something isn't working
module: pipelining
release_blocking
Issues that are blocking the milestone / release completion
FSDP 2 doesn't pad tensors?
question
Further information is requested
#764
opened Dec 29, 2024 by
cassanof
Checkpoint conversion
module: checkpoint
question
Further information is requested
#758
opened Dec 20, 2024 by
MaxiBoether
[question]can't disable CP for specific (unsupported) SDPA op
enhancement
New feature or request
module: context parallel
#757
opened Dec 20, 2024 by
FindDefinition
Any plans to support DPO training?
enhancement
New feature or request
#756
opened Dec 20, 2024 by
xs1997zju
JobConfig does not support typing
enhancement
New feature or request
#753
opened Dec 18, 2024 by
greeneggsandyaml
Model init with HuggingFace model
bug
Something isn't working
question
Further information is requested
#743
opened Dec 16, 2024 by
neeldani
Low bit Optimizers & FA-3
bug
Something isn't working
question
Further information is requested
#742
opened Dec 16, 2024 by
asahni04
using fsdp2 wrapper Flux(text to image) model , gradient is inconsistent with fsdp1
question
Further information is requested
#734
opened Dec 13, 2024 by
yanmj0601
Issue: Loss Discrepancy Between FSDP1 and FSDP2 with AdamW Optimizer
question
Further information is requested
#724
opened Dec 9, 2024 by
Teng-xu
Previous Next
ProTip!
Type g i on any issue or pull request to go back to the issue listing page.