-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it correct to set up fsdp for a machine (V100) that does not support bf16? #274
Comments
yes it seems correct! |
OK,thank u,I also want to ask about the main thread memory is higher than other threads and overflow situation, how I should solve it, do you have suggestions?
…---- Replied Message ----
| From | Li ***@***.***> |
| Date | 09/14/2023 12:10 |
| To | Luodian/Otter ***@***.***> |
| Cc | xmc-andy ***@***.***>,
Author ***@***.***> |
| Subject | Re: [Luodian/Otter] Is it correct to set up fsdp for a machine (V100) that does not support bf16? (Issue #274) |
yes it seems correct!
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I think you can refer to this link to see if you can do something. I know that you may able to set |
Previously I see some code doing so but I didnt use it before, maybe you should do some search on |
Thank u for your shared suggestions, I will try them, |
I tried setting device_map to 'auto', 'balanced', 'balanced_low_0' or 'sequential' respectively. Unfortunately, it still overflows the memory on 3 V100s (unfrozen ViT). In comparison, I think balanced_low_0 is It might be possible if I have enough cards, I will try it further if I have 4 V100s. |
compute_environment: LOCAL_MACHINE
distributed_type: no
downcast_bf16: false
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: 20687
The text was updated successfully, but these errors were encountered: