-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torchrun not identified #1
Comments
Okay, I'll fix that. |
You can try the latest commit. I remove the argument |
Thanks for your effort. I tried it but got some error. |
Unfortunately I don't have a multi-card machine in my hands to test at the moment. Please make sure that the machine itself has the ability to run both commands at the same time. If it's determined that it can run simultaneously when used standalone, but can't schedule multiple GPUs when using |
I think these two commands for ddp cannot be run at the same time. |
It can be implemented by a simple shell script, and a similar demo is shown in my anothor repo:
This may be because two commands are running at the same time, resulting in OOM. |
The new commit has some issue. I have to specify the GPU number and it could run properly. In the previous commit, even if I don't specify the GPU number, Runit could still dynamically select a GPU for my command. |
Indeed, this is a newly introduced bug. I seem to understand what the original problem was. Line 63 in 056b1cf
but by default only a single GPU can be used. This leads to incompatibility with multi-card programs like |
I'm a little confused here, do you mean at this moment we have some workaround for several multi-card programs in a script? |
In the current version, multi-card support is not available for the time being. |
Thanks, it seems now we could only use it for non-multiple card program. |
I recently refactored this tool based on a process pool and a cross-process communication manager! More details can be found in Feel free to use it and give feedback. |
@lartpang oh that's so great bro. I will try it once I am available (doing paper works recently). This tool is really useful for my work :) |
Hi, thank you for your effort and this is a real good tool!
![image](https://private-user-images.githubusercontent.com/142638691/306395340-f044fb02-dfcf-4396-a536-4fa6a12288b1.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk3MjY2NDQsIm5iZiI6MTczOTcyNjM0NCwicGF0aCI6Ii8xNDI2Mzg2OTEvMzA2Mzk1MzQwLWYwNDRmYjAyLWRmY2YtNDM5Ni1hNTM2LTRmYTZhMTIyODhiMS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjE2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxNlQxNzE5MDRaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1kMzBmY2EwZjI2YTFkNDQ0NjM5MmJhMjY3Y2I4Yzg3MmZkYTJhZTViOTk1ZmIzYjJhYzEwM2RmYzk1MmJiNjk0JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.uzY3tjloVuZWTw77V0yVjGoATXqZARx5rOILtNco_1g)
I tried to use it for ddp, and the command cannot be identified.
I'm wondering if you could add support for it.
This is the command in the config.txt.
torchrun --nproc-per-node=2 /home/jwq/Compressor/main.py
The text was updated successfully, but these errors were encountered: