-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyTorch example failed #7
Comments
Thanks for reporting. Just to confirm: were you trying to do single machine training? |
Yes, I am going to start with single machine example. |
Did you set "export DMLC_NUM_SERVER=1" or not? I know the comment says the value does not matter.. I just want to confirm. |
@bobzhuyb Thanks, after I set the environment. The program can execute normally! It seems the value indeed matters for the example. The documentation needs to be updated. |
Umm another issue raises. How to exit the training process cleanly? Ctrl + c only kills program on one gpu and leaves the other gpus occupied. |
Thanks for letting us know. We will update the documents (actually, I think we should update the code.. non-distributed mode should not check that value at all.) ctrl-c kills the main process. I am not sure why the child processes are not killed. You may do "ps -ef", find out those child processes, and kill them. We have always been killing the whole docker container, so never encountered this problem.. |
Thank you for the report. @bobzhuyb I will remove the non-relevant env checking. |
* rdma: allow binding to given interface * tests: add key log * tests: add knob for log frequency
Describe the bug
By following the instruction in step-by-step-tutorials.md, I failed to run the example.
To Reproduce
Steps to reproduce the behavior:
The error messages are attached below
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: