Improve stablability of remote training service. #2474

squirrelsc · 2020-05-22T07:15:09Z

When killing a GPU collector on Windows, it may fail by unknown reason. Because gpu collector is designed to a single instance process, so it causes the gpu collector from next experiments cannot start. This fix is to mitigate this issue.

Remove single instance check of GPU collector. So one machine can start multiple gpu collectors now.
Add killSelf parameter for killChildProcesses command. It's convenient in some cases.
Use -Force parameter in Stop-Process command to make kill operation more reliable.
Add stop detection in gpu tail loop, so that it can stop earlier when killing gpu collector is slow.

Minor change

I found the experiment may turn to error state when stopping, but no idea how it happens. Add information in assert to see what status it is.
clean up IT server cache to prevent disk full.

And allow multiple gpu collector to run.

This reverts commit 41d02d8.

JSong-Jia and others added 4 commits May 20, 2020 10:42

Update README.md (microsoft#2465)

41d02d8

Stablalize remote service

9ceaaa9

And allow multiple gpu collector to run.

Revert "Update README.md (microsoft#2465)"

0946bbd

This reverts commit 41d02d8.

stopping gpu tail command earlier

30a1ab0

squirrelsc requested review from SparkSnail, chicm-ms and QuanluZhang May 22, 2020 07:15

squirrelsc added 3 commits May 22, 2020 17:09

clean up server cache to prevent disk full

f1489b9

remvoe tmp clean up, as there is another PR fixing it.

ff65257

remove multiline

7f162d6

chicm-ms approved these changes May 25, 2020

View reviewed changes

SparkSnail approved these changes May 25, 2020

View reviewed changes

SparkSnail merged commit be09f11 into microsoft:v1.6 May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve stablability of remote training service. #2474

Improve stablability of remote training service. #2474

squirrelsc commented May 22, 2020 •

edited

Loading

Improve stablability of remote training service. #2474

Improve stablability of remote training service. #2474

Conversation

squirrelsc commented May 22, 2020 • edited Loading

squirrelsc commented May 22, 2020 •

edited

Loading