Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Improve stablability of remote training service. #2474

Merged
merged 7 commits into from
May 25, 2020

Conversation

squirrelsc
Copy link
Member

@squirrelsc squirrelsc commented May 22, 2020

When killing a GPU collector on Windows, it may fail by unknown reason. Because gpu collector is designed to a single instance process, so it causes the gpu collector from next experiments cannot start. This fix is to mitigate this issue.

  1. Remove single instance check of GPU collector. So one machine can start multiple gpu collectors now.
  2. Add killSelf parameter for killChildProcesses command. It's convenient in some cases.
  3. Use -Force parameter in Stop-Process command to make kill operation more reliable.
  4. Add stop detection in gpu tail loop, so that it can stop earlier when killing gpu collector is slow.

Minor change

  1. I found the experiment may turn to error state when stopping, but no idea how it happens. Add information in assert to see what status it is.
  2. clean up IT server cache to prevent disk full.

@SparkSnail SparkSnail merged commit be09f11 into microsoft:v1.6 May 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants