Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure workers get killed on unregister call #942

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

maheshambule
Copy link
Contributor

@maheshambule maheshambule commented Aug 24, 2020

Issue #, if available:

The orphan processes get created when you fire multiple register and unregister calls on same model one after another.
The orphan worker processes hogs the system memory.

Description of changes:

  1. Send sigterm to Main Worker Thread from frontend by using destroy call instead of destroyForcefully.
  2. Handle sigterm and kill all the child workers and current process.
  3. Add SIGCHLD handler to handle zombie processes. Code taken from here: maaquib@6ed099a#diff-efa92912588641e6a3b20d8900316be2R167

Testing done:

To run CI tests on your changes refer README.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@maheshambule maheshambule changed the title Ensure workers get killed on unregsiter call Ensure workers get killed on unregister call Aug 24, 2020
@dhanainme
Copy link
Contributor

@maheshambule , Can you please do a small write regarding the testing that was done for this PR.

Please add some context / behavior before the fix / behavior after the fix with steps & logs for both the cases.

Copy link

@ayushsengupta1991 ayushsengupta1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested that the worker processes actually get destroyed on an unregister call.

@kastman
Copy link

kastman commented Dec 1, 2021

I'm relatively sure I'm seeing this behavior as well - memory used on invocation/handle doesn't seem to be garbage collected and I'm quickly hitting 100% even on relatively large 3 x 18xlarge instances in the context of Sagemaker multi-model endpoints. The changes from @maheshambule were approved but never merged - is that because it needed documentation? @ayushsengupta1991 @dhanainme I'm happy to write something if it would get the code accepted upstream

@kastman
Copy link

kastman commented Dec 1, 2021

(The CI failure isn't accessible anymore or I'd go in and comment.) Hoping to get this bumped / helped. Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants