Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve process is alive detection #1361

Open
elfenpiff opened this issue May 18, 2022 · 4 comments · Fixed by #2056
Open

improve process is alive detection #1361

elfenpiff opened this issue May 18, 2022 · 4 comments · Fixed by #2056
Labels
enhancement New feature refactoring Refactor code without adding features technical debt unclean code and design flaws

Comments

@elfenpiff
Copy link
Contributor

elfenpiff commented May 18, 2022

Brief feature description

When on high CPU load it is possible that the heartbeat thread does not send its heartbeats in a given time-frame. This can cause roudi to cleanup all resources of the application which missed the heartbeat which may lead to use of resources which are deleted.

The solution should be as efficient as possible and may avoid context switches or sending messages (if possible).
One approach could be to use getpgid, which returns the group id of a given pid. If the pid does not exist it will fail. If we could couple this with the process runtime or creation time we can identify a process and check if it is still alive.

Relates

#1380

@elfenpiff elfenpiff added enhancement New feature refactoring Refactor code without adding features technical debt unclean code and design flaws labels May 18, 2022
@mossmaurice
Copy link
Contributor

@elfenpiff This is related to both #611 and #620. We should follow RAII for the resources of the app. I suppose a hierarchical structure as sketched in the .puml would allow easier handling of the resources in shared memory.

@elfenpiff
Copy link
Contributor Author

@mossmaurice loosely related. But the problem in here is not the handling of shared memory resource.

RouDi falsely assumes that an application has died since the high cpu load prevented that application to send the heartbeat in the required time frame.

@elBoberido
Copy link
Member

@elfenpiff I think there is the possibility to use a pipe or stream socket. AFAIK when the writing end of a pipe/stream socket gets closed, the process with the receiving end would get a POLLHUP via poll

elfenpiff added a commit to ApexAI/iceoryx that referenced this issue May 23, 2022
…essage handle thread, remove isMonitored and handle it directly in roudi

Signed-off-by: Christian Eltzschig <[email protected]>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue Jun 3, 2022
…n background threads as well as the keepalive background thread of applications. Deactivated roudi monitoring as well

Signed-off-by: Christian Eltzschig <[email protected]>
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue Jun 3, 2022
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue Jun 3, 2022
elfenpiff added a commit to ApexAI/iceoryx that referenced this issue Jun 3, 2022
…hread and always disable monitoring in roudi

Signed-off-by: Christian Eltzschig <[email protected]>
@qclzdh
Copy link

qclzdh commented Jun 20, 2022

Some info shared from my side about monitor mode:
When CPU load is high, There is a high possibility that "keepalivemsg" can't be sent to roudi within PROCESS_KEEP_ALIVE_TIMEOUT, we use "posix::FileLock::create(runtimeName);" to check that process is really died or not.

elBoberido added a commit that referenced this issue Oct 15, 2023
elBoberido added a commit that referenced this issue Oct 23, 2023
elBoberido added a commit that referenced this issue Oct 23, 2023
elBoberido added a commit that referenced this issue Oct 23, 2023
elBoberido added a commit that referenced this issue Oct 23, 2023
elBoberido added a commit that referenced this issue Oct 25, 2023
elBoberido added a commit that referenced this issue Oct 25, 2023
elBoberido added a commit that referenced this issue Oct 25, 2023
elBoberido added a commit that referenced this issue Oct 25, 2023
elBoberido added a commit that referenced this issue Oct 25, 2023
elBoberido added a commit that referenced this issue Oct 25, 2023
elBoberido added a commit that referenced this issue Oct 25, 2023
elBoberido added a commit that referenced this issue Oct 25, 2023
elBoberido added a commit that referenced this issue Oct 30, 2023
elBoberido added a commit that referenced this issue Oct 30, 2023
elBoberido added a commit that referenced this issue Oct 30, 2023
elBoberido added a commit that referenced this issue Oct 30, 2023
elBoberido added a commit that referenced this issue Nov 1, 2023
…ory-for-process-alive-detection

iox-#1361 Use shared memory for process alive detection
@elBoberido elBoberido linked a pull request Nov 1, 2023 that will close this issue
21 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature refactoring Refactor code without adding features technical debt unclean code and design flaws
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants