-
Notifications
You must be signed in to change notification settings - Fork 5k
Fix startup with failing configuration #26126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/agent (Team:Agent) |
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
| procState = ps | ||
| case <-a.bgContext.Done(): | ||
| a.Stop() | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: is there any connection between a.bgContext and ctx? Does ctx has to be cancelled when a.bgContext is cancelled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bgContext is like a cancellation token passed from top of the app and cancelled on exit or unenroll, so agent cleans and backs up everything in a nice manner. we want to avoid just shutting down agent without any cleaning as this may turn out problematic
| if a.state.ProcessInfo != proc { | ||
| // kill original process if possible | ||
| if proc != nil && proc.Process != nil { | ||
| _ = proc.Process.Kill() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we wait or check for other errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is best effort
|
|
||
| // was already stopped by Stop, do not restart | ||
| if a.state.Status == state.Stopped { | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we missing a `a.appLock.Unlock()t here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking the complete function body, we always use Unlock for each exit path. Better add the defer unlock right after the lock in order to reduce the chance of introducing new deadlocks in the future.
|
This pull request is now in conflicts. Could you fix it? 🙏 |
| } | ||
|
|
||
| // send stop signal to request stop | ||
| proc.Process.Signal(os.Interrupt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work on Windows? I thought process.Info had each OS implementation that is why it was added. Should we just move this logic into the Stop() function of process.Info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right windows does not implement sending interupt. using Stop function instead
blakerouse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest change to using Stop this looks good.
|
not sure if the 7.13.2 label helps, but i def would like to see this back ported to 7.13.x - @michalpristas |
Fix startup with failing configuration (elastic#26126)
Fix startup with failing configuration (elastic#26126)
This reverts commit 5a294a4.
* master: (26 commits) Report total and free CPU for vSphere virtual machines (elastic#26167) [filebeat] Add preserve_original_event option to o365audit input (elastic#26273) Change xml processor names in script processor to match convention (elastic#26263) [Oracle] Fixing default values for paths in config template (elastic#26276) Add more ECS fields to logs (elastic#25998) [Heartbeat] Fix broken invocation of synth package (elastic#26228) rename sqs file name (elastic#26227) Populate the agent action result if there is no matching action handlers (elastic#26152) Add ISO8601 as supported timestamp type (elastic#25564) Move Filebeat azure module to GA (elastic#26168) Filebeat azure module pipeline fixes and changes (elastic#26148) libbeat: monitor version (elastic#26214) Add new parser to filestream input: container (elastic#26115) [Metricbeat] Add state_statefulset replicas.ready (elastic#26088) Disable test processors system test for windows 10 (elastic#26216) Fix startup with failing configuration (elastic#26126) Remove 32 bits version of Elastic Agent. (elastic#25708) Chane fleetmode detection to ony use management.enabled (elastic#26180) Make `filestream` input GA (elastic#26127) libbeat/idxmgmt/ilm: fix alias creation (elastic#26146) ...
What does this PR do?
What is going on is a bit weird race.
Basically we start ok but troubles come on restart.
On restart we apply stored config to start MB (
mb1) with failing configuration,MB does not exit just reports Failed state because it cannot apply
system/loadon windows.At this time 10 second timer to recover is started otherwise we plan a restart.
Then we pull config from fleet and decide MB should be started.
We start MB
mb2whilemb1is still running. We are starting it because check checks for terminal statuses and Failed is one of them.This
mb2starts reports running, timer for killingmb1is stopped. Because MB is running and it does not differentiate between processes.mb2fails ondata.pathconflict withmb1watcher detects stopped metricbeat and
mb3is started,mb1is still running, we dont have any information aboutmb1anymore anywhere.mb3fails on same thingmb2failed and exits.mb1still running in Failed state and we trymbXover and over again.This PR adds some closers to watcher, some checks on Failure termination and passing proc so watcher and terminator are killing the process they were designed to kill.
While this fix works, i dont like the whole approach where we handle start/stop/restart from 4-5 places and they can be conflicting. I would like to see this redesigned. But i've spent 5 days chasing this and need some social distancing from the topic so this fix is OK at the moment for me.
How to test:
How it behaved before the fix: we had 3 metricbeats, 2 with stable PID (one for monitoring, one is
mb1) and then thirds MB process kept changing PID (crash-restart loop)Why is it important?
Fixes #25829
Checklist
CHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.