-
Notifications
You must be signed in to change notification settings - Fork 217
Description
Summary
If an LRP has already crashed once and then it crashed again more than 5 mins later, a crash event is not reported.
Steps to Reproduce
- clone and push spring-music app
while true; do cf ssh spring-music -c "kill -9 \$(pidof java)" ; sleep 600; done- check
cf eventsand see a sequence of
audit.app.ssh-authorized [email protected] index: 0
audit.app.ssh-authorized [email protected] index: 0
audit.app.ssh-authorized [email protected] index: 0
audit.app.ssh-authorized [email protected] index: 0
audit.app.ssh-authorized [email protected] index: 0
instead of
app.crash spring-music index: 0, reason: CRASHED, cell_id: 4581257e-2
audit.app.ssh-authorized [email protected] index: 0
app.crash spring-music index: 0, reason: CRASHED, cell_id: 27ee7e73-b
audit.app.ssh-authorized [email protected] index: 0
app.crash spring-music index: 0, reason: CRASHED, cell_id: 4581257e-2
Diego repo
Environment Details
- diego-release version or other BOSH releases you have deployed - Diego v2.66.2
Possible Causes or Fixes (optional)
The reason seems to be, that after the initial crash, BBS records in the DB that the "crash_count" for this LRP is 1 (or more in case of frequent subsequent crashes. But single sporadic crashes would result in crash_count = 1.
Then on subsequent crashes this code in actual_lrp_db
var newCrashCount int32
if latestChangeTime > models.CrashResetTimeout && actualLRP.State == models.ActualLRPStateRunning {
newCrashCount = 1
} else {
newCrashCount = actualLRP.CrashCount + 1
}
Will actually set the newCrashCount = 1 and later in actual_lrp_event_calculator.go/generateUnclaimedInstanceEvents
if after.CrashCount > before.CrashCount {
events = append(events, models.NewActualLRPCrashedEvent(before, after))
}
will not append the NewActualLRPCrashedEvent because both CrashCounts are equal.
Not quite sure how to properly fix it but this fix in actual_lrp_lifecycle_controller/CrashActualLRP
lrps[0].CrashCount = after.CrashCount - 1;
mediates the issue definitely
What it does is that since we are anyway in the CrashActualLRP, we just need to ensure that the Crash Event would be sent. So setting the CrashCount of the before lrp to after-1 seems to be enough to pass the check