Skip to content

BBS doesn't report new "crash" events for LRPs crashed > 5 mins ago #643

@vlast3k

Description

@vlast3k

Summary

If an LRP has already crashed once and then it crashed again more than 5 mins later, a crash event is not reported.

Steps to Reproduce

  • clone and push spring-music app
  • while true; do cf ssh spring-music -c "kill -9 \$(pidof java)" ; sleep 600; done
  • check cf events and see a sequence of
audit.app.ssh-authorized   [email protected]   index: 0
audit.app.ssh-authorized   [email protected]   index: 0
audit.app.ssh-authorized   [email protected]   index: 0
audit.app.ssh-authorized   [email protected]   index: 0
audit.app.ssh-authorized   [email protected]   index: 0

instead of

app.crash                  spring-music     index: 0, reason: CRASHED, cell_id: 4581257e-2
audit.app.ssh-authorized   [email protected]   index: 0
app.crash                  spring-music     index: 0, reason: CRASHED, cell_id: 27ee7e73-b
audit.app.ssh-authorized   [email protected]   index: 0
app.crash                  spring-music     index: 0, reason: CRASHED, cell_id: 4581257e-2

Diego repo

Environment Details

  • diego-release version or other BOSH releases you have deployed - Diego v2.66.2

Possible Causes or Fixes (optional)

The reason seems to be, that after the initial crash, BBS records in the DB that the "crash_count" for this LRP is 1 (or more in case of frequent subsequent crashes. But single sporadic crashes would result in crash_count = 1.

Then on subsequent crashes this code in actual_lrp_db

		var newCrashCount int32
		if latestChangeTime > models.CrashResetTimeout && actualLRP.State == models.ActualLRPStateRunning {
			newCrashCount = 1
		} else {
			newCrashCount = actualLRP.CrashCount + 1
		}

Will actually set the newCrashCount = 1 and later in actual_lrp_event_calculator.go/generateUnclaimedInstanceEvents

	if after.CrashCount > before.CrashCount {
		events = append(events, models.NewActualLRPCrashedEvent(before, after))
	}

will not append the NewActualLRPCrashedEvent because both CrashCounts are equal.

Not quite sure how to properly fix it but this fix in actual_lrp_lifecycle_controller/CrashActualLRP

lrps[0].CrashCount = after.CrashCount - 1;

mediates the issue definitely
What it does is that since we are anyway in the CrashActualLRP, we just need to ensure that the Crash Event would be sent. So setting the CrashCount of the before lrp to after-1 seems to be enough to pass the check

Additional Text Output, Screenshots, contextual information (optional)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions