Add failed task metric to help with cell health monitoring #9

andrew-edgar · 2016-11-18T12:40:46Z

Added an additional metric to the rep to increment a counter on each failed Task

This will help monitoring the status of cells and which may have issues with staging failures.

We will be submitting a documentation change in diego-release/docs/metrics.md to correspond to this change as well

Thanks!

Signed-off-by: Vadim Raskin [email protected]

Signed-off-by: Vadim Raskin <[email protected]>

cfdreddbot · 2016-11-18T12:40:48Z

Hey andrew-edgar!

Thanks for submitting this pull request! I'm here to inform the recipients of the pull request that you and the commit authors have already signed the CLA.

cf-gitbot · 2016-11-18T12:40:49Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/134582045

The labels on this github issue will be updated when the story is started.

Emit metrics for started and succeeded tasks

andrew-edgar · 2016-11-28T09:09:41Z

Updated our change to add a start and success metrics as well to be able to have cell based metrics on percent successful tasks (staging)

emalm · 2016-12-06T09:04:07Z

Thanks, @andrew-edgar. These metrics seem fine to add. I was considering whether they would be better to add in the BBS controller logic with a cell-id tag, since that's a more authoritative and central location for these metrics, but I can see the value and simplicity in having them come from each cell individually. I'm also wary of having the metric emissions interleaved with the BBS/executor interactions in the task processor, as there are now 3 different collaborators to ensure are ordered correctly, but I will let the pair evaluating the PR comment.

Best,
Eric

jenspinney

See the inline comments for details.

jenspinney · 2016-12-12T19:09:13Z

generator/internal/task_processor.go

+	} else if container.RunResult.Failed {
+		// When we have done BBS.CompleteTask on a failed Task increment the counter
+		// It will be incremented on every call to FailTask in the p.failTask() method
+		TasksFailed.Increment()


As a cleanup note, the 'else' part here is unnecessary. An 'if' by itself will suffice.

But as a more substantial question, we're not sure why CompleteTask needs to be called before checking the container.RunResult.Failed. Is it possible for the CompleteTask method itself to set RunResult.Failed from True to False? If that is possible, then we would get a TaskSucceeded and a TaskFailed metric for the same task, which seems strange.

On the other hand, if it's not possible for the RunResult.Failed to be changed by the CompleteTask call to the BBS, then this check on line 148 seems to belong on line 136 instead, to keep the code more readable.

Ah, we understand now the full logic. Even though we read the comment, it wasn't immediately clear that the reason for doing this extra TasksFailed.Increment() was needed in this particular spot in order to avoid duplication with the TasksFailed.Increment() in p.failTask()

For what is essentially pretty simple logic, it took us a lot longer to understand exactly what what is going on than it should have, and I think the basic problem with the way this is implemented is that it breaks the priniciple of having a consistent level of abstraction per function depth. If the metric is being incremented at function depth X, I don't also expect it to be incremented at function depth X+1, and that's where our surprise came from.

While we understand the merits of implementing this on the rep, we think this could be much simpler if it were implemented in the BBS with a cell-id tag, like @ematpl mentioned. Then the logic shows up in just a couple places, and the level of abstraction in the code is consistent. So, basically:

In the Task start handler (or whatever the equivalent is), increment the TasksStarted counter

In the CompleteTask handler, check the RunResult.Failed property. If true, increment TasksFailed, if false, increment TasksSucceeded.

In the FailTask handler, increment the TasksFailed counter.

One detail to watch out for: Not sure off the top of my head whether a failed task also goes through the CompleteTask handler, so you'll want to double check this.

~ @jenspinney && @bdshroyer

bdshroyer · 2017-01-05T16:27:26Z

@andrew-edgar Any updates on this?

andrew-edgar · 2017-01-12T11:10:16Z

So would you prefer that I close this and submit a new PR on the BBS to implement the changes there?

jfmyers9 · 2017-01-12T20:33:05Z

@andrew-edgar That would be best. Thanks.

Add failed task metric to help with cell health monitoring

6ac71b1

Signed-off-by: Vadim Raskin <[email protected]>

cf-gitbot added the unscheduled label Nov 18, 2016

andrew-edgar mentioned this pull request Nov 18, 2016

Update metrics docs related to PR cloudfoundry/diego-release#226

Closed

vvraskin and others added 2 commits November 28, 2016 09:55

Emit metrics for started and succeeded tasks

722db0a

Merge pull request #2 from vvraskin/master

5849c78

Emit metrics for started and succeeded tasks

cf-gitbot added in progress scheduled unscheduled and removed unscheduled in progress scheduled labels Dec 6, 2016

jenspinney suggested changes Dec 12, 2016

View reviewed changes

cf-gitbot added scheduled unscheduled and removed in progress scheduled labels Dec 12, 2016

jfmyers9 closed this Jan 12, 2017

cf-gitbot added delivered in progress and removed unscheduled delivered labels Jan 12, 2017

cf-gitbot added accepted and removed in progress labels Jan 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add failed task metric to help with cell health monitoring #9

Add failed task metric to help with cell health monitoring #9

Uh oh!

andrew-edgar commented Nov 18, 2016

Uh oh!

cfdreddbot commented Nov 18, 2016

Uh oh!

cf-gitbot commented Nov 18, 2016

Uh oh!

andrew-edgar commented Nov 28, 2016

Uh oh!

emalm commented Dec 6, 2016

Uh oh!

jenspinney left a comment

Uh oh!

jenspinney Dec 12, 2016

Uh oh!

jenspinney Dec 12, 2016 •

edited

Loading

Uh oh!

bdshroyer commented Jan 5, 2017

Uh oh!

andrew-edgar commented Jan 12, 2017

Uh oh!

jfmyers9 commented Jan 12, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Add failed task metric to help with cell health monitoring #9

Add failed task metric to help with cell health monitoring #9

Uh oh!

Conversation

andrew-edgar commented Nov 18, 2016

Uh oh!

cfdreddbot commented Nov 18, 2016

Uh oh!

cf-gitbot commented Nov 18, 2016

Uh oh!

andrew-edgar commented Nov 28, 2016

Uh oh!

emalm commented Dec 6, 2016

Uh oh!

jenspinney left a comment

Choose a reason for hiding this comment

Uh oh!

jenspinney Dec 12, 2016

Choose a reason for hiding this comment

Uh oh!

jenspinney Dec 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdshroyer commented Jan 5, 2017

Uh oh!

andrew-edgar commented Jan 12, 2017

Uh oh!

jfmyers9 commented Jan 12, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

jenspinney Dec 12, 2016 •

edited

Loading