Skip to content

Conversation

@andrew-edgar
Copy link
Contributor

Added an additional metric to the rep to increment a counter on each failed Task

This will help monitoring the status of cells and which may have issues with staging failures.

We will be submitting a documentation change in diego-release/docs/metrics.md to correspond to this change as well

Thanks!

Signed-off-by: Vadim Raskin [email protected]

@cfdreddbot
Copy link

Hey andrew-edgar!

Thanks for submitting this pull request! I'm here to inform the recipients of the pull request that you and the commit authors have already signed the CLA.

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/134582045

The labels on this github issue will be updated when the story is started.

@andrew-edgar
Copy link
Contributor Author

Updated our change to add a start and success metrics as well to be able to have cell based metrics on percent successful tasks (staging)

@emalm
Copy link
Contributor

emalm commented Dec 6, 2016

Thanks, @andrew-edgar. These metrics seem fine to add. I was considering whether they would be better to add in the BBS controller logic with a cell-id tag, since that's a more authoritative and central location for these metrics, but I can see the value and simplicity in having them come from each cell individually. I'm also wary of having the metric emissions interleaved with the BBS/executor interactions in the task processor, as there are now 3 different collaborators to ensure are ordered correctly, but I will let the pair evaluating the PR comment.

Best,
Eric

Copy link
Contributor

@jenspinney jenspinney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the inline comments for details.

} else if container.RunResult.Failed {
// When we have done BBS.CompleteTask on a failed Task increment the counter
// It will be incremented on every call to FailTask in the p.failTask() method
TasksFailed.Increment()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a cleanup note, the 'else' part here is unnecessary. An 'if' by itself will suffice.

But as a more substantial question, we're not sure why CompleteTask needs to be called before checking the container.RunResult.Failed. Is it possible for the CompleteTask method itself to set RunResult.Failed from True to False? If that is possible, then we would get a TaskSucceeded and a TaskFailed metric for the same task, which seems strange.

On the other hand, if it's not possible for the RunResult.Failed to be changed by the CompleteTask call to the BBS, then this check on line 148 seems to belong on line 136 instead, to keep the code more readable.

Copy link
Contributor

@jenspinney jenspinney Dec 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we understand now the full logic. Even though we read the comment, it wasn't immediately clear that the reason for doing this extra TasksFailed.Increment() was needed in this particular spot in order to avoid duplication with the TasksFailed.Increment() in p.failTask()

For what is essentially pretty simple logic, it took us a lot longer to understand exactly what what is going on than it should have, and I think the basic problem with the way this is implemented is that it breaks the priniciple of having a consistent level of abstraction per function depth. If the metric is being incremented at function depth X, I don't also expect it to be incremented at function depth X+1, and that's where our surprise came from.

While we understand the merits of implementing this on the rep, we think this could be much simpler if it were implemented in the BBS with a cell-id tag, like @ematpl mentioned. Then the logic shows up in just a couple places, and the level of abstraction in the code is consistent. So, basically:

  • In the Task start handler (or whatever the equivalent is), increment the TasksStarted counter
  • In the CompleteTask handler, check the RunResult.Failed property. If true, increment TasksFailed, if false, increment TasksSucceeded.
  • In the FailTask handler, increment the TasksFailed counter.

One detail to watch out for: Not sure off the top of my head whether a failed task also goes through the CompleteTask handler, so you'll want to double check this.

~ @jenspinney && @bdshroyer

@bdshroyer
Copy link

@andrew-edgar Any updates on this?

@andrew-edgar
Copy link
Contributor Author

So would you prefer that I close this and submit a new PR on the BBS to implement the changes there?

@jfmyers9
Copy link
Contributor

@andrew-edgar That would be best. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants