-
Notifications
You must be signed in to change notification settings - Fork 41
Add failed task metric to help with cell health monitoring #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Vadim Raskin <[email protected]>
|
Hey andrew-edgar! Thanks for submitting this pull request! I'm here to inform the recipients of the pull request that you and the commit authors have already signed the CLA. |
|
We have created an issue in Pivotal Tracker to manage this: https://www.pivotaltracker.com/story/show/134582045 The labels on this github issue will be updated when the story is started. |
Emit metrics for started and succeeded tasks
|
Updated our change to add a start and success metrics as well to be able to have cell based metrics on percent successful tasks (staging) |
|
Thanks, @andrew-edgar. These metrics seem fine to add. I was considering whether they would be better to add in the BBS controller logic with a cell-id tag, since that's a more authoritative and central location for these metrics, but I can see the value and simplicity in having them come from each cell individually. I'm also wary of having the metric emissions interleaved with the BBS/executor interactions in the task processor, as there are now 3 different collaborators to ensure are ordered correctly, but I will let the pair evaluating the PR comment. Best, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the inline comments for details.
| } else if container.RunResult.Failed { | ||
| // When we have done BBS.CompleteTask on a failed Task increment the counter | ||
| // It will be incremented on every call to FailTask in the p.failTask() method | ||
| TasksFailed.Increment() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a cleanup note, the 'else' part here is unnecessary. An 'if' by itself will suffice.
But as a more substantial question, we're not sure why CompleteTask needs to be called before checking the container.RunResult.Failed. Is it possible for the CompleteTask method itself to set RunResult.Failed from True to False? If that is possible, then we would get a TaskSucceeded and a TaskFailed metric for the same task, which seems strange.
On the other hand, if it's not possible for the RunResult.Failed to be changed by the CompleteTask call to the BBS, then this check on line 148 seems to belong on line 136 instead, to keep the code more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we understand now the full logic. Even though we read the comment, it wasn't immediately clear that the reason for doing this extra TasksFailed.Increment() was needed in this particular spot in order to avoid duplication with the TasksFailed.Increment() in p.failTask()
For what is essentially pretty simple logic, it took us a lot longer to understand exactly what what is going on than it should have, and I think the basic problem with the way this is implemented is that it breaks the priniciple of having a consistent level of abstraction per function depth. If the metric is being incremented at function depth X, I don't also expect it to be incremented at function depth X+1, and that's where our surprise came from.
While we understand the merits of implementing this on the rep, we think this could be much simpler if it were implemented in the BBS with a cell-id tag, like @ematpl mentioned. Then the logic shows up in just a couple places, and the level of abstraction in the code is consistent. So, basically:
- In the Task start handler (or whatever the equivalent is), increment the TasksStarted counter
- In the CompleteTask handler, check the RunResult.Failed property. If true, increment TasksFailed, if false, increment TasksSucceeded.
- In the FailTask handler, increment the TasksFailed counter.
One detail to watch out for: Not sure off the top of my head whether a failed task also goes through the CompleteTask handler, so you'll want to double check this.
~ @jenspinney && @bdshroyer
|
@andrew-edgar Any updates on this? |
|
So would you prefer that I close this and submit a new PR on the BBS to implement the changes there? |
|
@andrew-edgar That would be best. Thanks. |
Added an additional metric to the rep to increment a counter on each failed Task
This will help monitoring the status of cells and which may have issues with staging failures.
We will be submitting a documentation change in diego-release/docs/metrics.md to correspond to this change as well
Thanks!
Signed-off-by: Vadim Raskin [email protected]