-
Notifications
You must be signed in to change notification settings - Fork 54
job-list: make job stats consistent to job results #5048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-list: make job stats consistent to job results #5048
Conversation
4a40e97 to
c04ee4f
Compare
|
A downside to this proposal is that Just wondering if maybe it would be better to just call out in the |
|
I think that I updated the code for IMO I think I'd like for the RPC of job-list stats and job-list "listing" to be consistent to each other. How tools choose to display that data is sort of their own choosing? An alternative might be for Your note on clarification in the manpage is worthwhile and we can add that too. As well as add a fix for #5111 while we're at it. |
c04ee4f to
5014274
Compare
|
re-pushed, adding a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was on the fence on this one.
On one hand, it makes very good sense to match only the jobs reported as FAILED in the failed count in these stats. On the other hand, all the use cases seem to combine canceled+timeout+failed in the reported failed count, which means as far as stats go, combining all unsuccessful jobs into a failed count is the expected result.
However, since this change affects only the underlying payload of the stats response, and since it easier for end users to calculate a total failed count by adding the 3 different failed statuses than to get the FAILED jobs by subtracting, I think I am ok with this approach.
Made some suggestions inline though. We should make sure @garlick agrees before merging.
| self.active = self.total - self.inactive | ||
|
|
||
| # Special case, this class wants total failed jobs, not the | ||
| # division that is normally spliced up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what the "division that is normally spliced up" means here. Maybe something clearer like
This class reports the total number of unsuccessful jobs in the 'failed' attribute,
not just the count of jobs that ran to completion with nonzero exit code
(If I've understood the intent of the comment)
|
|
||
| if (sum->show_details) { | ||
| int failed = sum->stats.failed; | ||
| int failed = sum->stats.failed + sum->stats.timeout + sum->stats.canceled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in the JobStats class, maybe a comment stating here that flux top reports the total count of unsuccessful jobs in the failed stats? A minor suggestion since I guess it is pretty obvious.
doc/man1/flux-jobs.rst
Outdated
| running jobs, updated every 2 seconds. | ||
|
|
||
| Note that all job failures, including canceled and timedout jobs | ||
| are collectively listed as "failed" in ``--stats``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe s/listed/counted/?
doc/man1/flux-jobs.rst
Outdated
| ``--stats-only`` is used. | ||
|
|
||
| Note that all job failures, including canceled and timeout jobs | ||
| are collectively listed as "failed" in ``--stats-only``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe s/listed/counted/?
doc/man1/flux-jobs.rst
Outdated
|
|
||
| After a job has finished and is in the INACTIVE state, it can be | ||
| marked with one of three possible results: COMPLETED, FAILED, | ||
| marked with one of four possible results: COMPLETED, FAILED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion to prevent future doc bugs:
| marked with one of four possible results: COMPLETED, FAILED, | |
| marked with one of the possible results: COMPLETED, FAILED, |
5014274 to
2277f92
Compare
|
I just realized i was not testing the entire proposed change when I commented earlier. 🤦 If the tools are still reporting counts as before then I have no issue! |
2277f92 to
3b71ea4
Compare
|
Re-pushed, fixing up per the comments above. If I don't hear from anyone, I'll set MWP. thanks. |
3b71ea4 to
eda8525
Compare
|
removing MWP temporarily, want #5112 to be merged first to fix mergify |
Problem: The way that job-list stats counts failures is that all non-successful jobs are failures. This can be a bit confusing b/c job results return jobs as "failed", "canceled", or "timeout". So the total job failures via "job results" is different than the job failures via job-list stats. Problem: Make the job-list stat counts consistent to job results. Update tests and users of job-list stats in flux-top, flux-jobs, and t2260-job-list.t. Fixes flux-framework#5029
Problem: The flux-jobs(1) --stats and --stats-only output count
all job failures ("failed", "timeout", "canceled") as a single
statistic and output that as "failed". But this may not be obvious
because job results are listed as completed, failed, timeout, and
canceled.
Add a clarification under the --stats and --stats-only descriptions.
Problem: flux-jobs(1) says there are three possible results, but if we count COMPLETED, FAILED, CANCELED, and TIMEOUT, that's four! Update the text to not depend on the number of results. Fixes flux-framework#5111
eda8525 to
a77fcb3
Compare
Codecov Report
@@ Coverage Diff @@
## master #5048 +/- ##
===========================================
- Coverage 83.12% 60.14% -22.98%
===========================================
Files 453 436 -17
Lines 77655 72992 -4663
===========================================
- Hits 64549 43900 -20649
- Misses 13106 29092 +15986
|
Problem: The way that job-list stats counts failures is that all non-successful jobs are failures. This can be a bit confusing b/c job results return jobs as "failed", "canceled", or "timeout". So the total job failures via "job results" is different than the job failures via job-list stats.
Problem: Make the job-list stat counts consistent to job results. Update tests and users of job-list stats in flux-top, flux-jobs, and t2260-job-list.t.
This PR is built on top of #5031