Skip to content

Conversation

@chu11
Copy link
Member

@chu11 chu11 commented Mar 30, 2023

Problem: The way that job-list stats counts failures is that all non-successful jobs are failures. This can be a bit confusing b/c job results return jobs as "failed", "canceled", or "timeout". So the total job failures via "job results" is different than the job failures via job-list stats.

Problem: Make the job-list stat counts consistent to job results. Update tests and users of job-list stats in flux-top, flux-jobs, and t2260-job-list.t.

This PR is built on top of #5031

@chu11 chu11 force-pushed the issue5029_job_list_stats_consistency branch 2 times, most recently from 4a40e97 to c04ee4f Compare April 1, 2023 18:40
@garlick
Copy link
Member

garlick commented Apr 24, 2023

A downside to this proposal is that flux top and flux jobs --stats-only "failed" counts will differ, and the latter won't represent TIMEOUT or CANCELED jobs in any of the summary statistics. That seems a bit confusing too?

Just wondering if maybe it would be better to just call out in the flux-jobs(1) man page that the "failed" summary statistic includes jobs that ended with FAILED, TIMEOUT, or CANCELED job results?

@chu11
Copy link
Member Author

chu11 commented Apr 24, 2023

I think that I updated the code for flux top and flux jobs --stats-only to be consistent to each other. Did you mean flux job stats? The latter is the raw stats coming straight from the job-list stats RPC, but flux jobs --stats-only collates it into output it wants to display.

IMO I think I'd like for the RPC of job-list stats and job-list "listing" to be consistent to each other. How tools choose to display that data is sort of their own choosing? An alternative might be for flux-jobs --stats-only to display timeout and canceled jobs separately, but I don't think we want to change output format at this point? Edit: another idea, instead of outputting "failed" we could output "unsuccessful" or some alternate word?

Your note on clarification in the manpage is worthwhile and we can add that too. As well as add a fix for #5111 while we're at it.

@chu11 chu11 force-pushed the issue5029_job_list_stats_consistency branch from c04ee4f to 5014274 Compare April 24, 2023 15:06
@chu11
Copy link
Member Author

chu11 commented Apr 24, 2023

re-pushed, adding a flux-jobs(1) clarification per comment above, and stuck a fix for #5111 on top as well

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was on the fence on this one.

On one hand, it makes very good sense to match only the jobs reported as FAILED in the failed count in these stats. On the other hand, all the use cases seem to combine canceled+timeout+failed in the reported failed count, which means as far as stats go, combining all unsuccessful jobs into a failed count is the expected result.

However, since this change affects only the underlying payload of the stats response, and since it easier for end users to calculate a total failed count by adding the 3 different failed statuses than to get the FAILED jobs by subtracting, I think I am ok with this approach.

Made some suggestions inline though. We should make sure @garlick agrees before merging.

self.active = self.total - self.inactive

# Special case, this class wants total failed jobs, not the
# division that is normally spliced up.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the "division that is normally spliced up" means here. Maybe something clearer like

This class reports the total number of unsuccessful jobs in the 'failed' attribute,
not just the count of jobs that ran to completion with nonzero exit code

(If I've understood the intent of the comment)


if (sum->show_details) {
int failed = sum->stats.failed;
int failed = sum->stats.failed + sum->stats.timeout + sum->stats.canceled;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the JobStats class, maybe a comment stating here that flux top reports the total count of unsuccessful jobs in the failed stats? A minor suggestion since I guess it is pretty obvious.

running jobs, updated every 2 seconds.

Note that all job failures, including canceled and timedout jobs
are collectively listed as "failed" in ``--stats``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe s/listed/counted/?

``--stats-only`` is used.

Note that all job failures, including canceled and timeout jobs
are collectively listed as "failed" in ``--stats-only``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe s/listed/counted/?


After a job has finished and is in the INACTIVE state, it can be
marked with one of three possible results: COMPLETED, FAILED,
marked with one of four possible results: COMPLETED, FAILED,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to prevent future doc bugs:

Suggested change
marked with one of four possible results: COMPLETED, FAILED,
marked with one of the possible results: COMPLETED, FAILED,

@chu11 chu11 force-pushed the issue5029_job_list_stats_consistency branch from 5014274 to 2277f92 Compare April 24, 2023 15:16
@garlick
Copy link
Member

garlick commented Apr 24, 2023

I just realized i was not testing the entire proposed change when I commented earlier. 🤦 If the tools are still reporting counts as before then I have no issue!

@chu11 chu11 force-pushed the issue5029_job_list_stats_consistency branch from 2277f92 to 3b71ea4 Compare April 24, 2023 16:29
@chu11
Copy link
Member Author

chu11 commented Apr 24, 2023

Re-pushed, fixing up per the comments above. If I don't hear from anyone, I'll set MWP. thanks.

@chu11 chu11 force-pushed the issue5029_job_list_stats_consistency branch from 3b71ea4 to eda8525 Compare April 24, 2023 18:18
@chu11
Copy link
Member Author

chu11 commented Apr 24, 2023

removing MWP temporarily, want #5112 to be merged first to fix mergify

chu11 added 3 commits April 24, 2023 15:07
Problem: The way that job-list stats counts failures is that all non-successful
jobs are failures.  This can be a bit confusing b/c job results return jobs as
"failed", "canceled", or "timeout".  So the total job failures via "job results"
is different than the job failures via job-list stats.

Problem: Make the job-list stat counts consistent to job results.  Update tests
and users of job-list stats in flux-top, flux-jobs, and t2260-job-list.t.

Fixes flux-framework#5029
Problem: The flux-jobs(1) --stats and --stats-only output count
all job failures ("failed", "timeout", "canceled") as a single
statistic and output that as "failed".  But this may not be obvious
because job results are listed as completed, failed, timeout, and
canceled.

Add a clarification under the --stats and --stats-only descriptions.
Problem: flux-jobs(1) says there are three possible results, but
if we count COMPLETED, FAILED, CANCELED, and TIMEOUT, that's four!

Update the text to not depend on the number of results.

Fixes flux-framework#5111
@chu11 chu11 force-pushed the issue5029_job_list_stats_consistency branch from eda8525 to a77fcb3 Compare April 24, 2023 22:07
@mergify mergify bot merged commit 665d6ee into flux-framework:master Apr 24, 2023
@codecov
Copy link

codecov bot commented Apr 24, 2023

Codecov Report

Merging #5048 (a77fcb3) into master (920edbb) will decrease coverage by 22.98%.
The diff coverage is 66.66%.

@@             Coverage Diff             @@
##           master    #5048       +/-   ##
===========================================
- Coverage   83.12%   60.14%   -22.98%     
===========================================
  Files         453      436       -17     
  Lines       77655    72992     -4663     
===========================================
- Hits        64549    43900    -20649     
- Misses      13106    29092    +15986     
Impacted Files Coverage Δ
src/modules/job-list/stats.c 26.58% <50.00%> (-47.64%) ⬇️
src/bindings/python/flux/job/stats.py 97.36% <100.00%> (+0.14%) ⬆️

... and 307 files with indirect coverage changes

@chu11 chu11 deleted the issue5029_job_list_stats_consistency branch April 24, 2023 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants