Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash in spans when time() is not monotonic #7960

Merged
merged 2 commits into from
Jul 5, 2023

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Jul 4, 2023

Fix crash:

  File "/opt/coiled/env/lib/python3.11/site-packages/distributed/spans.py", line 328, in cumulative_worker_metrics
    unknown_seconds = max(0.0, self.active_cpu_seconds - known_seconds)
                               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/distributed/spans.py", line 439, in active_cpu_seconds
    return sum((t1 - t0) * nthreads for t0, t1, nthreads in self.nthreads_intervals)
                                                            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/distributed/spans.py", line 410, in nthreads_intervals
    is_active_t, is_active_flag = zip(*self._active_timeseries())
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/coiled/env/lib/python3.11/site-packages/distributed/spans.py", line 382, in _active_timeseries
    assert delta > 0
           ^^^^^^^^^
AssertionError

This was triggered when two subsequent calls to time() are not strictly monotonic ascending.
No unit tests, as I'm not too eager to monkey-patch time() to reproduce

@crusaderky crusaderky requested a review from fjetter as a code owner July 4, 2023 12:20
@crusaderky crusaderky self-assigned this Jul 4, 2023
@ntabris
Copy link
Contributor

ntabris commented Jul 4, 2023

Mostly just curious but if the problem is that time() isn't monotonic, why isn't the solution to use a monotonic clock?

@crusaderky
Copy link
Collaborator Author

Mostly just curious but if the problem is that time() isn't monotonic, why isn't the solution to use a monotonic clock?

Because time.monotonic() is not a wall clock, but just the number of seconds since process start.
In Windows we tamper with time.time() due to low resolution. In theory, we could do the same in Linux to create a monotonic wall time. In practice, it would be ugly as hell (see distributed.metrics._WindowsTime), slow, and very bug-prone.
So I feel it's much better to just work within the constraints of the standard UNIX syscalls.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 4, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       20 files  ±       0         20 suites  ±0   11h 29m 50s ⏱️ + 38m 22s
  3 704 tests ±       0    3 596 ✔️ +    4     106 💤 ±  0  2  - 4 
35 826 runs  +1 019  34 076 ✔️ +971  1 748 💤 +52  2  - 4 

For more details on these failures, see this check.

Results for commit 0d8ee46. ± Comparison against base commit a47cb0a.

♻️ This comment has been updated with latest results.

Comment on lines 381 to 382
# Note: in case of identical timestamps, there may be loops after sorting
# were n_active < 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, child.enqueued <= child.stop. How about sorting by timestamp only events.sort(key=lambda event: event[0])? This should maintain relative ordering and avoid n_active < 0. (It took me a moment to understand why n_active would ever drop below 0 here.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@hendrikmakait hendrikmakait merged commit 566fd1f into dask:main Jul 5, 2023
@crusaderky crusaderky deleted the spans_neg_delta branch July 6, 2023 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants