Annotation-less P2P shuffling #7801

hendrikmakait · 2023-04-25T17:20:07Z

Closes #7716 by avoiding annotations altogether.

Core idea:

Instead of propagating information through annotations, we optimistically schedule output tasks on any worker, identify the correct worker and reschedule onto that worker.

Assumption:

Rescheduling each task at most once will not add significant overhead.

Open questions:

How do we deal with pre-existing worker restrictions that do not match the output worker we have chosen for the given partition?

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2023-04-25T19:08:14Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      25 files ±      0       25 suites ±0 15h 59m 22s ⏱️ + 1h 7m 8s
  3 597 tests +    16   3 489 ✔️ +    20   105 💤 ±  0 3 ❌ - 4
44 210 runs +1 110 42 101 ✔️ +1 046 2 105 💤 +67 4 ❌ - 3

For more details on these failures, see this check.

Results for commit b276e79. ± Comparison against base commit 9d90792.

This pull request removes 14 and adds 30 tests. Note that renamed tests count towards both.

distributed.shuffle.tests.test_merge ‑ test_basic_merge[inner]
distributed.shuffle.tests.test_merge ‑ test_basic_merge[left]
distributed.shuffle.tests.test_merge ‑ test_basic_merge[outer]
distributed.shuffle.tests.test_merge ‑ test_basic_merge[right]
distributed.shuffle.tests.test_merge ‑ test_merge[inner]
distributed.shuffle.tests.test_merge ‑ test_merge[left]
distributed.shuffle.tests.test_merge ‑ test_merge[outer]
distributed.shuffle.tests.test_merge ‑ test_merge[right]
distributed.shuffle.tests.test_rechunk ‑ test_raise_on_fuse_optimization
distributed.shuffle.tests.test_rechunk ‑ test_raise_on_lost_annotation
…

distributed.shuffle.tests.test_merge ‑ test_basic_merge[all-inner]
distributed.shuffle.tests.test_merge ‑ test_basic_merge[all-left]
distributed.shuffle.tests.test_merge ‑ test_basic_merge[all-outer]
distributed.shuffle.tests.test_merge ‑ test_basic_merge[all-right]
distributed.shuffle.tests.test_merge ‑ test_basic_merge[none-inner]
distributed.shuffle.tests.test_merge ‑ test_basic_merge[none-left]
distributed.shuffle.tests.test_merge ‑ test_basic_merge[none-outer]
distributed.shuffle.tests.test_merge ‑ test_basic_merge[none-right]
distributed.shuffle.tests.test_merge ‑ test_basic_merge[some-inner]
distributed.shuffle.tests.test_merge ‑ test_basic_merge[some-left]
…

♻️ This comment has been updated with latest results.

mrocklin · 2023-04-26T02:44:03Z

Cc @rjzamora and @d-v-b so that they're aware of this possibility

hendrikmakait · 2023-04-26T16:37:18Z

cc @wence-

hendrikmakait · 2023-04-27T14:55:22Z

Here are results from an A/B test on this, it looks like there is no significant performance impact:

distributed/shuffle/_merge.py

fjetter · 2023-04-28T15:19:56Z

distributed/shuffle/_scheduler_extension.py

+        if shuffle.run_id != run_id:
+            raise RuntimeError()


I think we should also check if the shuffle this run_id points to is still valid. We don't want to have stale requests modify the restrictions.

We should probably document this somewhere, but if it's in ShuffleSchedulerExtension.states, it's an active and valid shuffle instance. Otherwise, it would have been dropped from there.

fjetter · 2023-04-28T15:21:36Z

distributed/shuffle/_scheduler_extension.py

-                    part, workers, npartitions
-                )
+        for part in parts_out:
+            # TODO: How do we deal with pre-existing worker restrictions?


why can't this logic be the same as before when it comes to existing worker restrictions?

I've adjusted the logic. The only unhandled situation I can imagine is if a task is fused with a follow-up task, which has worker restrictions of its own. In this case, we may ignore those. A fix for this behavior would be adding a shuffle-internal mechanism for transferring the output partition to another worker.

fjetter · 2023-04-28T15:22:59Z

I think this all makes sense. I wonder if we still want to keep the annotations path around for the "happy case" where HLGs and annotations just work

wence-

A few minor queries.

wence- · 2023-04-28T15:26:21Z

distributed/shuffle/tests/test_shuffle.py

@@ -133,6 +133,7 @@ def test_raise_on_fuse_optimization():
            dd.shuffle.shuffle(df, "x", shuffle="p2p")


+@pytest.mark.xfail()


This test doesn't make sense any more after these changes right? So xfail is wrong and it should be deleted?

Yup, this is still work in progress; I've adjusted the test.

wence- · 2023-04-28T15:27:14Z

distributed/shuffle/tests/test_rechunk.py

@@ -161,6 +161,7 @@ def test_raise_on_fuse_optimization():
        rechunk(x, chunks=new, method="p2p")


+@pytest.mark.xfail()


This is also now no longer expected to raise an error and so the xfail is a bit of a misnomer I think?

Yup, this is still work in progress; I've adjusted the test.

distributed/shuffle/_scheduler_extension.py

hendrikmakait · 2023-05-03T14:13:40Z

How do we deal with pre-existing worker restrictions that do not match the output worker we have chosen for the given partition?

If the worker restrictions are applied to the shuffle, they apply to all tasks. Thus, we can use the worker restrictions of the barrier task to bootstrap our set of valid workers. Consequently, the restrictions on the output task and the set of valid workers should match.

As things stand right now, I cannot think of another scenario that would add worker restrictions to tasks. With task fusion, annotations are still getting lost (dask/dask#7036). If this changes, we need to revisit this.

hendrikmakait · 2023-05-03T14:34:46Z

distributed/shuffle/_scheduler_extension.py

+            raise RuntimeError(
+                f"Barrier task with key {key!r} does not exist. This may be caused by "
+                "task fusion during graph generation. Please let us know that you ran "
+                "into this by leaving a comment at distributed#7816."


XREF: #7816

hendrikmakait · 2023-05-03T14:35:13Z

distributed/shuffle/tests/test_rechunk.py

@@ -238,10 +207,15 @@ async def test_rechunk_4d(c, s, *ws):
    new = ((10,),) * 4
    x2 = rechunk(x, chunks=new, method="p2p")
    assert x2.chunks == new
-    assert np.all(await c.compute(x2) == a)
+    # FIXME: distributed#7816


XREF: #7816

distributed/reschedule.py

fjetter

Implementation looks good. There are a couple of questions around testing we should address before merging

distributed/shuffle/_scheduler_extension.py

fjetter · 2023-05-09T09:42:22Z

distributed/shuffle/_scheduler_extension.py

+        barrier = self.scheduler.tasks[barrier_key(id)]
+
+        if barrier.worker_restrictions:
+            workers = list(barrier.worker_restrictions)


The implicit assumption here is that if the barrier and output tasks are guaranteed to have the same restrictions. I suggest to document this because it is a non-trivial conclusion and depending on how future versions of fusion work this may not even be true indefinitely.

I've added a docstring.

fjetter · 2023-05-09T09:57:09Z

distributed/shuffle/tests/utils.py

+class ShuffleAnnotationChaosPlugin(SchedulerPlugin):
+    #: Rate at which the plugin randomly drops shuffle annotations
+    rate: float
+    scheduler: Scheduler | None
+    seen: set
+
+    def __init__(self, rate: float):
+        self.rate = rate
+        self.scheduler = None
+        self.seen = set()
+
+    async def start(self, scheduler: Scheduler) -> None:
+        self.scheduler = scheduler
+
+    def transition(
+        self,
+        key: str,
+        start: TaskStateState,
+        finish: TaskStateState,
+        *args: Any,
+        **kwargs: Any,
+    ) -> None:
+        assert self.scheduler
+        if finish != "waiting":
+            return
+        if not key.startswith("shuffle-barrier-"):
+            return
+        if key in self.seen:
+            return
+
+        self.seen.add(key)
+
+        barrier = self.scheduler.tasks[key]
+
+        if self._flip():
+            barrier.annotations.pop("shuffle", None)
+        for dt in barrier.dependents:
+            if self._flip():
+                dt.annotations.pop("shuffle", None)
+
+    def _flip(self) -> bool:
+        return random.random() < self.rate


I'm a bit torn about this. This is testing properties of the current implementation that I consider mostly accidental / nice to haves but not hard requirements.

The implementation uses the annotations once during execution of the first shuffle task. Afterwards the annotations can be entirely forgotten.

It also tests that a partially annotated graph can be understood

I don't think any of these requirements are actually necessary but this test is setting this now as an implicit requirement.
For instance, I think it is a fair assumption to say that either all tasks or no tasks are annotated. This is generally how fusion and our graph manipulation works. The fact that the implementation can handle a mixture is nice but not required.
I think it's also reasonably to say that annotations might be read during runtime and not just initially. I don't know when or how this is useful but I don't see why we should restrict ourselves to the current behavior.

I wouldn't want future development to be slowed down if they break these properties / because these tests fail.

@hendrikmakait what are your thoughts about this?

Generally, I'm also not a huge fan of this, but I think it is the most defensive way of testing that we do not rely on annotations. Let me disentangle your statement.

This is testing properties of the current implementation that I consider mostly accidental / nice to haves but not hard requirements.

This is true, but it is also the implementation of annotations that is broken. If it were not broken, we would not need to fix this.

For instance, I think it is a fair assumption to say that either all tasks or no tasks are annotated. This is generally how fusion and our graph manipulation works. The fact that the implementation can handle a mixture is nice but not required.

At the moment, I do not feel confident to make this claim. I suppose it's true, but I don't know what's broken with annotations at the moment. Also, this feels like an implementation-specific assumption. I could see fusion dropping annotations just for fused tasks if one does not pay attention. I would like to not have to test this, but here we are.

I think it's also reasonably to say that annotations might be read during runtime and not just initially. I don't know when or how this is useful but I don't see why we should restrict ourselves to the current behavior.

I don't see where we restrict ourselves here. We are stripping the barrier and its dependents of annotations once they arrive on the scheduler's state machine. This makes no assumption about when they will be read. The only assumption this makes is that all the ways that annotations are broken will strip them before the tasks arrive at the scheduler. I think this is a fair assumption.

I wouldn't want future development to be slowed down if they break these properties / because these tests fail.

Agreed.

At the moment, I do not feel confident to make this claim

Fair enough. It's true that it's hard to extrapolate what happens in the future.

I could see fusion dropping annotations just for fused tasks if one does not pay attention.

Indeed. My point is rather that if fusion happens this should affect all output keys. However, this is also a generalization that may not be true.

I think it's also reasonably to say that annotations might be read during runtime and not just initially. I don't know when or how this is useful but I don't see why we should restrict ourselves to the current behavior.

I don't see where we restrict ourselves here.

Sorry, I see how my statement is not very clear. This thing mutates annotations as soon as the barrier is transitioned to a specific state. If we read out the annotations before this already, this modification will not have any effect.
If I'm not mistaken, this is also what's happening in the current implementation. Once the first transfer task is running we're evaluating annotations and are fixing the mappings. This transition hook is only executed later, i.e. the mutated annotations will not have an effect on any of this.
If we later changed the behavior such that the annotations are evaluated later or multiple times we are suddenly hitting this test code which may no longer make sense.

This is all a bit academic. I will not block merging because of this.

I wouldn't want future development to be slowed down if they break these properties / because these tests fail.

Agreed.

What I meant saying is that red tests often discourage people to change something and once this becomes legacy code, the new developer generation may not fully understand the context about whether this is required or just nice-to-have behavior.
I raised my concerns. If future developers run into this, they can follow the breadcrumbs to this thread =)

fjetter · 2023-05-09T10:00:52Z

distributed/shuffle/tests/test_shuffle.py

        await n.process.process.kill()
-        block_event.set()
-        with pytest.raises(RuntimeError):
-            await fut
+        await block_event.set()
+
+        await fut


If this is now able to recover we should add a couple of additional asserts

Assert that there is indeed a shuffle task on the dead worker

Assert that the output result is what is expected

I'm surprised that this change allows us to rerun. Do you understand why this works? You also reduced the data / number of partitions which makes me nervous in thinking that the now-dead worker just didn't run any shuffle tasks

TL; DR: Things now work as they should have worked from the beginning.

I didn't understand why this test failed in the beginning, but after some digging, I have figured it out: This is a positive side-effect of us correctly cleaning up worker restrictions when removing a shuffle on the scheduler. Previously, some output tasks would remain pinned to the failed worker. Since the worker isn't around anymore, workers that try to send data over will have a bad time and fail to connect.

thanks for digging in

Co-authored-by: Florian Jetter <[email protected]>

j-bennet

I don't know enough about annotations and what they are used for, but simpler workflow is almost always better, and this looks simpler.

I would be interested to see if this extra scheduling work (scheduling a task to a worker which may not be the right worker, and having to reschedule) leads to any slowdowns in realistically large workflows.

distributed/reschedule.py

distributed/shuffle/_scheduler_extension.py

j-bennet · 2023-05-10T23:03:52Z

distributed/shuffle/_scheduler_extension.py

+            # This may occur if multiple barriers share the same output task,
+            # e.g. in a hash join.
+            return
+        ts.annotations["shuffle_original_restrictions"] = ts.worker_restrictions.copy()


Dummy question, sorry. The problem that you're fixing is tasks occasionally losing annotations. AFAIK this is applicable to all annotations, not just shuffle annotations, and your ShuffleAnnotationChaosPlugin is only killingshuffle annotation. Is there a possibility that in a real-world scenario, this new shuffle_original_restrictions will be lost too?

Based on what we've seen, the current assumption is that annotations get lost before the tasks make it to the scheduler.

fjetter

Good to go once CI is green-ish

fjetter

Good to go once CI is green-ish

hendrikmakait · 2023-05-11T15:23:49Z

It's not exactly green, but test failures appear to be unrelated, so I'll merge this in.

Co-authored-by: Florian Jetter <[email protected]>

mrocklin · 2023-05-16T15:55:08Z

@rjzamora my guess is that p2p shuffling should be doable in dask-expr now.

rjzamora · 2023-05-16T15:58:27Z

my guess is that p2p shuffling should be doable in dask-expr now.

Nice! Thanks for this @hendrikmakait !

hendrikmakait added 3 commits April 25, 2023 19:18

WIP annotation-less shuffling

5b19534

Remove comments

edc8a8e

XFAIL

0359078

hendrikmakait added 3 commits April 25, 2023 21:14

Rechunk

de75ab0

Comment

7c603dd

Tests

35186d3

hendrikmakait added 2 commits April 26, 2023 09:54

Minor

0937e19

XFAIL

bf27d64

Merge branch 'main' into annotation-less-shuffling

0c2ead1

fjetter reviewed Apr 28, 2023

View reviewed changes

distributed/shuffle/_merge.py Outdated Show resolved Hide resolved

fjetter reviewed Apr 28, 2023

View reviewed changes

wence- reviewed Apr 28, 2023

View reviewed changes

hendrikmakait added 10 commits May 2, 2023 11:03

Unset restrictions

c60b60a

Recover from lost annotations

46bfc71

Fix tests

e69c985

P2P now works with fuse

fb93688

Refactor

673b03c

Fix tests

6ca596a

Keep annotations

502993e

Test with lost annotations

309d4d5

minor

30f9b91

Riase informative error if barrier task unknown

09d1436

hendrikmakait mentioned this pull request May 3, 2023

P2P shuffling is incompatible with low-level fusion #7816

Closed

Link with dask#7816

fd119c1

hendrikmakait commented May 3, 2023

View reviewed changes

distributed/reschedule.py Outdated Show resolved Hide resolved

hendrikmakait changed the title ~~[RFC] Annotation-less P2P shuffling~~ Annotation-less P2P shuffling May 3, 2023

hendrikmakait added the shuffle label May 3, 2023

hendrikmakait marked this pull request as ready for review May 3, 2023 15:28

fjetter reviewed May 9, 2023

View reviewed changes

hendrikmakait and others added 4 commits May 9, 2023 15:34

Update distributed/shuffle/_scheduler_extension.py

930693c

Co-authored-by: Florian Jetter <[email protected]>

Document assumption

e5c4022

Docs

7a1229d

Increase range

c57af86

j-bennet reviewed May 10, 2023

View reviewed changes

hendrikmakait added 4 commits May 11, 2023 08:51

Rename module

dc8055d

Ensure failing worker worked on task

d5aa8a0

Additional asserts

642007a

Merge branch 'main' into annotation-less-shuffling

b276e79

fjetter approved these changes May 11, 2023

View reviewed changes

hendrikmakait merged commit 21b70be into dask:main May 11, 2023

milesgranger pushed a commit to milesgranger/distributed that referenced this pull request May 15, 2023

Annotation-less P2P shuffling (dask#7801)

393f1d1

Co-authored-by: Florian Jetter <[email protected]>

jrbourbeau mentioned this pull request May 15, 2023

P2P shuffling loses annotations when dataframe is later transformed to array withdf.values #7715

Closed

jrbourbeau mentioned this pull request Jun 12, 2023

Choose P2P as shuffle default even if fusion is enabled dask/dask#10344

Merged

3 tasks

		@@ -133,6 +133,7 @@ def test_raise_on_fuse_optimization():
		dd.shuffle.shuffle(df, "x", shuffle="p2p")


		@pytest.mark.xfail()

		@@ -161,6 +161,7 @@ def test_raise_on_fuse_optimization():
		rechunk(x, chunks=new, method="p2p")


		@pytest.mark.xfail()

Annotation-less P2P shuffling #7801

Annotation-less P2P shuffling #7801

Conversation

hendrikmakait commented Apr 25, 2023 • edited Loading

github-actions bot commented Apr 25, 2023 • edited Loading

Unit Test Results

mrocklin commented Apr 26, 2023

hendrikmakait commented Apr 26, 2023

hendrikmakait commented Apr 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter commented Apr 28, 2023

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmakait commented May 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j-bennet left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

hendrikmakait commented May 11, 2023

mrocklin commented May 16, 2023

rjzamora commented May 16, 2023

hendrikmakait commented Apr 25, 2023 •

edited

Loading

github-actions bot commented Apr 25, 2023 •

edited

Loading

hendrikmakait commented May 3, 2023 •

edited

Loading

j-bennet left a comment •

edited

Loading