Replies: 3 comments 2 replies
-
Thank you for writing this up @gjoseph92 . To me, this is nicely written. |
Beta Was this translation helpful? Give feedback.
-
Agreed thanks for writing this Gabe 😄 A couple of initial comments: First Tornado, which is used in the TCP case does buffer reading and writing. Agree it's weird that there is an asymmetry between One thought is to replace Tornado with something else to see if it helps (though I suspect this will be a fair amount of surgery). A particular thought is to use As to It sounds like you have had better luck than Ben and I at running larger scale graphs, which is great! Will need to look more closely at what the results show for One last thought from a big picture perspective. When we started this project of optimizing the Scheduler, the slow part was building, sending, and scheduling the graph. IOW there was a long pause when sending a large chunk of work off for computation. This is largely not an issue (though as noted there may be some remaining work in |
Beta Was this translation helpful? Give feedback.
-
As mentioned in the very-long-profiling-issue, socket.send may be a red
herring. It's a place where Python releases the GIL, and so other
activities/threads/processes can claim activity. That time looks like it's
being spent in socket.send from our thread's perspective, but actually
other stuff is happening. There was some work in the
very-long-profiling-issue around looking at the distribution of times which
backed up this theory.
That being said, I'm still excited about the possibility of switching down
to asyncio and seeing if it helps if that's a relatively cheap thing to
do.
It seems like we might next want to look at what happens when we first
receive graphs and try to accelerate that.
…On Fri, May 14, 2021 at 8:50 PM jakirkham ***@***.***> wrote:
Agreed thanks for writing this Gabe 😄
A couple of initial comments:
First Tornado, which is used in the TCP case does buffer reading and
writing. Agree it's weird that there is an asymmetry between read & write
I suspect our profiling tools are not telling the whole story. As mentioned
it would be nice to push as much reading and writing through Tornado (to
avoid passing through all these Python layers repeatedly). This was the
idea behind PR ( #4506 <#4506> ),
which preps all the frames to write before writing and reads all the frames
into one big buffer. Yet as observed by these results there still remains
issues even when doing that. Ben and I have seen the same when using call
graphs (interestingly this effect is just as pronounced in the call graphs
for us before and after disabling GC).
One thought is to replace Tornado with something else to see if it helps
(though I suspect this will be a fair amount of surgery). A particular
thought is to use asyncio for this. One consequence of using asyncio
(particularly methods called on the loop like create_server
<https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.create_server>)
is that if we enable uvloop we immediately can benefit from the very
efficient Cython in uvloop and the fast libuv event library (for example
here is uvloop's create_server
<https://github.com/MagicStack/uvloop/blob/c808a663b297bb2aee745523c277d6fafecebbeb/uvloop/loop.pyx#L1622>).
This would let us replace a lot of Python layers with C layers on top of a
high performance event library. This would benefit both the Scheduler and
the Workers.
As to new_task am not actually surprised we spend a lot of time there (it
gets called a lot!). So ~2% seems pretty good given this. This function is
already fairly well optimized and is being called through a C API. Am not
expecting us to get much more out of it. Though at some point we would want
to refactor SchedulerState (including this method) into a separate object
that the Scheduler holds (as opposed to inherits from), which would
eliminate the virtual table call overhead (though this is pretty minuscule).
It sounds like you have had better luck than Ben and I at running larger
scale graphs, which is great! Will need to look more closely at what the
results show for update_graph_hlg. Actually expect there's still a fair
bit left on the table in that function. Started to do some work in PR (
#4418 <#4418> ), which I'm happy
to resurrect. While I suspected this would eventually matter, this wasn't
something Ben and I could demonstrate (not at least at the scale we were
running at).
One last thought from a big picture perspective. When we started this
project of optimizing the Scheduler, the slow part was building, sending,
and scheduling the graph. IOW there was a long pause when sending a large
chunk of work off for computation. This is largely not an issue (though as
noted there may be some remaining work in update_graph_hlg). We then were
spending comparable time in read and write on the Scheduler (and this
dominated other things). It is worth noting the underlying socket.send
and socket.recv calls in Tornado
<https://github.com/tornadoweb/tornado/blob/5eb7bb8efb4e1be8a2cec1744c857a8dadd43af9/tornado/iostream.py#L1138-L1152>
will release the GIL in CPython
<https://github.com/python/cpython/blob/f2f55e7f03d332fd43bc665a86d585a79c3b3ed4/Modules/socketmodule.c#L885-L887>.
So if we spend a lot of time sending lots of very small messages quickly,
we will suffer from GIL contention. On the Scheduler we have largely solved
this (as these results also show) by batching lots of small messages into
larger messages sent less often (so fewer writes) ( #4451
<#4451> ) ( #4526
<#4526> ). Is batching being done
on the Workers (maybe @gforsyth <https://github.com/gforsyth>?)? If not,
is that doable? IOW read and handle_task_finished taking comparable time
may just be a function of lots of small messages from Workers that result
in lots of context switching for the Scheduler slowing it down.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4825 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTFSBC4IU55DTVGYFVLTNXHN3ANCNFSM445JCIUQ>
.
|
Beta Was this translation helpful? Give feedback.
-
Opening as a discussion to share preliminary results, since there isn't one clear issue to resolve yet.
I've been profiling the scheduler while running a scheduler-bound workload on a large, multi-node cluster. This is corollary work to https://github.com/quasiben/dask-scheduler-performance. The main differences are 1) the workers are all separate machines with real network latencies between them, and 2) I'm profiling with py-spy to get a more complete temporal picture of what's happening than the dask profiler. I'm using https://github.com/gjoseph92/scheduler-profilers for this, which others might find useful.
If you want just one profile to look at, I'd use
cython-nogc.json
.You can see all the results (and code) here. Here are a few interesting discoveries so far:
Garbage collection is an issue!
Look at cython and notice the ~1.5sec pauses every ~10sec (the time units on speedscope are samples; py-spy is sampling at 100Hz):
In cython-nogc (make sure to select
MainThread
), garbage collection is disabled on the scheduler and those pauses disappear.I think you can see this in the task stream as well: look at the cython performance report:
Compared to the cython-nogc performance report:
Most importantly, turning off GC gave a ~13% improvement without py-spy running: see cython-gc-no-py-spy and cython-nogc-no-py-spy. It gave a 37% performance improvement with py-spy running, which is odd—py-spy does seem to slow things down by maybe ~40%, but I don't know why it would have an outsized effect on garbage collection (benfred/py-spy#389 could be worth us contributing to?).
Obviously turning off GC isn't the solution here, but we should investigate what's creating the reference cycles that require garbage-collection. Mostly, it's very important to be aware of for profiling: if you look at "what function takes up the most time in this profile", but that function happened to be active while GC paused the world, it'll skew results significantly, and you might end up refactoring something that won't actually make a difference. For that reason, I'm just using
nogc
profiles from here on.xref quasiben/dask-scheduler-performance#142
update_graph_hlg
takes nontrivial timeFor large graphs (this case is ~200k tasks), generating the low-level graph and adding it to scheduler state takes up a sizable portion of total runtime. In this profile (cython-gil-nogc), check out the left-heavy view of
MainThread
(see the speedscope docs if you're not familiar).update_graph_hlg
takes 19% of total time, which is on par withread
s (~18%) and a bit less thanhandle_task_finished
(27%). If you double-click onupdate_graph_hlg
, it's interesting thatScheduler.update_graph
,__dask_distributed_unpack__
, andorder
all take about a third. It's surprising to me thatSchedulerState_new_task
is 2.1% total runtime. Also looks like we're spending 2.1% runtime on dask/dask#7650.I imagine there is some room to optimize all of these parts, though it may be tricky since
__dask_distributed_unpack__
andorder
aren't in distributed (so Cythonizing them, which might be pretty effective, is harder).We're spending significant time on transitions
In cython-gil-nogc,
handle_task_finished
is the largest contributor to runtime (27%). Nearly all of that is spent in_transition
. I haven't tried with #4650 yet, but I should.Moderate time in
read
(+ blocking the event loop)See the left-heavy view of cython-nogc (
MainThread
):read
-related stuff seems to be ~15% of runtime.write
-ish stuff seems to be ~6%.Oddly, getting the header (
frames_nbytes = await stream.read_bytes(fmt_size)
) is taking the same amount of time as reading all the frames (n = await stream.read_into(frames)
): 5%. Since this should be 8 bytes vs lots, you'd expect it to be much faster, so likely there's some issue with buffers there (related to #4506?). I wonder how much of the stream SSL has to read and decrypt just to get those 8 bytes? Hopefully this would be buffered and reused in subsequent reads, so maybe this isn't a real problem, and just something that shows up in profiling, yet gets amortized out in real life.Also, that
frames_nbytes
read
incomm/tcp.py
takes 5% runtime, yetread
inssl.py
takes only 2.6% runtime. So it seems there's a bit of overhead traversing the many layers of Python networking stack (including 1.5% runtime in_asyncio_Future___init__
?!). Lots of prior discussion around this in #4443. I think it's worth looking carefully at tornado's iostream for inefficiencies.To me,
read
just feels like a particularly wasteful place to be spending time, since it feels like we ought to be able to pass that work off to something that doesn't block the event loop, and let us use our precious event loop cycles for application-level logic that can't be done concurrently (like scheduling tasks in transitions). There's no particular reason why we should have to hold off on task-scheduling work until SSL is done writing a buffer, whereas there are big reasons why we can't have twotransition
threads running at once. Even if we can't makeread
any faster, I wonder if we could make it more parallel.cc @jakirkham @mrocklin @jrbourbeau @quasiben @rjzamora and surely more; please tag others who would find this relevant!
Beta Was this translation helpful? Give feedback.
All reactions