Welcome to Dask-distributed Discussions! #4208

mrocklin · 2020-10-31T19:11:24Z

mrocklin
Oct 31, 2020
Maintainer

👋 Welcome!

We’re using Discussions as a place to connect with other members of our community. We hope that you:

Ask questions you’re wondering about.
Share ideas.
Engage with other community members.
Welcome others and are open-minded. Remember that this is a community we
build together 💪.

To get started, comment below with an introduction of yourself and tell us about what you do with this community.

micheles · 2021-01-19T07:25:26Z

micheles
Jan 19, 2021

Hi guys,

I see that this forum is not much frequented, but I have more of an user feedback than a question, so I will try here rather than on StackOverflow [WARNING: this is going to be long].

I am somebody who is experimenting with dask right now and trying to assess its viability for our software, i.e. the OpenQuake engine which is the leading OpenSource tool for seismic hazard and risk analysis.

The engine is a mature tool - we reached the 10th year anniversary last October - used by many governments and private organizations. As such, moving from the current task distrution mechanism to dask.distributed is something to be carefully planned; for sure it will take a lot of time and I am taking just the initial steps.

The architecture of the engine is such that one can trivially swap from a distribution mechanism to another one, provided that there is a .submit functionality. Fortunately dask provides that via the futures-compatible API. That was a major step to make dask appealing to the broader Python community, thanks for the effort.

The requirement of keeping the distribution mechanism swappable means that I cannot use much of dask, just a map-reduce with a trivial task-graph. I have already tested dask on some of the calculations that we run and it works, even if not perfectly.

Just to give you some perspective, a seismic calculation works by taking in input a few gigabytes of Python objects (the seismic sources) and sending them to the workers that generate hundreds of gigabytes of numpy arrays; such arrays are sent back to the master node, combined and then stored into an HDF5 file. Then there are multiple post-processing phases where data from the HDF5 file is read from the workers and analyzed, then the results are sent back and stored again.

This is a very common and ordinary workflow in scientific calculations. We currently use a cluster with 320 cores, in the range of what I call Middle Performance Computing (more than a server, less than a supercomputer).

For historical reasons - the developers that made the decisions were Django developers - the chosen mechanism for the task distribution was celery coupled with rabbitmq. At the time I was working for another company - incidentally, we were using the Grid Engine and python-drmaaa, which worked pretty well. 8 years ago I switched job and I immediately noticed that celery was the wrong tool for our use case, but even now, nearly all users are still using celery, even if it is now deprecated. Just to give you an idea of the inertia and legacy involved.

I became aware of dask in 2015, but at the time it was not mature enough. With the years it improved but still we could not use it because we thought it lacked an essential feature, i.e. the ability to kill the calculation of an user without killing the calculations of other users running concurrently [now we have changed our architecture and that feature is not needed anymore].

In the summer of 2018 the situation with celery/rabbitmq had become unmanageable: the issue was one of data transfer. As it would be obvious to anyone (except the original developers) if you try to transfer 100 GB of data with rabbitmq you have to pay a price: and it was not only a matter of performance: in practice, calculations could not be run because we were running out of disk space in the mnesia directory. rabbitmq is indeed a wonderful software which is meant to be resilient against failures, therefore it has to store task information on the disk. The point is that we never ever needed the resilience features of rabbitmq.

So we were paying a high price for nothing. Thus, we had to invent various workarounds. Two ideas helped a lot:

require a shared directory, then the workers could read the HDF5 files via NFS, with zero data transfer via rabbitmq
use zeromq to transfer the data back to the master node

This solved most of the issues we had with rabbitmq. Then in the summer of 2018 after studying a bit the ZeroMQ book I saw that with a trivial task ventilator we could have removed the need for celery entirely. So I did that, experimentally, expecting a dramatic failure for large calculations. After all, I have zero competence on network programming - my background is in Physics - and I did not expect the textbook exercise to scale to a cluster of 320 cores while transferring 1 TB of data.

Instead, contrarily to my expectations, it worked extremely well and now I have effectively replace 50,000+ lines of celery plus who-knows-how-many lines of Erlang with 250 lines of ZeroMQ or so, with a better performance, using a lot less memory and zero disk space.

I must say I am impressed with ZeroMQ.

So, all is well? Yes and no. Yes, because in two years of heavy usage we had zero issues with the current system, even under heavy load and when running out of memory. No, in the sense that it is an ad-hoc solution that feels like a hack. In particular the task ventilator is not a scheduler, it just sends a few tasks per core. When the tasks are all of the same size this is not an issue, but if the tasks have very different sizes and there is an unlucky core that gets a set of particularly slow tasks, then you have to wait for it. There is no logic for re-routing tasks from busy cores to free cores. Also, I do not want to write that logic, since I feel that it is not my job and that somebody else has already done it. In particular, you guys at dask.

Since for most of our calculations the task distribution is not that bad (I made a lot of work to ensure that the engine does not produce particularly slow tasks) this is not a pressing issue. Still, as a New Year resolution, I decided to give dask another try.

Compared to three years ago (the last time we tried dask) this time everything worked smoothly and in a short time I was able to get calculations running. Then I noticed two things:

using dask has a penalty of ~20% in the total calculation time with respect to my zmq hack (measured on small calculations)
dask is using a lot more memory than my hack (15 GB in the scheduler measured in a large calculation, compared to nothing)

I guess dask is doing a lot more than I need and that has a price. In particular our calculations are memory intensive and dask is logging thousands of warnings, probably because I did not set the memory limits. Also I see that dasks is resubmitting tasks that died (I had 3 erred tasks over ~1600 tasks, probably for out of memory errors): this also must have a price to pay.

So, here is my user feedback: there are applications that really care about performance and memory occupation. For such applications you should give a way to disable features in exchange for performance/memory.

For instance, the dashboard is spectacular and I like it a lot (especially the flame graphs) however I will be willing to lose it in exchange for more performance and less memory usage. After all the engine is 10 years old, so it has already internal ways to monitor its performance: the flame graphs, while very nice and convenient to use, are not giving me any information that I did not have. Also, I really would like to turn off the nanny (if a task dies, I want all the calculation to die, not to keep running) but it seems that I cannot having set --nprocs=-1.

It is entirely possible that I have missed things in the documentation (the PDF version is 200 pages long) but still I wanted to give you a "first impressions" kind of feedback which is probably useful. I also see that you made a huge effort on the documentation, compared to a few years ago and that there are also a lot of YouTube video. The problem now is that there is too much of it! ;-)

I do not think that you can do much about the documentation (it is already good) instead if you want to work on something I would suggesting looking at ways to reduce the memory footprint, both of the workers and of the scheduler. There are people that really care about that. We have 768 GB of cumulative RAM on our cluster with 6 machines and still we run out of memory all the time.

Notice that this is by design: for our scientists it is preferable to run out of memory after 10 minutes (then they can change the parameters and reduce the calculation) rather than waiting days for a calculation and having to kill it manually and restart it with different parameters. Since most of our calculations are research calculations (nobody ever did them before), the normal case is that they are too big to and run and have to be tweaked. Therefore we are always at the memory limits and paying a penalty of 25% in RAM usage is a bit too much.

Thanks for the patience to read until the end,

          Michele Simionato

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Welcome to Dask-distributed Discussions! #4208

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Welcome to Dask-distributed Discussions! #4208

mrocklin Oct 31, 2020 Maintainer

👋 Welcome!

Replies: 1 comment

micheles Jan 19, 2021

mrocklin
Oct 31, 2020
Maintainer

micheles
Jan 19, 2021