-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] Object store fills up too quickly in simple processing script #11565
Comments
cc @clarkzinzow |
Thanks for raising this issue @richardliaw. For more context @clarkzinzow, I'm finding that even with a dataset 1/12 the size (157M), this error is occurring when attempting to write to Parquet (but not, interestingly, when calling Let me know if you need more info from me. This is currently a blocker for me, so happy to help in any way I can. |
I take that back, I'm actually seeing this for |
In my attempt to reproduce, I'm getting a lot of segfaults, not sure what the cause is yet. I'll keep investigating. Let me know if you are able to create a simpler reproduction, that might help me see through a lot of the noise here. |
Hmm interesting, computing this without using the Dask-on-Ray scheduler crashed my dev VM! Maybe there's an underlying Dask or Parquet issue at work here. Are you able to run this workload to completion with a non-Ray scheduler? How large of a machine are you running this on? |
Yes, I can run without the Ray scheduler. I'm using a MacBook Pro with 16GB of RAM. |
I'm able to run this script on a MBP with 64GB of Ram. I can reliably produce failure with:
This produces a
Which is similar to Travis' output of:
|
@tgaddair @richardliaw, which versions of Ray, Dask, and Pandas are y'all on? |
Thanks @clarkzinzow for continuing to look into this! |
thanks a bunch @clarkzinzow ! |
I'm still working on getting to the bottom of the segfaults on Parquet table writing that I keep hitting, which is concerning in and of itself. I'll try to poke at this a bit more tomorrow. |
I'm assuming that both of y'all have |
@clarkzinzow nope didn't know that was a thing! trying it out now. |
@richardliaw Interesting, not sure why the pyarrow Parquet implementation is segfaulting for me and not you. What version of pyarrow are you using? |
though maybe we bundle it internally? |
Not anymore! 😁 |
ah nice! Let me know what other information you need to help debug your segfault. |
I'll probably end up opening a separate issue for the segfaults since it seems to be specific to using the pyarrow Parquet writer on Linux. Given that I have a workaround (using the fastparquet writer instead), I'm going to ignore the segfaults and keep investigating the object store OOMs so we can hopefully unblock @tgaddair's work. |
Hey @clarkzinzow, sorry for the delayed response. I'm actually using |
I'm using pyarrow |
@tgaddair Sweet, thanks! And if So, despite the source dataset being only 2GiB, I think that the working set for this load + repartition + merge is fundamentally large due to Dask's partitioned dataframe implementation (not due to the implementation of the Dask-on-Ray scheduler), and that this results in a failure when using the Ray scheduler because the working set is limited to the allocated object store memory instead of the memory capacity of the entire machine. Running this sans the Ray scheduler yields a peak memory utilization of 20GiB, a lot of which is a working set of intermediate task outputs that must be materialized in memory concurrently. If you have allocated less than 6GiB to the object store, then you can expect to see OOMs since that object store memory budget can't meet the peak demand. Until the underlying bloated memory utilization is fixed, you could try allocating e.g. 14GiB of memory to Ray and 7GiB to the object store in particular via GB = 1024 ** 3
ray.init(_memory=14*GB, object_store_memory=7*GB) to see if you have better luck, or try running this on a cloud instance or a set of cloud instances with more memory available. If I'm able to confirm that there are some unnecessary dataframe copies happening in Dask's dataframe implementation, then I'll open an issue on the Dask repo, and I'll double-check to make sure that the object store is freeing unused intermediate task outputs at the expected times. |
@clarkzinzow thanks a bunch for this thorough evaluation! So from what I understand, the same memory utilization occurs on both workloads, but the difference is that Dask does not have "memory limits" while Ray enforces memory limits. Is that correct? |
@richardliaw That's my current theory, yes! I do want to find out why the working set is so large during repartitioning and merging, and whether intermediate objects are being garbage collected in a timely manner (which I currently think they are, just want to double-check). Hopefully I'll have some time tomorrow to dig further and open up some issues. |
What is the problem?
We get an OOM though the working set is much smaller than expected.
Reproduction (REQUIRED)
Using: the netflix dataset: https://www.kaggle.com/netflix-inc/netflix-prize-data
cc @tgaddair
The text was updated successfully, but these errors were encountered: