Xarray operations (e.g., preprocess) running locally (post open_mfdataset) instead of on Dask distributed cluster #8913

lbesnard · 2024-10-28T06:49:16Z

lbesnard
Oct 28, 2024

I have a python script, running from my personal laptop (Machine A).

This script starts a Coiled cluster with dask distributed. Then NetCDFs are open together with xr.open_mfdataset.

Things go a bit funny with open_mfdataset. (A bit off topic for this conversation, but if I have open_mfdataset within class, my preprocess function needs to be outside of the class, and be a standalone function, otherwise I get serialization issues with partial... Anyway, back to the main problem)

open_mfdataset uses a preprocess function . This function simply creates missing variables to ensure the dataset is consistent. However, when the preprocess function is being run on the cluster, my local machine A's network bandwith goes like crazy. It looks like all the NetCDF files that are being imported via dask on the remote cluster were being downloaded to my local machine (since my laptop is suddenly downloading 10Gb+ of data). I also noticed a lot of local memory usage. To me it doesn't make any sense, as any computation should run on the remote cluster, however this is the behaviour I'm experiencing. I've managed to reproduce the problem by simplifying to the extreme my preprocess function. Even with something as simple as below, I encounter the same behaviour:

def preprocess(ds):
    return ds

I have by no mean an amazing understanding of dask/zarr/xarray, so it's very likely I'm not following the best practices (even though I've been looking).

I'm often running into cases where my local machine crashes because of all the memory is being used because I'm processing too many files at once. This defies the purpose of a remote cluster

I didn't write an issue, as it's a bit of a weird one and I'm not sure if it's a dask/xarray/coiled or me issue, but I'm seeking some help/advice, and if this has been ever noticed.

cheers

Answered by lbesnard

Nov 10, 2024

I got some help from Coiled, thanks @phofl

For reference, It turns out that the issue is related to s3fs and already lodge here:
fsspec/filesystem_spec#1747

The solution is to use this obscure option in s3fs:
default_file_cache=None

It sounds a bit insane that not many people are experiencing this issue as this means using a dask cluster with remote NetCDF files is useless as the bottleneck becomes the machine which is starting the code

View full answer

lbesnard · 2024-10-29T06:34:17Z

lbesnard
Oct 29, 2024
Author

as a reproducible example, see
https://gist.github.com/lbesnard/97bdf0b4af9fa340e8ef47aa20b3cc93

where the preprocess_xarray function does nothing. When running this example, I can see that all the data files are being downloaded to my local machine, and this function doesn't run on the cluster

With this slightly different gist, the local memory is used, which is "normal" since the preprocess_xarray is not running on the cluster
https://gist.github.com/lbesnard/0732b7572983ea5ede5e8d7bcbe809fc

0 replies

RichardScottOZ · 2024-11-02T03:26:57Z

RichardScottOZ
Nov 2, 2024

You are not using the cluster client to do anything...so code is being run locally?

1 reply

lbesnard Nov 2, 2024
Author

sorry I'm not sure I'm following you.I'm using the cluster to open NC files with open_mfdataset. I can see the dask dashboard processing data during the opening. As soon as the files are opened remotely, then thats when it seems like my local machine's memory is being used

RichardScottOZ · 2024-11-02T05:43:00Z

RichardScottOZ
Nov 2, 2024

https://github.com/heliocloud-data/science-tutorials/blob/main/S3-Dask-Demo.ipynb

2 replies

lbesnard Nov 10, 2024
Author

I got some help from Coiled, thanks @phofl

For reference, It turns out that the issue is related to s3fs and already lodge here:
fsspec/filesystem_spec#1747

The solution is to use this obscure option in s3fs:
default_file_cache=None

It sounds a bit insane that not many people are experiencing this issue as this means using a dask cluster with remote NetCDF files is useless as the bottleneck becomes the machine which is starting the code

Answer selected by lbesnard

phofl Nov 11, 2024
Collaborator

This pattern doesn't show up for all files (for example, I only observed it for 4 of your files, not for all of them), but yes I fully agree that this is very unfortunate

lbesnard · 2024-11-02T07:17:18Z

lbesnard
Nov 2, 2024
Author

so you're suggesting using client.map ?

All the documentation I read suggests that when a client is setup, as per my example, all open_mfdataset operation will use the cluster/client, especially with the parallel=true option

see

parallel (bool, default: False) – If True, the open and preprocess steps of this function will be performed in parallel using dask.delayed. Default is False.

https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html

or other coiled doc such as
https://www.coiled.io/#w-tabs-0-data-w-pane-2

import xarray as xr
import coiled
  
cluster = coiled.Cluster(
    n_workers=500,
    region="us-west-2",
    worker_memory="64 GiB",
    spot_policy="spot",
)
client = cluster.get_client()

ds = xr.open_mfdataset(
    "s3://mybucket/all-data.zarr",
    engine="zarr", parallel=True,
)

which is the most basic example, and not much different from my example

0 replies

RichardScottOZ · 2024-11-03T02:56:41Z

RichardScottOZ
Nov 3, 2024

I have never used coiled, but yes -- if things are happening locally and not on cluster as expected, try and explicitly send those functions/command to cluster and see if behaviour changes?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xarray operations (e.g., preprocess) running locally (post open_mfdataset) instead of on Dask distributed cluster #8913

{{title}}

Replies: 5 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Xarray operations (e.g., preprocess) running locally (post open_mfdataset) instead of on Dask distributed cluster #8913

lbesnard Oct 28, 2024

Replies: 5 comments · 3 replies

lbesnard Oct 29, 2024 Author

RichardScottOZ Nov 2, 2024

lbesnard Nov 2, 2024 Author

RichardScottOZ Nov 2, 2024

lbesnard Nov 10, 2024 Author

phofl Nov 11, 2024 Collaborator

lbesnard Nov 2, 2024 Author

RichardScottOZ Nov 3, 2024

lbesnard
Oct 28, 2024

Replies: 5 comments 3 replies

lbesnard
Oct 29, 2024
Author

RichardScottOZ
Nov 2, 2024

lbesnard Nov 2, 2024
Author

RichardScottOZ
Nov 2, 2024

lbesnard Nov 10, 2024
Author

phofl Nov 11, 2024
Collaborator

lbesnard
Nov 2, 2024
Author

RichardScottOZ
Nov 3, 2024