Reducing complexity of cc.querying.getvar #147

angus-g · 2019-08-05T06:04:59Z

This is a bit of a follow-up, and collection of ideas after our meeting last week regarding cc.querying.getvar. In its current state, getvar is responsible for a lot of things:

interrogation of the database to determine which files contain the desired variable
restriction of those files based on number of files, and the start/end times of the data contained within
removing non-existent files from the database
offsetting or altering the units of the time dimension

The last of these is mostly unnecessary due to recent improvements in and around xarray regarding times. We can safely (and should) pull this out, while exposing it through a separate function.

This leaves us with the core task, which is really just to return an appropriate xarray Dataset. For this, we only need minimal arguments: variable, experiment, session (for the workflow we've set up), and optionally ncfile for disambiguation at the moment, and any kwargs can be passed directly through to open_mfdataset. Note that ideally, we don't even need the start/end times, or number of files filters: the latter is an optimisation, and the former is an optimisation that still requires .sel() to be called on the resultant Dataset anyway.

So, this brings up two points: firstly, we're quite close to mimicking Intake with this reduced API. Secondly, are the former optimisations really required? open_mfdataset() has a parallel=True flag which makes it essentially do the following:

open_ = dask.delayed(open_dataset)
datasets = [open_(p) for p in paths]
dask.compute(datasets)
combine(datasets)

In my benchmarking, using the 01deg_jra55v13_ryf8485_spinup6_000-413 experiment with 414 files, it takes about 20s on a VDI node to perform the open_mfdataset call. In this case, it's pretty important to have a proper dask client (dask.distributed.Client, or the command-line equivalent). Using the default threaded scheduler performs horribly because of GIL contention between threads, and the multiprocessing scheduler isn't smart enough to optimise serialisation of data back to the main process. I will note that the slow part here is the number of files, not the size of the experiment. The 1deg_core_nyf_spinup_A has 1212 files, and takes around 2 minutes to open.

I think it's worth removing the time offsetting and database-modification from getvar. Under the hood, we can split into a function that queries the file paths from the database that getvar calls, which can additionally be used for database maintenance if required. This makes getvar essentially a wrapper over open_mfdataset, that ensures we open in a performant way (with parallel and preprocess). For the time being, the other queries to whittle down the files should stay: first, restrict based on the start/end times, then restrict to a total number of files if required.

The text was updated successfully, but these errors were encountered:

aekiss · 2019-08-05T06:27:51Z

OK, sounds good. Thanks for checking the runtimes. Was 2min for 1deg_core_nyf_spinup_A done with a dask client?

angus-g · 2019-08-05T07:35:40Z

Yes, it was much worse with the multiprocessing scheduler.

jmunroe · 2019-08-07T21:12:15Z

The cookbook API predates Intake (and other features like the parallelization in open_mfdataset) but would have used them if they had existed. I agree that we should be moving towards convergence of the API with those other tools.

In addition, to variable and experiment I think we need to add the concept of frequency to disambiguate variables.

Regarding, session, I wonder if we should have a default session that, once created, get reused by other calls. This is similar to how the dask.distributed.Client is used by xarray. A singleton design pattern might be good for this.

angus-g · 2019-09-17T04:49:06Z

Looks like there's a fast path for open_mfdataset now (see pydata/xarray#1823), which may eliminate the need to restrict the number of files we try to open.

AndyHoggANU · 2020-01-17T04:38:59Z

Hi @angus-g - just wondering if we have anything yet which will allow for the database building procedure to either (a) eliminate files from the database if they have been removed from the directory and/or (b) re-index files that have changed (say if it initially wasn't collated properly and needed to be re-done).

At the moment, the only method I have for this is making a new database, which is a bit time-consuming for the large-scale default database.

angus-g · 2020-01-17T06:20:08Z

Nothing exists as yet to actually do those things, but I think we know enough in the database to implement that. I’ll have a look into it!

aidanheerdegen · 2020-06-29T23:40:36Z

Note that #184 fixes (a) above @AndyHoggANU. By default indexing will auto-prune missing files. Detecting changed files would require checking the modification timestamp against the timestamp when the file was last indexed. After that most recent PR this logic should be pretty easy to add here

https://github.com/COSIMA/cosima-cookbook/blob/master/cosima_cookbook/database.py#L339

AndyHoggANU · 2020-06-30T00:01:47Z

👍

angus-g added the 🧜🏽‍♀️ enhancement label Aug 5, 2019

angus-g mentioned this issue Aug 6, 2019

Split up and remove offsetting from cc.querying.getvar #148

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing complexity of cc.querying.getvar #147

Reducing complexity of cc.querying.getvar #147

angus-g commented Aug 5, 2019

aekiss commented Aug 5, 2019

angus-g commented Aug 5, 2019

jmunroe commented Aug 7, 2019

angus-g commented Sep 17, 2019

AndyHoggANU commented Jan 17, 2020

angus-g commented Jan 17, 2020

aidanheerdegen commented Jun 29, 2020

AndyHoggANU commented Jun 30, 2020

Reducing complexity of cc.querying.getvar #147

Reducing complexity of cc.querying.getvar #147

Comments

angus-g commented Aug 5, 2019

aekiss commented Aug 5, 2019

angus-g commented Aug 5, 2019

jmunroe commented Aug 7, 2019

angus-g commented Sep 17, 2019

AndyHoggANU commented Jan 17, 2020

angus-g commented Jan 17, 2020

aidanheerdegen commented Jun 29, 2020

AndyHoggANU commented Jun 30, 2020