Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing complexity of cc.querying.getvar #147

Open
angus-g opened this issue Aug 5, 2019 · 8 comments
Open

Reducing complexity of cc.querying.getvar #147

angus-g opened this issue Aug 5, 2019 · 8 comments

Comments

@angus-g
Copy link
Collaborator

angus-g commented Aug 5, 2019

This is a bit of a follow-up, and collection of ideas after our meeting last week regarding cc.querying.getvar. In its current state, getvar is responsible for a lot of things:

  • interrogation of the database to determine which files contain the desired variable
  • restriction of those files based on number of files, and the start/end times of the data contained within
  • removing non-existent files from the database
  • offsetting or altering the units of the time dimension

The last of these is mostly unnecessary due to recent improvements in and around xarray regarding times. We can safely (and should) pull this out, while exposing it through a separate function.

This leaves us with the core task, which is really just to return an appropriate xarray Dataset. For this, we only need minimal arguments: variable, experiment, session (for the workflow we've set up), and optionally ncfile for disambiguation at the moment, and any kwargs can be passed directly through to open_mfdataset. Note that ideally, we don't even need the start/end times, or number of files filters: the latter is an optimisation, and the former is an optimisation that still requires .sel() to be called on the resultant Dataset anyway.

So, this brings up two points: firstly, we're quite close to mimicking Intake with this reduced API. Secondly, are the former optimisations really required? open_mfdataset() has a parallel=True flag which makes it essentially do the following:

open_ = dask.delayed(open_dataset)
datasets = [open_(p) for p in paths]
dask.compute(datasets)
combine(datasets)

In my benchmarking, using the 01deg_jra55v13_ryf8485_spinup6_000-413 experiment with 414 files, it takes about 20s on a VDI node to perform the open_mfdataset call. In this case, it's pretty important to have a proper dask client (dask.distributed.Client, or the command-line equivalent). Using the default threaded scheduler performs horribly because of GIL contention between threads, and the multiprocessing scheduler isn't smart enough to optimise serialisation of data back to the main process. I will note that the slow part here is the number of files, not the size of the experiment. The 1deg_core_nyf_spinup_A has 1212 files, and takes around 2 minutes to open.

I think it's worth removing the time offsetting and database-modification from getvar. Under the hood, we can split into a function that queries the file paths from the database that getvar calls, which can additionally be used for database maintenance if required. This makes getvar essentially a wrapper over open_mfdataset, that ensures we open in a performant way (with parallel and preprocess). For the time being, the other queries to whittle down the files should stay: first, restrict based on the start/end times, then restrict to a total number of files if required.

@aekiss
Copy link
Collaborator

aekiss commented Aug 5, 2019

OK, sounds good. Thanks for checking the runtimes. Was 2min for 1deg_core_nyf_spinup_A done with a dask client?

@angus-g
Copy link
Collaborator Author

angus-g commented Aug 5, 2019

Yes, it was much worse with the multiprocessing scheduler.

@jmunroe
Copy link
Collaborator

jmunroe commented Aug 7, 2019

The cookbook API predates Intake (and other features like the parallelization in open_mfdataset) but would have used them if they had existed. I agree that we should be moving towards convergence of the API with those other tools.

In addition, to variable and experiment I think we need to add the concept of frequency to disambiguate variables.

Regarding, session, I wonder if we should have a default session that, once created, get reused by other calls. This is similar to how the dask.distributed.Client is used by xarray. A singleton design pattern might be good for this.

@angus-g
Copy link
Collaborator Author

angus-g commented Sep 17, 2019

Looks like there's a fast path for open_mfdataset now (see pydata/xarray#1823), which may eliminate the need to restrict the number of files we try to open.

@AndyHoggANU
Copy link
Collaborator

Hi @angus-g - just wondering if we have anything yet which will allow for the database building procedure to either (a) eliminate files from the database if they have been removed from the directory and/or (b) re-index files that have changed (say if it initially wasn't collated properly and needed to be re-done).

At the moment, the only method I have for this is making a new database, which is a bit time-consuming for the large-scale default database.

@angus-g
Copy link
Collaborator Author

angus-g commented Jan 17, 2020

Nothing exists as yet to actually do those things, but I think we know enough in the database to implement that. I’ll have a look into it!

@aidanheerdegen
Copy link
Collaborator

Note that #184 fixes (a) above @AndyHoggANU. By default indexing will auto-prune missing files. Detecting changed files would require checking the modification timestamp against the timestamp when the file was last indexed. After that most recent PR this logic should be pretty easy to add here

https://github.com/COSIMA/cosima-cookbook/blob/master/cosima_cookbook/database.py#L339

@AndyHoggANU
Copy link
Collaborator

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants