-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reducing complexity of cc.querying.getvar #147
Comments
OK, sounds good. Thanks for checking the runtimes. Was 2min for |
Yes, it was much worse with the multiprocessing scheduler. |
The cookbook API predates Intake (and other features like the parallelization in open_mfdataset) but would have used them if they had existed. I agree that we should be moving towards convergence of the API with those other tools. In addition, to Regarding, |
Looks like there's a fast path for |
Hi @angus-g - just wondering if we have anything yet which will allow for the database building procedure to either (a) eliminate files from the database if they have been removed from the directory and/or (b) re-index files that have changed (say if it initially wasn't collated properly and needed to be re-done). At the moment, the only method I have for this is making a new database, which is a bit time-consuming for the large-scale default database. |
Nothing exists as yet to actually do those things, but I think we know enough in the database to implement that. I’ll have a look into it! |
Note that #184 fixes (a) above @AndyHoggANU. By default indexing will auto-prune missing files. Detecting changed files would require checking the modification timestamp against the timestamp when the file was last indexed. After that most recent PR this logic should be pretty easy to add here |
👍 |
This is a bit of a follow-up, and collection of ideas after our meeting last week regarding
cc.querying.getvar
. In its current state, getvar is responsible for a lot of things:The last of these is mostly unnecessary due to recent improvements in and around xarray regarding times. We can safely (and should) pull this out, while exposing it through a separate function.
This leaves us with the core task, which is really just to return an appropriate xarray Dataset. For this, we only need minimal arguments:
variable
,experiment
,session
(for the workflow we've set up), and optionallyncfile
for disambiguation at the moment, and anykwargs
can be passed directly through toopen_mfdataset
. Note that ideally, we don't even need the start/end times, or number of files filters: the latter is an optimisation, and the former is an optimisation that still requires.sel()
to be called on the resultant Dataset anyway.So, this brings up two points: firstly, we're quite close to mimicking Intake with this reduced API. Secondly, are the former optimisations really required?
open_mfdataset()
has aparallel=True
flag which makes it essentially do the following:In my benchmarking, using the
01deg_jra55v13_ryf8485_spinup6_000-413
experiment with 414 files, it takes about 20s on a VDI node to perform theopen_mfdataset
call. In this case, it's pretty important to have a proper dask client (dask.distributed.Client
, or the command-line equivalent). Using the default threaded scheduler performs horribly because of GIL contention between threads, and the multiprocessing scheduler isn't smart enough to optimise serialisation of data back to the main process. I will note that the slow part here is the number of files, not the size of the experiment. The1deg_core_nyf_spinup_A
has 1212 files, and takes around 2 minutes to open.I think it's worth removing the time offsetting and database-modification from getvar. Under the hood, we can split into a function that queries the file paths from the database that getvar calls, which can additionally be used for database maintenance if required. This makes getvar essentially a wrapper over
open_mfdataset
, that ensures we open in a performant way (withparallel
andpreprocess
). For the time being, the other queries to whittle down the files should stay: first, restrict based on the start/end times, then restrict to a total number of files if required.The text was updated successfully, but these errors were encountered: