Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot open coordinate variables from Intake-ESM datastores #660

Open
dougiesquire opened this issue Mar 5, 2024 · 3 comments
Open

Cannot open coordinate variables from Intake-ESM datastores #660

dougiesquire opened this issue Mar 5, 2024 · 3 comments

Comments

@dougiesquire
Copy link
Contributor

Description

I'd like to include coordinate variables in my Intake-ESM datastore so that they can be searched for and opened like any other variable. However, Intake-ESM currently only allows opening the xarray data_vars from an asset. If I try to open a coordinate variable from my datastore I am returned an empty xarray Dataset.

The obvious fix is to edit the line in the link above to include any variable in ds, but there may be a reason I don't understand for why things are done how they are currently.

What I Did

To reproduce requires a multi-variable catalog that includes a coordinate variable in the lists of variables. E.g. one can edit the first row of tests/sample-catalogs/multi-variable-catalog.csv to include "TLAT" in the list of variables. Then:

import intake

cat = intake.open_esm_datastore(
    "tests/sample-catalogs/multi-variable-catalog.json",
    columns_with_iterables=["variable"]
)

cat.search(variable="TLAT").to_dask()

returns a Dataset that does not contain "TLAT"

Version information: output of intake_esm.show_versions()

INSTALLED VERSIONS

cftime: 1.6.3
dask: 2024.2.1
fastprogress: 1.0.3
fsspec: 2024.2.0
gcsfs: 2024.2.0
intake: 0.7.0
intake_esm: 2023.6.14.post31+g141425b
netCDF4: 1.6.5
pandas: 2.2.1
requests: 2.31.0
s3fs: 2024.2.0
xarray: 2024.2.0
zarr: 2.17.0

@andersy005
Copy link
Member

andersy005 commented Mar 5, 2024

👋🏽 @dougiesquire,

the primary reason why we are using ds.data_vars instead of ds.variables is that we don't want xarray to perform unnecessary and expensive checks on the coordinates when merging or concatenating datasets. In the past, merging and concatenating model data produced on different machines often failed due to roundoff errors in coordinates, which was a considerable issue. by using data_vars(), we were able to bypass these unnecessary coordinate comparisons (e.g. when concatenating two variables from same model but with compatible grids) in xarray. however, I'm not sure if this is still necessary.

So, I'm in favor of trying to relax this requirement to see what happens. another option would be to introduce a configurable setting that allows users to turn on the use of ds.variables in place of ds.data_vars. this latter approach is likely better if we aim to ensure backward compatibility.

@dougiesquire
Copy link
Contributor Author

Hello and thanks @andersy005! I could well be misunderstanding, but I think using ds.variables would only change things for users who have included coordinate variables in their catalog (and presumably want to be able to open them). I don’t understand why there would be any performance implications: variables is constructed only from variables included in the catalog, so I don’t think any additional coordinates should be loaded unless they are explicitly asked for from the catalog. FWIW, changing to ds.variables doesn’t appear to make the tests run any slower.

If you agree, I advocate that we don’t add the additional complexity of making this change configurable.

@mgrover1
Copy link
Collaborator

@dougiesquire - checking back in here, are you interested in making a pull request with the fix you mentioned?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants