Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catalog search on derived variable fails if the variable column is iterable. #684

Closed
rbeucher opened this issue Nov 1, 2024 · 3 comments · Fixed by #685
Closed

Catalog search on derived variable fails if the variable column is iterable. #684

rbeucher opened this issue Nov 1, 2024 · 3 comments · Fixed by #685

Comments

@rbeucher
Copy link
Collaborator

rbeucher commented Nov 1, 2024

Description

We cannot run a search query on a derived variable if the variable column is iterable.

What I Did

Example csv

model_id,ensemble_id,model_institution_id,model_id,experiment_id,frequency_id,realm_id,member_id,variable_id,file
exp1,CMIP6,CSIRO,ACCESS-ESM1-5,historical,mon,atm,r1i1p1,"['tasmax']",file1
exp1,CMIP6,CSIRO,ACCESS-ESM1-5,historical,mon,atm,r1i1p1,"['pr', 'tas']",file2
exp2,CMIP6,CSIRO,ACCESS-ESM1-5,historical,mon,atm,r1i1p1,"['pr', 'tasmin']",file3
exp3,CMIP6,CSIRO,ACCESS-ESM1-5,historical,mon,atm,r1i1p1,"['tasmax']",file4
exp4,CMIP6,CSIRO,ACCESS-ESM1-5,historical,mon,atm,r1i1p1,"['pr']",file5
exp5,CMIP6,CSIRO,ACCESS-ESM1-5,historical,mon,atm,r1i1p1,"['prsn']",file6
exp6,CMIP6,CSIRO,ACCESS-ESM1-5,historical,mon,atm,r1i1p1,"['tasmax']",file7

catalog.json

{
    "esmcat_version": "0.1.0",
    "assets": {
        "column_name": "file",
        "format": "netcdf"
    },
    "aggregation_control": {
        "variable_column_name": "variable_id",
        "groupby_attrs": [
            "model_id",
            "realm_id",
            "frequency_id"
        ],
        "aggregations": [
            {
                "type": "join_new",
                "attribute_name": "member_id"
            },
            {
                "type": "union",
                "attribute_name": "variable_id"
            }
        ]
    },
    "attributes": [],
    "catalog_file": "test.csv"
}

Running the following code:

import intake
import intake_esm

dvr = intake_esm.DerivedVariableRegistry()

@dvr.register(variable="myvar", query={"variable_id": "pr"})
def calc_myvar(ds):
    return ds.pr

cat = intake.open_esm_datastore("catalog.json", columns_with_iterables=["variable_id"], registry=dvr)
cat.search(variable_id=["myvar"]).df

I believe the issue comes from:

File intake_esm/core.py:425, in esm_datastore.search(self, require_all_on, **query)
    419                 derived_cat_subset[key] = value
    421 if derivedcat_results:
    422     # Merge results from the main and the derived catalogs
    423     esmcat_results = (
    424         pd.concat([esmcat_results, *derivedcat_results])
--> 425         .drop_duplicates()
    426         .reset_index(drop=True)
    427     )
    429 cat = self.__class__({'esmcat': self.esmcat.dict(), 'df': esmcat_results})
    430 cat.esmcat.catalog_file = None  # Don't save the catalog file

Version information: output of intake_esm.show_versions()

Paste the output of intake_esm.show_versions() here:

import intake_esm

intake_esm.show_versions()
INSTALLED VERSIONS
------------------

cftime: 1.6.4
dask: 2024.10.0
fastprogress: 1.0.3
fsspec: 2024.10.0
gcsfs: None
intake: 0.7.0
intake_esm: 2024.2.6
netCDF4: 1.7.2
pandas: 2.2.3
requests: 2.32.3
s3fs: None
xarray: 2024.10.0
zarr: 2.18.3
@rbeucher
Copy link
Collaborator Author

rbeucher commented Nov 1, 2024

@charles-turner-1 FYI

@charles-turner-1
Copy link
Collaborator

@rbeucher will take a look.

@rbeucher
Copy link
Collaborator Author

rbeucher commented Nov 1, 2024

I am looking at it now. Will let you know how it goes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants