Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a DerivedCatalog object to deal with derived variables #357

Open
mgrover1 opened this issue Aug 17, 2021 · 2 comments
Open

Add a DerivedCatalog object to deal with derived variables #357

mgrover1 opened this issue Aug 17, 2021 · 2 comments

Comments

@mgrover1
Copy link
Collaborator

mgrover1 commented Aug 17, 2021

Similar to the development in esds-funnel, we think it would be useful to be able to add "derived variables" to a catalog, accessible via an api similar to this:

DerivedCatalog.add_variable(intake_esm_catalog, variable='TEMP_100m', dependent_variable=['TEMP'])
@mgrover1
Copy link
Collaborator Author

The result (with adding an SST variable), would be:
Screen Shot 2021-08-17 at 4 32 15 PM

@andersy005
Copy link
Member

andersy005 commented Oct 13, 2021

I took a stab at this. My current approach is similar to Matt's in that I'm keeping track of derived variable's info in a registry attached to the intake_esm catalog object via .derivedcat attribute:

Initially this derivedcat registry is empty

In [1]: import intake, intake_esm

In [2]: cat = intake.open_esm_datastore("./tests/sample-collections/catalog-dict-records.json")

In [4]: cat.unique()
Out[4]: 
component                                                       [atm]
frequency                                                     [daily]
experiment                                                      [20C]
variable                             [FLNS, FLNSC, FLUT, FSNS, FSNSC]
path                [s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS...
derived_variable                                                   []
dtype: object

The user can register their derivation function via a decorator.

In [5]: @intake_esm.register_derived_variable(varname="FOO", required=[{'variable': "TEMP", "component": "ocn"}])
   ...: def func(ds):
   ...:     return ds.TEMP + 1
   ...: 

The user should be able to validate the derived catalog whenever they want via

In [9]: cat.validate_derivedcat()
Looks good!
This validation method looks like
        for key, entry in self.derivedcat.items():
            for req in entry.required:
                for col in req:
                    if col not in self.esmcat.df.columns:
                        raise ValueError(
                            f"{key} requires {col} to be in the ESM catalog columns: {self.esmcat.df.columns.tolist()}"
                        )
                if self.esmcat.aggregation_control.variable_column_name not in req.keys():
                    raise ValueError(
                        f"Variable derivation requires *{self.esmcat.aggregation_control.variable_column_name}* to be in the dictionary of requirements: {req}"
                    )
        else:
            print('Looks good!')

Operations like nunique() and unique() are able to merge the information from both the main (base) catalog and the derived variable registry

In [6]: cat.unique()
Out[6]: 
component                                                       [atm]
frequency                                                     [daily]
experiment                                                      [20C]
variable                             [FLNS, FLNSC, FLUT, FSNS, FSNSC]
path                [s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS...
derived_variable                                                [FOO]
dtype: object

In [8]: cat.derivedcat
Out[8]: {'FOO': DerivedVariable(func=<function func at 0x1072dc310>, required=[{'variable': 'TEMP', 'component': 'ocn'}])}
  • Is this API good enough?
  • How should the .search() work?
    1. Should we return subsets of the main(base) catalog and derived catalog or
    2. should we keep the derived catalog intact i.e. return the subset of the base catalog + everything in the derived catalog?

Cc @matt-long, @kmpaul, @mgrover1

@andersy005 andersy005 linked a pull request Oct 14, 2021 that will close this issue
7 tasks
@andersy005 andersy005 removed a link to a pull request Oct 14, 2021
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants