Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Did a package ever come out of this hackathon? #13

Open
hdrake opened this issue Feb 17, 2022 · 11 comments
Open

Did a package ever come out of this hackathon? #13

hdrake opened this issue Feb 17, 2022 · 11 comments

Comments

@hdrake
Copy link

hdrake commented Feb 17, 2022

Hi! I am writing some code to do similar things to subsampled a high-resolution MITgcm simulation and don't want to reinvent the wheel. Two questions:

  1. Did this ever result in a standalone package being developed (or is there a similar package that does this already?)
  2. If not, is it alright if I use some code from here to make one?
@dhruvbalwada
Copy link
Collaborator

This did not ever make it to a standalone package.
Part of the reason was because we never thought of it.

Also because the if you look through the notebooks (e.g. https://github.com/oceanhackweek/ohw21-proj-model-subsampling/blob/main/Application2LLC.ipynb), you will see that xarray's interp function does almost everything we need (cell number 128).
Most of the extra code is there to really just there to take care of the mapping between lat/lon and this particular MITgcm simulations index based dimensions.

Do you think this still requires a package of its own? If so, will the package just work for MITgcm?

@kdrushka
Copy link
Collaborator

kdrushka commented Feb 17, 2022

Quick answer: I have been developing it here: https://github.com/kdrushka/oceanliner
Would love to collaborate on this if you're interested. Or you can just poach code if you find it useful :)
I'm headed out of the office for the long weekend but happy to discuss next week!

(most up-to-date and functional code is in the testing folder ...)

@dhruvbalwada
Copy link
Collaborator

This looks nice, I had not realized that you were working on this.
I have starred it to see how it goes. :)

@kdrushka
Copy link
Collaborator

kdrushka commented Feb 17, 2022

Thanks! Indeed, the xarray interp function does all the heavy lifting. I have mainly been working on the user inputs and outputs.

It is written for netcdf files that have been converted to netcdf (while retaining original variable names) but could easily be adapted to other formats (e.g., binary files).

@hdrake
Copy link
Author

hdrake commented Feb 18, 2022

Thanks @dhruvbalwada and @kdrushka! What I had in mind was something in between both of your suggestions, i.e. a very light-weight wrapper for

survey = xr.Dataset(
    dict(
        lon = xr.DataArray(lon, dims='casts'),
        lat =  xr.DataArray(lat, dims='casts'),
        time =  xr.DataArray(time, dims='casts'),
    )
)
ds['THETA'].interp(survey)

that would robustly handle various use cases and provide some basic utilities.

Even if this code snippet isn't complicated, it's 5 lines longer than it should be and will have to be modified for every use case! I imagine something flexible enough to handle various permutations of common arguments like:

import xsample
sample(ds['THETA'], ds_survey['THETA'], inner=['XC', 'YC', 'time']) # for xr.Dataset with dimension
sample(ds['THETA, SALT'], df_survey, inner=['XC', 'YC', 'Z', 'time'], outer='point') # for pd.DataFrame
sample(ds['UVEL'], {'XG': XG, 'YC': YC}, outer='mooring') # for np.array

where outer just refers to the name of the broadcasted dimension and would default to either the dimension of the xr.Dataset or, if its a pd.DataFrame, a list, or dict, it will just broadcast to a new dimension cast with integer indices.

@hdrake
Copy link
Author

hdrake commented Feb 18, 2022

Anecdotally, I am even having trouble getting this to compute() for my large MITgcm dataset, which I have to chunk in both time and depth. The performance is much worse than it should be, I think because of the inefficient dask task-tree going on under the hood... Did you run into any of these issues?

@dhruvbalwada
Copy link
Collaborator

I personally did not dig into the optimal behavior that much when we were working on this. But I do remember wondering about this, especially for datasets from the LLC4320 run, but I don't remember what the chunking on it was. Maybe @kdrushka knows more since she has been working with the cutouts for SWOT cross-over more?

Do you have a notebook with what you are trying to do? Maybe if you post a notebook in gist, which shows the chunking etc for your particular example, I can take a look if I get some time - or ask people around me who are much more superior at task-graphs etc.

@iuryt
Copy link
Member

iuryt commented Feb 19, 2022

Thanks @dhruvbalwada and @kdrushka! What I had in mind was something in between both of your suggestions, i.e. a very light-weight wrapper for

survey = xr.Dataset(
    dict(
        lon = xr.DataArray(lon, dims='casts'),
        lat =  xr.DataArray(lat, dims='casts'),
        time =  xr.DataArray(time, dims='casts'),
    )
)
ds['THETA'].interp(survey)

that would robustly handle various use cases and provide some basic utilities.

Even if this code snippet isn't complicated, it's 5 lines longer than it should be and will have to be modified for every use case! I imagine something flexible enough to handle various permutations of common arguments like:

import xsample
sample(ds['THETA'], ds_survey['THETA'], inner=['XC', 'YC', 'time']) # for xr.Dataset with dimension
sample(ds['THETA, SALT'], df_survey, inner=['XC', 'YC', 'Z', 'time'], outer='point') # for pd.DataFrame
sample(ds['UVEL'], {'XG': XG, 'YC': YC}, outer='mooring') # for np.array

where outer just refers to the name of the broadcasted dimension and would default to either the dimension of the xr.Dataset or, if its a pd.DataFrame, a list, or dict, it will just broadcast to a new dimension cast with integer indices.

I like your suggestion and believe that this could be a nice solution for that. @kdrushka and @dhruvbalwada
Should we create a separate issue to tackle this implementation?
Depending on how this works out, we could also think on publishing it on pip and conda-forge.

@kdrushka
Copy link
Collaborator

@hdrake , I like your suggestion - this is much simpler than what I'm doing.

I am also struggling with memory - in part because I am still new to python/dask so I am mostly guessing at how to optimize the code. That said, it runs (albeit slowly) on the llc4320 cutouts I'm working with.

@hdrake
Copy link
Author

hdrake commented Feb 26, 2022

I like your suggestion and believe that this could be a nice solution for that. @kdrushka and @dhruvbalwada Should we create a separate issue to tackle this implementation? Depending on how this works out, we could also think on publishing it on pip and conda-forge.

I agree it could be worth putting on pip / conda-forge if it works out, but in that case it should be extremely lightweight. As few dependencies as possible, no notebooks or figures in the Repo to keep it small, etc.

I'm a bit swamped with OSM this next week but will return to this in March.

@kdrushka
Copy link
Collaborator

I like your suggestion and believe that this could be a nice solution for that. @kdrushka and @dhruvbalwada Should we create a separate issue to tackle this implementation? Depending on how this works out, we could also think on publishing it on pip and conda-forge.

I agree it could be worth putting on pip / conda-forge if it works out, but in that case it should be extremely lightweight. As few dependencies as possible, no notebooks or figures in the Repo to keep it small, etc.

I'm a bit swamped with OSM this next week but will return to this in March.

Sure! Let me know if/how I can help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants