-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed Recipes for ERA5 #92
Comments
I was just talking about this with @spencerahill. We also had a meeting with ECMWF about this last spring. It is a really big job. |
Yes. Context: 3-yr NSF CLD grant starting hopefully within next month or two, 6 mo / yr my time and hopefully next year a graduate student. We're doing wavenumber-frequency spectral analysis of energy transports in low latitudes using ERA5. So that requires 6-hourly or higher resolution, up to a dozen or so vertically defined variables. Many TBs. My default plan was to just use the CDSAPI to download it to local cluster but talking w/ Ryan sounds like this could plug into pangeo efforts nicely! Which would be fun for me, having been mostly watching from the sidelines for a few years now :) |
It is! But this is the crew to do it! We're in a similar position to you @spencerahill . We are currently using the AWS public dataset but we're about to outgrow the offerings there. We could pull data from the CDSAPI to our own cloud bucket but that runs counter to the mission here.
@rabernat - Any takeaways you can share? |
Here are some notes from Baudouin Raoult after our meeting last April
That was sufficiently intimidating that we tabled the discussion for the time being. Now that Pangeo Forge is farther along, and we have people who are interested in working on it, I think we can pick it up again. |
Exciting! A few questions/comments:
|
To clarify, what Baudouin was proposing was that we go around the CDS API and talk directly to MARS, their internal archival system. |
As a heads up, there is a full copy of the Copernicus 0.25 degree version of the ERA-5 Reanalysis dataset maintained at NCAR under the following dataset collections, which are preserved as time-series NetCDF-4 files: These may be easier to access and stage to public cloud resources unless you need the raw spherical harmonics and reduced gaussian grids at model level resolution, which are only available through ECMWF MARS. You can also access the NCAR maintained datasets by direct read from NCAR HPC systems as an NSF funded researcher. https://www2.cisl.ucar.edu/user-support/allocations/university-allocations |
Cross-reference: #22 |
Question (which may reveal just how little I've worked with and understand the cloud): would it be useful to this effort to have some nontrivial chunk of the ERA5 data downloaded to the cluster at Columbia (berimbau) we're using for our project, to subsequently be uploaded to the cloud? My big concern w/r/t tying my project's science tightly to this pangeo-ERA5 effort is our project's science potentially getting held up, maybe in a big way, if there end up being unforeseen delays etc. in getting the data onto the cloud. Whereas I already have a functional pipeline for downloading the ERA5 data I'll need directly to that cluster via the CDS API, as well as the computational power I'll need at least for the preliminary analyses. So, in this scenario, I'd start downloading the data I need basically right away to our cluster, and then once on the pangeo side things are ready I could upload from our cluster to the cloud. The upside for pangeo of this direct transfer from us would be no waiting on the CDS system queue etc. Thoughts? |
Short answer: no, it would not be particularly useful for you to manually download data and store in on your cluster. That is sort of the status quo that we are trying to escape with Pangeo Forge. The problems with that workflow are
The goal with Pangeo Forge is to develop fully automated, reproducible pipelines for building datasets. However, I recognize the tension here: you want to move forward with your science, and you need ERA5 data to do so. You can't wait a year for us to sort out all of these issues. Here is an intermediate plan that I think might balance the two objectives. You should write a Pangeo Forge recipe to build the dataset you need on berimbau. The process of writing this recipe will be very useful for the broader effort. Note that this won't be possible until pangeo-forge/pangeo-forge-recipes#245 is done, since that will be required to get data from the CDS API. @alxmrs has also been working on a "ERA5 downloader" tool which could be very useful here.. Alex is that released yet? So a possible order of operations would be:
|
Exactly. Thanks @rabernat. That all makes sense. I'm subscribed to the relevant pangeo repos now and in particular will keep an eye on pangeo-forge/pangeo-forge-recipes#245. And going through the docs/tutorials sounds like a great task when I'm procrastinating on a review / revision / etc. in the near-ish term future. |
@spencerahill, once you've started on your recipe, please feel free to @ me in a comment here with any questions. The documentation is far from exhaustive, so don't be discouraged if there's something that doesn't make sense. I'll make sure any questions you have get answered, and we can use any insights we gain to improve the official docs. |
Excellent, thanks! Also IIRC you are at least sometimes in-person at Lamont(?) If so would be fun to meet+chat in person too |
Hey Ryan! The release is in progress – I have just submitted the codebase for internal approval. Not sure about the ETA, since we are in late December. Usually, this last part of the process takes about ~1 week. As soon as it's public, I will post about it here. |
Sadly I'm rarely there as I work out of a home office in California. Even if you sail through the recipe development process without any issues, I'd love to set aside some time to catch up over video either way. 😊 |
I'm happy to announce that the aforementioned tools to help download era 5 data are public, as of today! Please check out @spencerahill @jhamman I'm happy to answer and questions you have along the way. CC: @rabernat |
There are currently a few subset's of the ERA5 dataset on cloud storage (example but none are complete or updated regularly. It wont be a trivial recipe to implement with Pangeo-Forge but it would be a good stretch goal to support such a dataset.
Source Dataset
cdsapi
orcdstoolbox
apisTransformation / Alignment / Merging
Most likely, the best way to access and arrange the data is in 1-day chunks, concatenating along the time dimension. Given the large user pool for this dataset, I would suggest this recipe does as little data processing as possible.
Output Dataset
One (or more?) Zarr stores. Hourly data for all available variables, all pressure levels, etc.
The text was updated successfully, but these errors were encountered: