Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define STAC metadata structure #1

Closed
hrodmn opened this issue May 24, 2024 · 19 comments
Closed

Define STAC metadata structure #1

hrodmn opened this issue May 24, 2024 · 19 comments
Assignees
Labels
enhancement New feature or request

Comments

@hrodmn
Copy link
Collaborator

hrodmn commented May 24, 2024

Background

The NOAA HRRR data (Planetary Computer catalog entry, NOAA product page) is a continuously updated forecast data product. The data are currently housed in Azure but are not yet cataloged. Development Seed is working on improving the accessibility of the data for analytical applications.

Data structure

  • There are two regions: CONUS and Alaska
  • Every hour, new hourly forecasts are generated for many atmospheric attributes for each region
    • All hours (00-23) get an 18-hour forecast
    • Hours 00, 06, 12, 18 get a 48-hour forecast
    • One of the products (subh) gets 15 minute forecasts (four per hour per attribute)
  • The forecasts are broken up into 4 products,
  • An hourly forecast for all variables from a product is uploaded to Azure storage in the GRIB2 format
  • Each GRIB2 file has hundreds to thousands of variables
  • Each .grib2 file is accompanied by a .grib2.idx which has variable-level metadata including the starting byte for the data in that variable (useful for making range requests instead of reading the entire file) and some other descriptive metadata
    • the .grib2.idx files could save us from working with the full .grib2 file when generating the STAC metadata

Goal

Generate STAC metadata that supports these use-cases:

  1. Queries to get the hrefs for the GRIB files that match some set of conditions (reference time, forecast time, specific variable, etc)
    • Bonus points if these STAC items can be useful for producing the datacube metadata!
  2. A single(?) STAC item that represents a data cube (VirtualiZarr/kerchunk) of the entire dataset that can be queried/analyzed directly with xarray
    • one item per region?

Relevant links

Proposed solution

  • Follow the patterns from the ECMWF dataset because it will make the HRRR metadata consistent with similar forecast products in the Planetary Computer Catalog.
    • This approach is appealing because it does not require opening any GRIB2 files to generate the metadata (we will be needing to process a lot of these items) and because we might be able to use the kerchunk logic for our work on the big datacube asset.
  • Follow the NOAA GEFS collection's use of the forecast extension instead of hrrr:reference_time
  • use the forecast time (not model run time) to set the item datetime property
  • make separate items for each variable and forecast hour
  • TODO: spec out datacube Item metadata

Example STAC item metadata for a GRIB2 Item:

...
    {
      "id": "noaa-hrrr-20240524-t00-prs-FH12",
      "bbox": [...],
      "type": "Feature",
      "links": [...],
      "assets": {
        "data": {
          "href": "https://noaahrrr.blob.core.windows.net/hrrr/hrrr.20240524/conus/hrrr.t00z.wrfprsf12.grib2",
          "type": "application/wmo-GRIB2",
          "roles": [
            "data"
          ],
          "raster:bands": [
            {"data_type": "float64", ...}, # more information about each band
          ]
        },
        "index": {
          "href": "https://noaahrrr.blob.core.windows.net/hrrr/hrrr.20240524/conus/hrrr.t00z.wrfprsf12.grib2.index",
          "type": "application/x-ndjson",
          "roles": [
            "index"
          ]
        }
      },
      "geometry": {...},
      "collection": "noaa-hrrr",
      "properties": {
        "datetime": "2024-05-24T12:00:00Z",
        "forecast:reference_datetime": "2024-05-24T00:00:00Z",
        "forecast:horizon": "PT12H",
        "hrrr:region": "conus",
        "hrrr:cc": "12",
        "hrrr:type": "prs"
      },
      "stac_extensions": [
        "https://stac-extensions.github.io/raster/v1.1.0/schema.json",
        "https://stac-extensions.github.io/projection/v1.0.0/schema.json",
        "https://stac-extensions.github.io/timestamps/v1.0.0/schema.json",
        "https://stac-extensions.github.io/forecast/v1.0.0/schema.json"
      ],
      "stac_version": "1.0.0"
    },
...

Questions

cc @zacharyDez @abarciauskas-bgse @sharkinsspatial @vincentsarago @TomAugspurger

@hrodmn hrodmn added the enhancement New feature or request label May 24, 2024
@TomAugspurger
Copy link
Member

Thanks @hrodmn. A few quick suggestions that shouldn't be taken too seriously

Each GRIB2 file has hundreds to thousands of bands

This was probably the thing I struggled with the most when working with ecmwf-forecast. If we take building data cubes as our end goal (which I think is reasonable), it's not always obvious what the relationship is between a list of grib files and one or more data cubes (xarray Datasets). cfgrib has some conventions for building a list of xarray Datasets out of a STAC item, which might be worth following. But in general, a single GRIB file might map to multiple Datasets, which makes things messy (not sure if it applies to the HRRR files).

How should we handle multiple timestamps within a single GRIB2 asset for the subh variable?

Can you give a bit more detail on this? (But in general, my recommendation will probably be "do whatever cfgrib does")

Do we need to open a connection to the GRIB2 file to get specific GRIB band-level metadata or can we assume that it is constant across all files for a given variable?

I'm generally OK with trusting the provider's documentation around how things are organized, but it would be good to have an easy way to validate that the STAC items are correct.

Would the kerchunk:indices property be useful for our datacube work?

I'm not sure yet. I think there's a larger discussion to be had about where Kerchunk metadata lives (briefly, it might go in STAC metadata (and from the stac-geoparquet exports be available in parquet for bulk queries) or it might go in sidecar JSON / parquet files.) If possible, I would defer this initially, while we work out the best way to do that.

@hrodmn hrodmn self-assigned this May 28, 2024
@hrodmn
Copy link
Collaborator Author

hrodmn commented May 28, 2024

Thanks for your suggestions, @TomAugspurger!

Each GRIB2 file has hundreds to thousands of bands

This was probably the thing I struggled with the most when working with ecmwf-forecast. If we take building data cubes as our end goal (which I think is reasonable), it's not always obvious what the relationship is between a list of grib files and one or more data cubes (xarray Datasets). cfgrib has some conventions for building a list of xarray Datasets out of a STAC item, which might be worth following. But in general, a single GRIB file might map to multiple Datasets, which makes things messy (not sure if it applies to the HRRR files).

Thanks for the cfgrib suggestion, I'll take a look!

How should we handle multiple timestamps within a single GRIB2 asset for the subh variable?

Can you give a bit more detail on this? (But in general, my recommendation will probably be "do whatever cfgrib does")

The difference between subh and the rest of the variables is that, in addition to the data variable dimension, there is a time dimension represented in the bands within a single GRIB2 file (all other variables represent a single forecast time + reference time combination). cfgrib handles this by adding a step dimension to the xarray.Dataset with 15 minute intervals. I am not sure why it does that instead of using the existing time dimension, but I guess it would allow you to do analysis on a time series of e.g. step 3 (45 minute) forecasts. It does make it a little trickier to combine multiple subh files for analysis, i.e. concatenating the 15-minute interval forecasts across multiple hours.

Would the kerchunk:indices property be useful for our datacube work?

I'm not sure yet. I think there's a larger discussion to be had about where Kerchunk metadata lives (briefly, it might go in STAC metadata (and from the stac-geoparquet exports be available in parquet for bulk queries) or it might go in sidecar JSON / parquet files.) If possible, I would defer this initially, while we work out the best way to do that.

Good call. For now I could just add the datacube extension to the individual STAC assets so that the bands within each GRIB are documented.

@abarciauskas-bgse @sharkinsspatial After thinking about it some more, I think it might make sense to declare the separate variables as assets within a single item rather than creating separate items (or collections) for each variable. This seems intuitive to me since the groups of GRIB files (one for each variable) represent a single forecast hour + cycle run hour and we could explain a lot about the structure of the whole collection using the item_assets extension. Would that approach make the tiling work more difficult?

@hrodmn
Copy link
Collaborator Author

hrodmn commented Jun 4, 2024

After hacking on the STAC metadata in #3 and after much discussion with @sharkinsspatial and @abarciauskas-bgse I have a few more thoughts and a few alternative approaches, each with a set of pros and cons.

Here are some more details about the HRRR data that make it challenging to produce useful STAC metadata:

  • Within a single .grib file, all of the variables cannot be read into a single xarray dataset
    • not all variables share the same dimensions
    • all variables have x/y, some have atmospheric level (e.g. surface), some have height above ground, etc
  • Within a single product (e.g. sfc), the number of variables present in a .grib file varies according to the forecast hour
    • this means that we cannot reasonably include collection-level datacube extension metadata since the list of variables among .grib files within a single product is not consistent
  • Even if we do fill out the datacube metadata, we do not yet have a way to specify which exact coordinates in each dimension have data for any given variable such that the metadata can be used to constrain an interactive application to a valid list of combinations of variable, forecast_time, and level

Option 1: many collections

sharkinspatial proposed this as an option that would be most efficient on the STAC side:

  1. Create separate collections for each product x forecast hour group (either FH00-01 or FH2-48) combination
    • this is how NOAA breaks down the inventory of the data catalog: e.g. FH00-01 FH02-48
  2. For each collection, add collection-level datacube extension metadata

Pros:

  • Don't need to stuff datacube metadata all the way down in the items or assets, just store it once in the collection objects since all items within a collection have assets with the same structure
  • Downstream applications (like dynamic tiling services) could use the datacube metadata to formlate queries to the raw data for visualization or analysis
  • This is consistent with other large-scale STAC collections with similar between-item asset inconsistencies (e.g. HLS sentinel vs landsat)

Cons:

  • A .grib file does not completely fill out a datacube, i.e. not all combinations of all dimensions have data for a given variable
    • we would need some way to declare which combinations exist (this is not a part of the datacube spec yet)
  • The separation of data into multiple collections can impose a barrier to users who want to explore the full dataset

Option 2: one collection, all coherent datacubes listed as assets within items

  1. One collection with one item per cycle run x forecast hour combination
  2. Set up the list of datacubes that cfgrib (and Herbie) can read as assets within each STAC item.
  • This would mean tucking the datacube metadata along with some special properties into the asset metadata that could be passed to cfgrib or Herbie to selectively download/read the specified datacube from the full .grib files.
  • Many assets per .grib file

When you read a .grib file with cfgrib.open_datasets, it returns a list of ~dozens of datacubes. This is because some of the variables in the .grib file can be combined into coherent datacubes.

!wget https://noaahrrr.blob.core.windows.net/hrrr/hrrr.20240530/conus/hrrr.t06z.wrfsfcf00.grib2 -O /tmp/hrrr.20240530__conus__hrrr.t06z.wrfsfcf00.grib2

import cfgrib

local_file = "/tmp/hrrr.20240530__conus__hrrr.t06z.wrfsfcf00.grib2"

cfgrib_datasets = cfgrib.open_datasets(
    local_file,
)

This returns a list of 47 xarray datasets with one or more data variables. Many have two dimensions (x and y), but some have a third dimension like heightAboveGround.

It would not be typical to store references to internal layers of a .grib file as assets, but it is a natural way to access the data.

Pros:

  • Shortens the ramp for a user to send a query to a the STAC and load the results into xarray
  • Can use the item-assets extension to make the full list of datacubes viewable at the collection-level
    • We still have the same issue as before with inconsistent variables between items, but could list all possible datacubes/assets in item-assets then limit the list to the available assets in each item.
  • Variable discovery should be easier if each datacube has a reasonable description

Cons:

  • This seems very specific to xarray as an application which is not something I have seen in a STAC collection
  • There will be hundreds of assets in every item
  • Storing all of the datacube metadata at the asset-level is relatively inefficient

If anyone has any strong reactions to these options I would love to hear them!

@sharkinsspatial
Copy link

@hrodmn in relation to

we would need some way to declare which combinations exist (this is not a part of the datacube spec yet)

I think a possible approach here might be adding another custom property to the datacube variable's with the following structure

{
   "dimension_domains": {
       "level": [
           "entire atmosphere",
           "cloud_top",
           "surface"
      ]
   }
}

where each dimension_domains key must be a key in cube:dimensions.

I'll try to crack open some notebooks today to investigate but it also appears that when applying specific level filters with Herbie that it seems to coerce the level coordinate name to one that is specific to the selected value (isobaricInhPa in this case). Have you noticed this behavior?

@hrodmn
Copy link
Collaborator Author

hrodmn commented Jun 10, 2024

I am back with an example of the "many collections" metadata!

For the entire HRRR dataset, there would be a single collection per region x product x forecast hour set combination (16 total collections), each with a datacube specification at the collection-level.

Here is what the collection metadata look like for one combination (conus x sfc x FH00-01):

{
  "type": "Collection",
  "id": "noaa-hrrr-conus-sfc-fh02-48",
  "stac_version": "1.0.0",
  "description": "The NOAA HRRR is a real-time 3km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-hour period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh (RAP) system.",
  "links": [
    {
      "rel": "license",
      "href": "https://creativecommons.org/licenses/by/4.0/",
      "type": "text/html",
      "title": "CC-BY-4.0 license"
    },
    {
      "rel": "documentation",
      "href": "https://rapidrefresh.noaa.gov/hrrr/",
      "type": "text/html",
      "title": "NOAA HRRR documentation"
    }
  ],
  "stac_extensions": [
    "https://stac-extensions.github.io/item-assets/v1.0.0/schema.json",
    "https://stac-extensions.github.io/datacube/v2.2.0/schema.json"
  ],
  "item_assets": {
    "grib": {
      "type": "application/wmo-GRIB2",
      "roles": [
        "data"
      ],
      "title": "2D Surface Levels",
      "description": "2D Surface Level forecast data as a grib2 file. Subsets of the data can be loaded using the provided byte range."
    },
    "index": {
      "type": "application/x-ndjson",
      "roles": [
        "index"
      ],
      "title": "Index file",
      "description": "The index file contains information on each message within the GRIB2 file."
    }
  },
  "cube:dimensions": {
    "x": {
      "type": "spatial",
      "reference_system": "PROJCS[\"unknown\",GEOGCS[\"unknown\",DATUM[\"unknown\",SPHEROID[\"unknown\",6371229,0]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]]],PROJECTION[\"Lambert_Conformal_Conic_2SP\"],PARAMETER[\"latitude_of_origin\",38.5],PARAMETER[\"central_meridian\",-97.5],PARAMETER[\"standard_parallel_1\",38.5],PARAMETER[\"standard_parallel_2\",38.5],PARAMETER[\"false_easting\",0],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]],AXIS[\"Easting\",EAST],AXIS[\"Northing\",NORTH]]",
      "extent": [
        -2697520.142521929,
        2696479.857478071
      ],
      "axis": "x"
    },
    "y": {
      "type": "spatial",
      "reference_system": "PROJCS[\"unknown\",GEOGCS[\"unknown\",DATUM[\"unknown\",SPHEROID[\"unknown\",6371229,0]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]]],PROJECTION[\"Lambert_Conformal_Conic_2SP\"],PARAMETER[\"latitude_of_origin\",38.5],PARAMETER[\"central_meridian\",-97.5],PARAMETER[\"standard_parallel_1\",38.5],PARAMETER[\"standard_parallel_2\",38.5],PARAMETER[\"false_easting\",0],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]],AXIS[\"Easting\",EAST],AXIS[\"Northing\",NORTH]]",
      "extent": [
        -1587306.152556665,
        1586693.847443335
      ],
      "axis": "y"
    },
    "level": {
      "type": "atmospheric level",
      "description": "The atmospheric level for which the forecast is applicable, e.g. surface, top of atmosphere, 100 m above ground, etc.",
      "values": [
        "0 m underground",
        "0-1000 m above ground",
        "0-3000 m above ground",
        "0-500 m above ground",
        "0-6000 m above ground",
        "0.1 sigma level",
        "0.5-0.8 sigma layer",
        "0C isotherm",
        "1 m above ground",
        "10 m above ground",
        "100-1000 mb above ground",
        "1000 m above ground",
        "1000 mb",
        "1000-0 m above ground",
        "180-0 mb above ground",
        "2 m above ground",
        "2000-0 m above ground",
        "250 mb",
        "253 K level",
        "255-0 mb above ground",
        "261 K level - 256 K level",
        "263 K level",
        "300 mb",
        "3000-0 m above ground",
        "4000 m above ground",
        "500 mb",
        "500-1000 mb",
        "5000-2000 m above ground",
        "700 mb",
        "8 m above ground",
        "80 m above ground",
        "850 mb",
        "90-0 mb above ground",
        "925 mb",
        "boundary layer cloud layer",
        "cloud base",
        "cloud ceiling",
        "cloud top",
        "entire atmosphere",
        "entire atmosphere (considered as a single layer)",
        "equilibrium level",
        "high cloud layer",
        "highest tropospheric freezing level",
        "level of adiabatic condensation from sfc",
        "level of free convection",
        "low cloud layer",
        "mean sea level",
        "middle cloud layer",
        "surface",
        "top of atmosphere"
      ]
    },
    "forecast_time": {
      "type": "temporal",
      "description": "The time horizon for which the forecast is applicable.",
      "values": [
        "0-1 day acc fcst",
        "0-10 hour acc fcst",
        "0-11 hour acc fcst",
        "0-12 hour acc fcst",
        "0-13 hour acc fcst",
        "0-14 hour acc fcst",
        "0-15 hour acc fcst",
        "0-16 hour acc fcst",
        "0-17 hour acc fcst",
        "0-18 hour acc fcst",
        "0-19 hour acc fcst",
        "0-2 day acc fcst",
        "0-2 hour acc fcst",
        "0-20 hour acc fcst",
        "0-21 hour acc fcst",
        "0-22 hour acc fcst",
        "0-23 hour acc fcst",
        "0-25 hour acc fcst",
        "0-26 hour acc fcst",
        "0-27 hour acc fcst",
        "0-28 hour acc fcst",
        "0-29 hour acc fcst",
        "0-3 hour acc fcst",
        "0-30 hour acc fcst",
        "0-31 hour acc fcst",
        "0-32 hour acc fcst",
        "0-33 hour acc fcst",
        "0-34 hour acc fcst",
        "0-35 hour acc fcst",
        "0-36 hour acc fcst",
        "0-37 hour acc fcst",
        "0-38 hour acc fcst",
        "0-39 hour acc fcst",
        "0-4 hour acc fcst",
        "0-40 hour acc fcst",
        "0-41 hour acc fcst",
        "0-42 hour acc fcst",
        "0-43 hour acc fcst",
        "0-44 hour acc fcst",
        "0-45 hour acc fcst",
        "0-46 hour acc fcst",
        "0-47 hour acc fcst",
        "0-5 hour acc fcst",
        "0-6 hour acc fcst",
        "0-7 hour acc fcst",
        "0-8 hour acc fcst",
        "0-9 hour acc fcst",
        "1-2 hour acc fcst",
        "1-2 hour ave fcst",
        "1-2 hour max fcst",
        "1-2 hour min fcst",
        "10 hour fcst",
        "10-11 hour acc fcst",
        "10-11 hour ave fcst",
        "10-11 hour max fcst",
        "10-11 hour min fcst",
        "11 hour fcst",
        "11-12 hour acc fcst",
        "11-12 hour ave fcst",
        "11-12 hour max fcst",
        "11-12 hour min fcst",
        "12 hour fcst",
        "12-13 hour acc fcst",
        "12-13 hour ave fcst",
        "12-13 hour max fcst",
        "12-13 hour min fcst",
        "13 hour fcst",
        "13-14 hour acc fcst",
        "13-14 hour ave fcst",
        "13-14 hour max fcst",
        "13-14 hour min fcst",
        "14 hour fcst",
        "14-15 hour acc fcst",
        "14-15 hour ave fcst",
        "14-15 hour max fcst",
        "14-15 hour min fcst",
        "15 hour fcst",
        "15-16 hour acc fcst",
        "15-16 hour ave fcst",
        "15-16 hour max fcst",
        "15-16 hour min fcst",
        "16 hour fcst",
        "16-17 hour acc fcst",
        "16-17 hour ave fcst",
        "16-17 hour max fcst",
        "16-17 hour min fcst",
        "17 hour fcst",
        "17-18 hour acc fcst",
        "17-18 hour ave fcst",
        "17-18 hour max fcst",
        "17-18 hour min fcst",
        "18 hour fcst",
        "18-19 hour acc fcst",
        "18-19 hour ave fcst",
        "18-19 hour max fcst",
        "18-19 hour min fcst",
        "19 hour fcst",
        "19-20 hour acc fcst",
        "19-20 hour ave fcst",
        "19-20 hour max fcst",
        "19-20 hour min fcst",
        "2 hour fcst",
        "2-3 hour acc fcst",
        "2-3 hour ave fcst",
        "2-3 hour max fcst",
        "2-3 hour min fcst",
        "20 hour fcst",
        "20-21 hour acc fcst",
        "20-21 hour ave fcst",
        "20-21 hour max fcst",
        "20-21 hour min fcst",
        "21 hour fcst",
        "21-22 hour acc fcst",
        "21-22 hour ave fcst",
        "21-22 hour max fcst",
        "21-22 hour min fcst",
        "22 hour fcst",
        "22-23 hour acc fcst",
        "22-23 hour ave fcst",
        "22-23 hour max fcst",
        "22-23 hour min fcst",
        "23 hour fcst",
        "23-24 hour acc fcst",
        "23-24 hour ave fcst",
        "23-24 hour max fcst",
        "23-24 hour min fcst",
        "24 hour fcst",
        "24-25 hour acc fcst",
        "24-25 hour ave fcst",
        "24-25 hour max fcst",
        "24-25 hour min fcst",
        "25 hour fcst",
        "25-26 hour acc fcst",
        "25-26 hour ave fcst",
        "25-26 hour max fcst",
        "25-26 hour min fcst",
        "26 hour fcst",
        "26-27 hour acc fcst",
        "26-27 hour ave fcst",
        "26-27 hour max fcst",
        "26-27 hour min fcst",
        "27 hour fcst",
        "27-28 hour acc fcst",
        "27-28 hour ave fcst",
        "27-28 hour max fcst",
        "27-28 hour min fcst",
        "28 hour fcst",
        "28-29 hour acc fcst",
        "28-29 hour ave fcst",
        "28-29 hour max fcst",
        "28-29 hour min fcst",
        "29 hour fcst",
        "29-30 hour acc fcst",
        "29-30 hour ave fcst",
        "29-30 hour max fcst",
        "29-30 hour min fcst",
        "3 hour fcst",
        "3-4 hour acc fcst",
        "3-4 hour ave fcst",
        "3-4 hour max fcst",
        "3-4 hour min fcst",
        "30 hour fcst",
        "30-31 hour acc fcst",
        "30-31 hour ave fcst",
        "30-31 hour max fcst",
        "30-31 hour min fcst",
        "31 hour fcst",
        "31-32 hour acc fcst",
        "31-32 hour ave fcst",
        "31-32 hour max fcst",
        "31-32 hour min fcst",
        "32 hour fcst",
        "32-33 hour acc fcst",
        "32-33 hour ave fcst",
        "32-33 hour max fcst",
        "32-33 hour min fcst",
        "33 hour fcst",
        "33-34 hour acc fcst",
        "33-34 hour ave fcst",
        "33-34 hour max fcst",
        "33-34 hour min fcst",
        "34 hour fcst",
        "34-35 hour acc fcst",
        "34-35 hour ave fcst",
        "34-35 hour max fcst",
        "34-35 hour min fcst",
        "35 hour fcst",
        "35-36 hour acc fcst",
        "35-36 hour ave fcst",
        "35-36 hour max fcst",
        "35-36 hour min fcst",
        "36 hour fcst",
        "36-37 hour acc fcst",
        "36-37 hour ave fcst",
        "36-37 hour max fcst",
        "36-37 hour min fcst",
        "37 hour fcst",
        "37-38 hour acc fcst",
        "37-38 hour ave fcst",
        "37-38 hour max fcst",
        "37-38 hour min fcst",
        "38 hour fcst",
        "38-39 hour acc fcst",
        "38-39 hour ave fcst",
        "38-39 hour max fcst",
        "38-39 hour min fcst",
        "39 hour fcst",
        "39-40 hour acc fcst",
        "39-40 hour ave fcst",
        "39-40 hour max fcst",
        "39-40 hour min fcst",
        "4 hour fcst",
        "4-5 hour acc fcst",
        "4-5 hour ave fcst",
        "4-5 hour max fcst",
        "4-5 hour min fcst",
        "40 hour fcst",
        "40-41 hour acc fcst",
        "40-41 hour ave fcst",
        "40-41 hour max fcst",
        "40-41 hour min fcst",
        "41 hour fcst",
        "41-42 hour acc fcst",
        "41-42 hour ave fcst",
        "41-42 hour max fcst",
        "41-42 hour min fcst",
        "42 hour fcst",
        "42-43 hour acc fcst",
        "42-43 hour ave fcst",
        "42-43 hour max fcst",
        "42-43 hour min fcst",
        "43 hour fcst",
        "43-44 hour acc fcst",
        "43-44 hour ave fcst",
        "43-44 hour max fcst",
        "43-44 hour min fcst",
        "44 hour fcst",
        "44-45 hour acc fcst",
        "44-45 hour ave fcst",
        "44-45 hour max fcst",
        "44-45 hour min fcst",
        "45 hour fcst",
        "45-46 hour acc fcst",
        "45-46 hour ave fcst",
        "45-46 hour max fcst",
        "45-46 hour min fcst",
        "46 hour fcst",
        "46-47 hour acc fcst",
        "46-47 hour ave fcst",
        "46-47 hour max fcst",
        "46-47 hour min fcst",
        "47 hour fcst",
        "47-48 hour acc fcst",
        "47-48 hour ave fcst",
        "47-48 hour max fcst",
        "47-48 hour min fcst",
        "48 hour fcst",
        "5 hour fcst",
        "5-6 hour acc fcst",
        "5-6 hour ave fcst",
        "5-6 hour max fcst",
        "5-6 hour min fcst",
        "6 hour fcst",
        "6-7 hour acc fcst",
        "6-7 hour ave fcst",
        "6-7 hour max fcst",
        "6-7 hour min fcst",
        "7 hour fcst",
        "7-8 hour acc fcst",
        "7-8 hour ave fcst",
        "7-8 hour max fcst",
        "7-8 hour min fcst",
        "8 hour fcst",
        "8-9 hour acc fcst",
        "8-9 hour ave fcst",
        "8-9 hour max fcst",
        "8-9 hour min fcst",
        "9 hour fcst",
        "9-10 hour acc fcst",
        "9-10 hour ave fcst",
        "9-10 hour max fcst",
        "9-10 hour min fcst"
      ]
    }
  },
  "cube:variables": {
    ...
    "ASNOW": {
      "dimensions": [
        "x",
        "y",
        "level",
        "forecast_time"
      ],
      "type": "data",
      "description": "Total Snowfall",
      "unit": "m",
      "dimension_domains": {
        "level": [
          "surface"
        ],
        "forecast_time": [
          "0-2 hour acc fcst",
          "0-3 hour acc fcst",
          "0-4 hour acc fcst",
          "0-5 hour acc fcst",
          "0-6 hour acc fcst",
          "0-7 hour acc fcst",
          "0-8 hour acc fcst",
          "0-9 hour acc fcst",
          "0-10 hour acc fcst",
          "0-11 hour acc fcst",
          "0-12 hour acc fcst",
          "0-13 hour acc fcst",
          "0-14 hour acc fcst",
          "0-15 hour acc fcst",
          "0-16 hour acc fcst",
          "0-17 hour acc fcst",
          "0-18 hour acc fcst",
          "0-19 hour acc fcst",
          "0-20 hour acc fcst",
          "0-21 hour acc fcst",
          "0-22 hour acc fcst",
          "0-23 hour acc fcst",
          "0-1 day acc fcst",
          "0-25 hour acc fcst",
          "0-26 hour acc fcst",
          "0-27 hour acc fcst",
          "0-28 hour acc fcst",
          "0-29 hour acc fcst",
          "0-30 hour acc fcst",
          "0-31 hour acc fcst",
          "0-32 hour acc fcst",
          "0-33 hour acc fcst",
          "0-34 hour acc fcst",
          "0-35 hour acc fcst",
          "0-36 hour acc fcst",
          "0-37 hour acc fcst",
          "0-38 hour acc fcst",
          "0-39 hour acc fcst",
          "0-40 hour acc fcst",
          "0-41 hour acc fcst",
          "0-42 hour acc fcst",
          "0-43 hour acc fcst",
          "0-44 hour acc fcst",
          "0-45 hour acc fcst",
          "0-46 hour acc fcst",
          "0-47 hour acc fcst",
          "0-2 day acc fcst"
        ]
      }
    },
    ...
  }

The forecast_time, and level dimensions map to columns in the .grib2.idx files and can be concatenated with the variable name to form the search strings that we can use to subset data using Herbie, e.g. `search="ASNOW:surface:0-45 acc fcst:".

The main issue that I can see is that each specific item will only cover a specific subset of the forecast_time dimension, but I think we can write an application that knows the rules well enough.

@sharkinsspatial @abarciauskas-bgse how does this look to you? I think we could come up with a pretty simple application that takes this metadata to populate Herbie instances for easily reading subsets of data with xarray.

@hrodmn
Copy link
Collaborator Author

hrodmn commented Jun 11, 2024

Another possible approach would be to treat the individual "bands" in the GRIB files as assets within the items. We could keep the base .grib and .grib.idx assets and add assets that represent the byte range either as asset-level properties or directly in the href using /vsisubfile, e.g. /vsisubfile/57675323_1224134,/vsicurl/https://noaa-hrrr-bdp-pds.s3.amazonaws.com/hrrr.20210103/conus/hrrr.t12z.wrfsfcf06.grib2.

It would not require a very complicated application to concatenate the same assets from different items along a time dimension in xarray and it could be used without installing eccodes or cfgrib.

Here is a gist showing how you can load tiny subsets of data into xarray without any downloading logic using /vsisubfile and /vsicurl: https://gist.github.com/hrodmn/278e50aa94ef4748cdc8e51d4db3b354/8e480f513930c807ef599b506f889e1536a9166f

To make it easier to combine assets from a single item into a datacube, we could add an asset-level property that indicates which group or datacube each asset can be added to, but maybe this would just be a separate datacube asset within each item.

@TomAugspurger
Copy link
Member

Another possible approach would be to treat the individual "bands" in the GRIB files as assets within the items.

FWIW, over time I've become more of a "purist" that assets in STAC items should refer to actual assets, and that things like this example extracting a variable from a file are better off in an extension on the item or asset. For example, as an extension on the assets:

"assets": {
"data": {
    "href": "...",
    "variables": {
        "DPT": {
           "offset": "...",
           "length": "...",
    }
}
}

That leaves the assets as plain links to files, and a (not very complicated, as you say) client application could pick out the extra metadata if they want to construct the efficient range queries.

@hrodmn
Copy link
Collaborator Author

hrodmn commented Jun 13, 2024

@TomAugspurger thanks for the suggestion. I tried the slices as assets approach in #5 and it seems a bit unwieldy.

Here is a snippet from the example output in #5

  "assets": {
    "grib": {
      "href": "https://noaahrrr.blob.core.windows.net/hrrr/hrrr.20240510/conus/hrrr.t12z.wrfsfcf00.grib2",
      "type": "application/wmo-GRIB2",
      "title": "2D Surface Levels",
      "description": "2D Surface Level forecast data as a grib2 file. Subsets of the data can be loaded using the provided byte range.",
      "roles": [
        "data"
      ]
    },
    "index": {
      "href": "https://noaahrrr.blob.core.windows.net/hrrr/hrrr.20240510/conus/hrrr.t12z.wrfsfcf00.grib2.idx",
      "type": "application/x-ndjson",
      "title": "Index file",
      "description": "The index file contains information on each message within the GRIB2 file.",
      "roles": [
        "index"
      ]
    },
    "REFC__entire_atmosphere__analysis": {
      "href": "/vsisubfile/0_238534,/vsicurl/https://noaahrrr.blob.core.windows.net/hrrr/hrrr.20240510/conus/hrrr.t12z.wrfsfcf00.grib2",
      "type": "application/wmo-GRIB2",
      "title": "REFC - entire atmosphere - analysis",
      "hrrr:forecast_layer_type": "analysis",
      "roles": [
        "data"
      ]
    },
    "RETOP__cloud_top__analysis": {
      "href": "/vsisubfile/238534_138642,/vsicurl/https://noaahrrr.blob.core.windows.net/hrrr/hrrr.20240510/conus/hrrr.t12z.wrfsfcf00.grib2",
      "type": "application/wmo-GRIB2",
      "title": "RETOP - cloud top - analysis",
      "hrrr:forecast_layer_type": "analysis",
      "roles": [
        "data"
      ]
    }
  }

I like how a user could take the href and immediately read the data without any application code, but it does seem a little unwieldy. Your idea for an asset-level extension property that holds all of the subset info sounds good to me.

@abarciauskas-bgse
Copy link
Contributor

I just caught up on these comments and had some questions which I just discussed with @hrodmn

  • Above it was stated "Within a single product (e.g. sfc), the number of variables present in a .grib file varies according to the forecast hour" - how do the variables vary and which forecast hours are different?
    • Answer to which forecast hours are different: The variation is between forecast hours 00-01 and 02-48. In other words, 00-01 have the same inventory and 02-48 have the same inventory.
    • Variable differences: It appears the forecast valid and level fields are different and I see that there are a few extra variables when I look at the "Number of records" on the inventory file pages (it looks like the difference is always by 3), but it's hard to tell from the page what those extra variables are.
    • We think that one difference is that 02-48 includes "accumulations".
  • The product inventory calls the field "forecast valid" https://www.nco.ncep.noaa.gov/pmb/products/hrrr/hrrr.t00z.wrfprsf00.grib2.shtml. In Define STAC metadata structure #1 (comment) the example includes the forecast_time dimension - if I understand this dimension correctly, "forecast_valid" conveys more about what the dimension represents, but I would go with whatever the dimension name is in the actual dataset.
  • Each STAC item will represent a single grib file. I asked if each STAC item would only have 1 possible value for "forecast valid", e.g. all files with FH == 02 have forecast_valid = 2 hrs. But is not the case, there are different values.

We could still benefit from having some answers to these questions from a HRRR data provider or expert:

  • What is the detail of the difference between 00-01 and 02-48?
  • What exactly is the definition of "analysis" value for the forecast valid field?

@TomAugspurger
Copy link
Member

One more high-level question (since I haven't had time to review the PR yet), related to data cubes, STAC items, and chunk manifests / virtual Zarr.

IIUC, by storing the byte range offsets of each message in https://github.com/developmentseed/noaa-hrrr/blob/01a335d258b40a2aa9cb2dc073cdb5d7bc3c3e75/examples/hrrr-conus-sfc-2024-05-10T12-FH2/hrrr-conus-sfc-2024-05-10T12-FH2.json#L59-L95, it should be relatively easy (and fast) to form a datacube from a STAC search. Given

  1. A list of STAC items with those offsets
  2. Some kind of specification for which subset of messages to load (similar to Herbie's search thing)

We should be able to make a datacube by

  1. Building a ChunkManifest (or is it a ChunkKey? and maybe multiple of these, if you're going to combine multiple data variables from the same asset?) for each item
  2. Concatenating these into a ChunkManifestArray through time

This seems pretty useful. It costs extra time when creating the STAC metadata (to extract the offsets) and extra space for storage and transmission of the STAC items. Do folks think the benefit is worth it?

Assuming that's a good idea, I'm struggling a bit with how to expose these datacubes through STAC. Some options I see:

  1. I think Henry or Aimee suggested putting the possible "datacube specifications" somewhere on the STAC collection, which seems like a good idea to me. Then (in pseudo-code) this could be
client = pystac_client.Client.open(...)
collection= client.get_collection("noaa-hrr")
items = client.search(collections=collection.id, datetime=[start, end])
search = collection.datacubes["ASNOW:surface:0-45 acc fcst:"]  # get the search string

datasets = [ChunkThing.from_asset(item.assets["data"], search=search) for item in items]
ds = xr.concat(datasets)

We could work on higher-level wrappers, but the idea is that the STAC metadata has all the information needed to combine Items into a datacube.

  1. How does this relate to storing these chunk manifests in a compressed format like parquet? Perhaps the STAC collection could contain a dictionary of "datacube names" linking to Parquet datasets with the same manifest data?

Sorry to the ramble-y post. Still working through this.

@hrodmn
Copy link
Collaborator Author

hrodmn commented Jun 14, 2024

@TomAugspurger thanks for your thoughts.

We could work on higher-level wrappers, but the idea is that the STAC metadata has all the information needed to combine Items into a datacube.

Your vision for an application that can use the collection-level datacube schema and the sub-asset metadata to quickly construct an xarray is exactly what I think we want to be able to build.

This seems pretty useful. It costs extra time when creating the STAC metadata (to extract the offsets) and extra space for storage and transmission of the STAC items. Do folks think the benefit is worth it?

In this case, I am not very concerned about the overhead for STAC metadata generation and storage because we are not dealing with that many items. For any forecast hour there are just two items (conus + alaska). For a whole day's worth of forecasts there are 848 items.

After a lot of trial and error over the last few weeks, here is an outline of an approach that I think we should take for v0.1:

Collection(s)

  • One collection per product
    • The products are different enough that this makes sense. This also aligns with how NOAA segments the data in the documentation.
    • Keep data from FH00-01 and FH02-48 in the same collection. We could separate them since the use-cases for the data in each of the forecast hour sets are different (FH00-01: retrospective analysis of actual conditions, FH02-48: forecasts from current conditions).
  • Add datacube extension to the collection to provide high-level visibility of the dimensions, coordinates, and variables available within a collection.
    • For this to actually make sense, we might need to translate some of the coordinate values from the GRIB files into values that make sense in a datacube context.
    • I am thinking specifically of forecast_valid coordinate values like '2 hour fcst' or '0-1 day fcst' which contain information relative to the cycle_run_hour that is already captured in the item datetime property. I propose that we define some key/value pairs like 'forecast_layer_type': 'cumulative_summary' and 'forecast_statistic': 'maximum' and 'timedelta_start_minutes': 0, 'timedelta_end_minutes': 1440 to describe for a forecast_valid value like '0-1 day forecast'.
    • I don't love the idea of modifying the metadata in a substantial way, so maybe this is a bad idea, but it would facilitate more useful datacubes. I think this is basically what cfgrib does to create the list of coherent xarray.Datasets.

Items

  • One item per region x cycle run hour x forecast hour
    • to get the most recent forecast for a given hour, filter down to items where item.datetime == forecast_time, then sort in descending order by forecast:reference_time.
    • to get a retrospective look at actual conditions, filter down to items where item.properties["forecast:horizon"] == "PT0H"
  • Two assets per item: grib and idx

grib:layers extension to the grib assets

  • When I tried this out originally I set it up with a key format of {variable}__{level}__{forecast_layer_type}

          "grib:layers": {
            "REFC__entire_atmosphere__point_in_time": {
              "description": "Composite reflectivity",
              "unit": "dB",
              "grib_message": 1,
              "start_byte": 0,
              "byte_size": 345203,
              "variable": "REFC",
              "level": "entire atmosphere",
              "forecast_valid": "2 hour fcst",
              "forecast_layer_type": "point_in_time",
              "end_timedelta": 7200.0
            },
            "RETOP__cloud_top__point_in_time": {
              "description": "Echo Top",
              "unit": "m",
              "grib_message": 2,
              "start_byte": 345203,
              "byte_size": 103748,
              "variable": "RETOP",
              "level": "cloud top",
              "forecast_valid": "2 hour fcst",
              "forecast_layer_type": "point_in_time",
              "end_timedelta": 7200.0
            }
        }
  • I could also see how it would be useful to nest the dimensions somehow rather than encoding everything into a single key.

  • It is very easy to construct the dsn for the specific byte range into a single string that GDAL can use:

    def format_vsi_path(grib_asset: Asset, grib_layer_key: str):
        # yank the layer metadata from grib:layers
        layer = grib_asset.extra_fields["grib:layers"][grib_layer_key]
        
        # format the virtual file path for the byte range for this variable
        return f"/vsisubfile/{layer['start_byte']}_{layer['byte_size']},/vsicurl/{grib_asset.href}"
  • Maybe we can try this out for HRRR then upstream it to a proper STAC extension

Questions

  • Put CONUS and Alaska in the same collection or different collections?
  • Put forecast hour sets (e.g. FH00-01 / FH02-48) in separate collections?
  • Punt on forecast_valid value refactoring (forecast_layer_type, forecast_statistic, timedelta_start_minutes, etc) until later?
  • The subh product has GRIB files with a time dimension in its layers. We could split those out into separate Items

@hrodmn
Copy link
Collaborator Author

hrodmn commented Jun 17, 2024

There is a simpler way to do targeted reads from GRIB files with GDAL's driver. You can use GDAL's vrt connection string syntax and the GRIB message indices like bands in a vrt:// path. By default, GDAL uses the accompanying .grib.idx file to do the /vsisubfile range requests for us.
OSGeo/gdal#10214 (comment)

def format_vrt_path(grib_asset: Asset, grib_layer_key: str):
    layer = grib_asset.extra_fields["grib:layers"][grib_layer_key]

    return f"vrt:///vsicurl/{grib_asset.href}?bands={layer['grib_message']}"

where layer['grib_message'] is the row in the .grib2.idx file for the layer that you want to load.

@TomAugspurger
Copy link
Member

Put CONUS and Alaska in the same collection or different collections?

I could go either way, but lean towards the same collection (assuming the only difference between the two is the AOI covered. If there are differences in the "schema" then Alaska can go on its own)

Put forecast hour sets (e.g. FH00-01 / FH02-48) in separate collections?

I'm not sure, but I lean towards the same collection? I see your comment about the different use-cases, which is compelling. Can those two use cases be supported by including some condition on the forecast hour set in the query that gets the list of STAC items?

Punt on forecast_valid value refactoring

I don't have strong thoughts. Seems fine to defer this.

The subh product has GRIB files with a time dimension in its layers. We could split those out into separate Items

Happy to split out "problematic" items into their own collections if needed. But just to confirm: is this subh product in the same GRIB2 file as the other productions? If so, it will feel a bit strange to have the same asset cataloged by multiple STAC items... What effect does the presence of this subh product have on the STAC metadata?

@hrodmn
Copy link
Collaborator Author

hrodmn commented Jun 17, 2024

Put CONUS and Alaska in the same collection or different collections?

I could go either way, but lean towards the same collection (assuming the only difference between the two is the AOI covered. If there are differences in the "schema" then Alaska can go on its own)

The only substantial difference between Alaska and CONUS is that Alaska only gets a fresh model run every three hours while CONUS gets a fresh model run every hour. In addition to the non-overlapping spatial extents for the two regions, I added an item-level property noaa-hrrr:region which makes it possible to do a simple filter at STAC query time.

Put forecast hour sets (e.g. FH00-01 / FH02-48) in separate collections?

I'm not sure, but I lean towards the same collection? I see your comment about the different use-cases, which is compelling. Can those two use cases be supported by including some condition on the forecast hour set in the query that gets the list of STAC items?

The item-level property forecast:horizon can be used to apply this filter and we could easily add a noaa-hrrr:forecast_hour_set property with values ["FH00-01", "FH02-48"]. My instinct is to keep them in the same collection and let the user apply whatever filter they want since, in addition to the two use-cases I outlined above, there are probably use-cases where you would want to readily combine data across forecast hour sets.

The subh product has GRIB files with a time dimension in its layers. We could split those out into separate Items

Happy to split out "problematic" items into their own collections if needed. But just to confirm: is this subh product in the same GRIB2 file as the other productions? If so, it will feel a bit strange to have the same asset cataloged by multiple STAC items... What effect does the presence of this subh product have on the STAC metadata?

The way I have it set up right now, every product is in a separate collection. The reason I would consider splitting the subh GRIB layers across multiple items is to facilitate loading higher-resolution time series datasets. Without this modification, users and applications will have to do some work to parse the GRIB metadata further and split the 15-minute interval layers into the time dimension.

@hrodmn
Copy link
Collaborator Author

hrodmn commented Jun 17, 2024

re:CONUS / Alaska discussion:
I don't know how we would specify the datacube metadata if we put Alaska and CONUS in the same collection since they use different projections. Specifically, I don't know how we would specify the x and y dimensions since the coordinate values can't be combined without transforming them to some common CRS.

If the datacube extension metadata is a priority, we will probably need to store CONUS and Alaska in separate collections.

@abarciauskas-bgse
Copy link
Contributor

I'm not the STAC expert here but it seems like having a distinct spatial extents and different projections is good enough reason to put Alaska and CONUS into separate collections.

Cool to learn about the vrt:// syntax!

@hrodmn
Copy link
Collaborator Author

hrodmn commented Jun 21, 2024

Thanks all for your feedback and guidance on the best structure - I think we landed on a workable format for the HRRR data.

@hrodmn hrodmn closed this as completed Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants