-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opening STAC elements with assets pointing to kerchunk
files
#34
Comments
FWIW, I'm having some reservations on embedding Kerchunk metadata in STAC items. I'll post over inhttps://github.com//issues/32. For the media types stuff, AFAIK it's very challenging to officially register one. We might be able to use asset roles to indicate that the asset at that link (which might be JSON, parquet, ...) is intended to be consumed by Kerchunk. |
Yeah I saw that when I was working on #32! That's great to see!!
So far xpystac has been using a combination of Lines 74 to 76 in 342c197
Obviously the
Yeah I think the work would be that when passed an item, we should look for an asset that has the specific Lines 88 to 96 in 342c197
The main issue I have with relying on the kerchunk library directly is the dependencies which I don't think you need if you are just reading.
I don't see why not. xpystac doesn't currently try to do anything with collections, so it should be fine to just look for a kerchunk asset in there. |
oops Tom beat me to it 😄 |
wow, it looks like I didn't read the readme or the code close enough! Looks like
That makes sense to me. The only question I'd have is: are those sufficient? In other words, do we think that someone might want to use (maybe I'm overthinking this, and we can just fix if anything like this occurs)
For reference, which dependency of That said, as it is the mapper creation is hidden which was the main point of this issue, so the only argument for switching to
I added that because I've seen @TomAugspurger has some |
lol what an understatement :)
Yeah I suspect that we can just try to treat the
I was thinking
xpystac is always happy to operate on an asset (even a collection asset) but if you are suggesting that we could accept a collection and just figure out if there is a zarr asset I guess we'd need to know exactly which roles we are expecting the asset to have. |
Sounds good to me, but let's check. @m-mohr, what do you think? Does that sound like something we should be doing? Or do you know a better way (or someone who does)?
I guess so. I had imagined we could just filter collection-level assets by media type / role as we do with items, but that was before realizing that |
For items yeah, but for collections there isn't really a behavior defined yet. So I think we can just look for the zarr media_type and read it if found. |
Ideally, bring this up in the next community call :-) I don't have the time right now to read through the long issue to get the context, sorry. @keewis |
of course! Where can I find more about that? Naively searching for "stac community call" does not yield any information on when and where to join. |
@keewis Good question, I thought there would be information available, but seems like it's not yet. If you join https://groups.google.com/g/stac-community you should get an invite for the bi-weekly calls afaik. |
I started looking into opening the zarr asset if the user passes a collection. I was using planetary computer as an example and I think the trouble is that on any given collection there might be several different collection-level zarr or kerchunk assets. For instance: https://planetarycomputer.microsoft.com/api/stac/v1/collections/nasa-nex-gddp-cmip6 has 17 kerchunk assets. So I am kind of thinking that the current implementation of requiring the user to supply an asset directly to xpystac might be the best we can do. |
FWIW, don't place too much weight on how I did those. That was my first time using Kerchunk. I was writing up a post explaining my reasoning behind the collection-level assets, but found I couldn't make sense of it :) I suspect that if I were redoing that dataset today, I would move, for example, the collection-level asset at I think the best collection-level asset, if feasible, would be a set of Kerchunk references that concatenates all of these item-level references into a single Dataset (or DataTree? I'm not sure.) |
Ah ok that is helpful context. I think I agree that it feels conceptually like an item should be readable without the user needing to pick and choose assets. |
Hi everyone, I jump in the discussion trying to figure out how to create and read a meaningful STAC Collections based on Kerchunk'ed netCDFs. Here is a sample code I'm using. The url refers to the sample Item I created.
import pystac_client
import odc.stac
import json
import xpystac
import xarray as xr
url_1 = "https://gist.githubusercontent.com/clausmichele/28efa0007731044db3a7752da2164fe0/raw/1cba235038f0aa20e16675a863224a4f3ab79e4a/CERRA-20010101000000_20011231000000.json"
stac_api = pystac_client.stac_api_io.StacApiIO()
stac_dict_1 = json.loads(stac_api.read_text(url_1))
item_1 = stac_api.stac_object_from_dict(stac_dict_1)
ds1 = xr.open_dataset(item_1, engine="stac")
ds1 Do you have suggestions on how to proceed? This code would work with my current data structure: import pystac_client
import odc.stac
import json
import xpystac
import xarray as xr
url_1 = "https://gist.githubusercontent.com/clausmichele/28efa0007731044db3a7752da2164fe0/raw/1cba235038f0aa20e16675a863224a4f3ab79e4a/CERRA-20010101000000_20011231000000.json"
url_2 = "https://gist.githubusercontent.com/clausmichele/6b78a70ef153c4c841401ec0b7d2b75f/raw/e0d2f307b1f8caef7ec19ae68b8100fb7d5f25dd/CERRA-20020101000000_20021231000000.json"
stac_api = pystac_client.stac_api_io.StacApiIO()
stac_dict_1 = json.loads(stac_api.read_text(url_1))
item_1 = stac_api.stac_object_from_dict(stac_dict_1)
stac_dict_2 = json.loads(stac_api.read_text(url_2))
item_2 = stac_api.stac_object_from_dict(stac_dict_2)
items = [item_1,item_2]
datasets_list = []
for item in items:
for asset in item.assets:
print(item.assets[asset])
if item.assets[asset].href.endswith(".json"):
data = xr.open_dataset(
"reference://",
engine="zarr",
decode_coords="all",
backend_kwargs={
"storage_options": {
"fo":item.assets[asset].href,
},
"consolidated": False
},chunks={}
).to_dataarray(dim="bands")
datasets_list.append(data)
# Need to create one Item per time/netCDF
data = xr.combine_by_coords(datasets_list,combine_attrs="drop_conflicts")
data |
Thanks for taking the time to write that up @clausmichele that is an interesting example! For kerchunk, xpystac expects one asset pointing to a kerchunk file. There is no logic for one data variable in each asset of an item. For this reason, you might be better off keeping the combination logic in your code and using the import pystac
import xarray as xr
url_1 = "https://gist.githubusercontent.com/clausmichele/28efa0007731044db3a7752da2164fe0/raw/1cba235038f0aa20e16675a863224a4f3ab79e4a/CERRA-20010101000000_20011231000000.json"
url_2 = "https://gist.githubusercontent.com/clausmichele/6b78a70ef153c4c841401ec0b7d2b75f/raw/e0d2f307b1f8caef7ec19ae68b8100fb7d5f25dd/CERRA-20020101000000_20021231000000.json"
item_1 = pystac.read_file(url_1)
item_2 = pystac.read_file(url_2)
items = [item_1, item_2]
datasets_list = []
for item in items:
for asset in item.assets.values():
print(asset)
if asset.href.endswith(".json"):
data = xr.open_dataset(asset.href, engine="kerchunk", chunks={})
datasets_list.append(data)
data = xr.combine_by_coords(datasets_list, combine_attrs="drop_conflicts")
data Of course you could do the same thing with xpystac but you need to make sure that import pystac
import xarray as xr
url_1 = "https://gist.githubusercontent.com/clausmichele/28efa0007731044db3a7752da2164fe0/raw/1cba235038f0aa20e16675a863224a4f3ab79e4a/CERRA-20010101000000_20011231000000.json"
url_2 = "https://gist.githubusercontent.com/clausmichele/6b78a70ef153c4c841401ec0b7d2b75f/raw/e0d2f307b1f8caef7ec19ae68b8100fb7d5f25dd/CERRA-20020101000000_20021231000000.json"
item_1 = pystac.read_file(url_1)
item_2 = pystac.read_file(url_2)
items = [item_1, item_2]
datasets_list = []
for item in items:
for asset in item.assets.values():
print(asset)
if asset.href.endswith(".json"):
asset.media_type = "application/json"
asset.roles = ["index"]
data = xr.open_dataset(asset, engine="stac", chunks={})
datasets_list.append(data)
data = xr.combine_by_coords(datasets_list, combine_attrs="drop_conflicts")
data Possible future workI can imagine a world in which xpystac builds up a little bit of its own stacking logic so that you could stack zarr or kerchunk assets -- which I don't think are supported by That would make it so (once you have fixed up your data = xr.open_dataset(items, engine="stac", stacking_library="xpystac") I wrote up a little proof of concept for this, but I'm not 100% sure it's a good idea to put very much logic into xpystac. Maybe the thing to do would be to get stacstack and odc-stac to support kerchunk themselves. |
This returned an error for me:
probably it's an unreleased functionality? Anyway, thanks for the suggestions! |
hmm yeah might not be released, but glad it worked! |
kerchunk
somewhat recently included aopen_dataset
engine:This allows getting rid of the somewhat confusing and certainly hard to remember "create a fsspec mapper and feed it to zarr" pattern that was necessary before.
I believe we can make use of this to get
xpystac
to transparently open reference files from STAC items / collections. Note that this would be different from the proposal in #32, which aims to store the references directly in STAC (as far as I can tell, #32 would avoid the additional download step, but wouldn't be able to support enormous reference data or file formats other than JSON).To make this happen, I think we would need to:
xpystac
detect the mime type and open the references using the pattern abovexpystac
deal with collection level assetsI can help with 2, but I don't really have an idea of how to construct (and register?) a proper mime type. 3 would need some discussion, but should be easy to implement.
cc @dwest77a, @TomAugspurger
The text was updated successfully, but these errors were encountered: