example #2

martindurant · 2019-01-11T20:47:40Z

Works:

In [1]: import intake

In [2]: cat = intake.open_thredds_cat('http://dap.nci.org.au/thredds/catalog.xml')

In [3]: catty = cat['eMAST TERN']()['eMAST TERN - files']()['ASCAT']()['ASCAT_v1-0_soil-moisture_daily_0-05deg_2007-2011']()['00000000']()

In [4]: catty['ASCAT_v1-0_soil-moisture_daily_0-05deg_2007-2011_00000000_201104.nc'].to_dask()

<xarray.Dataset>
Dimensions:                                 (lat: 681, lon: 841, time: 30)
Coordinates:
  * time                                    (time) datetime64[ns] 2011-04-01 ... 2011-04-30
  * lat                                     (lat) float64 -10.0 -10.05 ... -44.0
  * lon                                     (lon) float64 112.0 112.0 ... 154.0
Data variables:
    crs                                     uint8 ...
    lwe_thickness_of_soil_moisture_content  (time, lat, lon) float32 ...
Attributes:
    history:                         Reformatted to NetCDF 2014-08-04. Update...
    license:                         Copyright 2014 CSIRO. Rights owned by th...
    licence_data_access:             These data can be freely downloaded and ...
    spatial_coverage:                Australia
    acknowledgment:                  The conversion of this data into NetCDFs...
    citation:                        Wagner, Wolfgang; Lemoine, Guido; Rott, ...
    TERN_eMAST_contact:              [email protected]
    title:                           ASCAT derived daily soil moisture: 0.05 ...
    licence_copyright:               Copyright 2014 CSIRO. Rights owned by th...
    short_desc:                      ASCAT derived soil moisture, Australian ...
    summary:                         Daily top-layer (~ 0-2 cm) soil moisture...
    long_name:                       ASCAT derived soil moisture
    contact:                         Luigi Renzullo, Senior Research Scientis...
    keywords:                        EARTH SCIENCE > CLIMATE INDICATORS > LAN...
    Conventions:                     CF-1.6
    institution:                     CSIRO
    geospatial_lat_min:              -44.025
    geospatial_lat_max:              -9.974999999999994
    geospatial_lat_units:            degrees_north
    geospatial_lat_resolution:       -0.05
    geospatial_lon_min:              111.975
    geospatial_lon_max:              154.025
    geospatial_lon_units:            degrees_east
    geospatial_lon_resolution:       0.05
    keywords_vocabulary:             Global Change Master Directory (http://g...
    metadata_link:                   http://datamgt.nci.org.au:8080/geonetwork
    standard_name_vocabulary:        Climate and Forecast(CF) convention stan...
    DOI:                             To be added
    cdm_data_type:                   grid
    contributor_name:                Wolfgang Wagner, Guido Lemoine, Helmut Rott
    contributor_role:                principalInvestigator, author, author
    creator_name:                    eMAST data manager
    creator_url:                     http://www.emast.org.au/
    Metadata_Conventions:            Unidata Dataset Discovery v1.0
    publisher_name:                  Ecosystem Modelling and Scaling Infrastr...
    publisher_email:                 [email protected]
    publisher_url:                   http://www.emast.org.au/
    id:                              ASCAT_v1-0_soil-moisture_daily_0-05deg_2...
    source:                          ASCAT_v1-0
    date_created:                    2014/08/04
    creator_email:                   [email protected]
    metadata_uuid:                   17236f6f-829e-48e1-98fd-32c7904a5793
    DODS_EXTRA.Unlimited_Dimension:  time

Notes:

the ugly [...]() syntax is an artefact of having reference names that aren't valid python identifiers. It is certainly possible to get rid of the parentheses and maybe to include the chain as [.., ..]
the output here is a Dasky xarray (via opendap) but with a single chunk; I don't know what one should say to xarray in order to get "automatic" chunking.
each nested level is loaded on demand only. In theory it would be possible to walk the whole tree, but that would take many HTTP calls and be slow. Various metadata is available as one descends the levels and none of this is captured
this took very few lines of code
not sure if this should be called siphon or thredds. Probably few have heard of the former?

martindurant · 2019-01-11T20:47:53Z

cc #1

rabernat · 2019-01-12T15:13:56Z

Wow that was fast! 😲 Amazing work Martin!

the ugly [...]() syntax is an artefact of having reference names that aren't valid python identifiers. It is certainly possible to get rid of the parentheses and maybe to include the chain as [.., ..]

I don't see much of a way around this.

the output here is a Dasky xarray (via opendap) but with a single chunk; I don't know what one should say to xarray in order to get "automatic" chunking.

Agreed: the to_dask() syntax is starting to feel a bit of a stretch here, since there are actually no dask arrays in this xarray dataset. (Out of curiosity, why not .to_xarray(dask=True) or .to_xarray(chunks=True).

There is no automatic chunking of single files in xarray.open_dataset. You need to manually call .chunk(). Automatic chunking only happens with open_mfdataset.

In my original siphon to xarray example, I loaded many individual OpenDAP data files into a single xarray dataset via open_mfdataset. I think a broader goal for intake-siphon would be to allow skipping the final ['00000000'] level of the hierarchy and just getting the whole set of files (different time snapshots of the same field) as a single xarray dataset with dask arrays.

each nested level is loaded on demand only. In theory it would be possible to walk the whole tree, but that would take many HTTP calls and be slow. Various metadata is available as one descends the levels and none of this is captured

I think that's fine.

this took very few lines of code

👏

not sure if this should be called siphon or thredds. Probably few have heard of the former?

Agreed, thredds is probably more generic and recognizable.

kmpaul · 2019-01-12T22:35:29Z

This is great!

I agree with @rabernat that calling this a intake-thredds might be clearer.

Also, I will second @rabernat's question about the to_dask() method. I believe that some of @martindurant's thoughts on this can be found here: intake/intake-xarray#26.

pbranson · 2019-01-14T10:11:42Z

This is cool! Going to check out your code when I get a chance. However I'm not sure about use in practice, my experience when hitting opendap servers with open_mfdataset and subsetting with multiple dask workers is that the thredds server dies or starts to timeout. Usually end up making a mirror(!) Hoping to do some comparisons between Netcdf on thredds, minio S3 and Amazon S3 via FUSE in the coming months. Wondering about any more recent results from https://github.com/pangeo-data/storage-benchmarks

…

On Sun., 13 Jan. 2019, 6:35 am Kevin Paul ***@***.*** wrote: This is great! I agree with @rabernat <https://github.com/rabernat> that calling this a intake-thredds might be clearer. Also, I will second @rabernat <https://github.com/rabernat>'s question about the to_dask() method. I believe that some of @martindurant <https://github.com/martindurant>'s thoughts on this can be found here: intake/intake-xarray#26 <intake/intake-xarray#26>. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AM3bQOjN1j8c822yucKUqbSfdMEnq5Djks5vCmMxgaJpZM4Z8Jty> .

rabernat · 2019-01-14T10:24:56Z

However I'm not sure about use in practice, my experience when hitting opendap servers with open_mfdataset and subsetting with multiple dask workers is that the thredds server dies or starts to timeout.

I have had the opposite experience. Well configured TDS servers seem to be able to handle many simultaneous request. I have some experiments in this binder:
https://github.com/rabernat/pangeo_esgf_demo

Wondering about any more recent results from https://github.com/pangeo-data/storage-benchmarks

This project was abandoned by the intern who was working on it and is not going anywhere. However, I have done some of my own benchmarking.

Here is throughput from an ESGF THREDDS server running in google cloud, accessed in parallel via xarray / dask:

(cs stands for "chunk size".)

Here is the same access pattern using zarr with directly cloud storage.

You can see that the direct approach gets orders of magnitude higher throughput.

More here: https://speakerdeck.com/rabernat/cloud-native-climate-data-with-zarr-and-xarray

martindurant · 2019-01-14T14:44:49Z

I think a broader goal for intake-siphon would be to allow skipping the final ['00000000'] level of the hierarchy and just getting the whole set of files

Is there a way that I could have known (attached metadata or something) that this was the final level and that the entries below it formed a coordinate grid?

martindurant · 2019-01-14T18:53:51Z

With intake/intake#229, the following syntax does work:

cat['eMAST TERN', 'eMAST TERN - files', 'ASCAT', 'ASCAT_v1-0_soil-moisture_daily_0-05deg_2007-2011', '00000000',
    'ASCAT_v1-0_soil-moisture_daily_0-05deg_2007-2011_00000000_201111.nc'].to_dask()

dopplershift · 2019-01-18T00:02:07Z

This looks really cool! I'd echo the sentiment about intake-thredds--that would make more sense to me.

In siphon, I think we're looking at adding some syntax for walking through the catalog more simply (Unidata/siphon#263). Not sure if any similar ideas are applicable to the intake world.

martindurant · 2019-01-18T03:01:49Z

@dopplershift , those ideas for walking the catalog will work as-is with the code here, except that "cat" is an Intake catalogue (but every instance has the siphon cat as an attribute too). Since the names are not valid python identifiers, you cannot use tab-completion, but if they were you could.

dopplershift · 2019-01-21T18:07:12Z

@martindurant IPython has hooks that also enable completion of dictionary keys within []--are you saying those need to be valid identifiers as well?

martindurant · 2019-01-22T18:01:43Z

@dopplershift , no I was not saying that. Do you know how ipython fetches the set of potential completions?

dopplershift · 2019-01-22T18:20:15Z

Looks like you can define _ipython_key_completions_() to provide suggested completions for obj[".

martindurant · 2019-01-24T15:00:04Z

^ Done in master

martindurant · 2019-02-11T15:13:51Z

@andersy005 , can we change the name of this repo to intake-thredds? I'll change the name of the python package in this PR, and I think we should merge this near to what is here, so that people can try it, and we can iterate over ideas like @rabernat 's about merging several like sources using xarray.

andersy005 · 2019-02-11T16:22:01Z

Can we change the name of this repo to intake-thredds?

@martindurant, this is done.

martindurant · 2019-02-11T16:32:54Z

Thanks @andersy005 . I have globally renamed things within the repo, including places where it probably doesn't matter. What do you think remains to be done in this PR? I'd be keen to merge earlier, so that we can get something up and usable for experimentation.

andersy005 · 2019-02-11T19:18:42Z

This looks good, @martindurant! Are you planning on adding tests to this PR? If not, that's also fine. We can merge this and iterate on tests in future PRs.

martindurant · 2019-02-11T19:25:56Z

I do not know where to go for tests, it doesn't seem like a good idea to test against live thredds servers which may change or go down without notice.
Perhaps https://github.com/Unidata/thredds-docker would be a way to do it? I would put that in a separate PR, since it may take some work to get right.

andersy005 · 2019-02-11T19:35:22Z

Sounds good. I am going to merge this. By the way, I added you as an Admin to the repo.

dopplershift · 2019-02-12T02:46:24Z

For siphon, we’ve used vcrpy as a way to record web requests to e.g. THREDDS servers and play them back for testing purposes.

martindurant · 2019-02-12T13:31:56Z

I have used vcrpy for gcsfs, and find it an immense pain to work with!

rabernat · 2019-02-12T13:37:21Z

An alternative may be to use pydap to start up a lightweight opendap server. Pydap also serves THREDDS metadata. Since it is pure python, it can be launched in a range of ways that are compatible with testing environments.

martindurant · 2019-02-12T13:47:19Z

Thanks @rabernat , that sounds like the best option then

dopplershift · 2019-02-12T14:22:19Z

@martindurant It’s interesting that you feel that way. In the scheme of things that cause me pain in testing and maintaining a CI system, vcrpy doesn’t crack my top 20. Would love to know more (but don’t want to belabor the point).

As far as pydap is concerned, you’re now introducing a 3rd party package here to stand in as a mock for TDS—one whose goal isn’t to be a TDS, just serve THREDDS-compatible catalogs. I feel like this has the potential to be shaking out issues in pydap rather than test intake-thredds.
(This isn’t hypothetical, I’ve seen some impactful differences in the past.)

Just $0.02 from someone whose not signing up for more work. 😉

martindurant · 2019-02-12T14:24:56Z

from someone whose not signing up for more work.

!!

It may well be that my vcr setup for gcsfs is not as intended, but I first came across it in the context of adlfs (azure-datalake-storage), and the collaborators there claimed to know vcr well, and I mostly copied their prescription. Perhaps imperfectly.

dopplershift · 2019-02-12T14:45:34Z

It’s entirely possible our use of vcrpy is too simplistic to encounter the pain points.

Martin Durant added 2 commits January 11, 2019 15:40

example

5ecbcbc

tidy

0c7b76d

rabernat mentioned this pull request Jan 20, 2019

recursively combine items from catalog using xarray.auto_combine intake/intake-xarray#29

Open

martindurant mentioned this pull request Jan 23, 2019

Exclude invalid identifiers from __dir__. Include normal attrs. intake/intake#238

Merged

martindurant mentioned this pull request Jan 28, 2019

Dynamic catalogs intake/intake#242

Closed

Rename all siphon->thredds

9008a43

andersy005 merged commit 792fdc0 into intake:master Feb 11, 2019

martindurant deleted the first_attempt branch February 11, 2019 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example #2

example #2

martindurant commented Jan 11, 2019

martindurant commented Jan 11, 2019

rabernat commented Jan 12, 2019

kmpaul commented Jan 12, 2019

pbranson commented Jan 14, 2019 via email

rabernat commented Jan 14, 2019

martindurant commented Jan 14, 2019

martindurant commented Jan 14, 2019

dopplershift commented Jan 18, 2019

martindurant commented Jan 18, 2019

dopplershift commented Jan 21, 2019

martindurant commented Jan 22, 2019

dopplershift commented Jan 22, 2019

martindurant commented Jan 24, 2019

martindurant commented Feb 11, 2019

andersy005 commented Feb 11, 2019

martindurant commented Feb 11, 2019

andersy005 commented Feb 11, 2019

martindurant commented Feb 11, 2019

andersy005 commented Feb 11, 2019

dopplershift commented Feb 12, 2019

martindurant commented Feb 12, 2019

rabernat commented Feb 12, 2019

martindurant commented Feb 12, 2019

dopplershift commented Feb 12, 2019

martindurant commented Feb 12, 2019

dopplershift commented Feb 12, 2019

example #2

example #2

Conversation

martindurant commented Jan 11, 2019

martindurant commented Jan 11, 2019

rabernat commented Jan 12, 2019

kmpaul commented Jan 12, 2019

pbranson commented Jan 14, 2019 via email

rabernat commented Jan 14, 2019

martindurant commented Jan 14, 2019

martindurant commented Jan 14, 2019

dopplershift commented Jan 18, 2019

martindurant commented Jan 18, 2019

dopplershift commented Jan 21, 2019

martindurant commented Jan 22, 2019

dopplershift commented Jan 22, 2019

martindurant commented Jan 24, 2019

martindurant commented Feb 11, 2019

andersy005 commented Feb 11, 2019

martindurant commented Feb 11, 2019

andersy005 commented Feb 11, 2019

martindurant commented Feb 11, 2019

andersy005 commented Feb 11, 2019

dopplershift commented Feb 12, 2019

martindurant commented Feb 12, 2019

rabernat commented Feb 12, 2019

martindurant commented Feb 12, 2019

dopplershift commented Feb 12, 2019

martindurant commented Feb 12, 2019

dopplershift commented Feb 12, 2019