Skip to content

Commit

Permalink
update streaming docs to recommend fsspec (#1575)
Browse files Browse the repository at this point in the history
  • Loading branch information
bendichter authored Oct 20, 2022
1 parent fdb0297 commit d50e372
Showing 1 changed file with 46 additions and 31 deletions.
77 changes: 46 additions & 31 deletions docs/gallery/advanced_io/streaming.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
using the dandi API library.
Getting the location of the file on DANDI
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----------------------------------------
The :py:class:`~dandi.dandiapi.DandiAPIClient` can be used to get the S3 URL of any NWB file stored in the DANDI
Archive. If you have not already, install the latest release of the ``dandi`` package.
Expand All @@ -34,36 +34,8 @@
s3_url = asset.get_content_url(follow_redirects=1, strip_query=True)
Streaming Method 1: ROS3
~~~~~~~~~~~~~~~~~~~~~~~~
ROS3 is one of the supported methods for reading data from a remote store. ROS3 stands for "read only S3" and is a
driver created by the HDF5 Group that allows HDF5 to read HDF5 files stored remotely in s3 buckets. Using this method
requires that your HDF5 library is installed with the ROS3 driver enabled. This is not the default configuration,
so you will need to make sure you install the right version of ``h5py`` that has this advanced configuration enabled.
You can install HDF5 with the ROS3 driver from `conda-forge <https://conda-forge.org/>`_ using ``conda``. You may
first need to uninstall a currently installed version of ``h5py``.
.. code-block:: bash
pip uninstall h5py
conda install -c conda-forge "h5py>=3.2"
Now instantiate a :py:class:`~pynwb.NWBHDF5IO` object with the S3 URL and specify the driver as "ros3". This
will download metadata about the file from the S3 bucket to memory. The values of datasets are accessed lazily,
just like when reading an NWB file stored locally. So, slicing into a dataset will require additional time to
download the sliced data (and only the sliced data) to memory.
.. code-block:: python
from pynwb import NWBHDF5IO
with NWBHDF5IO(s3_url, mode='r', load_namespaces=True, driver='ros3') as io:
nwbfile = io.read()
print(nwbfile)
print(nwbfile.acquisition['lick_times'].time_series['lick_left_times'].data[:])
Streaming Method 2: fsspec
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Streaming Method 1: fsspec
--------------------------
fsspec is another data streaming approach that is quite flexible and has several performance advantages. This library
creates a virtual filesystem for remote stores. With this approach, a virtual file is created for the file and
the virtual filesystem layer takes care of requesting data from the S3 bucket whenever data is
Expand Down Expand Up @@ -113,6 +85,49 @@
The S3 backend, in particular, may provide additional functionality for accessing data on DANDI. See the
`fsspec documentation on known implementations <https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=S3#other-known-implementations>`_
for a full updated list of supported store formats.
Streaming Method 2: ROS3
------------------------
ROS3 is one of the supported methods for reading data from a remote store. ROS3 stands for "read only S3" and is a
driver created by the HDF5 Group that allows HDF5 to read HDF5 files stored remotely in s3 buckets. Using this method
requires that your HDF5 library is installed with the ROS3 driver enabled. This is not the default configuration,
so you will need to make sure you install the right version of ``h5py`` that has this advanced configuration enabled.
You can install HDF5 with the ROS3 driver from `conda-forge <https://conda-forge.org/>`_ using ``conda``. You may
first need to uninstall a currently installed version of ``h5py``.
.. code-block:: bash
pip uninstall h5py
conda install -c conda-forge "h5py>=3.2"
Now instantiate a :py:class:`~pynwb.NWBHDF5IO` object with the S3 URL and specify the driver as "ros3". This
will download metadata about the file from the S3 bucket to memory. The values of datasets are accessed lazily,
just like when reading an NWB file stored locally. So, slicing into a dataset will require additional time to
download the sliced data (and only the sliced data) to memory.
.. code-block:: python
from pynwb import NWBHDF5IO
with NWBHDF5IO(s3_url, mode='r', load_namespaces=True, driver='ros3') as io:
nwbfile = io.read()
print(nwbfile)
print(nwbfile.acquisition['lick_times'].time_series['lick_left_times'].data[:])
Which streaming method to choose?
---------------------------------
fsspec has many advantages over ros3:
1. fsspec is easier to install
2. fsspec supports caching, which will dramatically speed up repeated requests for the
same region of data
3. fsspec automatically retries when s3 fails to return.
4. fsspec works with other storage backends and
5. fsspec works with other types of files.
6. In our hands, fsspec is faster out-of-the-box.
For these reasons, we would recommend use fsspec for most Python users.
'''

# sphinx_gallery_thumbnail_path = 'figures/gallery_thumbnails_streaming.png'

0 comments on commit d50e372

Please sign in to comment.