Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update iterative write and parallel I/O tutorial #1633

Merged
merged 11 commits into from
Jan 11, 2023

Conversation

oruebel
Copy link
Contributor

@oruebel oruebel commented Jan 11, 2023

Motivation

This PR is related to HDMF #623 and Fix #1514

  • Update the iterative write tutorial to:
    • mention GenericDataChunkIterator and crosslink to the corresponding tutorial on HDMF
    • use the new HDF5IO.dataset property to avoid having to close and reopen a file
    • rename the tutorial to add the plot_ prefix to make sure outputs are captured directly from the tutorial rather than being harcoded in the tutorial
  • Update the parallel I/O tutorial to use HDF5IO to setup a dataset in a file rather than an empty DataChunkIterator
  • Update the Makefile for the docs to clean up files generated by the advanced_io tutorial

How to test the behavior?

Build the docs

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Have you checked our Contributing document?
  • Have you ensured the PR clearly describes the problem and the solution?
  • Is your contribution compliant with our coding style? This can be checked running flake8 from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

@codecov
Copy link

codecov bot commented Jan 11, 2023

Codecov Report

Merging #1633 (1251ed3) into dev (f4bbbd6) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##              dev    #1633   +/-   ##
=======================================
  Coverage   91.31%   91.31%           
=======================================
  Files          25       25           
  Lines        2534     2534           
  Branches      481      481           
=======================================
  Hits         2314     2314           
  Misses        139      139           
  Partials       81       81           
Flag Coverage Δ
integration 70.44% <ø> (ø)
unit 84.37% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@oruebel oruebel marked this pull request as ready for review January 11, 2023 01:54
@oruebel
Copy link
Contributor Author

oruebel commented Jan 11, 2023

@CodyCBakerPhD while going through the iterative write tutorial to fix a few issues, I noticed that we were not discussing the GenericDataChunkIterator here. I added a few references to the corresponding tutorial, however, it would be nice if we could also show the usage of GenericDataChunkIterator here. I think we could just update the "Convert large binary data arrays" section of the tutorial to use GenericDataChunkIterator instead of DataChunkIterator. If you have time, could you maybe just quickly add those changes to this PR. I think it should be a fairly simple change, but I figured since you are the expert in GenericDataChunkIterator it would be best if you could make that change.

####################
# Example: Convert large binary data arrays
# -----------------------------------------------------
#
# When converting large data files, a typical problem is that it is often too expensive to load all the data
# into memory. This example is very similar to the data generator example only that instead of generating
# data on-the-fly in memory we are loading data from a file one-chunk-at-a-time in our generator.
#
####################
# Create example data
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
import numpy as np
# Create the test data
datashape = (100, 10) # OK, this not really large, but we just want to show how it works
num_values = np.prod(datashape)
arrdata = np.arange(num_values).reshape(datashape)
# Write the test data to disk
temp = np.memmap('basic_sparse_iterwrite_testdata.npy', dtype='float64', mode='w+', shape=datashape)
temp[:] = arrdata
del temp # Flush to disk
####################
# Step 1: Create a generator for our array
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
# Note, we here use a generator for simplicity but we could equally well also implement our own
# :py:class:`~hdmf.data_utils.AbstractDataChunkIterator`.
def iter_largearray(filename, shape, dtype='float64'):
"""
Generator reading [chunk_size, :] elements from our array in each iteration.
"""
for i in range(shape[0]):
# Open the file and read the next chunk
newfp = np.memmap(filename, dtype=dtype, mode='r', shape=shape)
curr_data = newfp[i:(i + 1), ...][0]
del newfp # Reopen the file in each iterator to prevent accumulation of data in memory
yield curr_data
return
####################
# Step 2: Wrap the generator in a DataChunkIterator
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
from hdmf.data_utils import DataChunkIterator
data = DataChunkIterator(data=iter_largearray(filename='basic_sparse_iterwrite_testdata.npy',
shape=datashape),
maxshape=datashape,
buffer_size=10) # Buffer 10 elements into a chunk, i.e., create chunks of shape (10,10)
####################
# Step 3: Write the data as usual
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
write_test_file(filename='basic_sparse_iterwrite_largearray.nwb',
data=data)
####################
# .. tip::
#
# Again, if we want to explicitly control how our data will be chunked (compressed etc.)
# in the HDF5 file then we need to wrap our :py:class:`~hdmf.data_utils.DataChunkIterator`
# using :py:class:`~hdmf.backends.hdf5.h5_utils.H5DataIO`
####################
# Discussion
# ^^^^^^^^^^
# Let's verify that our data was written correctly
# Read the NWB file
from pynwb import NWBHDF5IO # noqa: F811
with NWBHDF5IO('basic_sparse_iterwrite_largearray.nwb', 'r') as io:
nwbfile = io.read()
data = nwbfile.get_acquisition('synthetic_timeseries').data
# Compare all the data values of our two arrays
data_match = np.all(arrdata == data[:]) # Don't do this for very large arrays!
# Print result message
if data_match:
print("Success: All data values match")
else:
print("ERROR: Mismatch between data")
####################
# ``[Out]:``
#
# .. code-block:: python
#
# Success: All data values match

@oruebel oruebel added category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of NWB users topic: docs issues related to documentation labels Jan 11, 2023
@oruebel oruebel added this to the Next Release milestone Jan 11, 2023
@oruebel oruebel merged commit 8395176 into dev Jan 11, 2023
@oruebel oruebel deleted the update/iter_write_tutorial branch January 11, 2023 02:31
@oruebel
Copy link
Contributor Author

oruebel commented Jan 11, 2023

@rly thanks for the fixes

mavaylon1 added a commit that referenced this pull request Jan 17, 2023
* Check nwb_version on read (#1612)

* Added NWBHDF5IO.nwb_version property and check for version on NWBHDF5IO.read
* Updated icephys tests to skip version check when writing non NWBFile container
* Add tests for NWB version check on read
* Add unit tests for NWBHDF5IO.nwb_version property
* Updated changelog

Co-authored-by: Ryan Ly <[email protected]>

* Bump setuptools from 65.4.1 to 65.5.1 (#1614)

Bumps [setuptools](https://github.com/pypa/setuptools) from 65.4.1 to 65.5.1.
- [Release notes](https://github.com/pypa/setuptools/releases)
- [Changelog](https://github.com/pypa/setuptools/blob/main/CHANGES.rst)
- [Commits](pypa/setuptools@v65.4.1...v65.5.1)

---
updated-dependencies:
- dependency-name: setuptools
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* modify export.rst to have proper links to the NWBFile API docs (#1615)

* Create project_action.yml (#1617)

* Create project_action.yml

* Update project_action.yml

* Update project_action.yml

* Update project_action.yml (#1620)

* Update project_action.yml (#1623)

* Project action (#1626)

* Create project_action.yml

* Update project_action.yml

* Update project_action.yml

* Update project_action.yml

* Show recommended usaege for hdf5plugin in tutorial (#1630)

* Show recommended usaege for hdf5plugin in tutorial

* Update docs/gallery/advanced_io/h5dataio.py

* Update docs/gallery/advanced_io/h5dataio.py

Co-authored-by: Heberto Mayorquin <[email protected]>

Co-authored-by: Ben Dichter <[email protected]>
Co-authored-by: Heberto Mayorquin <[email protected]>

* Update iterative write and parallel I/O tutorial (#1633)

* Update iterative write tutorial
* Update doc makefiles to clean up files created by the advanced io tutorial
* Fix #1514  Update parallel I/O tutorial to use H5DataIO instead of DataChunkIterator to setup data for parallel write
* Update changelog
* Fix flake8
* Fix broken external links
* Update make.bat
* Update CHANGELOG.md
* Update plot_iterative_write.py
* Update docs/gallery/advanced_io/plot_iterative_write.py

Co-authored-by: Ryan Ly <[email protected]>

* Update project_action.yml (#1632)

* nwb_schema_2.6.0

* Update CHANGELOG.md

* remove

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Oliver Ruebel <[email protected]>
Co-authored-by: Ryan Ly <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Ben Dichter <[email protected]>
Co-authored-by: Heberto Mayorquin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of NWB users topic: docs issues related to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Documentation]: Update parallel I/O and iterative write tutorial
2 participants