-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement staging store for preparing .nwb.json for upload to cloud #42
Conversation
See https://github.com/magland/lindi-dandi - This provides the DANDI upload functionality |
This opens the possibility of having mp4-encoded video in Zarr in a nwb.json file on DANDI. Here's the example on dandi staging (note: it's not a proper nwb): The script that generated it: The custom codec for mp4/h264: |
Very cool! So an |
Yes exactly |
@CodyCBakerPhD To illustrate more clearly, here's the content of that .json file (again, not NWB compliant)
Note the compressor id is "mp4avc" (the custom one), and all those references are to the same file -- the consolidated blob on DANDI. Each chunk within that consolidated blob is mp4 encoded. So you have the advantage of chunked zarr and also the advantage of one big video file. |
yet to review, provide any meaningful feedback, but just want to note that eventually we would get a .json for each .nwb already whenever collating .nwb's into BIDS datasets per https://bids.neuroimaging.io/bep032 . So it might get confusing to get |
Thanks ... copying those over: .lindi.zarr.json , .nwb.zarr.json, .nwb.lindi , .nwb.lindi.json, .lindi.json, .lindi If .nwb.json is not specific enough, then I guess my vote would be nwb.lindi.json because that would allow for nwb.lindi.parquet or other in the future. The problem with .nwb.zarr.json is that it doesn't specify that it contains lindi-specific attributes for handling scalars, references, compound dtypes, etc. |
I renamed these files to .nwb.lindi.json to see how that looks https://gui-staging.dandiarchive.org/dandiset/213569/draft/files?location=000946%2Fsub-BH494&page=1 What do you think @yarikoptic @rly If we can decide on this, I can create a PR for dandi to enable the "open in neurosift" menu option. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #42 +/- ##
==========================================
+ Coverage 82.33% 83.10% +0.77%
==========================================
Files 25 28 +3
Lines 1715 1930 +215
==========================================
+ Hits 1412 1604 +192
- Misses 303 326 +23 ☔ View full report in Codecov by Sentry. |
I guess it would not hurt us at all "add support" for |
One advantage of using these LINDI files is the initial load time when loading remotely. For example, this takes around 10 seconds on my machine to fully load the meta data (loading from hdf5) If I load from the .nwb.lindi.json, then this reduces to less than 1 second A second advantage is that we can use custom compression codecs that are not available in HDF5. For example, this allows including mp4-compressed video data, rather than requiring external links to .mp4 file. I realize it's ironic that we are solving one type of external link by using another, but the LINDI approach is much more seamless. Here's an example Click on acquisition, and then raw_suite2p_motion_corrected_compressed A third advantage is that you can create derived NWB files that include the same data as the upstream file without any actual data duplication. Described earlier in this thread. A fourth advantage is that you can make corrections to a NWB file (attributes, etc) and then re-upload without needing to re-upload the bulk of the data. Finally, there are reasons to believe that remote streaming of actual data chunks will be a lot more efficient for LINDI compared with HDF5. But I don't have any examples showcasing that as of yet. |
with tempfile.TemporaryDirectory() as tmpdir: | ||
rfs_fname = f"{tmpdir}/rfs.json" | ||
with open(rfs_fname, 'w') as f: | ||
json.dump(rfs, f, indent=2, sort_keys=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this and
lindi/lindi/LindiH5ZarrStore/LindiH5ZarrStore.py
Lines 463 to 473 in bc71a9d
def to_file(self, file_name: str, *, file_type: Literal["zarr.json"] = "zarr.json"): | |
"""Write a reference file system corresponding to this store to a file. | |
This can then be loaded using LindiH5pyFile.from_reference_file_system(file_name) | |
""" | |
if file_type != "zarr.json": | |
raise Exception(f"Unsupported file type: {file_type}") | |
ret = self.to_reference_file_system() | |
with open(file_name, "w") as f: | |
json.dump(ret, f, indent=2) |
json.dump
args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
A reference file system works well if making small changes or edits to attributes. But when it comes time to add large datasets and binary chunks, those should not be embedded into the json. With this PR there is a mechanism to specify a staging area (temporary directory on local machine) where binary chunks are stored as the reference file system is edited. Then, prior to uploading to the cloud (e.g., DANDI), the chunks are consolidated (so that we don't have thousands or even hundreds of thousands files to upload) into larger blocks. Then the data files are uploaded and the reference file system is adjusted so the references point to the URLs of the newly uploaded files rather than pointing to local file paths.
The actual mechanism for upload is not provided by lindi. I think it makes sense to keep that part of it separate, so that lindi is not fundamentally tied to DANDI.
Here's an example script on how this is intended to be used.
This script resulted in the following files on DANDI (staging site): https://gui-staging.dandiarchive.org/dandiset/213569/draft/files?location=000946%2Fsub-BH494&page=1
These are small .nwb.json files that reference data blobs from (a) the original nwb/hdf5 file on DANDISET 000946 and (b) new blobs for the newly-added autocorrelogram column in the Units table.
I wasn't sure where to put the blob on DANDI, so for now, it's here:
https://gui-staging.dandiarchive.org/dandiset/213569/draft/files?location=sha1&page=1
We'll need to discuss what makes sense for that. A lot of things to consider here.
Here's the supplemented file on Neurosift
https://neurosift.app/?p=/nwb&url=https://api-staging.dandiarchive.org/api/assets/64aa19bd-496f-47be-aa72-2b1cedb9b20a/download/&st=lindi&dandisetId=213569&dandisetVersion=draft
Expand units and then click on "autocorrelograms". You can also explore in the RAW tab.