Implement staging store for preparing .nwb.json for upload to cloud #42

magland · 2024-04-05T17:58:10Z

A reference file system works well if making small changes or edits to attributes. But when it comes time to add large datasets and binary chunks, those should not be embedded into the json. With this PR there is a mechanism to specify a staging area (temporary directory on local machine) where binary chunks are stored as the reference file system is edited. Then, prior to uploading to the cloud (e.g., DANDI), the chunks are consolidated (so that we don't have thousands or even hundreds of thousands files to upload) into larger blocks. Then the data files are uploaded and the reference file system is adjusted so the references point to the URLs of the newly uploaded files rather than pointing to local file paths.

The actual mechanism for upload is not provided by lindi. I think it makes sense to keep that part of it separate, so that lindi is not fundamentally tied to DANDI.

Here's an example script on how this is intended to be used.

import hashlib
import lindi
from lindi_dandi import create_asset
from lindi.LindiStagingStore import StagingArea
from helpers.add_autocorrelograms import add_autocorrelograms


lindi_tests_dandiset_id = '213569'
lindi_tests_dandiset_version = 'draft'


def lindi_test_2():
    # path = '000946/sub-BH494/sub-BH494_ses-20230817T132836_ecephys.nwb'
    download_url = 'https://api-staging.dandiarchive.org/api/assets/11b45116-807e-4ba0-82cd-95530c3b095f/download/'

    with StagingArea.create(base_dir='lindi_staging') as staging_area:
        client = lindi.LindiH5pyFile.from_reference_file_system(download_url, mode='r+', staging_area=staging_area)
        add_autocorrelograms(client)
        upload_nwb_to_dandi(
            nwb_client=client,
            dandiset_id=lindi_tests_dandiset_id,
            dandiset_version=lindi_tests_dandiset_version,
            path='000946/sub-BH494/sub-BH494_ses-20230817T132836_desc-autocorrelograms_ecephys.nwb.json'
        )


def upload_nwb_to_dandi(
    nwb_client: lindi.LindiH5pyFile,
    dandiset_id: str,
    dandiset_version: str,
    path: str
):
    staging_store = nwb_client.staging_store
    if staging_store is None:
        raise ValueError("NWB client does not have a staging store")

    def on_store_blob(filename: str):
        sha1 = _compute_sha1_of_file(filename)
        a = create_asset(
            dandiset_id=dandiset_id,
            dandiset_version=dandiset_version,
            local_filename=filename,
            asset_path=f'sha1/{sha1}',
            leave_existing=True
        )
        assert isinstance(a, dict)
        return a['download_url']

    def on_store_main(filename: str):
        a = create_asset(
            dandiset_id=dandiset_id,
            dandiset_version=dandiset_version,
            local_filename=filename,
            asset_path=path,
            replace_existing=True
        )
        assert isinstance(a, dict)
        return a['download_url']

    staging_store.upload(
        on_store_blob=on_store_blob,
        on_store_main=on_store_main,
        consolidate_chunks=True
    )


def _compute_sha1_of_file(filename: str):
    sha1 = hashlib.sha1()
    with open(filename, 'rb') as f:
        while True:
            data = f.read(65536)
            if not data:
                break
            sha1.update(data)
    return sha1.hexdigest()


if __name__ == '__main__':
    lindi_test_2()

This script resulted in the following files on DANDI (staging site): https://gui-staging.dandiarchive.org/dandiset/213569/draft/files?location=000946%2Fsub-BH494&page=1

These are small .nwb.json files that reference data blobs from (a) the original nwb/hdf5 file on DANDISET 000946 and (b) new blobs for the newly-added autocorrelogram column in the Units table.

I wasn't sure where to put the blob on DANDI, so for now, it's here:
https://gui-staging.dandiarchive.org/dandiset/213569/draft/files?location=sha1&page=1

We'll need to discuss what makes sense for that. A lot of things to consider here.

Here's the supplemented file on Neurosift
https://neurosift.app/?p=/nwb&url=https://api-staging.dandiarchive.org/api/assets/64aa19bd-496f-47be-aa72-2b1cedb9b20a/download/&st=lindi&dandisetId=213569&dandisetVersion=draft

Expand units and then click on "autocorrelograms". You can also explore in the RAW tab.

magland · 2024-04-05T18:21:34Z

See https://github.com/magland/lindi-dandi - This provides the DANDI upload functionality

magland · 2024-04-05T19:08:54Z

This opens the possibility of having mp4-encoded video in Zarr in a nwb.json file on DANDI.

Here's the example on dandi staging (note: it's not a proper nwb):
https://gui-staging.dandiarchive.org/dandiset/213569/draft/files?location=test_video&page=1

On neurosift:
https://neurosift.app/?p=/nwb&url=https://api-staging.dandiarchive.org/api/assets/029ac422-af30-4cce-ba0a-e460bd113528/download/&st=lindi&dandisetId=000946&dandisetVersion=draft&tab=neurodata-item:/group1/test_video|test_video

The script that generated it:
https://github.com/magland/lindi-dandi/blob/main/devel/encoded_video_zarr.py

The custom codec for mp4/h264:
https://github.com/flatironinstitute/neurosift/blob/main/python/neurosift/codecs/MP4AVCCodec.py

CodyCBakerPhD · 2024-04-05T19:28:39Z

Very cool!

So an ImageSeries group written as Zarr backend through LINDI could have a data field that looks like this?

magland · 2024-04-05T19:30:18Z

So an ImageSeries group written as Zarr backend through LINDI could have a data field that looks like this?

Yes exactly

magland · 2024-04-05T19:33:55Z

@CodyCBakerPhD To illustrate more clearly, here's the content of that .json file (again, not NWB compliant)

{
  "refs": {
    ".zgroup": {
      "zarr_format": 2
    },
    "group1/.zgroup": {
      "zarr_format": 2
    },
    "group1/test_video/.zattrs": {
      "neurodata_type": "test_video"
    },
    "group1/test_video/.zgroup": {
      "zarr_format": 2
    },
    "group1/test_video/data/.zarray": {
      "chunks": [
        2000,
        480,
        640,
        3
      ],
      "compressor": {
        "fps": 30,
        "id": "mp4avc"
      },
      "dtype": "|u1",
      "fill_value": 0,
      "filters": null,
      "order": "C",
      "shape": [
        15521,
        480,
        640,
        3
      ],
      "zarr_format": 2
    },
    "group1/test_video/data/0.0.0.0": [
      "https://api-staging.dandiarchive.org/api/assets/c0dd3974-2746-48be-b95a-1cc3766e0403/download/",
      19914396,
      5292428
    ],
    "group1/test_video/data/1.0.0.0": [
      "https://api-staging.dandiarchive.org/api/assets/c0dd3974-2746-48be-b95a-1cc3766e0403/download/",
      13935646,
      3546920
    ],
    "group1/test_video/data/2.0.0.0": [
      "https://api-staging.dandiarchive.org/api/assets/c0dd3974-2746-48be-b95a-1cc3766e0403/download/",
      0,
      3619104
    ],
    "group1/test_video/data/3.0.0.0": [
      "https://api-staging.dandiarchive.org/api/assets/c0dd3974-2746-48be-b95a-1cc3766e0403/download/",
      6994064,
      3641331
    ],
    "group1/test_video/data/4.0.0.0": [
      "https://api-staging.dandiarchive.org/api/assets/c0dd3974-2746-48be-b95a-1cc3766e0403/download/",
      3619104,
      3374960
    ],
    "group1/test_video/data/5.0.0.0": [
      "https://api-staging.dandiarchive.org/api/assets/c0dd3974-2746-48be-b95a-1cc3766e0403/download/",
      25206824,
      3368001
    ],
    "group1/test_video/data/6.0.0.0": [
      "https://api-staging.dandiarchive.org/api/assets/c0dd3974-2746-48be-b95a-1cc3766e0403/download/",
      10635395,
      3300251
    ],
    "group1/test_video/data/7.0.0.0": [
      "https://api-staging.dandiarchive.org/api/assets/c0dd3974-2746-48be-b95a-1cc3766e0403/download/",
      17482566,
      2431830
    ]
  },
  "version": 1
}

Note the compressor id is "mp4avc" (the custom one), and all those references are to the same file -- the consolidated blob on DANDI. Each chunk within that consolidated blob is mp4 encoded. So you have the advantage of chunked zarr and also the advantage of one big video file.

yarikoptic · 2024-04-05T22:28:49Z

yet to review, provide any meaningful feedback, but just want to note that eventually we would get a .json for each .nwb already whenever collating .nwb's into BIDS datasets per https://bids.neuroimaging.io/bep032 . So it might get confusing to get .nwb.json and .json... FWIW our prior discussion of possible (other) extensions is at

Pointers from our conversation #32 (comment)

magland · 2024-04-06T11:42:39Z

yet to review, provide any meaningful feedback, but just want to note that eventually we would get a .json for each .nwb already whenever collating .nwb's into BIDS datasets per https://bids.neuroimaging.io/bep032 . So it might get confusing to get .nwb.json and .json... FWIW our prior discussion of possible (other) extensions is at

Pointers from our conversation #32 (comment)

Thanks ... copying those over: .lindi.zarr.json , .nwb.zarr.json, .nwb.lindi , .nwb.lindi.json, .lindi.json, .lindi

If .nwb.json is not specific enough, then I guess my vote would be nwb.lindi.json because that would allow for nwb.lindi.parquet or other in the future. The problem with .nwb.zarr.json is that it doesn't specify that it contains lindi-specific attributes for handling scalars, references, compound dtypes, etc.

magland · 2024-04-09T20:02:24Z

I renamed these files to .nwb.lindi.json to see how that looks

https://gui-staging.dandiarchive.org/dandiset/213569/draft/files?location=000946%2Fsub-BH494&page=1

What do you think @yarikoptic @rly

If we can decide on this, I can create a PR for dandi to enable the "open in neurosift" menu option.

codecov-commenter · 2024-04-09T20:46:43Z

Codecov Report

Attention: Patch coverage is 86.34361% with 31 lines in your changes missing coverage. Please review.

Project coverage is 83.10%. Comparing base (bc71a9d) to head (5b4b6f4).
Report is 143 commits behind head on main.

Files	Patch %	Lines
lindi/LindiStagingStore/LindiStagingStore.py	85.44%	23 Missing ⚠️
lindi/LindiStagingStore/StagingArea.py	86.11%	5 Missing ⚠️
lindi/LindiH5pyFile/LindiH5pyFile.py	90.47%	2 Missing ⚠️
lindi/LindiH5ZarrStore/LindiH5ZarrStore.py	83.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #42      +/-   ##
==========================================
+ Coverage   82.33%   83.10%   +0.77%     
==========================================
  Files          25       28       +3     
  Lines        1715     1930     +215     
==========================================
+ Hits         1412     1604     +192     
- Misses        303      326      +23

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

yarikoptic · 2024-04-15T20:18:09Z

I guess it would not hurt us at all "add support" for .nwb.lindi.json even if just to get a way to see the difference between operation on them instead of directly on .nwb files. Would you happen to have some strikingly good example for performance gains which would come thanks due to such external indexing?

magland · 2024-04-15T20:41:37Z

I guess it would not hurt us at all "add support" for .nwb.lindi.json even if just to get a way to see the difference between operation on them instead of directly on .nwb files. Would you happen to have some strikingly good example for performance gains which would come thanks due to such external indexing?

One advantage of using these LINDI files is the initial load time when loading remotely. For example, this takes around 10 seconds on my machine to fully load the meta data (loading from hdf5)

https://neurosift.app/?p=/nwb&dandisetId=000409&dandisetVersion=draft&url=https://api.dandiarchive.org/api/assets/c04f6b30-82bf-40e1-9210-34f0bcd8be24/download/&lindi=0

If I load from the .nwb.lindi.json, then this reduces to less than 1 second

https://neurosift.app/?p=/nwb&dandisetId=000409&dandisetVersion=draft&url=https://api.dandiarchive.org/api/assets/c04f6b30-82bf-40e1-9210-34f0bcd8be24/download/&lindi=1

A second advantage is that we can use custom compression codecs that are not available in HDF5. For example, this allows including mp4-compressed video data, rather than requiring external links to .mp4 file. I realize it's ironic that we are solving one type of external link by using another, but the LINDI approach is much more seamless. Here's an example

https://neurosift.app/?p=/nwb&url=https://fsbucket-dendro.flatironinstitute.org/dendro-outputs/2d2b13fd/2d2b13fd.ec254350/output&dandisetId=000871&dandisetVersion=draft&dandiAssetId=407d6e5f-2a16-4879-b38f-b74cad7b9e24&st=lindi

Click on acquisition, and then raw_suite2p_motion_corrected_compressed

A third advantage is that you can create derived NWB files that include the same data as the upstream file without any actual data duplication. Described earlier in this thread.

A fourth advantage is that you can make corrections to a NWB file (attributes, etc) and then re-upload without needing to re-upload the bulk of the data.

Finally, there are reasons to believe that remote streaming of actual data chunks will be a lot more efficient for LINDI compared with HDF5. But I don't have any examples showcasing that as of yet.

lindi/LindiStagingStore/LindiStagingStore.py

…rders/lindi into staging-store

rly · 2024-04-19T15:10:58Z

lindi/LindiStagingStore/LindiStagingStore.py

+        with tempfile.TemporaryDirectory() as tmpdir:
+            rfs_fname = f"{tmpdir}/rfs.json"
+            with open(rfs_fname, 'w') as f:
+                json.dump(rfs, f, indent=2, sort_keys=True)


Should this and

lindi/lindi/LindiH5ZarrStore/LindiH5ZarrStore.py

Lines 463 to 473 in bc71a9d

def to_file(self, file_name: str, *, file_type: Literal["zarr.json"] = "zarr.json"):

"""Write a reference file system corresponding to this store to a file.

This can then be loaded using LindiH5pyFile.from_reference_file_system(file_name)

"""

if file_type != "zarr.json":

raise Exception(f"Unsupported file type: {file_type}")

ret = self.to_reference_file_system()

with open(file_name, "w") as f:

json.dump(ret, f, indent=2)

use the same function so that we have consistent file name suffixes and json.dump args?

magland added 5 commits April 4, 2024 19:36

implement LindiStagingStore

a55f6af

implement staging store

4d51de2

Merge branch 'rfs-json' into staging-store

e6aab4a

format rfs for upload

18ee3d9

update __init__.py for staging store

6776fec

Merge branch 'main' into staging-store

513db8e

magland requested a review from rly April 12, 2024 18:16

sort files prior to consolidating chunks

07658fd

magland mentioned this pull request Apr 15, 2024

neurosift external service for .nwb.lindi.json dandi/dandi-archive#1922

Merged

Merge branch 'main' into staging-store

2ed7e89

rly reviewed Apr 18, 2024

View reviewed changes

lindi/LindiStagingStore/LindiStagingStore.py Outdated Show resolved Hide resolved

magland added 2 commits April 18, 2024 20:33

rename on_store -> on_upload

c650217

Merge branch 'staging-store' of https://github.com/NeurodataWithoutBo…

b9bbe03

…rders/lindi into staging-store

rly reviewed Apr 19, 2024

View reviewed changes

test_staging_area

f4f9a02

rly approved these changes Apr 19, 2024

View reviewed changes

magland added 2 commits April 19, 2024 10:28

to_file -> write_reference_file_system

0b35dae

use _write_rfs_to_file()

5b4b6f4

magland merged commit 40c94a1 into main Apr 19, 2024
6 checks passed

magland deleted the staging-store branch April 19, 2024 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement staging store for preparing .nwb.json for upload to cloud #42

Implement staging store for preparing .nwb.json for upload to cloud #42

magland commented Apr 5, 2024

magland commented Apr 5, 2024

magland commented Apr 5, 2024 •

edited

Loading

CodyCBakerPhD commented Apr 5, 2024

magland commented Apr 5, 2024

magland commented Apr 5, 2024

yarikoptic commented Apr 5, 2024

magland commented Apr 6, 2024

magland commented Apr 9, 2024

codecov-commenter commented Apr 9, 2024 •

edited

Loading

yarikoptic commented Apr 15, 2024

magland commented Apr 15, 2024

rly Apr 19, 2024 •

edited

Loading

magland Apr 19, 2024

	def to_file(self, file_name: str, *, file_type: Literal["zarr.json"] = "zarr.json"):
	"""Write a reference file system corresponding to this store to a file.

	This can then be loaded using LindiH5pyFile.from_reference_file_system(file_name)
	"""
	if file_type != "zarr.json":
	raise Exception(f"Unsupported file type: {file_type}")

	ret = self.to_reference_file_system()
	with open(file_name, "w") as f:
	json.dump(ret, f, indent=2)

Implement staging store for preparing .nwb.json for upload to cloud #42

Implement staging store for preparing .nwb.json for upload to cloud #42

Conversation

magland commented Apr 5, 2024

magland commented Apr 5, 2024

magland commented Apr 5, 2024 • edited Loading

CodyCBakerPhD commented Apr 5, 2024

magland commented Apr 5, 2024

magland commented Apr 5, 2024

yarikoptic commented Apr 5, 2024

magland commented Apr 6, 2024

magland commented Apr 9, 2024

codecov-commenter commented Apr 9, 2024 • edited Loading

Codecov Report

yarikoptic commented Apr 15, 2024

magland commented Apr 15, 2024

rly Apr 19, 2024 • edited Loading

Choose a reason for hiding this comment

magland Apr 19, 2024

Choose a reason for hiding this comment

magland commented Apr 5, 2024 •

edited

Loading

codecov-commenter commented Apr 9, 2024 •

edited

Loading

rly Apr 19, 2024 •

edited

Loading