Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement staging store for preparing .nwb.json for upload to cloud #42

Merged
merged 13 commits into from
Apr 19, 2024

Conversation

magland
Copy link
Collaborator

@magland magland commented Apr 5, 2024

A reference file system works well if making small changes or edits to attributes. But when it comes time to add large datasets and binary chunks, those should not be embedded into the json. With this PR there is a mechanism to specify a staging area (temporary directory on local machine) where binary chunks are stored as the reference file system is edited. Then, prior to uploading to the cloud (e.g., DANDI), the chunks are consolidated (so that we don't have thousands or even hundreds of thousands files to upload) into larger blocks. Then the data files are uploaded and the reference file system is adjusted so the references point to the URLs of the newly uploaded files rather than pointing to local file paths.

The actual mechanism for upload is not provided by lindi. I think it makes sense to keep that part of it separate, so that lindi is not fundamentally tied to DANDI.

Here's an example script on how this is intended to be used.

import hashlib
import lindi
from lindi_dandi import create_asset
from lindi.LindiStagingStore import StagingArea
from helpers.add_autocorrelograms import add_autocorrelograms


lindi_tests_dandiset_id = '213569'
lindi_tests_dandiset_version = 'draft'


def lindi_test_2():
    # path = '000946/sub-BH494/sub-BH494_ses-20230817T132836_ecephys.nwb'
    download_url = 'https://api-staging.dandiarchive.org/api/assets/11b45116-807e-4ba0-82cd-95530c3b095f/download/'

    with StagingArea.create(base_dir='lindi_staging') as staging_area:
        client = lindi.LindiH5pyFile.from_reference_file_system(download_url, mode='r+', staging_area=staging_area)
        add_autocorrelograms(client)
        upload_nwb_to_dandi(
            nwb_client=client,
            dandiset_id=lindi_tests_dandiset_id,
            dandiset_version=lindi_tests_dandiset_version,
            path='000946/sub-BH494/sub-BH494_ses-20230817T132836_desc-autocorrelograms_ecephys.nwb.json'
        )


def upload_nwb_to_dandi(
    nwb_client: lindi.LindiH5pyFile,
    dandiset_id: str,
    dandiset_version: str,
    path: str
):
    staging_store = nwb_client.staging_store
    if staging_store is None:
        raise ValueError("NWB client does not have a staging store")

    def on_store_blob(filename: str):
        sha1 = _compute_sha1_of_file(filename)
        a = create_asset(
            dandiset_id=dandiset_id,
            dandiset_version=dandiset_version,
            local_filename=filename,
            asset_path=f'sha1/{sha1}',
            leave_existing=True
        )
        assert isinstance(a, dict)
        return a['download_url']

    def on_store_main(filename: str):
        a = create_asset(
            dandiset_id=dandiset_id,
            dandiset_version=dandiset_version,
            local_filename=filename,
            asset_path=path,
            replace_existing=True
        )
        assert isinstance(a, dict)
        return a['download_url']

    staging_store.upload(
        on_store_blob=on_store_blob,
        on_store_main=on_store_main,
        consolidate_chunks=True
    )


def _compute_sha1_of_file(filename: str):
    sha1 = hashlib.sha1()
    with open(filename, 'rb') as f:
        while True:
            data = f.read(65536)
            if not data:
                break
            sha1.update(data)
    return sha1.hexdigest()


if __name__ == '__main__':
    lindi_test_2()

This script resulted in the following files on DANDI (staging site): https://gui-staging.dandiarchive.org/dandiset/213569/draft/files?location=000946%2Fsub-BH494&page=1

image

These are small .nwb.json files that reference data blobs from (a) the original nwb/hdf5 file on DANDISET 000946 and (b) new blobs for the newly-added autocorrelogram column in the Units table.

I wasn't sure where to put the blob on DANDI, so for now, it's here:
https://gui-staging.dandiarchive.org/dandiset/213569/draft/files?location=sha1&page=1

We'll need to discuss what makes sense for that. A lot of things to consider here.

Here's the supplemented file on Neurosift
https://neurosift.app/?p=/nwb&url=https://api-staging.dandiarchive.org/api/assets/64aa19bd-496f-47be-aa72-2b1cedb9b20a/download/&st=lindi&dandisetId=213569&dandisetVersion=draft

Expand units and then click on "autocorrelograms". You can also explore in the RAW tab.

@magland
Copy link
Collaborator Author

magland commented Apr 5, 2024

See https://github.com/magland/lindi-dandi - This provides the DANDI upload functionality

@CodyCBakerPhD
Copy link

Very cool!

So an ImageSeries group written as Zarr backend through LINDI could have a data field that looks like this?

@magland
Copy link
Collaborator Author

magland commented Apr 5, 2024

So an ImageSeries group written as Zarr backend through LINDI could have a data field that looks like this?

Yes exactly

@magland
Copy link
Collaborator Author

magland commented Apr 5, 2024

@CodyCBakerPhD To illustrate more clearly, here's the content of that .json file (again, not NWB compliant)

{
  "refs": {
    ".zgroup": {
      "zarr_format": 2
    },
    "group1/.zgroup": {
      "zarr_format": 2
    },
    "group1/test_video/.zattrs": {
      "neurodata_type": "test_video"
    },
    "group1/test_video/.zgroup": {
      "zarr_format": 2
    },
    "group1/test_video/data/.zarray": {
      "chunks": [
        2000,
        480,
        640,
        3
      ],
      "compressor": {
        "fps": 30,
        "id": "mp4avc"
      },
      "dtype": "|u1",
      "fill_value": 0,
      "filters": null,
      "order": "C",
      "shape": [
        15521,
        480,
        640,
        3
      ],
      "zarr_format": 2
    },
    "group1/test_video/data/0.0.0.0": [
      "https://api-staging.dandiarchive.org/api/assets/c0dd3974-2746-48be-b95a-1cc3766e0403/download/",
      19914396,
      5292428
    ],
    "group1/test_video/data/1.0.0.0": [
      "https://api-staging.dandiarchive.org/api/assets/c0dd3974-2746-48be-b95a-1cc3766e0403/download/",
      13935646,
      3546920
    ],
    "group1/test_video/data/2.0.0.0": [
      "https://api-staging.dandiarchive.org/api/assets/c0dd3974-2746-48be-b95a-1cc3766e0403/download/",
      0,
      3619104
    ],
    "group1/test_video/data/3.0.0.0": [
      "https://api-staging.dandiarchive.org/api/assets/c0dd3974-2746-48be-b95a-1cc3766e0403/download/",
      6994064,
      3641331
    ],
    "group1/test_video/data/4.0.0.0": [
      "https://api-staging.dandiarchive.org/api/assets/c0dd3974-2746-48be-b95a-1cc3766e0403/download/",
      3619104,
      3374960
    ],
    "group1/test_video/data/5.0.0.0": [
      "https://api-staging.dandiarchive.org/api/assets/c0dd3974-2746-48be-b95a-1cc3766e0403/download/",
      25206824,
      3368001
    ],
    "group1/test_video/data/6.0.0.0": [
      "https://api-staging.dandiarchive.org/api/assets/c0dd3974-2746-48be-b95a-1cc3766e0403/download/",
      10635395,
      3300251
    ],
    "group1/test_video/data/7.0.0.0": [
      "https://api-staging.dandiarchive.org/api/assets/c0dd3974-2746-48be-b95a-1cc3766e0403/download/",
      17482566,
      2431830
    ]
  },
  "version": 1
}

Note the compressor id is "mp4avc" (the custom one), and all those references are to the same file -- the consolidated blob on DANDI. Each chunk within that consolidated blob is mp4 encoded. So you have the advantage of chunked zarr and also the advantage of one big video file.

@yarikoptic
Copy link
Contributor

yet to review, provide any meaningful feedback, but just want to note that eventually we would get a .json for each .nwb already whenever collating .nwb's into BIDS datasets per https://bids.neuroimaging.io/bep032 . So it might get confusing to get .nwb.json and .json... FWIW our prior discussion of possible (other) extensions is at

@magland
Copy link
Collaborator Author

magland commented Apr 6, 2024

yet to review, provide any meaningful feedback, but just want to note that eventually we would get a .json for each .nwb already whenever collating .nwb's into BIDS datasets per https://bids.neuroimaging.io/bep032 . So it might get confusing to get .nwb.json and .json... FWIW our prior discussion of possible (other) extensions is at

Thanks ... copying those over: .lindi.zarr.json , .nwb.zarr.json, .nwb.lindi , .nwb.lindi.json, .lindi.json, .lindi

If .nwb.json is not specific enough, then I guess my vote would be nwb.lindi.json because that would allow for nwb.lindi.parquet or other in the future. The problem with .nwb.zarr.json is that it doesn't specify that it contains lindi-specific attributes for handling scalars, references, compound dtypes, etc.

@magland
Copy link
Collaborator Author

magland commented Apr 9, 2024

I renamed these files to .nwb.lindi.json to see how that looks

https://gui-staging.dandiarchive.org/dandiset/213569/draft/files?location=000946%2Fsub-BH494&page=1

What do you think @yarikoptic @rly

If we can decide on this, I can create a PR for dandi to enable the "open in neurosift" menu option.

@codecov-commenter
Copy link

codecov-commenter commented Apr 9, 2024

Codecov Report

Attention: Patch coverage is 86.34361% with 31 lines in your changes missing coverage. Please review.

Project coverage is 83.10%. Comparing base (bc71a9d) to head (5b4b6f4).
Report is 143 commits behind head on main.

Files Patch % Lines
lindi/LindiStagingStore/LindiStagingStore.py 85.44% 23 Missing ⚠️
lindi/LindiStagingStore/StagingArea.py 86.11% 5 Missing ⚠️
lindi/LindiH5pyFile/LindiH5pyFile.py 90.47% 2 Missing ⚠️
lindi/LindiH5ZarrStore/LindiH5ZarrStore.py 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #42      +/-   ##
==========================================
+ Coverage   82.33%   83.10%   +0.77%     
==========================================
  Files          25       28       +3     
  Lines        1715     1930     +215     
==========================================
+ Hits         1412     1604     +192     
- Misses        303      326      +23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@magland magland requested a review from rly April 12, 2024 18:16
@yarikoptic
Copy link
Contributor

I guess it would not hurt us at all "add support" for .nwb.lindi.json even if just to get a way to see the difference between operation on them instead of directly on .nwb files. Would you happen to have some strikingly good example for performance gains which would come thanks due to such external indexing?

@magland
Copy link
Collaborator Author

magland commented Apr 15, 2024

I guess it would not hurt us at all "add support" for .nwb.lindi.json even if just to get a way to see the difference between operation on them instead of directly on .nwb files. Would you happen to have some strikingly good example for performance gains which would come thanks due to such external indexing?

One advantage of using these LINDI files is the initial load time when loading remotely. For example, this takes around 10 seconds on my machine to fully load the meta data (loading from hdf5)

https://neurosift.app/?p=/nwb&dandisetId=000409&dandisetVersion=draft&url=https://api.dandiarchive.org/api/assets/c04f6b30-82bf-40e1-9210-34f0bcd8be24/download/&lindi=0

If I load from the .nwb.lindi.json, then this reduces to less than 1 second

https://neurosift.app/?p=/nwb&dandisetId=000409&dandisetVersion=draft&url=https://api.dandiarchive.org/api/assets/c04f6b30-82bf-40e1-9210-34f0bcd8be24/download/&lindi=1

A second advantage is that we can use custom compression codecs that are not available in HDF5. For example, this allows including mp4-compressed video data, rather than requiring external links to .mp4 file. I realize it's ironic that we are solving one type of external link by using another, but the LINDI approach is much more seamless. Here's an example

https://neurosift.app/?p=/nwb&url=https://fsbucket-dendro.flatironinstitute.org/dendro-outputs/2d2b13fd/2d2b13fd.ec254350/output&dandisetId=000871&dandisetVersion=draft&dandiAssetId=407d6e5f-2a16-4879-b38f-b74cad7b9e24&st=lindi

Click on acquisition, and then raw_suite2p_motion_corrected_compressed

A third advantage is that you can create derived NWB files that include the same data as the upstream file without any actual data duplication. Described earlier in this thread.

A fourth advantage is that you can make corrections to a NWB file (attributes, etc) and then re-upload without needing to re-upload the bulk of the data.

Finally, there are reasons to believe that remote streaming of actual data chunks will be a lot more efficient for LINDI compared with HDF5. But I don't have any examples showcasing that as of yet.

Comment on lines 140 to 143
with tempfile.TemporaryDirectory() as tmpdir:
rfs_fname = f"{tmpdir}/rfs.json"
with open(rfs_fname, 'w') as f:
json.dump(rfs, f, indent=2, sort_keys=True)
Copy link
Contributor

@rly rly Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this and

def to_file(self, file_name: str, *, file_type: Literal["zarr.json"] = "zarr.json"):
"""Write a reference file system corresponding to this store to a file.
This can then be loaded using LindiH5pyFile.from_reference_file_system(file_name)
"""
if file_type != "zarr.json":
raise Exception(f"Unsupported file type: {file_type}")
ret = self.to_reference_file_system()
with open(file_name, "w") as f:
json.dump(ret, f, indent=2)
use the same function so that we have consistent file name suffixes and json.dump args?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@magland magland merged commit 40c94a1 into main Apr 19, 2024
6 checks passed
@magland magland deleted the staging-store branch April 19, 2024 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants