Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [0.7.0, 0.8.0] Fails to open file with consolidated metadata from S3 #205

Open
3 tasks done
bjhardcastle opened this issue Jul 15, 2024 · 24 comments
Open
3 tasks done
Assignees
Labels
category: bug errors in the code or code behavior priority: medium non-critical problem and/or affecting only a small set of users

Comments

@bjhardcastle
Copy link

What happened?

#185 highlights the need for a test for this, so maybe you're already aware of this issue.

The problem stems from self.is_remote() incorrectly returning False.
Adding ConsolidatedMetadataStore to the type check here is enough to get it working again:

if isinstance(self.file.store, FSStore):

Steps to Reproduce

import hdmf_zarr

hdmf_zarr.NWBZarrIO(
    path="s3://codeocean-s3datasetsbucket-1u41qdg42ur9/320c7f2a-f145-4a44-8421-2b2fe1d1acbc/nwb/ecephys_712815_2024-05-21_13-01-23_experiment1_recording1.nwb", 
    # not publicly accessible
    mode="r",
).read()

Traceback

Traceback (most recent call last):
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 492, in __open_file_consolidated
    return zarr.open_consolidated(store=store,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\zarr\convenience.py", line 
1335, in open_consolidated
    meta_store = ConsolidatedStoreClass(store, metadata_key=metadata_key)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\zarr\storage.py", line 2974, in __init__
    meta = json_loads(self.store[metadata_key])
                      ~~~~~~~~~~^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\zarr\storage.py", line 1115, in __getitem__
    raise KeyError(key)
KeyError: '.zmetadata'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf\utils.py", line 668, in func_call
    return func(args[0], **pargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf\backends\io.py", line 
56, in read
    f_builder = self.read_builder()
                ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf\utils.py", line 668, in func_call
    return func(args[0], **pargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 1327, in read_builder
    f_builder = self.__read_group(self.__file, ROOT_NAME)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 1392, in __read_group
    sub_builder = self.__read_group(sub_group, sub_name)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 1392, in __read_group
    sub_builder = self.__read_group(sub_group, sub_name)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 1392, in __read_group
    sub_builder = self.__read_group(sub_group, sub_name)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 1401, in __read_group
    self.__read_links(zarr_obj=zarr_obj, parent=ret)
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 1420, in __read_links
    target_name, target_zarr_obj = self.resolve_ref(link)
                                   ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 730, in resolve_ref
    target_zarr_obj = self.__open_file_consolidated(store=source_file,
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 497, in __open_file_consolidated
    return zarr.open(store=store,
           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\zarr\convenience.py", line 
132, in open
    raise PathNotFoundError(path)
zarr.errors.PathNotFoundError: nothing found at path ''

Operating System

Windows

Python Executable

Python

Python Version

3.11

Package Versions

No response

Code of Conduct

@mavaylon1
Copy link
Contributor

I am able to open a file with consolidated metadata using s3. I'll look into this.

@mavaylon1 mavaylon1 self-assigned this Jul 18, 2024
@mavaylon1
Copy link
Contributor

mavaylon1 commented Jul 18, 2024

Open works, but read does not for a certain file, but the error is not the same. I do not believe it is related.

@mavaylon1 mavaylon1 added category: bug errors in the code or code behavior priority: medium non-critical problem and/or affecting only a small set of users labels Jul 18, 2024
@mavaylon1
Copy link
Contributor

@bjhardcastle I'm not able to reproduce the same error. Is it possible for you to create a public example? (a zarr file I can use)

@mavaylon1
Copy link
Contributor

We have tests that read consolidate metadata files that are passing so I am not sure what you are facing.

@alejoe91
Copy link
Collaborator

alejoe91 commented Aug 2, 2024

@bjhardcastle I think this is due to some export functions on our side without the write_args={'link_data': False} as reported in this issue: #179, not to a specific read of consolidated data from s3

On our end, all export functions have been now updated to force link_data: False. Can you try to generate a new NWB file with the pipeline and see if the problem persists? Once a fix on the hdmf-zarr is in place, we can drop the write_args

@bjhardcastle
Copy link
Author

bjhardcastle commented Aug 2, 2024

@mavaylon1 Are you talking about these tests?
2ba6bc5

They aren't sufficient:

  1. they don't read the NWB file (as you said yourself, opening it is not actually the problem: reading it is)
  2. the zarr file is empty, so the reading fails for a different reason

Here's an MRE with reading added to your test:

# python -m venv .venv-hdmf-test
# .venv-hdmf-test/scripts/activate (Windows)
# source .venv-hdmf-test/bin/activate (Linux)
# python -m ensurepip
# python -m pip install hdmf-zarr fsspec s3fs

import hdmf_zarr 
import zarr 

s3_path = "https://dandiarchive.s3.amazonaws.com/zarr/ccefbc9f-30e7-4a4c-b044-5b59d300040b/"

with hdmf_zarr.NWBZarrIO(s3_path, mode='r') as read_io:
    read_io.open()
    assert isinstance(read_io.file.store, zarr.storage.ConsolidatedMetadataStore)
    try:
        # this fails:
        nwb = read_io.read()
    except Exception as exc:
        print(repr(exc))
        # ValueError: No data_type found for builder root
   
with hdmf_zarr.NWBZarrIO(s3_path, mode='-r') as read_io:
    read_io.open()
    assert isinstance(read_io.file.store, zarr.storage.FSStore)
    try:    
        # this fails:
        nwb = read_io.read()
    except Exception as exc:
        print(repr(exc))
        # hdmf.backends.errors.UnsupportedOperation: Cannot build data. There are no values.
   
# the zarr file is empty:
z = zarr.open(s3_path, mode='r')
assert not tuple(z.keys())

@mavaylon1
Copy link
Contributor

mavaylon1 commented Aug 2, 2024

@bjhardcastle I am talking about our roundtrip tests, such as https://github.com/hdmf-dev/hdmf-zarr/blob/8ca578733db5078d1ff6d2dfb47407a680c58caf/tests/unit/base_tests_zarrio.py#L331C9-L331C40

In hdmf-zarr our default is to consolidate metadata, the test passes to read a file with a zarr.storage.ConsolidatedMetadataStore.

Thanks for the code. I will look into this.

@bjhardcastle
Copy link
Author

Is that writing to an S3 bucket though?

@mavaylon1
Copy link
Contributor

@bjhardcastle No, but my point is that locally ConsolidatedMetadata works. We don't have a read test and I couldn't reproduce it without a s3 file.

But you gave one, so this should be enough for me to reproduce the error and fix it. :)

@bjhardcastle
Copy link
Author

Yes it works locally because is_remote() returns False, which is correct.

When the file is remote, is_remote() also returns False, because ConsolidatedMetadataStore is not a subclass of FSStore. This messes up the paths to the underlying data.

The file I gave you is the one from your tests - but it's empty.
I'm trying to upload a non-empty zarr NWB file via the dandi CLI now, but following the docs hasn't been successful yet. If you have any working code to do that I'd appreciate it..

@mavaylon1
Copy link
Contributor

mavaylon1 commented Aug 2, 2024

I have to reproduce the error to comment on your findings regarding is_remote(). I will see what I can do to help on the DANDI side

@mavaylon1
Copy link
Contributor

@bjhardcastle At any point did you export?

@bjhardcastle
Copy link
Author

bjhardcastle commented Aug 2, 2024

@bjhardcastle At any point did you export?

Export what?

@mavaylon1
Copy link
Contributor

mavaylon1 commented Aug 2, 2024

@bjhardcastle At any point did you export?

Export what?

Did you export a file to get to your current file.

@bjhardcastle
Copy link
Author

@bjhardcastle At any point did you export?

Export what?

Did you export a file to get to your current file.

Sorry I don't know what you mean. Could you be more specific?

@bjhardcastle
Copy link
Author

bjhardcastle commented Aug 2, 2024

I have to reproduce the error to comment on your findings regarding is_remote()

Can't you see that the code doesn't make sense? The type of zarr object is not an indicator of whether the data is remote or not.

As a result, lines like this make paths that are completely wrong:

# Resolve the path relative to the current file
if not self.is_remote():
source_file = os.path.abspath(os.path.join(self.source, source_file))

@mavaylon1
Copy link
Contributor

That could be the case. I have not taken a deep look at what the issue is. I usually like to reproduce the error so that any changes I make to address that error can be tested.

In any case, I will find some time next week to look at this as I have higher priority issues I've already committed to. Let me know when you figure out uploading your file. I will look into this as well.

Within NWB you can edit files and export them into a new file. If you did not make the file and are unsure, that is fine. It could be helpful to know, and I would suggest finding out if possible.

I invite you to also review our Code of Conduct.

@oruebel
Copy link
Contributor

oruebel commented Aug 2, 2024

@bjhardcastle @mavaylon1 @alejoe91 thank you for the discussion to help us better understand the issue. We'll review the issue in our team meeting next week to make a plan to develop a fix for this issue and get back. I appreciate everyone's patients and hope you have a great weekend.

@alejoe91
Copy link
Collaborator

alejoe91 commented Aug 5, 2024

@mavaylon1 if this can help, I got the same error on an NWB file produced by a double export (first adding electrical series, second adding units) and when opening the file locally.

Here's a link with the file so you can download it: https://drive.google.com/drive/folders/1f8_92cXrEvOdJWJvghNVNR_sDjpejmEO?usp=sharing

@alejoe91
Copy link
Collaborator

alejoe91 commented Aug 6, 2024

Another update: switching the units export to an append mechanism seems to produce NWB files without the issue.

To summarize, we currently produce NWB files in multiple steps:

  • Step 1: add subject and session metadata
  • Step 2: add LFP electrical series and electrodes tables (via export). After this step, the file can be read with no errors.
  • Step 3: add Units table (reading fails with export, works with append)

Here are the NWB zarr files produced:

So it seems that the issue is with adding the Units table with export. I hope this helps!

@oruebel
Copy link
Contributor

oruebel commented Aug 6, 2024

Thanks you @alejoe91 for debugging this issue further and providing these additional details. We'll take a look.

@mavaylon1
Copy link
Contributor

@bjhardcastle we are actively investigating adjacent issues that may fix your ticket. Our goal is to either have a solution or a clear path to one by the end of next week.

@alejoe91
Copy link
Collaborator

@mavaylon1 any lead on this?

@mavaylon1
Copy link
Contributor

@alejoe91 @bjhardcastle The issue the is stalling this is our implementation of export. I would find out who generated the files/exported the files as the files in general might not be valid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: bug errors in the code or code behavior priority: medium non-critical problem and/or affecting only a small set of users
Projects
None yet
Development

No branches or pull requests

4 participants