Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reading utf-8 encoded json files #1312

Merged
merged 8 commits into from
Feb 9, 2023
Merged

Allow reading utf-8 encoded json files #1312

merged 8 commits into from
Feb 9, 2023

Conversation

nhz2
Copy link
Contributor

@nhz2 nhz2 commented Dec 27, 2022

Fixes #1308

Currently, zarr-python errors when reading a json file with non-ascii characters encoded with utf-8, however, Zarr.jl writes json files that include non-ascii characters using utf-8 encoding.
This PR will enable zarr attributes written in Zarr.jl to be read by zarr-python.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added needs release notes Automatically applied to PRs which haven't added release notes and removed needs release notes Automatically applied to PRs which haven't added release notes labels Dec 27, 2022
@nhz2 nhz2 marked this pull request as ready for review December 27, 2022 20:12
@nhz2
Copy link
Contributor Author

nhz2 commented Jan 10, 2023

@MSanKeys963 @jakirkham This PR is ready for review. Also, because I am a first-time contributor, I am not able to trigger Github Actions myself.

@sanketverma1704
Copy link
Member

Thanks for sending the PR @nhz2.
@jakirkham, @joshmoore and @rabernat, please have a look.

@joshmoore
Copy link
Member

Thanks for this, @nhz2! I've triggered the actions and will go through the code ASAP.

@codecov
Copy link

codecov bot commented Jan 16, 2023

Codecov Report

Merging #1312 (2024d7f) into main (4dc6f1f) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main     #1312   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           36        36           
  Lines        14744     14748    +4     
=========================================
+ Hits         14744     14748    +4     
Impacted Files Coverage Δ
zarr/meta.py 100.00% <100.00%> (ø)
zarr/tests/test_attrs.py 100.00% <100.00%> (ø)
zarr/util.py 100.00% <100.00%> (ø)

@joshmoore
Copy link
Member

Relaunched the Windows built after:

    requests.exceptions.ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

@joshmoore
Copy link
Member

joshmoore commented Jan 16, 2023

Green 🎉 in case anyone else wants to now take a closer look. (release.rst will need its conflict handled)

For my part, reading this makes complete sense, but I do wonder what the impact on downstream code will be and therefore what types of warnings/modifications might be needed.

@nhz2
Copy link
Contributor Author

nhz2 commented Jan 17, 2023

This PR should have no effect on the JSON files zarr-python writes, they will still be ASCII only with non-ASCII characters escaped. It only affects what kind of JSON files zarr-python can read.

@joshmoore
Copy link
Member

Thanks, @nhz2. Guess I was more concerned whether any calling code (and/or subclasses) would need modification.

@nhz2
Copy link
Contributor Author

nhz2 commented Jan 24, 2023

In terms of the effect on other code, the main change is
from

def json_loads(s: str) -> Dict[str, Any]:
    """Read JSON in a consistent way."""
    return json.loads(ensure_text(s, 'ascii'))

to

def json_loads(s: bytes) -> Dict[str, Any]:
    """Read JSON in a consistent way."""
    return json.loads(ensure_text(s, 'utf-8'))

Functions called by json_loads

The ensure_text is defined in numcodecs: https://github.com/zarr-developers/numcodecs/blob/c9f47a80981dbefe372c4ae7ff3938ba000f2de9/numcodecs/compat.py#L178

If a bytes is input to ensure_text, it will decode it and output a str, or fail if the bytes cannot be decoded.

json.loads takes a str and will return a python dict, list, str, int, float, True, False, or None or raise a JSONDecodeError if the input is not a valid JSON document according to https://docs.python.org/3/library/json.html

Should json_loads error if the object returned by json.loads isn't a dict? Maybe that can be added in a future PR?

Functions that call json_loads

consolidate_metadata:

key: json_loads(store[key])

parse_metadata:

meta = json_loads(s)

It is also called a few times in n5.py

ConsolidatedMetadataStore.__init__:

meta = json_loads(self.store[metadata_key])

ConsolidatedMetadataStoreV3.__init__:

meta = json_loads(self.store[metadata_key])

All of these cases directly, or indirectly through parse_metadata, call json_loads on an object obtained from a store. Since store values are bytes, not strings, I changed the type hints to match the current usage.

@joshmoore
Copy link
Member

Thanks for the analysis, @nhz2! Sounds like it should not impact external consumers.

Any objections to rolling this into 2.14, anyone?

@jakirkham
Copy link
Member

Thanks for the PR Nathan! 🙏

This is a good improvement to include.

Noting some changes in the typing to still allow str in functions (since it should still technically be valid). Though agree bytes should be included in a few places as well for clarity

Ideally we would capture any bytes-like type here (since they are all valid), but that is somewhat complicated to do in practice. Maybe there will be guidance on this in the future ( python/typing#593 )

@joshmoore joshmoore mentioned this pull request Feb 8, 2023
nhz2 and others added 2 commits February 8, 2023 17:25
@joshmoore joshmoore merged commit 280d969 into zarr-developers:main Feb 9, 2023
@joshmoore
Copy link
Member

I think this has failed on the zarr-feedstock: conda-forge/zarr-feedstock#72

Comment on lines +51 to +55
def test_utf8_encoding(self, zarr_version):

# fixture data
fixture = group(store=DirectoryStore('fixture'))
assert fixture['utf8attrs'].attrs.asdict() == dict(foo='た')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is failing there

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is neither of these files were included in the Python sdist

@@ -0,0 +1 @@
{"foo": "た"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is missing from the sdist on PyPI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jakirkham. Argh, I thought I had caught these issues but likely only if there's code to generate it them as well. Are you already working on a fix or shall I?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to go ahead. I might not get to this for a while

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forget if we already discussed this before, but maybe we can make some fixes to CI to ensure we catch these issues in PRs. Wrote up some thoughts on this in issue ( #1347 )

Comment on lines +1 to +3
{
"zarr_format": 2
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this one

joshmoore added a commit to joshmoore/zarr-python that referenced this pull request Feb 12, 2023
This is a temporary fix for the larger issue of out-of-tree
testing described in zarr-developers#1347, but this should allow a release
of 2.14.1 which passes on conda.
joshmoore added a commit that referenced this pull request Feb 12, 2023
This is a temporary fix for the larger issue of out-of-tree
testing described in #1347, but this should allow a release
of 2.14.1 which passes on conda.
joshmoore added a commit to joshmoore/zarr-python that referenced this pull request Mar 10, 2023
@joshmoore joshmoore mentioned this pull request Mar 10, 2023
6 tasks
joshmoore added a commit that referenced this pull request Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot read attributes that contain non-ASCII characters
4 participants