Skip to content

Conversation

@jbms
Copy link
Contributor

@jbms jbms commented Jun 1, 2022

While this extension provides compatibility with zarr v2, and there
are certainly applications where it makes sense to store arbitrary
Python objects, the current extension has a number of issues:

  • As it is currently specified, it describes the in-memory Python
    object array representation, but surely it is not intended that the
    zarr array would actually store Python object memory addresses.

  • No encoding is specified (e.g. pickle or json).

  • Pickle is problematic since it is Python-specific and inherently
    unsafe to use to read untrusted data.

While this extension provides compatibility with zarr v2, and there
are certainly applications where it makes sense to store arbitrary
Python objects, the current extension has a number of issues:

- As it is currently specified, it describes the in-memory Python
  object array representation, but surely it is not intended that the
  zarr array would actually store Python object memory addresses.

- No encoding is specified (e.g. pickle or json).

- Pickle is problematic since it is Python-specific and inherently
  unsafe to use to read untrusted data.
@jbms jbms force-pushed the remove-object-data-type-extension branch from 0e16ba5 to 6cc2e21 Compare June 1, 2022 21:53
@jakirkham
Copy link
Member

cc @jstriebel @joshmoore

@jstriebel
Copy link
Member

LGTM, I always assumed that's a "works for python only":tm: extension. In this case there's no point to put it into the spec. If this is still supported in zarr-python (which I'd find valid, even if it's not part of the spec), we should probably reconsider the name/shorthand, but that's a different issue.

@jbms
Copy link
Contributor Author

jbms commented Jun 2, 2022

There are some variants of an "object" type that don't have to be Python-only, like a json object type using json, cbor, or msgpack.

Additionally, in principle there could be multiple zarr implementations that support Python, e.g. tensorstore could support Python objects with various codecs, so even if it is Python-specific there could be some benefit in having a specification.

However, I would caution against zarr-python supporting pickle at all in the way it does for zarr v2. Currently, with zarr v2, if you open an array from an untrusted source, it opens you up to an attacker being able to execute arbitrary code by specifying the pickle codec. Instead, it is important that the pickle codec require an explicit opt-in by the user when opening the array, e.g. zarr.open(..., unsafe_codecs=True).

@jakirkham
Copy link
Member

Just a note on context/history, "works with Python" means users need to supply an object_codec explicitly (Pickle being one of many options). Interestingly the PR that added Pickle as a codec also added MsgPack ( zarr-developers/numcodecs#5 ). A number of other alternatives also exist JSON, Categorize, VLen*, etc. and were also mentioned in a notebook when adding object_codec ( zarr-developers/zarr-python#212 ). So folks have been cognizant about the issues around Pickle and tried to provide reasonable alternatives based on user needs.

Think for the most part when this has come up, the interest has largely been around handling of strings, which can be a bit bumpy in Python. It may be worth opening an issue to discuss strings handling separately.

Other issues might be ragged or other array types. Though those probably deserve their own (use case motivated) discussions to ensure we handle them effectively.

@jakirkham jakirkham merged commit 649faa2 into zarr-developers:main Jun 3, 2022
@jakirkham
Copy link
Member

Thanks Jeremy for the PR and Jonathan for the review! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants