Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 1 addition & 8 deletions appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,18 +40,11 @@ environment:

install:
- "SET PATH=%PYTHON%;%PYTHON%\\Scripts;%PATH%"
- git submodule update --init --recursive

build: off

test_script:
- "%CMD_IN_ENV% python -m pip install -U pip setuptools wheel"
- "%CMD_IN_ENV% python -m pip install -rrequirements_dev.txt"
- "%CMD_IN_ENV% python setup.py build_ext --inplace"
- "%CMD_IN_ENV% python -m nose -v"

after_test:
- "%CMD_IN_ENV% python setup.py bdist_wheel"

artifacts:
- path: dist\*
- "%CMD_IN_ENV% python -m pytest -v zarr"
11 changes: 11 additions & 0 deletions docs/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,17 @@ Enhancements
* **New Array.hexdigest() method** computes an ``Array``'s hash with ``hashlib``.
By :user:`John Kirkham <jakirkham>`, :issue:`98`, :issue:`203`.

* **Improved support for object arrays**. In previous versions of Zarr,
creating an array with ``dtype=object`` was possible but could under certain
circumstances lead to unexpected errors and/or segmentation faults. To make it easier
to properly configure an object array, a new ``object_codec`` parameter has been
added to array creation functions. See the tutorial section on :ref:`tutorial_objects`
for more information and examples. Also, runtime checks have been added in both Zarr
and Numcodecs so that segmentation faults are no longer possible, even with a badly
configured array. This API change is backwards compatible and previous code that created
an object array and provided an object codec via the ``filters`` parameter will
continue to work, however a warning will be raised to encourage use of the
``object_codec`` parameter. :issue:`208`, :issue:`212`.

Bug fixes
~~~~~~~~~
Expand Down
105 changes: 72 additions & 33 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -178,8 +178,8 @@ print some diagnostics, e.g.::
: blocksize=0)
Store type : builtins.dict
No. bytes : 400000000 (381.5M)
No. bytes stored : 4565053 (4.4M)
Storage ratio : 87.6
No. bytes stored : 3702484 (3.5M)
Storage ratio : 108.0
Chunks initialized : 100/100

If you don't specify a compressor, by default Zarr uses the Blosc
Expand Down Expand Up @@ -270,8 +270,8 @@ Here is an example using a delta filter with the Blosc compressor::
Compressor : Blosc(cname='zstd', clevel=1, shuffle=SHUFFLE, blocksize=0)
Store type : builtins.dict
No. bytes : 400000000 (381.5M)
No. bytes stored : 648605 (633.4K)
Storage ratio : 616.7
No. bytes stored : 328085 (320.4K)
Storage ratio : 1219.2
Chunks initialized : 100/100

For more information about available filter codecs, see the `Numcodecs
Expand Down Expand Up @@ -394,8 +394,8 @@ property. E.g.::
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : zarr.storage.DictStore
No. bytes : 8000000 (7.6M)
No. bytes stored : 37480 (36.6K)
Storage ratio : 213.4
No. bytes stored : 34840 (34.0K)
Storage ratio : 229.6
Chunks initialized : 10/10

>>> baz.info
Expand All @@ -409,8 +409,8 @@ property. E.g.::
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : zarr.storage.DictStore
No. bytes : 4000000 (3.8M)
No. bytes stored : 23243 (22.7K)
Storage ratio : 172.1
No. bytes stored : 20443 (20.0K)
Storage ratio : 195.7
Chunks initialized : 100/100

Groups also have the :func:`zarr.hierarchy.Group.tree` method, e.g.::
Expand Down Expand Up @@ -768,7 +768,6 @@ Here is an example using S3Map to read an array created previously::
b'Hello from the cloud!'



.. _tutorial_strings:

String arrays
Expand All @@ -788,40 +787,80 @@ your dataset, then you can use an array with a fixed-length bytes dtype. E.g.::

A fixed-length unicode dtype is also available, e.g.::

>>> z = zarr.zeros(12, dtype='U20')
>>> greetings = ['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
... 'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!',
... 'こんにちは世界', '世界,你好!', 'Helló, világ!', 'Zdravo svete!',
... 'เฮลโลเวิลด์']
>>> z[:] = greetings
>>> text_data = greetings * 10000
>>> z = zarr.array(text_data, dtype='U20')
>>> z[:]
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!', 'こんにちは世界',
'世界,你好!', 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'],
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'],
dtype='<U20')

For variable-length strings, the "object" dtype can be used, but a filter must be
provided to encode the data. There are currently two codecs available that can encode
variable length string objects, :class:`numcodecs.Pickle` and :class:`numcodecs.MsgPack`.
E.g. using pickle::
For variable-length strings, the "object" dtype can be used, but a codec must be
provided to encode the data (see also :ref:`tutorial_objects` below). At the time of
writing there are three codecs available that can encode variable length string
objects, :class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`. and
:class:`numcodecs.Pickle`. E.g. using JSON::

>>> import numcodecs
>>> z = zarr.zeros(12, dtype=object, filters=[numcodecs.Pickle()])
>>> z[:] = greetings
>>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.JSON())
>>> z[:]
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!', 'こんにちは世界',
'世界,你好!', 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)

...or alternatively using msgpack (requires `msgpack-python
<https://github.com/msgpack/msgpack-python>`_ to be installed)::

>>> z = zarr.zeros(12, dtype=object, filters=[numcodecs.MsgPack()])
>>> z[:] = greetings
>>> z = zarr.array(text_data, dtype=object, object_codec=numcodecs.MsgPack())
>>> z[:]
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)

If you know ahead of time all the possible string values that can occur, then you could
also use the :class:`numcodecs.Categorize` codec to encode each unique value as an
integer. E.g.::

>>> categorize = numcodecs.Categorize(greetings, dtype=object)
>>> z = zarr.array(text_data, dtype=object, object_codec=categorize)
>>> z[:]
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', ...,
'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)


.. _tutorial_objects:

Object arrays
-------------

Zarr supports arrays with an "object" dtype. This allows arrays to contain any type of
object, such as variable length unicode strings, or variable length lists, or other
possibilities. When creating an object array, a codec must be provided via the
``object_codec`` argument. This codec handles encoding (serialization) of Python objects.
At the time of writing there are three codecs available that can serve as a
general purpose object codec and support encoding of a variety of
object types: :class:`numcodecs.JSON`, :class:`numcodecs.MsgPack`. and
:class:`numcodecs.Pickle`.

For example, using the JSON codec::

>>> z = zarr.empty(5, dtype=object, object_codec=numcodecs.JSON())
>>> z[0] = 42
>>> z[1] = 'foo'
>>> z[2] = ['bar', 'baz', 'qux']
>>> z[3] = {'a': 1, 'b': 2.2}
>>> z[:]
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!', 'こんにちは世界',
'世界,你好!', 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)
array([42, 'foo', list(['bar', 'baz', 'qux']), {'a': 1, 'b': 2.2}, None], dtype=object)

Not all codecs support encoding of all object types. The
:class:`numcodecs.Pickle` codec is the most flexible, supporting encoding any type
of Python object. However, if you are sharing data with anyone other than yourself, then
Pickle is not recommended as it is a potential security risk. This is because malicious
code can be embedded within pickled data. The JSON and MsgPack codecs do not have any
security issues and support encoding of unicode strings, lists and dictionaries.
MsgPack is usually faster for both encoding and decoding.


.. _tutorial_chunks:

Expand Down Expand Up @@ -898,8 +937,8 @@ ratios, depending on the correlation structure within the data. E.g.::
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : builtins.dict
No. bytes : 400000000 (381.5M)
No. bytes stored : 26805735 (25.6M)
Storage ratio : 14.9
No. bytes stored : 15857834 (15.1M)
Storage ratio : 25.2
Chunks initialized : 100/100
>>> f = zarr.array(a, chunks=(1000, 1000), order='F')
>>> f.info
Expand All @@ -912,8 +951,8 @@ ratios, depending on the correlation structure within the data. E.g.::
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : builtins.dict
No. bytes : 400000000 (381.5M)
No. bytes stored : 9633601 (9.2M)
Storage ratio : 41.5
No. bytes stored : 7233241 (6.9M)
Storage ratio : 55.3
Chunks initialized : 100/100

In the above example, Fortran order gives a better compression ratio. This is an
Expand Down Expand Up @@ -1014,7 +1053,7 @@ E.g., pickle/unpickle an in-memory array::
>>> import pickle
>>> z1 = zarr.array(np.arange(100000))
>>> s = pickle.dumps(z1)
>>> len(s) > 10000 # relatively large because data have been pickled
>>> len(s) > 5000 # relatively large because data have been pickled
True
>>> z2 = pickle.loads(s)
>>> z1 == z2
Expand Down
Loading