Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion docs/api/storage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,12 @@ Storage (``zarr.storage``)
.. autoclass:: DBMStore

.. automethod:: close
.. automethod:: sync
.. automethod:: flush

.. autoclass:: LMDBStore

.. automethod:: close
.. automethod:: flush

.. autofunction:: init_array
.. autofunction:: init_group
Expand Down
13 changes: 9 additions & 4 deletions docs/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,19 @@ Enhancements
:issue:`123`, :issue:`139`.

* **New storage class for DBM-style databases**. The
:class:`zarr.storage.DBMStore` class enables any DBM-style database to be used
as the backing store for an array or group. See the tutorial section on
:ref:`tutorial_storage` for some examples. :issue:`133`, :issue:`186`
:class:`zarr.storage.DBMStore` class enables any DBM-style database such as gdbm,
ndbm or Berkeley DB, to be used as the backing store for an array or group. See the
tutorial section on :ref:`tutorial_storage` for some examples. :issue:`133`,
:issue:`186`.

* **New storage class for LMDB databases**. The :class:`zarr.storage.LMDBStore` class
enables an LMDB "Lightning" database to be used as the backing store for an array or
group. @@TODO issue

* **New storage class using a nested directory structure for chunk files**. The
:class:`zarr.storage.NestedDirectoryStore` has been added, which is similar to
the existing :class:`zarr.storage.DirectoryStore` class but nests chunk files
for multidimensional arrays into sub-directories. :issue:`155`, :issue:`177`
for multidimensional arrays into sub-directories. :issue:`155`, :issue:`177`.

* **New tree() method for printing hierarchies**. The ``Group`` class has a new
:func:`zarr.hierarchy.Group.tree` method which enables a tree representation of
Expand Down
80 changes: 39 additions & 41 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -683,9 +683,6 @@ here is an array stored directly into a Zip file, via the
>>> z = root.zeros('foo/bar', shape=(1000, 1000), chunks=(100, 100), dtype='i4')
>>> z[:] = 42
>>> store.close()
>>> import os
>>> os.path.getsize('data/example.zip')
32805

Re-open and check that data have been written::

Expand Down Expand Up @@ -713,31 +710,23 @@ Another storage alternative is the :class:`zarr.storage.DBMStore` class, added
in Zarr version 2.2. This class allows any DBM-style database to be used for
storing an array or group. Here is an example using a Berkeley DB B-tree
database for storage (requires `bsddb3
<https://www.jcea.es/programacion/pybsddb.htm>`_ to be installed):
<https://www.jcea.es/programacion/pybsddb.htm>`_ to be installed)::

>>> import bsddb3
>>> store = zarr.DBMStore('data/example.db', open=bsddb3.btopen, flag='n')
>>> root = zarr.group(store=store)
>>> store = zarr.DBMStore('data/example.bdb', open=bsddb3.btopen)
>>> root = zarr.group(store=store, overwrite=True)
>>> z = root.zeros('foo/bar', shape=(1000, 1000), chunks=(100, 100), dtype='i4')
>>> z[:] = 42
>>> store.close()
>>> import os
>>> os.path.getsize('data/example.db')
36864

Re-open and check that data have been written::
Also added in Zarr version 2.2 is the :class:`zarr.storage.LMDBStore` class which
enables the lightning memory-mapped database (LMDB) to be used for storing an array or
group (requires `lmdb <http://lmdb.readthedocs.io/>`_ to be installed)::

>>> store = zarr.DBMStore('data/example.db', open=bsddb3.btopen)
>>> root = zarr.group(store=store)
>>> z = root['foo/bar']
>>> z[:]
array([[42, 42, 42, ..., 42, 42, 42],
[42, 42, 42, ..., 42, 42, 42],
[42, 42, 42, ..., 42, 42, 42],
...,
[42, 42, 42, ..., 42, 42, 42],
[42, 42, 42, ..., 42, 42, 42],
[42, 42, 42, ..., 42, 42, 42]], dtype=int32)
>>> store = zarr.LMDBStore('data/example.lmdb')
>>> root = zarr.group(store=store, overwrite=True)
>>> z = root.zeros('foo/bar', shape=(1000, 1000), chunks=(100, 100), dtype='i4')
>>> z[:] = 42
>>> store.close()

It is also possible to use distributed storage systems. The Dask project has
Expand Down Expand Up @@ -778,6 +767,8 @@ Here is an example using S3Map to read an array created previously::
>>> z[:].tostring()
b'Hello from the cloud!'



.. _tutorial_strings:

String arrays
Expand Down Expand Up @@ -875,7 +866,7 @@ a chunk shape is based on simple heuristics and may be far from optimal. E.g.::

>>> z4 = zarr.zeros((10000, 10000), chunks=True, dtype='i4')
>>> z4.chunks
(313, 625)
(625, 625)

If you know you are always going to be loading the entire array into memory, you
can turn off chunks by providing ``chunks=False``, in which case there will be
Expand Down Expand Up @@ -936,24 +927,26 @@ filters (e.g., byte-shuffle) have been applied.
Parallel computing and synchronization
--------------------------------------

Zarr arrays can be used as either the source or sink for data in parallel
computations. Both multi-threaded and multi-process parallelism are
supported. The Python global interpreter lock (GIL) is released wherever
possible for both compression and decompression operations, so Zarr will
generally not block other Python threads from running.

A Zarr array can be read concurrently by multiple threads or processes. No
synchronization (i.e., locking) is required for concurrent reads.

A Zarr array can also be written to concurrently by multiple threads or
processes. Some synchronization may be required, depending on the way the data
is being written.

If each worker in a parallel computation is writing to a separate region of the
array, and if region boundaries are perfectly aligned with chunk boundaries,
then no synchronization is required. However, if region and chunk boundaries are
not perfectly aligned, then synchronization is required to avoid two workers
attempting to modify the same chunk at the same time.
Zarr arrays have been designed for use as the source or sink for data in
parallel computations. By data source we mean that multiple concurrent read
operations may occur. By data sink we mean that multiple concurrent write
operations may occur, with each writer updating a different region of the
array. Zarr arrays have **not** been designed for situations where multiple
readers and writers are concurrently operating on the same array.

Both multi-threaded and multi-process parallelism are possible. The bottleneck
for most storage and retrieval operations is compression/decompression, and the
Python global interpreter lock (GIL) is released wherever possible during these
operations, so Zarr will generally not block other Python threads from running.

When using a Zarr array as a data sink, some synchronization (locking) may be
required to avoid data loss, depending on how data are being updated. If each
worker in a parallel computation is writing to a separate region of the array,
and if region boundaries are perfectly aligned with chunk boundaries, then no
synchronization is required. However, if region and chunk boundaries are not
perfectly aligned, then synchronization is required to avoid two workers
attempting to modify the same chunk at the same time, which could result in data
loss.

To give a simple example, consider a 1-dimensional array of length 60, ``z``,
divided into three chunks of 20 elements each. If three workers are running and
Expand Down Expand Up @@ -986,7 +979,12 @@ some networked file systems). E.g.::
>>> z
<zarr.core.Array (10000, 10000) int32>

This array is safe to read or write from multiple processes,
This array is safe to read or write from multiple processes.

Please note that support for parallel computing is an area of ongoing research
and development. If you are using Zarr for parallel computing, we welcome
feedback, experience, discussion, ideas and advice, particularly about issues
related to data integrity and performance.

.. _tutorial_pickle:

Expand Down
Loading