Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
5cdf6c8
added HiddenKeyDict class
rabernat Aug 27, 2017
f305c25
new zarr backend
rabernat Aug 27, 2017
2ea21c5
added HiddenKeyDict class
rabernat Aug 27, 2017
d92bf2f
new zarr backend
rabernat Aug 27, 2017
79da971
add zarr to ci reqs
Oct 5, 2017
31e4409
add zarr api to docs
Oct 5, 2017
2ec5ee5
some zarr tests passing
rabernat Oct 6, 2017
bd21720
Merge pull request #1 from jhamman/zarr_backend
rabernat Oct 6, 2017
7e898fc
merged stuff from joe
rabernat Oct 6, 2017
af5ff6c
Merge branch 'master' of github.com:pydata/xarray into zarr_backend
Oct 6, 2017
9e7cc09
Merge branch 'zarr_backend' of github.com:rabernat/xray into zarr_bac…
Oct 6, 2017
3f01365
requires zarr decorator
Oct 6, 2017
41cf706
Merge pull request #2 from jhamman/zarr_backend
rabernat Oct 6, 2017
fd9fd0f
wip
rabernat Oct 7, 2017
9f16e8f
added chunking test
rabernat Oct 8, 2017
fe9ebe7
remove debuggin statements
rabernat Oct 8, 2017
c01cd09
fixed HiddenKeyDict
rabernat Oct 8, 2017
b3e5d76
added HiddenKeyDict class
rabernat Aug 27, 2017
45375b2
new zarr backend
rabernat Aug 27, 2017
0e79718
add zarr to ci reqs
Oct 5, 2017
3d39ade
add zarr api to docs
Oct 5, 2017
3d09c67
some zarr tests passing
rabernat Oct 6, 2017
0b4a27a
requires zarr decorator
Oct 6, 2017
f39035c
wip
rabernat Oct 7, 2017
6446ea2
added chunking test
rabernat Oct 8, 2017
9136064
remove debuggin statements
rabernat Oct 8, 2017
2966100
fixed HiddenKeyDict
rabernat Oct 8, 2017
6bedf22
wip
rabernat Oct 14, 2017
ced8267
finished merge
rabernat Oct 16, 2017
e461cdb
finished merge
rabernat Oct 16, 2017
049bf9e
create opener object
rabernat Oct 16, 2017
c169128
trying to get caching working
rabernat Oct 16, 2017
82ef456
caching still not working
rabernat Oct 16, 2017
3ee243e
merge conflicts
rabernat Nov 13, 2017
e20c29f
updating zarr backend with new indexing mixins
rabernat Nov 13, 2017
f82c8c1
added new zarr dev test env
rabernat Nov 13, 2017
43e539f
update travis
rabernat Nov 13, 2017
66299f0
move zarr-dev to travis allowed failures
rabernat Nov 13, 2017
2fce362
fix typo in env file
rabernat Nov 13, 2017
c19b81a
wip
rabernat Nov 17, 2017
68b8f07
fixed zarr auto_chunk
rabernat Nov 17, 2017
0ea0dad
refactored zarr tests
rabernat Nov 17, 2017
58b3bf0
new encoding test
rabernat Nov 17, 2017
9da22da
Merge branch 'master' of github.com:pydata/xarray into zarr_backend_jjh
Nov 17, 2017
a8b4785
cleanup and buildout ZarrArrayWrapper, vectorized indexing
Nov 17, 2017
2a6a776
Merge pull request #4 from jhamman/zarr_backend_jjh
rabernat Nov 17, 2017
021d3ba
more wip
rabernat Nov 27, 2017
5ef10d2
removed chaching test
rabernat Nov 17, 2017
e47d936
Merge remote-tracking branch 'origin/zarr_backend' into zarr_backend
rabernat Nov 27, 2017
a4b024e
very close to passing all tests
rabernat Nov 27, 2017
d8842a6
Merge remote-tracking branch 'upstream/master' into zarr_backend
rabernat Nov 28, 2017
54d116d
modified inheritance
rabernat Nov 29, 2017
94678f4
subclass AbstractWriteableDataStore
rabernat Nov 29, 2017
64942e5
Merge remote-tracking branch 'origin/zarr_backend' into zarr_backend
rabernat Dec 1, 2017
f584456
xfailed certain tests
rabernat Dec 1, 2017
c43284e
pr comments wip
rabernat Dec 4, 2017
9df6e50
removed autoclose
rabernat Dec 4, 2017
012e858
new test for chunk encoding
rabernat Dec 4, 2017
b1819f4
added another test
rabernat Dec 5, 2017
8eb98c9
tests for HiddenKeyDict
rabernat Dec 6, 2017
64bd76c
flake8
rabernat Dec 6, 2017
cffa158
Merge remote-tracking branch 'upstream/master' into zarr_backend
rabernat Dec 6, 2017
3b4a941
zarr version update
rabernat Dec 6, 2017
688f415
added more tests
rabernat Dec 6, 2017
c115a2b
added compressor test
rabernat Dec 6, 2017
4c92531
docs
rabernat Dec 6, 2017
61027eb
weird ascii character issue
rabernat Dec 6, 2017
bbaa776
doc fixes
rabernat Dec 6, 2017
c8f23a5
what's new
rabernat Dec 6, 2017
f0c76f7
more file encoding nightmares
rabernat Dec 6, 2017
a84e388
Tests for backends.zarr._replace_slices_with_arrays
shoyer Dec 6, 2017
37bc2f0
respond to @shoyer's review
rabernat Dec 6, 2017
8cd1707
final fixes
rabernat Dec 7, 2017
ac27411
put back @shoyer's original max function
rabernat Dec 7, 2017
618bf81
another try with 2.7-safe max function
rabernat Dec 7, 2017
e942130
put back @shoyer's original max function
rabernat Dec 7, 2017
b1fa690
bypass lock on ArrayWriter
rabernat Dec 8, 2017
4089d13
Merge branch 'zarr_backend' of github.com:rabernat/xarray into zarr_b…
rabernat Dec 8, 2017
ba200c1
eliminate read mode
rabernat Dec 8, 2017
8dafaf7
added zarr distributed integration test
rabernat Dec 8, 2017
85174cd
fixed max bug
rabernat Dec 8, 2017
c76a01b
change lock to False
rabernat Dec 11, 2017
c011c2d
fix doc typos
rabernat Dec 11, 2017
054ffeb
Merge branch 'master' into zarr_backend
rabernat Dec 12, 2017
f5633ca
Merge branch 'master' into zarr_backend
rabernat Dec 12, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions xarray/backends/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,9 +164,10 @@ def __exit__(self, exception_type, exception_value, traceback):


class ArrayWriter(object):
def __init__(self):
def __init__(self, lock=GLOBAL_LOCK):
self.sources = []
self.targets = []
self.lock = lock

def add(self, source, target):
if isinstance(source, dask_array_type):
Expand All @@ -184,7 +185,7 @@ def sync(self):
import dask.array as da
import dask
if LooseVersion(dask.__version__) > LooseVersion('0.8.1'):
da.store(self.sources, self.targets, lock=GLOBAL_LOCK)
da.store(self.sources, self.targets, lock=self.lock)
Copy link
Contributor Author

@rabernat rabernat Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I made this modification to the ArrayWriter class to allow us to bypass the use of a lock when writing, since (the way we have implemented it), the writes to the zarr store should all be thread safe and not require any lock. Later on, when we initialize the ZarrStore, we create a writer for it as ArrayWriter(lock=None). @mrocklin, does this seem like the right approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can guarantee that no two tasks will write to the same block in Zarr then yes, I think that it is appropriate to avoid locking. This is based on old information though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another observation: this call to dask.array.store does not show up on my distributed dashboard task / status pages. The writes are obviously happening, but I can't follow their progress. Since we are going to be using this to move some very large datasets, interactive monitoring via the dashboard is quite important. Do you have any idea why this is not showing up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can guarantee that no two tasks will write to the same block in Zarr then yes, I think that it is appropriate to avoid locking.

As of this PR, we do not allow multiple dask chunks per zarr chunk. That scenario is covered by the test suite. It may change in the future, but that's how it is for now.

Once we cross that bridge, we will also have to deal with the fact that both zarr and dask have their own locking (aka "synchronization" in zarr parlance) enforcement mechanisms. We will presumably have to pick one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another observation: this call to dask.array.store does not show up on my distributed dashboard task / status pages. The writes are obviously happening, but I can't follow their progress. Since we are going to be using this to move some very large datasets, interactive monitoring via the dashboard is quite important. Do you have any idea why this is not showing up?

There is no reason that a task run on the distributed system will not show up on the dashboard. My first guess is that somehow you're using a local scheduler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of this PR, we do not allow multiple dask chunks per zarr chunk. That scenario is covered by the test suite. It may change in the future, but that's how it is for now.

Once we cross that bridge, we will also have to deal with the fact that both zarr and dask have their own locking (aka "synchronization" in zarr parlance) enforcement mechanisms. We will presumably have to pick one.

I suspect that we will never change this behavior. I don't think we should ever have multiple dask chunks write to one zarr chunk. Any to_zarr storage function should rechunk explicitly.

If for some reason we did need to synchronize, Dask provides a distributed locking mechanism that could be keyed by chunk label.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason that a task run on the distributed system will not show up on the dashboard. My first guess is that somehow you're using a local scheduler.

I was not using a local scheduler. After digging further, I can see the tasks on the distributed dashboard using a regular zarr.DirectoryStore, but not when I pass a gcsfs.mapping.GCSMap to to_zarr. Is there any reasons these two should behave differently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: it does eventually show up, it just takes a really long time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks good.

else:
da.store(self.sources, self.targets)
self.sources = []
Expand Down
11 changes: 9 additions & 2 deletions xarray/backends/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from ..core import indexing
from ..core.utils import FrozenOrderedDict, HiddenKeyDict
from ..core.pycompat import iteritems, OrderedDict, integer_types
from .common import (AbstractWritableDataStore, BackendArray)
from .common import AbstractWritableDataStore, BackendArray, ArrayWriter
from .. import conventions

# need some special secret attributes to tell us the dimensions
Expand Down Expand Up @@ -298,8 +298,15 @@ def __init__(self, zarr_group, mode='r', synchronizer=None, group=None,
if _DIMENSION_KEY not in self.ds.attrs:
self.ds.attrs[_DIMENSION_KEY] = {}

if writer is None:
# by default, we should not need a lock for writing zarr because
# we do not (yet) allow overlapping chunks during write
zarr_writer = ArrayWriter(lock=None)
else:
zarr_writer = writer

# do we need to define attributes for all of the opener keyword args?
super(ZarrStore, self).__init__(writer)
super(ZarrStore, self).__init__(zarr_writer)

def open_store_variable(self, name, zarr_array):
data = indexing.LazilyIndexedArray(ZarrArrayWrapper(name, self))
Expand Down