-
-
Notifications
You must be signed in to change notification settings - Fork 370
Update guess-chunks so that it works with zero dimensions. #154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@newt0311: Do you think this would fix one of the problems we have encountered in pydata/xarray#1528? Would it enable the following code to work? za = zarr.create(shape=(), store='tmp_file')
za[...] = 0Currently this raises a permission error. |
|
Just checked. Unfortunately no. But I think the changes needed to make this work are fairly small. I can just roll them into my existing branch... -- PG. |
|
My changes are for handling cases like arrays with shape (2, 0, 2). Without my changes, the second zero dimension causes problems. |
| ar = np.ndarray(()) | ||
| ar[()] = 100 | ||
| z = array(ar) | ||
| assert_array_equal(ar, z[:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect!
|
I'm pretty sure this fixes pydata/xarray#1528. |
…tores. The problem was the empty file name.
|
One more piece of the puzzle. Python dicts don't care if a key is an empty string but unix file-systems do... -- PG. |
|
The build failures here look unrelated to my changes. Could somebody trigger a re-build? Thanks. |
|
I think your build failure in python3.6 is related to a flake issue |
|
I think you're right. This should fix it. |
|
I just checked out your branch locally and confirmed it fixes 4 out of 12 failing tests in pydata/xarray#1528! @alimanfoo: merging this PR would really help us move forward on the xarray side. |
|
Nice this seems to work with minimal changes to existing code. It would be good to review how the h5py API works with 0d datasets (I.e. scalars) and also with nd datasets with one or more zero length dims, to be sure there is compatibility between h5py and zarr where possible. The important methods where I've tried to get compatibility are getitem and setitem on the Array class, and create_dataset and require_dataset on the Group class. This PR may already have achieved good compatibility but I am not familiar with how h5py behaves in these cases so would like to get some more info on h5py behaviour. E.g., how do you create a 0d dataset in h5py? What does getitem return from 0d dataset when given empty tuple [()], or total slice [:], or ellipsis [...]? Similar questions for nd dataset with one or more zero length dimensions? Also would be nice to check that Array.resize works fine if there are any 0 length dims. And check that sensible exceptions are raised when trying to do nonsense stuff on 0d arrays (eg resize). If it would help the xarray work I'd be happy to merge this PR now and raise issues to review h5py compatibility and other API concerns, then address those a bit later on but before next release. One other minor point, using "null" to pad out the chunk key for 0d array is a bit ugly. Suggest using "0" instead. |
|
The following works in h5py: import h5py
f = h5py.File("mytestfile.hdf5", "w")
dset = f.create_dataset("mydataset", ())
dset[...] = 99
assert dset[...] == 99
assert dset[()] == 99
|
|
The xarray work is not so urgent as to warrant a premature, rushed merge. I agree totally that h5py compatibility should be a high priority. It sounds like a few more tests, including of exceptions, are needed. If exact compatibility with the h5py api is desired, one option would be to actually require h5py for the tests and build this into the test suite...i.e. check explicitly that zarr operations and h5py operations give identical results / raise the same exceptions. |
|
Thanks Ryan, some tests would be good. Comparing against h5py in the tests
is a nice idea, although this isn't done anywhere else in zarr test suite
at the moment so I'd be happy for now with some simple direct tests along
the lines of existing unit tests for Array getitem and setitem and Group
create_dataset.
On other thing that may be worth reviewing is exactly what happens to
scalar (0d) arrays in terms of encoding and storage. I think under the
current code in this PR a scalar value would still get passed through
compressor and any filters before storage. I'm actually surprised that
works, but main point is that compressor or filters on scalar is
pointless/nonsense. Maybe better to special case scalars and don't allow
any compressor or filters, and see if any shortcuts are possible given no
compressor or filters. May also be worth mentioning some of this in the
storage spec so it's explicit how scalars should be implemented.
On Tue, 10 Oct 2017 at 02:06, Ryan Abernathey ***@***.***> wrote:
The xarray work is not so urgent as to warrant a premature, rushed merge.
I agree totally that h5py compatibility should be a high priority. It
sounds like a few more tests, including of exceptions, are needed.
If exact compatibility with the h5py api is desired, one option would be
to actually require h5py for the tests and build this into the test
suite...i.e. check explicitly that zarr operations and h5py operations give
identical results / raise the same exceptions.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/pull/154#issuecomment-335330069>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QtRKCUWRleuHCnTYB1XNxvVObQViks5sqsMigaJpZM4PMWbo>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: [email protected]
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
|
Btw happy to work on this (tests, spec) myself when back from leave if no
one has time.
On Wed, 11 Oct 2017 at 21:45, Alistair Miles <[email protected]>
wrote:
Thanks Ryan, some tests would be good. Comparing against h5py in the tests
is a nice idea, although this isn't done anywhere else in zarr test suite
at the moment so I'd be happy for now with some simple direct tests along
the lines of existing unit tests for Array getitem and setitem and Group
create_dataset.
On other thing that may be worth reviewing is exactly what happens to
scalar (0d) arrays in terms of encoding and storage. I think under the
current code in this PR a scalar value would still get passed through
compressor and any filters before storage. I'm actually surprised that
works, but main point is that compressor or filters on scalar is
pointless/nonsense. Maybe better to special case scalars and don't allow
any compressor or filters, and see if any shortcuts are possible given no
compressor or filters. May also be worth mentioning some of this in the
storage spec so it's explicit how scalars should be implemented.
On Tue, 10 Oct 2017 at 02:06, Ryan Abernathey ***@***.***>
wrote:
> The xarray work is not so urgent as to warrant a premature, rushed merge.
> I agree totally that h5py compatibility should be a high priority. It
> sounds like a few more tests, including of exceptions, are needed.
>
> If exact compatibility with the h5py api is desired, one option would be
> to actually require h5py for the tests and build this into the test
> suite...i.e. check explicitly that zarr operations and h5py operations give
> identical results / raise the same exceptions.
>
> —
> You are receiving this because you were mentioned.
>
>
> Reply to this email directly, view it on GitHub
> <https://github.com/alimanfoo/zarr/pull/154#issuecomment-335330069>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AAq8QtRKCUWRleuHCnTYB1XNxvVObQViks5sqsMigaJpZM4PMWbo>
> .
>
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: ***@***.***
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: [email protected]
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
|
Just want to add my two cents. I used to be confused about this since different libraries (like numpy, h5py and pytables) support 0-d array differently. Please note that h5py and pytables don't really support it well due to hdf5 limitations so I would not use them as the final design if you are aiming for the correct behavior. I do like the numpy model as I think it's somewhat consistent at least. In numpy: I think at the end of the day the types are different (scalar vs nd-array) and the round trip from numpy to zarr (or pytables/h5py in that regard) should be seamless. Last I read about this I think HDF5 has an empty data set support but you can not store the shape. The end result is that you ended up writing meta data yourself so you can preserve the original shape like (10, 0, 2). |
|
Thanks Will, very timely, I'm just looking at this.
On Tue, 24 Oct 2017 at 16:43, Will Lee ***@***.***> wrote:
Just want to add my two cents.
I used to be confused about this since different libraries (like numpy,
h5py and pytables) support 0-d array differently. Please note that h5py and
pytables don't really support it well due to hdf5 limitations so I would
not use them as the final design if you are aiming for the correct
behavior. I do like the numpy model as I think it's somewhat consistent at
least. In numpy:
>>> a = numpy.arange(30)
>>> type(a[0])
numpy.int64
# Note that this is really not a 0-d array, it's a scalar type instead where there is no shape
>>> a[0].shape
()
# Note that the previous scalar value is different from a 0-d array:
>>> a[0:0]
array([], dtype=int64)
# where there is a shape of length 1 (the content is is 0 so the shape is preserved)
>>> a[0:0].shape
(0,)
# the type is an n-d array
>>> print type(a[0:0])
<type 'numpy.ndarray'>
# Here is an nd-array with 1 element, which is different from the numpy scalar
>>> print a[1:2]
[1]
# Of course then it has a shape:
>>> print a[1:2].shape
(1,)
>>> print type(a[1:2])
<type 'numpy.ndarray'>
# You can actually reshape this to arbitrary dimension and it works fine:
>>> print a[1:2].reshape((1, 1, 1, 1))
[[[[1]]]]
>>> a[1:2].reshape((1, 1, 1, 1)).shape
(1, 1, 1, 1)
I think at the end of the day the types are different (scalar vs nd-array)
and the round trip from numpy to zarr (or pytables/h5py in that regard)
should be seamless. Last I read about this I think HDF5 has an empty data
set support but you can not store the shape. The end result is that you
ended up writing meta data yourself so you can preserve the original shape
like (10, 0, 2).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/pull/154#issuecomment-339035385>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QrkgxE4pgyR0tFI4qC4sDLV12-Fxks5svgWigaJpZM4PMWbo>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: [email protected]
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
|
I've created a separate PR #160 to focus on zero length dimensions. It has some fixes from this PR manually merged, plus some extra tests. |
|
I've created another PR #161 to focus on zero-dimensional arrays, building on work here. Comments welcome there. |
Just a possible solution. It seemed to work for me but I'm not as familiar with the code-base so some feedback would be appreciated.