-
-
Notifications
You must be signed in to change notification settings - Fork 370
WIP: Use kenjutsu #116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Use kenjutsu #116
Conversation
zarr/util.py
Outdated
| if isinstance(item, tuple): | ||
| rf_item = reformat_slices(item) | ||
| if Ellipsis not in rf_item: | ||
| rf_item += (Ellipsis,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary, but is really a bug on my part where I'm being overly strict. Will fix in kenjutsu with PR ( jakirkham/kenjutsu#59 ). In the interim, this shouldn't cause any issues and should be fine even when we fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solved in 0.4.1.
|
Appears there is not enough coverage, but only in the Python 3.6 case. 😕 Not really sure what to do about that though or whether that is something for me to improve or whether it lies outside these changes. Edit: More info would be very helpful. ( https://github.com/alimanfoo/zarr/issues/117 ) |
|
Hi @jakirkham, thank you for this. Would mind breaking out just the changes that address #93 into a separate PR? That would help a lot. Regarding fancy indexing, if I've understood this correctly, I think there may be performance issues. In the worst case, chunks could get decompressed many times over. Also, again in the worst case, if there are lots of indices, then there could be a lot of Python looping. I think we need a solution that ensures each chunk of the array is decompressed at most once, and also which minimises looping. I have some ideas, happy to discuss further. |
So just move the multiple indices into a separate PR? No problem.
Indeed. Was interested first to see if we could get it to work. Plus it would provide some demonstration as to how one might use this functionality to handle these slices. Figured optimization would require some discussion or passing of the baton. Not sure I know enough about Zarr's internals to do this alone. |
|
I'll probably borrow from this PR and tweak it as it makes for a nice template. However, I have broken out the slice normalization fixes into PR ( https://github.com/alimanfoo/zarr/pull/119 ). |
|
Thanks for breaking out #119. For fancy indexing, the critical section of code is within For fancy indexing I think it is potentially simpler to handle the fancy part of the index item as a boolean array, rather than a sequence of integer indices. The code within the loop could then be modified to slice this boolean array, to obtain a sub-array of the boolean array which would be appropriate for indexing just that chunk. I.e., the boolean array would get divided up into pieces, with one piece for each chunk along the dimension the boolean array is being applied to. Figuring out the pieces is easy because they are equal sized, determined by the size of the chunks along that dimension. Within the loop there is already the current offset for each dimension, which tells you where you are in the context of the whole array. So this offset and the chunk shape is all that is needed to slice the boolean indexing array for the current chunk. Then the whole chunk can be loaded into a numpy array and the selection applied via numpy indexing. The trick is then to figure out where the data extracted from the current chunk should end up in the output array. I think probably we'd have to keep track of the offsets into the output array too, which means counting how many True values from the boolean array have been handled so far. All of this could I think be done with an integer index array instead of a boolean array, but would be a bit harder as it would require searching the array to find the set of indices that apply to the current chunk, then mapping those indices into the chunk's coordinate system, which I think will be more complex to code and slower. There may also need to be some modification to the get_chunk_range function to handle a Boolean array, which in the first instance could be dumb and just return all chunks for the indexed dimension as overlapping the selection, but potentially could be optimised so only the chunks where there is some data to extract are returned as in-range. This is probably all making very little sense, but thought I'd share ideas. This is the basic route I was planning to explore at some point, although have some other priorities I have to work through at the moment. |
|
@alimanfoo any update on this? Would love to use this to update an Array using an array of indices. |
|
Hi @samiur, apologies I'm snowed under with other work at the moment. TBH I think support for fancy indexing to update the contents of a zarr array may be some way off. I need to think through how to implement limited support for fancy indexing in |
|
@alimanfoo no worries! Thanks for the help on this. |
|
Closing as indexing code is substantially changed via #172. Happy to revisit using kenjutsu in future if it looks like there's a lot of common code. |
Fixes https://github.com/alimanfoo/zarr/issues/93
Fixes https://github.com/alimanfoo/zarr/issues/78
Use kenjutsu to handle slice normalization and shape determination.
Also have extend further to handle a sequence of indices in the slice, but may require some discussion about whether this is ok.