-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement preferred_chunks for netcdf 4 backends #7948
Conversation
Hope this is up to the standards! Please tell me if there is anything I can do to improve this PR. |
Is the resulting chunk size a multiple of the on-disk file chunk size or is it exactly the file chunk size? So if I open a file with a file chunk size of 256, do I get a dask array with chunk size 256 or some larger chunk size that is a multiple of 256. By itself, 256 would perform worse than one large chunk for a lot of arrays sizes just from the overhead. |
@djhoese if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on @mraspaud. This has been something folks have wanted for some time now!
@Illviljan @headtr1ck help! |
xarray/tests/test_backends.py
Outdated
assert all(np.asanyarray(chunksizes) == expected) | ||
|
||
@contextlib.contextmanager | ||
def create_chunked_file(self, array_shape, chunk_sizes) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def create_chunked_file(self, array_shape, chunk_sizes) -> None: | |
def create_chunked_file(self, array_shape: tuple[int, int, int], chunk_sizes: tuple[int, int, int]) -> Generator[str, None, None]: |
I haven't checked it though, but it should bring you in the right direction.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
I attempted to fix the type annotation problem, please tell me if there is more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for your patience here @mraspaud
Can you add a note to whats-new
to advertise this great improvement please?
What's new now updated |
The mypy failures are. related to these changes I think:
@Illviljan can you take a look please? |
Thanks, @mraspaud ! |
My pleasure! thanks for merging! |
According to the
open_dataset
documentation, usingchunks="auto"
orchunks={}
should yield datasets with variables chunked depending on the preferred chunks of the backend. However neither the netcdf4 nor the h5netcdf backend seem to implement thepreferred_chunks
encoding attribute needed for this to work.This PR adds this attribute to the encoding upon data reading. This results in
chunks="auto"
inopen_dataset
returning variables with chunk sizes multiples of the chunks in the nc file, and forchunks={}
, returning the variables with then exact nc chunk sizes.whats-new.rst
api.rst