Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement preferred_chunks for netcdf 4 backends #7948

Merged
merged 21 commits into from
Sep 11, 2023

Conversation

mraspaud
Copy link
Contributor

@mraspaud mraspaud commented Jun 28, 2023

According to the open_dataset documentation, using chunks="auto" or chunks={} should yield datasets with variables chunked depending on the preferred chunks of the backend. However neither the netcdf4 nor the h5netcdf backend seem to implement the preferred_chunks encoding attribute needed for this to work.

This PR adds this attribute to the encoding upon data reading. This results in chunks="auto" in open_dataset returning variables with chunk sizes multiples of the chunks in the nc file, and for chunks={}, returning the variables with then exact nc chunk sizes.

@mraspaud
Copy link
Contributor Author

Hope this is up to the standards! Please tell me if there is anything I can do to improve this PR.

@mraspaud mraspaud changed the title Implement peferred_chunks for netcdf 4 backend Implement preferred_chunks for netcdf 4 backend Jun 28, 2023
@mraspaud mraspaud changed the title Implement preferred_chunks for netcdf 4 backend Implement preferred_chunks for netcdf 4 backends Jun 28, 2023
@djhoese
Copy link
Contributor

djhoese commented Jun 28, 2023

Is the resulting chunk size a multiple of the on-disk file chunk size or is it exactly the file chunk size? So if I open a file with a file chunk size of 256, do I get a dask array with chunk size 256 or some larger chunk size that is a multiple of 256. By itself, 256 would perform worse than one large chunk for a lot of arrays sizes just from the overhead.

@mraspaud
Copy link
Contributor Author

@djhoese if chunks={}, the same chunks size as the on-disk file will be used. if chunks="auto", a multiple of the on-disk chunk size will be used to accommodate the dask chunk size preferences. This is in line with what is said in the open_dataset docstring as I understand it.
For our case, using chunks="auto" with this patch resulted in a significant performance boost in processing the data as compared to the main branch.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on @mraspaud. This has been something folks have wanted for some time now!

xarray/tests/test_backends.py Show resolved Hide resolved
@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label Jun 28, 2023
@dcherian dcherian requested a review from jhamman July 3, 2023 15:40
@dcherian
Copy link
Contributor

Found 5 errors in 1 file (checked 142 source files)
xarray/tests/test_backends.py:1568: error: Need type annotation for "tmp_file"  [var-annotated]
xarray/tests/test_backends.py:1577: error: Argument 1 to "contextmanager" has incompatible type "Callable[[NetCDF4Base, Any, Any], None]"; expected "Callable[[NetCDF4Base, Any, Any], Iterator[<nothing>]]"  [arg-type]
xarray/tests/test_backends.py:1578: error: The return type of a generator function should be "Generator" or one of its supertypes  [misc]
xarray/tests/test_backends.py:1599: error: Need type annotation for "tmp_file"  [var-annotated]

@Illviljan @headtr1ck help!

assert all(np.asanyarray(chunksizes) == expected)

@contextlib.contextmanager
def create_chunked_file(self, array_shape, chunk_sizes) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def create_chunked_file(self, array_shape, chunk_sizes) -> None:
def create_chunked_file(self, array_shape: tuple[int, int, int], chunk_sizes: tuple[int, int, int]) -> Generator[str, None, None]:

I haven't checked it though, but it should bring you in the right direction.

@mraspaud
Copy link
Contributor Author

I attempted to fix the type annotation problem, please tell me if there is more.

@mraspaud
Copy link
Contributor Author

@dcherian @jhamman anything more I can do on this?

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for your patience here @mraspaud

Can you add a note to whats-new to advertise this great improvement please?

@dcherian dcherian added the plan to merge Final call for comments label Aug 31, 2023
@mraspaud
Copy link
Contributor Author

What's new now updated

@dcherian
Copy link
Contributor

dcherian commented Sep 8, 2023

The mypy failures are. related to these changes I think:

xarray/tests/test_backends.py:1559: error: "str" has no attribute "data"  [attr-defined]
xarray/tests/test_backends.py:1559: error: Invalid index type "str" for "str"; expected type "SupportsIndex | slice"  [index]
xarray/tests/test_backends.py:1576: error: "str" has no attribute "data"  [attr-defined]
xarray/tests/test_backends.py:1576: error: Invalid index type "str" for "str"; expected type "SupportsIndex | slice"  [index]
xarray/tests/test_backends.py:1610: error: "str" has no attribute "encoding"  [attr-defined]
xarray/tests/test_backends.py:1610: error: Invalid index type "str" for "str"; expected type "SupportsIndex | slice"  [index]

@Illviljan can you take a look please?

@dcherian dcherian removed the plan to merge Final call for comments label Sep 9, 2023
@Illviljan Illviljan enabled auto-merge (squash) September 11, 2023 18:30
@Illviljan Illviljan disabled auto-merge September 11, 2023 23:05
@Illviljan Illviljan merged commit de66dae into pydata:main Sep 11, 2023
@Illviljan
Copy link
Contributor

Thanks, @mraspaud !

@mraspaud mraspaud deleted the feature-nc-preferred-chunks branch September 12, 2023 09:00
@mraspaud
Copy link
Contributor Author

My pleasure! thanks for merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io run-benchmark Run the ASV benchmark workflow topic-backends
Projects
None yet
Development

Successfully merging this pull request may close these issues.

If a NetCDF file is chunked on disk, open it with compatible dask chunks
7 participants