Implement preferred_chunks for netcdf 4 backends #7948

mraspaud · 2023-06-28T08:43:30Z

According to the open_dataset documentation, using chunks="auto" or chunks={} should yield datasets with variables chunked depending on the preferred chunks of the backend. However neither the netcdf4 nor the h5netcdf backend seem to implement the preferred_chunks encoding attribute needed for this to work.

This PR adds this attribute to the encoding upon data reading. This results in chunks="auto" in open_dataset returning variables with chunk sizes multiples of the chunks in the nc file, and for chunks={}, returning the variables with then exact nc chunk sizes.

Closes If a NetCDF file is chunked on disk, open it with compatible dask chunks #1440
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

mraspaud · 2023-06-28T08:45:35Z

Hope this is up to the standards! Please tell me if there is anything I can do to improve this PR.

djhoese · 2023-06-28T13:47:15Z

Is the resulting chunk size a multiple of the on-disk file chunk size or is it exactly the file chunk size? So if I open a file with a file chunk size of 256, do I get a dask array with chunk size 256 or some larger chunk size that is a multiple of 256. By itself, 256 would perform worse than one large chunk for a lot of arrays sizes just from the overhead.

mraspaud · 2023-06-28T14:15:24Z

@djhoese if chunks={}, the same chunks size as the on-disk file will be used. if chunks="auto", a multiple of the on-disk chunk size will be used to accommodate the dask chunk size preferences. This is in line with what is said in the open_dataset docstring as I understand it.
For our case, using chunks="auto" with this patch resulted in a significant performance boost in processing the data as compared to the main branch.

jhamman

Thanks for taking this on @mraspaud. This has been something folks have wanted for some time now!

xarray/tests/test_backends.py

for more information, see https://pre-commit.ci

dcherian · 2023-07-26T03:21:02Z

Found 5 errors in 1 file (checked 142 source files)
xarray/tests/test_backends.py:1568: error: Need type annotation for "tmp_file"  [var-annotated]
xarray/tests/test_backends.py:1577: error: Argument 1 to "contextmanager" has incompatible type "Callable[[NetCDF4Base, Any, Any], None]"; expected "Callable[[NetCDF4Base, Any, Any], Iterator[<nothing>]]"  [arg-type]
xarray/tests/test_backends.py:1578: error: The return type of a generator function should be "Generator" or one of its supertypes  [misc]
xarray/tests/test_backends.py:1599: error: Need type annotation for "tmp_file"  [var-annotated]

@Illviljan @headtr1ck help!

headtr1ck · 2023-07-26T05:12:57Z

xarray/tests/test_backends.py

+                    assert all(np.asanyarray(chunksizes) == expected)
+
+    @contextlib.contextmanager
+    def create_chunked_file(self, array_shape, chunk_sizes) -> None:


Suggested change

def create_chunked_file(self, array_shape, chunk_sizes) -> None:

def create_chunked_file(self, array_shape: tuple[int, int, int], chunk_sizes: tuple[int, int, int]) -> Generator[str, None, None]:

I haven't checked it though, but it should bring you in the right direction.

for more information, see https://pre-commit.ci

mraspaud · 2023-08-14T07:17:13Z

I attempted to fix the type annotation problem, please tell me if there is more.

mraspaud · 2023-08-21T07:08:45Z

@dcherian @jhamman anything more I can do on this?

dcherian

LGTM! Thanks for your patience here @mraspaud

Can you add a note to whats-new to advertise this great improvement please?

xarray/tests/test_backends.py

mraspaud · 2023-08-31T07:47:25Z

What's new now updated

dcherian · 2023-09-08T15:37:38Z

The mypy failures are. related to these changes I think:

xarray/tests/test_backends.py:1559: error: "str" has no attribute "data"  [attr-defined]
xarray/tests/test_backends.py:1559: error: Invalid index type "str" for "str"; expected type "SupportsIndex | slice"  [index]
xarray/tests/test_backends.py:1576: error: "str" has no attribute "data"  [attr-defined]
xarray/tests/test_backends.py:1576: error: Invalid index type "str" for "str"; expected type "SupportsIndex | slice"  [index]
xarray/tests/test_backends.py:1610: error: "str" has no attribute "encoding"  [attr-defined]
xarray/tests/test_backends.py:1610: error: Invalid index type "str" for "str"; expected type "SupportsIndex | slice"  [index]

@Illviljan can you take a look please?

xarray/tests/test_backends.py

Illviljan · 2023-09-11T23:06:30Z

Thanks, @mraspaud !

mraspaud · 2023-09-12T09:01:02Z

My pleasure! thanks for merging!

mraspaud added 2 commits June 28, 2023 10:30

Write failing test

ca0ebea

Add preferred chunks to netcdf 4 backends

27d6693

github-actions bot added topic-backends io labels Jun 28, 2023

mraspaud changed the title ~~Implement peferred_chunks for netcdf 4 backend~~ Implement preferred_chunks for netcdf 4 backend Jun 28, 2023

mraspaud changed the title ~~Implement preferred_chunks for netcdf 4 backend~~ Implement preferred_chunks for netcdf 4 backends Jun 28, 2023

ghiggi mentioned this pull request Jun 28, 2023

open_dataset with chunks="auto" fails when a netCDF4 variables/coordinates is encoded as NC_STRING #7868

Closed

jhamman reviewed Jun 28, 2023

View reviewed changes

xarray/tests/test_backends.py Show resolved Hide resolved

Illviljan added the run-benchmark Run the ASV benchmark workflow label Jun 28, 2023

mraspaud added 3 commits June 29, 2023 13:16

Add unit tests for preferred chunking

049b0f1

Fix formatting

dfbdbf7

Require dask for a couple of chunking tests

0e89353

jhamman reviewed Jun 30, 2023

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

Use xarray's interface to create a test chunked nc file

29c93b8

dcherian requested a review from jhamman July 3, 2023 15:40

dcherian and others added 3 commits July 5, 2023 09:47

Merge branch 'main' into feature-nc-preferred-chunks

d46896c

[pre-commit.ci] auto fixes from pre-commit.com hooks

996ac29

for more information, see https://pre-commit.ci

Merge branch 'main' into feature-nc-preferred-chunks

d0f3f92

headtr1ck reviewed Jul 26, 2023

View reviewed changes

jhamman mentioned this pull request Jul 26, 2023

Specify chunks in bytes #8021

Open

mraspaud and others added 4 commits August 14, 2023 09:02

Fix type annotations

a6d922a

[pre-commit.ci] auto fixes from pre-commit.com hooks

46ee947

for more information, see https://pre-commit.ci

Import Generator

c011c53

[pre-commit.ci] auto fixes from pre-commit.com hooks

c835a81

for more information, see https://pre-commit.ci

dcherian approved these changes Aug 31, 2023

View reviewed changes

dcherian reviewed Aug 31, 2023

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

dcherian added the plan to merge Final call for comments label Aug 31, 2023

mraspaud added 2 commits August 31, 2023 09:44

Use roundtrip

2c643da

Add news about the new feature

4903dec

Merge branch 'main' into feature-nc-preferred-chunks

4b81a76

dcherian removed the plan to merge Final call for comments label Sep 9, 2023

Illviljan reviewed Sep 11, 2023

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

Update xarray/tests/test_backends.py

6ee5ace

Illviljan reviewed Sep 11, 2023

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

Illviljan added 4 commits September 11, 2023 18:51

Update xarray/tests/test_backends.py

463cf60

Merge branch 'main' into pr/7948

af27333

Move whats new line

8864c11

Merge branch 'main' into pr/7948

8d1a140

Illviljan enabled auto-merge (squash) September 11, 2023 18:30

Illviljan disabled auto-merge September 11, 2023 23:05

Illviljan merged commit de66dae into pydata:main Sep 11, 2023

mraspaud deleted the feature-nc-preferred-chunks branch September 12, 2023 09:00

This was referenced Sep 12, 2023

Parallel read with MPI #6919

Closed

Memory Leak open_mfdataset #5585

Closed

dougiesquire mentioned this pull request Oct 12, 2023

Add 3rd method for computing meridional heat transport using velocities and temperature fields COSIMA/cosima-recipes#285

Merged

dave-andersen mentioned this pull request Nov 29, 2023

performance regression 2023.08 -> 2023.09 to_zarr from netcdf4 open_mfdataset #8490

Closed

5 tasks

weiji14 mentioned this pull request Jan 31, 2024

xarray.open_dataset with chunks={} returns a single chunk and not engine (h5netcdf) preferred chunks #8691

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement preferred_chunks for netcdf 4 backends #7948

Implement preferred_chunks for netcdf 4 backends #7948

mraspaud commented Jun 28, 2023 •

edited by dcherian

Loading

mraspaud commented Jun 28, 2023

djhoese commented Jun 28, 2023

mraspaud commented Jun 28, 2023

jhamman left a comment

dcherian commented Jul 26, 2023

headtr1ck Jul 26, 2023

mraspaud commented Aug 14, 2023

mraspaud commented Aug 21, 2023

dcherian left a comment

mraspaud commented Aug 31, 2023

dcherian commented Sep 8, 2023

Illviljan commented Sep 11, 2023

mraspaud commented Sep 12, 2023

	def create_chunked_file(self, array_shape, chunk_sizes) -> None:
	def create_chunked_file(self, array_shape: tuple[int, int, int], chunk_sizes: tuple[int, int, int]) -> Generator[str, None, None]:

Implement preferred_chunks for netcdf 4 backends #7948

Implement preferred_chunks for netcdf 4 backends #7948

Conversation

mraspaud commented Jun 28, 2023 • edited by dcherian Loading

mraspaud commented Jun 28, 2023

djhoese commented Jun 28, 2023

mraspaud commented Jun 28, 2023

jhamman left a comment

Choose a reason for hiding this comment

dcherian commented Jul 26, 2023

headtr1ck Jul 26, 2023

Choose a reason for hiding this comment

mraspaud commented Aug 14, 2023

mraspaud commented Aug 21, 2023

dcherian left a comment

Choose a reason for hiding this comment

mraspaud commented Aug 31, 2023

dcherian commented Sep 8, 2023

Illviljan commented Sep 11, 2023

mraspaud commented Sep 12, 2023

mraspaud commented Jun 28, 2023 •

edited by dcherian

Loading