Automatic region detection and transpose for `to_zarr()` #8434

slevang · 2023-11-09T16:15:08Z

Closes Allow passing coordinates in to_zarr(region=...) rather than passing indexes #7702, to_zarr could transpose dims #8421
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

A quick pass at implementing these two improvements for zarr region writes:

allow passing region={dim: "auto"}, which opens the existing zarr store and identifies the correct slice to write to, using a variation of the approach suggested by @DahnJ here. We also check for non-matching coordinates and non-contiguous indices.
automatically transpose dimensions if they otherwise match the existing store but are out of order

slevang · 2023-11-09T16:18:53Z

xarray/backends/api.py

+
+
+def _auto_detect_regions(ds, region, store):
+    ds_original = open_zarr(store)


This may be neglecting important args passed to the original to_zarr() call. Should we pass things through here like consolidated, group, zarr_version, etc?

We could add **kwargs to both to cover bases...

Though enumerating the list of options will be manual. I think the ones you suggest above make sense. If we miss something, it's not a disaster — "auto" may not work in some rare cases, but I don't think it will be an unexpected error, and we can add kwargs later.

max-sixty · 2023-11-09T19:44:05Z

xarray/backends/api.py

@@ -1578,7 +1622,7 @@ def to_zarr(
    compute: bool = True,
    consolidated: bool | None = None,
    append_dim: Hashable | None = None,
-    region: Mapping[str, slice] | None = None,
+    region: Mapping[str, slice | Literal["auto"]] | None = None,


We could also offer region="auto" to run the "auto" on all dims (but not required)

Oh yeah, that's an obvious improvement, thanks. Now I'm wondering, is there any reason not to have this be the only auto option, i.e. Mapping[str, slice] | Literal["auto"]? Would there ever be a need to do a mix e.g. region={"x": "auto", "y": slice(0, 10)}?

Yes good point, I think that's a simpler API.

I guess there's possibly some corner cases where we have one index but not the other. No strong view on whether it's worth the additional complexity to enable those...

max-sixty

Awesome, thanks @slevang !

Separately, if anyone can get to #8371, I think that increases in importance with this PR, since it'll be easier to write to unaligned chunks. And losing data is really not ideal...

max-sixty · 2023-11-09T21:45:24Z

This is awesome so I really don't want to put up any blockers, but one thought:

If we are writing to the index that we're reading, is that going to be a race condition between two concurrent processes that are writing to different regions?

I think we should probably not be writing any index when we're using region, as either it:

Is along a region-dim, in which case there's now a race condition
Isn't along a region dim, in which case it definitionally doesn't share a dim with a region dim and can't be written anyway

This would also mitigate #6069 (comment)

slevang · 2023-11-09T23:52:50Z

If we are writing to the index that we're reading, is that going to be a race condition between two concurrent processes that are writing to different regions?

I think this is definitely a valid concern. Trying to wrap my head around potential uses, but if you're indeed correct that we should never write an index when using region, we could easily add a step here that just drops any IndexVariables from the dataset:

https://github.com/pydata/xarray/blob/main/xarray/backends/api.py#L1697-L1704

Aren't there perhaps cases where someone would want to update indices as well as data when writing to region? This won't really work now with auto, but I could imagine uses with explicit slices.

max-sixty · 2023-11-10T00:38:35Z

Aren't there perhaps cases where someone would want to update indices as well as data when writing to region? This won't really work now with auto, but I could imagine uses with explicit slices.

We could start by dropping indexes on region="auto" — then there's no change in behavior, always an escape hatch, and we eliminate the race condition.

We'd still have the small inconvenience of #6069 (comment), and more complicated behavior to explain. I also don't think it's ever needed to write indexes in chunks — indexes always need to be in memory at the moment anyway. So if anyone (@jhamman ?) is confident that that assertion is correct, we could drop indexes on any region=.

Does that make sense?

doc/user-guide/io.rst

Co-authored-by: Maximilian Roos <[email protected]>

max-sixty · 2023-11-10T03:13:16Z

I don't mean to keep hanging things on this Christmas tree, but to the extent you're up for it, adding a suggestion to use region to this error message would be a perfect fit, since this will just work now¹

ValueError: variable 'lat' already exists with different dimension sizes: {'lat': 25} != {'lat': 10}. to_zarr() only supports changing dimension sizes when explicitly appending, but append_dim=None.

if the chunks align! Otherwise it'll mangle the data, but we can fix that... ↩

rabernat · 2023-11-10T09:32:39Z

Thanks so much @slevang for taking on this ambitious and very useful feature! 🙌

From my point of view, I've been a little reluctant around this because of the complexity of trying to align coordinates between source and target dataset. I'd like to have a little discussion about what sort of behavior we want to support. This PR makes some reasonable but ultimately opinionated choices about how to do it.

What we are talking about here is alignment. Xarray already has some conventions and machinery around this, ie align. Could we be reusing the align logic here? My reading of the code and tests is that this PR basically implement some variation of join="exact". Could we also support other join modes? Let's try to enumerate all the different possibilities here.

I am looking at a very similar problem with appending data in #8427

slevang · 2023-11-10T13:48:32Z

Could we also support other join modes? Let's try to enumerate all the different possibilities here.

Yes happy to iterate on this. When I wrote this up, I was only thinking of mode="r+", for which an "exact" join of some form I believe is all we can support, since we don't allow changing shapes or metadata, and zarr can only write slices. I'm not sure the .sel() I used directly corresponds to any xr.align modes, it's sort of an "inner" with a subsequent "exact" check?

But what I hadn't thought of is that you can also use mode="a" with region, in which case we need some more complicated logic to find any overlapping portion, plus a potential expansion of the index space. Does that sound right?

slevang · 2023-11-10T19:42:02Z

Allowing region="auto" together with mode="a" can definitely put us in the race condition suggested by @max-sixty. We need to first read the appended dim and then update it in one go, which makes this type of combo read/write op inherently unsafe from multiple processes. Unless someone sees another way, this makes me think we perhaps actually shouldn't support region="auto", mode="a", or we should thoroughly warn about this pattern in the docs.

These issues are summarized pretty nicely in this post on #6069.

I'll say that for my own uses, I've come to think that pre-allocating a store with compute=False (oversize if necessary), then writing the chunks with r+ is a much safer pattern overall. For something like a daily data collection job with mode="a", I've been bitten too many times by a process randomly hanging somewhere in the middle of 1) updating indices, 2) writing data or 3) consolidating metadata. Then you end up with a store that is effectively corrupted and inconsistent which is a huge hassle to fix.

If we use r+ and pre-allocation, we shouldn't have to touch the indices at all, and then the only remaining pitfall I can see is lost data with parallel writes on an unaligned chunk structure.

max-sixty · 2023-11-10T19:48:42Z

If we use r+ and pre-allocation, we shouldn't have to touch the indices at all, and then the only remaining pitfall I can see is lost data with parallel writes on an unaligned chunk structure.

Agree! I had been holding that r+ was the impactful feature here.

Am I correct in thinking we'd need to change the proposed code to not write the region-ed index?

rabernat · 2023-11-10T20:02:17Z

I wonder if we are hitting the limits of file-like "mode" syntax and instead want to use a more expressive, sql-like lexicon like "update", "insert", "upsert" to describe all of these options.

I'm in transit right now and writing from my phone. Will try to provide some more detailed responses to the excellent discussion above soon.

xarray/tests/test_backends.py

max-sixty · 2023-11-12T01:46:42Z

I wonder if we are hitting the limits of file-like "mode" syntax and instead want to use a more expressive, sql-like lexicon like "update", "insert", "upsert" to describe all of these options.

I'm in transit right now and writing from my phone. Will try to provide some more detailed responses to the excellent discussion above soon.

OK. I agree we could consider formalizing & naming these.

How do you think about whether we merge a partial change? You have better & wider context than me, but I had thought that the proposed code dominated the old code without precluding any future additions (though dependent on the range of potential future additions!).

rabernat · 2023-11-12T11:05:25Z

How do you think about whether we merge a partial change?

Sure, let's move forward with this. It seems like a great improvement. But I do think we need to zoom out and take a more comprehensive look at how we "join" when writing to existing zarr datasets (both with region and with append).

Regarding the race condition on dimension coordinates, I believe that a very reasonable approach would be to never write dimension coordinates with region='auto'. The added benefit of this is that it will also improve performance because there are fewer I/O ops. AFAIU, the whole point of this new mode is to align exactly with the target dataset dimension coordinates, so overwriting them is unnecessary.

rabernat

A few suggestions about how to implement and test that dimension coordinates are not overwritten.

rabernat · 2023-11-12T11:15:23Z

xarray/backends/api.py

+            storage_options=storage_options,
+            zarr_version=zarr_version,
+        )
+        region = _validate_and_autodetect_region(dataset, region, open_kwargs)


Somewhere around here might be the right place to drop the dimension coordinates for region='auto'.

rabernat · 2023-11-12T11:18:56Z

xarray/tests/test_backends.py

+
+        expected = ds.copy()
+        expected["test"][2:4, 6:8] += 1
+        assert_identical(ds_updated, expected)


This would be the place to add a test to verify that dimension coordinates are not being overwritten. To accomplish this, you could use a similar pattern to what @olimcc did in #8297: patch the store object so that it counts how many times a certain method has been called

xarray/xarray/tests/test_backends.py

Lines 2892 to 2909 in e5d163a

with (

self.create_zarr_target() as store,

patch.object(

Group, "__getitem__", side_effect=Group.__getitem__, autospec=True

) as mock,

):

ds.to_zarr(store, mode="w")

# We expect this to request array metadata information, so call_count should be >= 1,

# At time of writing, 2 calls are made

xrds = xr.open_zarr(store)

call_count = mock.call_count

assert call_count > 0

# compute() requests array data, which should not trigger additional metadata requests

# we assert that the number of calls has not increased after fetchhing the array

xrds.test.compute(scheduler="sync")

assert mock.call_count == call_count

Here you would want to patch __setitem__ rather than __getitem__.

slevang · 2023-11-13T04:37:09Z

That all makes sense to me. Shouldn't be hard to drop the indices and test that they aren't being written. @rabernat @max-sixty do you agree that we should explicitly disallow mode="a" with any variant of region="auto" for now then?

rabernat · 2023-11-13T12:46:19Z

do you agree that we should explicitly disallow mode="a" with any variant of region="auto"

👍 to this.

max-sixty · 2023-11-13T20:12:37Z

That all makes sense to me. Shouldn't be hard to drop the indices and test that they aren't being written. @rabernat @max-sixty do you agree that we should explicitly disallow mode="a" with any variant of region="auto" for now then?

That's totally fine for now.

I think as we clarify these categories, we'll have a be unsafe for concurrent writes, so allowing using region="auto" would actually be OK. But def don't want to slow this down.

…r-auto-region

max-sixty · 2023-11-14T02:46:19Z

xarray/backends/api.py

+                [
+                    name
+                    for name, v in dataset.variables.items()
+                    if isinstance(v, IndexVariable)
+                ]


Does this suffice?

Suggested change

[

name

for name, v in dataset.variables.items()

if isinstance(v, IndexVariable)

]

dataset.indexes

max-sixty

Looks excellent! Thanks a lot @slevang .

I'll wait for @rabernat before hitting the button.

rabernat · 2023-11-14T03:05:47Z

xarray/tests/test_backends.py

+                assert "x" not in written_variables
+                assert "y" not in written_variables


max-sixty · 2023-11-14T18:30:57Z

I'm hitting the button! Thank you v much @slevang , this is a big improvement!

* main: [skip-ci] dev whats-new (pydata#8467) 2023.11.0 Whats-new (pydata#8461) migrate the other CI to python 3.11 (pydata#8416) preserve vlen string dtypes, allow vlen string fill_values (pydata#7869) Pin mypy < 1.7 (pydata#8458) Fix typos found by codespell (pydata#8457) [skip-ci] Small updates to IO docs. (pydata#8452) Deprecate certain cftime frequency strings following pandas (pydata#8415) Added driver parameter for h5netcdf (pydata#8360) Raise exception in to_dataset if resulting variable is also the name of a coordinate (pydata#8433) Automatic region detection and transpose for `to_zarr()` (pydata#8434) remove `cdms2` (pydata#8441) Remove PseudoNetCDF (pydata#8446) Pin pint to >=0.22 (pydata#8445) Remove keep_attrs from resample signature (pydata#8444) Rename `to_array` to `to_dataarray` (pydata#8438) Add missing DataArray.dt.total_seconds() method (pydata#8435) Declare Dataset, DataArray, Variable, GroupBy unhashable (pydata#8392)

slevang added 2 commits November 8, 2023 23:41

implement auto region and transpose

5b62b7c

fix validation

b0058ae

github-actions bot added topic-backends topic-zarr Related to zarr storage library io labels Nov 9, 2023

slevang commented Nov 9, 2023

View reviewed changes

max-sixty reviewed Nov 9, 2023

View reviewed changes

max-sixty approved these changes Nov 9, 2023

View reviewed changes

slevang added 3 commits November 9, 2023 20:48

support str auto, Dataset doc string, and user-guide example

17fbca9

Merge branch 'main' into zarr-auto-region

463736e

add whats new entry

d10c029

max-sixty reviewed Nov 10, 2023

View reviewed changes

doc/user-guide/io.rst Outdated Show resolved Hide resolved

Update doc/user-guide/io.rst

7e3419b

Co-authored-by: Maximilian Roos <[email protected]>

max-sixty reviewed Nov 12, 2023

View reviewed changes

xarray/tests/test_backends.py Show resolved Hide resolved

rabernat reviewed Nov 12, 2023

View reviewed changes

drop indices and test that they are not written

5120f1f

slevang added 3 commits November 13, 2023 21:23

Merge branch 'main' into zarr-auto-region

c1326c4

Merge branch 'zarr-auto-region' of github.com:slevang/xarray into zar…

809e9e8

…r-auto-region

test that auto append fails

9eb1b58

slevang requested a review from rabernat November 14, 2023 02:38

max-sixty reviewed Nov 14, 2023

View reviewed changes

max-sixty approved these changes Nov 14, 2023

View reviewed changes

rabernat approved these changes Nov 14, 2023

View reviewed changes

slevang added 2 commits November 13, 2023 22:09

more concise indexes detection

e1a2cb3

fix typing

d4b8a0d

max-sixty merged commit f0ade3d into pydata:main Nov 14, 2023

dcherian mentioned this pull request Nov 15, 2023

Allow writing to zarr with differently ordered dims #8381

Closed

2 tasks

slevang mentioned this pull request Jan 5, 2024

Don't overwrite indexes for region writes, always #8589

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic region detection and transpose for `to_zarr()` #8434

Automatic region detection and transpose for `to_zarr()` #8434

slevang commented Nov 9, 2023 •

edited

Loading

slevang Nov 9, 2023

max-sixty Nov 9, 2023 •

edited

Loading

max-sixty Nov 9, 2023

slevang Nov 9, 2023

max-sixty Nov 9, 2023

max-sixty left a comment

max-sixty commented Nov 9, 2023

slevang commented Nov 9, 2023

max-sixty commented Nov 10, 2023 •

edited

Loading

max-sixty commented Nov 10, 2023

rabernat commented Nov 10, 2023

slevang commented Nov 10, 2023 •

edited

Loading

slevang commented Nov 10, 2023

max-sixty commented Nov 10, 2023

rabernat commented Nov 10, 2023

max-sixty commented Nov 12, 2023

rabernat commented Nov 12, 2023

rabernat left a comment

rabernat Nov 12, 2023

rabernat Nov 12, 2023

slevang commented Nov 13, 2023

rabernat commented Nov 13, 2023

max-sixty commented Nov 13, 2023

max-sixty Nov 14, 2023 •

edited

Loading

slevang Nov 14, 2023

max-sixty left a comment

rabernat Nov 14, 2023

max-sixty commented Nov 14, 2023



		def _auto_detect_regions(ds, region, store):
		ds_original = open_zarr(store)

	with (
	self.create_zarr_target() as store,
	patch.object(
	Group, "__getitem__", side_effect=Group.__getitem__, autospec=True
	) as mock,
	):
	ds.to_zarr(store, mode="w")

	# We expect this to request array metadata information, so call_count should be >= 1,
	# At time of writing, 2 calls are made
	xrds = xr.open_zarr(store)
	call_count = mock.call_count
	assert call_count > 0

	# compute() requests array data, which should not trigger additional metadata requests
	# we assert that the number of calls has not increased after fetchhing the array
	xrds.test.compute(scheduler="sync")
	assert mock.call_count == call_count

		assert "x" not in written_variables
		assert "y" not in written_variables

Automatic region detection and transpose for to_zarr() #8434

Automatic region detection and transpose for to_zarr() #8434

Conversation

slevang commented Nov 9, 2023 • edited Loading

Choose a reason for hiding this comment

max-sixty Nov 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty left a comment

Choose a reason for hiding this comment

max-sixty commented Nov 9, 2023

slevang commented Nov 9, 2023

max-sixty commented Nov 10, 2023 • edited Loading

max-sixty commented Nov 10, 2023

Footnotes

rabernat commented Nov 10, 2023

slevang commented Nov 10, 2023 • edited Loading

slevang commented Nov 10, 2023

max-sixty commented Nov 10, 2023

rabernat commented Nov 10, 2023

max-sixty commented Nov 12, 2023

rabernat commented Nov 12, 2023

rabernat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slevang commented Nov 13, 2023

rabernat commented Nov 13, 2023

max-sixty commented Nov 13, 2023

max-sixty Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty commented Nov 14, 2023

Automatic region detection and transpose for `to_zarr()` #8434

Automatic region detection and transpose for `to_zarr()` #8434

slevang commented Nov 9, 2023 •

edited

Loading

max-sixty Nov 9, 2023 •

edited

Loading

max-sixty commented Nov 10, 2023 •

edited

Loading

slevang commented Nov 10, 2023 •

edited

Loading

max-sixty Nov 14, 2023 •

edited

Loading