Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic region detection and transpose for to_zarr() #8434

Merged
merged 12 commits into from
Nov 14, 2023

Conversation

slevang
Copy link
Contributor

@slevang slevang commented Nov 9, 2023

A quick pass at implementing these two improvements for zarr region writes:

  1. allow passing region={dim: "auto"}, which opens the existing zarr store and identifies the correct slice to write to, using a variation of the approach suggested by @DahnJ here. We also check for non-matching coordinates and non-contiguous indices.
  2. automatically transpose dimensions if they otherwise match the existing store but are out of order

@github-actions github-actions bot added topic-backends topic-zarr Related to zarr storage library io labels Nov 9, 2023


def _auto_detect_regions(ds, region, store):
ds_original = open_zarr(store)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be neglecting important args passed to the original to_zarr() call. Should we pass things through here like consolidated, group, zarr_version, etc?

Copy link
Collaborator

@max-sixty max-sixty Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add **kwargs to both to cover bases...

Though enumerating the list of options will be manual. I think the ones you suggest above make sense. If we miss something, it's not a disaster — "auto" may not work in some rare cases, but I don't think it will be an unexpected error, and we can add kwargs later.

@@ -1578,7 +1622,7 @@ def to_zarr(
compute: bool = True,
consolidated: bool | None = None,
append_dim: Hashable | None = None,
region: Mapping[str, slice] | None = None,
region: Mapping[str, slice | Literal["auto"]] | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also offer region="auto" to run the "auto" on all dims (but not required)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, that's an obvious improvement, thanks. Now I'm wondering, is there any reason not to have this be the only auto option, i.e. Mapping[str, slice] | Literal["auto"]? Would there ever be a need to do a mix e.g. region={"x": "auto", "y": slice(0, 10)}?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good point, I think that's a simpler API.

I guess there's possibly some corner cases where we have one index but not the other. No strong view on whether it's worth the additional complexity to enable those...

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks @slevang !


Separately, if anyone can get to #8371, I think that increases in importance with this PR, since it'll be easier to write to unaligned chunks. And losing data is really not ideal...

@max-sixty
Copy link
Collaborator

This is awesome so I really don't want to put up any blockers, but one thought:

If we are writing to the index that we're reading, is that going to be a race condition between two concurrent processes that are writing to different regions?

I think we should probably not be writing any index when we're using region, as either it:

  • Is along a region-dim, in which case there's now a race condition
  • Isn't along a region dim, in which case it definitionally doesn't share a dim with a region dim and can't be written anyway

This would also mitigate #6069 (comment)

@slevang
Copy link
Contributor Author

slevang commented Nov 9, 2023

If we are writing to the index that we're reading, is that going to be a race condition between two concurrent processes that are writing to different regions?

I think this is definitely a valid concern. Trying to wrap my head around potential uses, but if you're indeed correct that we should never write an index when using region, we could easily add a step here that just drops any IndexVariables from the dataset:

https://github.com/pydata/xarray/blob/main/xarray/backends/api.py#L1697-L1704

Aren't there perhaps cases where someone would want to update indices as well as data when writing to region? This won't really work now with auto, but I could imagine uses with explicit slices.

@max-sixty
Copy link
Collaborator

max-sixty commented Nov 10, 2023

Aren't there perhaps cases where someone would want to update indices as well as data when writing to region? This won't really work now with auto, but I could imagine uses with explicit slices.

We could start by dropping indexes on region="auto" — then there's no change in behavior, always an escape hatch, and we eliminate the race condition.

We'd still have the small inconvenience of #6069 (comment), and more complicated behavior to explain. I also don't think it's ever needed to write indexes in chunks — indexes always need to be in memory at the moment anyway. So if anyone (@jhamman ?) is confident that that assertion is correct, we could drop indexes on any region=.

Does that make sense?

Co-authored-by: Maximilian Roos <[email protected]>
@max-sixty
Copy link
Collaborator

I don't mean to keep hanging things on this Christmas tree, but to the extent you're up for it, adding a suggestion to use region to this error message would be a perfect fit, since this will just work now1

ValueError: variable 'lat' already exists with different dimension sizes: {'lat': 25} != {'lat': 10}. to_zarr() only supports changing dimension sizes when explicitly appending, but append_dim=None.

Footnotes

  1. if the chunks align! Otherwise it'll mangle the data, but we can fix that...

@rabernat
Copy link
Contributor

Thanks so much @slevang for taking on this ambitious and very useful feature! 🙌

From my point of view, I've been a little reluctant around this because of the complexity of trying to align coordinates between source and target dataset. I'd like to have a little discussion about what sort of behavior we want to support. This PR makes some reasonable but ultimately opinionated choices about how to do it.

What we are talking about here is alignment. Xarray already has some conventions and machinery around this, ie align. Could we be reusing the align logic here? My reading of the code and tests is that this PR basically implement some variation of join="exact". Could we also support other join modes? Let's try to enumerate all the different possibilities here.

I am looking at a very similar problem with appending data in #8427

@slevang
Copy link
Contributor Author

slevang commented Nov 10, 2023

Could we also support other join modes? Let's try to enumerate all the different possibilities here.

Yes happy to iterate on this. When I wrote this up, I was only thinking of mode="r+", for which an "exact" join of some form I believe is all we can support, since we don't allow changing shapes or metadata, and zarr can only write slices. I'm not sure the .sel() I used directly corresponds to any xr.align modes, it's sort of an "inner" with a subsequent "exact" check?

But what I hadn't thought of is that you can also use mode="a" with region, in which case we need some more complicated logic to find any overlapping portion, plus a potential expansion of the index space. Does that sound right?

@slevang
Copy link
Contributor Author

slevang commented Nov 10, 2023

Allowing region="auto" together with mode="a" can definitely put us in the race condition suggested by @max-sixty. We need to first read the appended dim and then update it in one go, which makes this type of combo read/write op inherently unsafe from multiple processes. Unless someone sees another way, this makes me think we perhaps actually shouldn't support region="auto", mode="a", or we should thoroughly warn about this pattern in the docs.

These issues are summarized pretty nicely in this post on #6069.

I'll say that for my own uses, I've come to think that pre-allocating a store with compute=False (oversize if necessary), then writing the chunks with r+ is a much safer pattern overall. For something like a daily data collection job with mode="a", I've been bitten too many times by a process randomly hanging somewhere in the middle of 1) updating indices, 2) writing data or 3) consolidating metadata. Then you end up with a store that is effectively corrupted and inconsistent which is a huge hassle to fix.

If we use r+ and pre-allocation, we shouldn't have to touch the indices at all, and then the only remaining pitfall I can see is lost data with parallel writes on an unaligned chunk structure.

@max-sixty
Copy link
Collaborator

If we use r+ and pre-allocation, we shouldn't have to touch the indices at all, and then the only remaining pitfall I can see is lost data with parallel writes on an unaligned chunk structure.

Agree! I had been holding that r+ was the impactful feature here.

Am I correct in thinking we'd need to change the proposed code to not write the region-ed index?

@rabernat
Copy link
Contributor

I wonder if we are hitting the limits of file-like "mode" syntax and instead want to use a more expressive, sql-like lexicon like "update", "insert", "upsert" to describe all of these options.

I'm in transit right now and writing from my phone. Will try to provide some more detailed responses to the excellent discussion above soon.

@max-sixty
Copy link
Collaborator

I wonder if we are hitting the limits of file-like "mode" syntax and instead want to use a more expressive, sql-like lexicon like "update", "insert", "upsert" to describe all of these options.

I'm in transit right now and writing from my phone. Will try to provide some more detailed responses to the excellent discussion above soon.

OK. I agree we could consider formalizing & naming these.

How do you think about whether we merge a partial change? You have better & wider context than me, but I had thought that the proposed code dominated the old code without precluding any future additions (though dependent on the range of potential future additions!).

@rabernat
Copy link
Contributor

How do you think about whether we merge a partial change?

Sure, let's move forward with this. It seems like a great improvement. But I do think we need to zoom out and take a more comprehensive look at how we "join" when writing to existing zarr datasets (both with region and with append).

Regarding the race condition on dimension coordinates, I believe that a very reasonable approach would be to never write dimension coordinates with region='auto'. The added benefit of this is that it will also improve performance because there are fewer I/O ops. AFAIU, the whole point of this new mode is to align exactly with the target dataset dimension coordinates, so overwriting them is unnecessary.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions about how to implement and test that dimension coordinates are not overwritten.

storage_options=storage_options,
zarr_version=zarr_version,
)
region = _validate_and_autodetect_region(dataset, region, open_kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere around here might be the right place to drop the dimension coordinates for region='auto'.


expected = ds.copy()
expected["test"][2:4, 6:8] += 1
assert_identical(ds_updated, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be the place to add a test to verify that dimension coordinates are not being overwritten. To accomplish this, you could use a similar pattern to what @olimcc did in #8297: patch the store object so that it counts how many times a certain method has been called

with (
self.create_zarr_target() as store,
patch.object(
Group, "__getitem__", side_effect=Group.__getitem__, autospec=True
) as mock,
):
ds.to_zarr(store, mode="w")
# We expect this to request array metadata information, so call_count should be >= 1,
# At time of writing, 2 calls are made
xrds = xr.open_zarr(store)
call_count = mock.call_count
assert call_count > 0
# compute() requests array data, which should not trigger additional metadata requests
# we assert that the number of calls has not increased after fetchhing the array
xrds.test.compute(scheduler="sync")
assert mock.call_count == call_count

Here you would want to patch __setitem__ rather than __getitem__.

@slevang
Copy link
Contributor Author

slevang commented Nov 13, 2023

That all makes sense to me. Shouldn't be hard to drop the indices and test that they aren't being written. @rabernat @max-sixty do you agree that we should explicitly disallow mode="a" with any variant of region="auto" for now then?

@rabernat
Copy link
Contributor

do you agree that we should explicitly disallow mode="a" with any variant of region="auto"

👍 to this.

@max-sixty
Copy link
Collaborator

That all makes sense to me. Shouldn't be hard to drop the indices and test that they aren't being written. @rabernat @max-sixty do you agree that we should explicitly disallow mode="a" with any variant of region="auto" for now then?

That's totally fine for now.

I think as we clarify these categories, we'll have a be unsafe for concurrent writes, so allowing using region="auto" would actually be OK. But def don't want to slow this down.

@slevang slevang requested a review from rabernat November 14, 2023 02:38
Comment on lines 1712 to 1716
[
name
for name, v in dataset.variables.items()
if isinstance(v, IndexVariable)
]
Copy link
Collaborator

@max-sixty max-sixty Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this suffice?

Suggested change
[
name
for name, v in dataset.variables.items()
if isinstance(v, IndexVariable)
]
dataset.indexes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, yes

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks excellent! Thanks a lot @slevang .

I'll wait for @rabernat before hitting the button.

Comment on lines +5350 to +5351
assert "x" not in written_variables
assert "y" not in written_variables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

@max-sixty
Copy link
Collaborator

I'm hitting the button! Thank you v much @slevang , this is a big improvement!

@max-sixty max-sixty merged commit f0ade3d into pydata:main Nov 14, 2023
dcherian added a commit to rabernat/xarray that referenced this pull request Nov 29, 2023
* main:
  [skip-ci] dev whats-new (pydata#8467)
  2023.11.0 Whats-new (pydata#8461)
  migrate the other CI to python 3.11 (pydata#8416)
  preserve vlen string dtypes, allow vlen string fill_values (pydata#7869)
  Pin mypy < 1.7 (pydata#8458)
  Fix typos found by codespell (pydata#8457)
  [skip-ci] Small updates to IO docs. (pydata#8452)
  Deprecate certain cftime frequency strings following pandas (pydata#8415)
  Added driver parameter for h5netcdf (pydata#8360)
  Raise exception in to_dataset if resulting variable is also the name of a coordinate (pydata#8433)
  Automatic region detection and transpose for `to_zarr()` (pydata#8434)
  remove `cdms2` (pydata#8441)
  Remove PseudoNetCDF (pydata#8446)
  Pin pint to >=0.22 (pydata#8445)
  Remove keep_attrs from resample signature (pydata#8444)
  Rename `to_array` to `to_dataarray` (pydata#8438)
  Add missing DataArray.dt.total_seconds() method (pydata#8435)
  Declare Dataset, DataArray, Variable, GroupBy unhashable (pydata#8392)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io topic-backends topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow passing coordinates in to_zarr(region=...) rather than passing indexes
3 participants