-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic region detection and transpose for to_zarr()
#8434
Conversation
xarray/backends/api.py
Outdated
|
||
|
||
def _auto_detect_regions(ds, region, store): | ||
ds_original = open_zarr(store) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be neglecting important args passed to the original to_zarr()
call. Should we pass things through here like consolidated
, group
, zarr_version
, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add **kwargs
to both to cover bases...
Though enumerating the list of options will be manual. I think the ones you suggest above make sense. If we miss something, it's not a disaster — "auto"
may not work in some rare cases, but I don't think it will be an unexpected error, and we can add kwargs later.
xarray/backends/api.py
Outdated
@@ -1578,7 +1622,7 @@ def to_zarr( | |||
compute: bool = True, | |||
consolidated: bool | None = None, | |||
append_dim: Hashable | None = None, | |||
region: Mapping[str, slice] | None = None, | |||
region: Mapping[str, slice | Literal["auto"]] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also offer region="auto"
to run the "auto"
on all dims (but not required)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, that's an obvious improvement, thanks. Now I'm wondering, is there any reason not to have this be the only auto option, i.e. Mapping[str, slice] | Literal["auto"]
? Would there ever be a need to do a mix e.g. region={"x": "auto", "y": slice(0, 10)}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good point, I think that's a simpler API.
I guess there's possibly some corner cases where we have one index but not the other. No strong view on whether it's worth the additional complexity to enable those...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome so I really don't want to put up any blockers, but one thought: If we are writing to the index that we're reading, is that going to be a race condition between two concurrent processes that are writing to different regions? I think we should probably not be writing any index when we're using
This would also mitigate #6069 (comment) |
I think this is definitely a valid concern. Trying to wrap my head around potential uses, but if you're indeed correct that we should never write an index when using https://github.com/pydata/xarray/blob/main/xarray/backends/api.py#L1697-L1704 Aren't there perhaps cases where someone would want to update indices as well as data when writing to |
We could start by dropping indexes on We'd still have the small inconvenience of #6069 (comment), and more complicated behavior to explain. I also don't think it's ever needed to write indexes in chunks — indexes always need to be in memory at the moment anyway. So if anyone (@jhamman ?) is confident that that assertion is correct, we could drop indexes on any Does that make sense? |
Co-authored-by: Maximilian Roos <[email protected]>
I don't mean to keep hanging things on this Christmas tree, but to the extent you're up for it, adding a suggestion to use region to this error message would be a perfect fit, since this will just work now1
Footnotes
|
Thanks so much @slevang for taking on this ambitious and very useful feature! 🙌 From my point of view, I've been a little reluctant around this because of the complexity of trying to align coordinates between source and target dataset. I'd like to have a little discussion about what sort of behavior we want to support. This PR makes some reasonable but ultimately opinionated choices about how to do it. What we are talking about here is alignment. Xarray already has some conventions and machinery around this, ie align. Could we be reusing the align logic here? My reading of the code and tests is that this PR basically implement some variation of I am looking at a very similar problem with appending data in #8427 |
Yes happy to iterate on this. When I wrote this up, I was only thinking of But what I hadn't thought of is that you can also use |
Allowing These issues are summarized pretty nicely in this post on #6069. I'll say that for my own uses, I've come to think that pre-allocating a store with If we use |
Agree! I had been holding that Am I correct in thinking we'd need to change the proposed code to not write the region-ed index? |
I wonder if we are hitting the limits of file-like "mode" syntax and instead want to use a more expressive, sql-like lexicon like "update", "insert", "upsert" to describe all of these options. I'm in transit right now and writing from my phone. Will try to provide some more detailed responses to the excellent discussion above soon. |
OK. I agree we could consider formalizing & naming these. How do you think about whether we merge a partial change? You have better & wider context than me, but I had thought that the proposed code dominated the old code without precluding any future additions (though dependent on the range of potential future additions!). |
Sure, let's move forward with this. It seems like a great improvement. But I do think we need to zoom out and take a more comprehensive look at how we "join" when writing to existing zarr datasets (both with Regarding the race condition on dimension coordinates, I believe that a very reasonable approach would be to never write dimension coordinates with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few suggestions about how to implement and test that dimension coordinates are not overwritten.
xarray/backends/api.py
Outdated
storage_options=storage_options, | ||
zarr_version=zarr_version, | ||
) | ||
region = _validate_and_autodetect_region(dataset, region, open_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhere around here might be the right place to drop the dimension coordinates for region='auto'
.
|
||
expected = ds.copy() | ||
expected["test"][2:4, 6:8] += 1 | ||
assert_identical(ds_updated, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be the place to add a test to verify that dimension coordinates are not being overwritten. To accomplish this, you could use a similar pattern to what @olimcc did in #8297: patch the store object so that it counts how many times a certain method has been called
xarray/xarray/tests/test_backends.py
Lines 2892 to 2909 in e5d163a
with ( | |
self.create_zarr_target() as store, | |
patch.object( | |
Group, "__getitem__", side_effect=Group.__getitem__, autospec=True | |
) as mock, | |
): | |
ds.to_zarr(store, mode="w") | |
# We expect this to request array metadata information, so call_count should be >= 1, | |
# At time of writing, 2 calls are made | |
xrds = xr.open_zarr(store) | |
call_count = mock.call_count | |
assert call_count > 0 | |
# compute() requests array data, which should not trigger additional metadata requests | |
# we assert that the number of calls has not increased after fetchhing the array | |
xrds.test.compute(scheduler="sync") | |
assert mock.call_count == call_count |
Here you would want to patch __setitem__
rather than __getitem__
.
That all makes sense to me. Shouldn't be hard to drop the indices and test that they aren't being written. @rabernat @max-sixty do you agree that we should explicitly disallow |
👍 to this. |
That's totally fine for now. I think as we clarify these categories, we'll have |
xarray/backends/api.py
Outdated
[ | ||
name | ||
for name, v in dataset.variables.items() | ||
if isinstance(v, IndexVariable) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this suffice?
[ | |
name | |
for name, v in dataset.variables.items() | |
if isinstance(v, IndexVariable) | |
] | |
dataset.indexes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert "x" not in written_variables | ||
assert "y" not in written_variables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
I'm hitting the button! Thank you v much @slevang , this is a big improvement! |
* main: [skip-ci] dev whats-new (pydata#8467) 2023.11.0 Whats-new (pydata#8461) migrate the other CI to python 3.11 (pydata#8416) preserve vlen string dtypes, allow vlen string fill_values (pydata#7869) Pin mypy < 1.7 (pydata#8458) Fix typos found by codespell (pydata#8457) [skip-ci] Small updates to IO docs. (pydata#8452) Deprecate certain cftime frequency strings following pandas (pydata#8415) Added driver parameter for h5netcdf (pydata#8360) Raise exception in to_dataset if resulting variable is also the name of a coordinate (pydata#8433) Automatic region detection and transpose for `to_zarr()` (pydata#8434) remove `cdms2` (pydata#8441) Remove PseudoNetCDF (pydata#8446) Pin pint to >=0.22 (pydata#8445) Remove keep_attrs from resample signature (pydata#8444) Rename `to_array` to `to_dataarray` (pydata#8438) Add missing DataArray.dt.total_seconds() method (pydata#8435) Declare Dataset, DataArray, Variable, GroupBy unhashable (pydata#8392)
to_zarr(region=...)
rather than passing indexes #7702,to_zarr
could transpose dims #8421whats-new.rst
A quick pass at implementing these two improvements for zarr region writes:
region={dim: "auto"}
, which opens the existing zarr store and identifies the correct slice to write to, using a variation of the approach suggested by @DahnJ here. We also check for non-matching coordinates and non-contiguous indices.