Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inferring concatenation order from coordinate data values #18

Closed
TomNicholas opened this issue Mar 9, 2024 · 18 comments · Fixed by #52 or #357
Closed

Inferring concatenation order from coordinate data values #18

TomNicholas opened this issue Mar 9, 2024 · 18 comments · Fixed by #52 or #357
Labels
xarray Requires changes to xarray upstream

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Mar 9, 2024

If xr.combine_by_coords is used, then indexes must be created for dimension coordinates in order to use them to infer the correct order of concatenation. This requires loading of the data for that coordinate in order to create the index. We can do this, but I think we will then end up with a Dataset which contains some ManifestArrays but also some numpy arrays for the dimension coordinates. This isn't itself a problem - we have that information, we can load it from the files.

The tricky part is that we presumably don't want to write those actual data values back out during the serialization step, we want to only write the chunk manifest for each dimension coordinate. How can we regain access to this information, or carry it around ready for serialization at the end?

@TomNicholas TomNicholas mentioned this issue Mar 10, 2024
15 tasks
@TomNicholas
Copy link
Member Author

TomNicholas commented Mar 11, 2024

My understanding (/ my recollection 😅 ) of the code in xr.combine_by_coords is that it infers concatenation order from the pandas indexes backing any 1D coordinate variables. So as long as those indexes are present and correct, the datasets can be ordered. So perhaps we could just manually create those indexes when we create the dataset (which requires loading data from the files), so the indexes are eager (i.e. how they normally work) but the corresponding array data is virtualized.

This would mean we would need to both read the kerchunk references and actually read data values from the file at dataset construction time. We would want to have a function flag to opt in to this behaviour.

@TomNicholas TomNicholas changed the title Virtualizing coordinates that were used to infer concat order Inferring concatenation order from coordinate data values Mar 11, 2024
@TomNicholas
Copy link
Member Author

TomNicholas commented Mar 27, 2024

Okay I prematurely closed this, as it was not actually as simple as it seemed.

perhaps we could just manually create those indexes when we create the dataset (which requires loading data from the files), so the indexes are eager (i.e. how they normally work) but the corresponding array data is virtualized.

This would mean we would need to both read the kerchunk references and actually read data values from the file at dataset construction time.

This is what I implemented in #52, but I didn't remember to check this part:

The tricky part is that we presumably don't want to write those actual data values back out during the serialization step, we want to only write the chunk manifest for each dimension coordinate.

Turns out that doesn't work yet - at some point within xr.combine_by_coords the coordinate variable is being recreated from the index as a numpy array, even though the original coordinate variable was a virtual ManifestArray. Hopefully this can be fixed with another change to xarray upstream similar to pydata/xarray#8872.

In the meantime I should remove that section from the docs so as not to commit false advertising! EDIT: Removed in 2c5be3f

cc @norlandrhagen

@benbovy
Copy link

benbovy commented Mar 28, 2024

So perhaps we could just manually create those indexes when we create the dataset (which requires loading data from the files), so the indexes are eager (i.e. how they normally work) but the corresponding array data is virtualized.

In general I wonder if it would be useful to have some custom Xarray index here. Basically a simple subclass or a very lightweight wrapper around xarray.indexes.PandasIndex where the .create_variables() method is re-implemented such that the indexed coordinate variable doesn't wrap the pandas.Index object but returns a variable still wrapping the original ManifestArray (IIUC).

we presumably don't want to write those actual data values back out during the serialization step

Would such decoupling of the pandas.Index from the indexed coordinate data solve this?

The tricky part is that we still need to refactor Xarray IndexVariable, which is still too tied to pandas.Index and currently prevents implementing lazy custom indexes (see pydata/xarray#8124). Perhaps as a temporary workaround we could use a regular Variable for the indexed coordinate instead of an IndexVariable...

@dcherian
Copy link

dcherian commented Mar 28, 2024

we presumably don't want to write those actual data values back out during the serialization step,

TWriting would speed things up reading from the consolidated files. Kerchunk calls this "inlining". For e.g. you wouldn't want read ~90k time steps from 90k files to construct a 90K long time coordinate

@TomNicholas
Copy link
Member Author

a variable still wrapping the original ManifestArray (IIUC).

Would such decoupling of the pandas.Index from the indexed coordinate data solve this?

Yeah that sounds like what we need! I basically need the ManifestArray inside the coordinate variable to still exist after xarray concat etc. operations, so that I can write it to disk. So I need the coordinate data to still wrap the original duck array, without having coerced it or dropped it in favour of a numpy array created from the pd.Index.

Perhaps as a temporary workaround we could use a regular Variable for the indexed coordinate instead of an IndexVariable...

I think this is what I was hoping I could achieve with pydata/xarray#8872, but clearly I was missing some subtleties with IndexVariable vs Variable.

In general I wonder if it would be useful to have some custom Xarray index here.

Naively, I don't really understand why this use case requires a custom Xarray index. I am still happy to use PandasIndex for the index class (i.e. not custom), I just want my wrapped coordinate variable data to not be messed with.

@TomNicholas
Copy link
Member Author

TWriting would speed things up reading from the consolidated files. Kerchunk calls this "inlining". For e.g. you wouldn't want read ~90k time steps from 90k files to construct a 90K long time coordinate

@dcherian if we have retained both the index and the ManifestArray, wouldn't the choice of whether or not to "inline" be something that can be deferred to the write step?

@dcherian
Copy link

IIUC yes.

@benbovy
Copy link

benbovy commented Mar 28, 2024

Naively, I don't really understand why this use case requires a custom Xarray index. I am still happy to use PandasIndex for the index class (i.e. not custom), I just want my wrapped coordinate variable data to not be messed with.

Xarray indexes have full control on the data wrapped by their corresponding indexed coordinates, via Index.create_variables().

Currently PandasIndex.create_variables() creates and returns a new IndexVariable backed by the pandas.Index to prevent duplicating data in memory. In most cases this is probably what we want but not here IIUC, so I'm afraid you'll need some custom index if you want to override that current behavior.

Actually, it might be enough to have a custom index as a lightweight wrapper around any xarray index (PandasIndex or other) where create_variables doesn't replace the original coordinate variable data.

@TomNicholas
Copy link
Member Author

Okay thanks. So implementing such an index would be essentially involve subclassing PandasIndex but overriding .create_variables()?

Actually, it might be enough to have a custom index as a lightweight wrapper around any xarray index (PandasIndex or other) where create_variables doesn't replace the original coordinate variable data.

This seems like something that might be useful in general, a sort of InaccessibleDuckArrayIndex.

@benbovy
Copy link

benbovy commented Mar 28, 2024

Alternatively, we could add a build option to PandasIndex to opt out from the default behavior. That would be perfectly reasonable IMO. No need for any subclass or wrapper then, unless you'd like to also support custom indexes in VirtualiZarr (if that's relevant).

@TomNicholas
Copy link
Member Author

Alternatively, we could add a build option to PandasIndex to opt out from the default behavior.

A build option as in a new kwarg to .create_variables() or something? How would I opt in to that?

unless you'd like to also support custom indexes in VirtualiZarr (if that's relevant).

I don't think it is? Currently the only reason I want these indexes is so that xr.combine_by_coords can look at them to infer the correct order of concatenation of the datasets. But there might be some future feature that would require a custom index...

@benbovy
Copy link

benbovy commented Mar 28, 2024

A build option as in a new kwarg to .create_variables() or something?

Rather a new argument to PandasIndex.__init__(..., wrap_index_data=True) such that we could opt out with something like ds.set_xindex("x", PandasIndex, wrap_index_data=False).

Currently the only reason I want these indexes is so that xr.combine_by_coords can look at them to infer the correct order of concatenation of the datasets. But there might be some future feature that would require a custom index...

Yeah I don't know. Maybe to eventually allow users manually setting custom indexes and doing some domain-specific operations before "saving" the resulting virtual dataset?

@TomNicholas
Copy link
Member Author

Rather a new argument to PandasIndex.__init__(..., wrap_index_data=True) such that we could opt out with something like ds.set_xindex("x", PandasIndex, wrap_index_data=False).

Both this and the subclassing-PandasIndex approach seem like good general solutions.

Yeah I don't know. Maybe to eventually allow users manually setting custom indexes and doing some domain-specific operations before "saving" the resulting virtual dataset?

Yeah quite possibly. One thing at a time though! I appreciate your input here.

@benbovy
Copy link

benbovy commented Mar 28, 2024

One thing at a time though!

Yes sure! :-)

@TomNicholas
Copy link
Member Author

@benbovy do we actually need a subclass / build option at all? Can't we just change the implementation of PandasIndex to say "If the wrapped variable contains a numpy array, avoid duplicating data (i.e. wrap_index_data=True). But if it contains any other duck array, return that instead unaltered (i.e. wrap_index_data=False)." So basically make the currently implementation a memory optimization that only gets applied if the PandasIndex sees that the variable is wrapping a numpy array. Are there any cases where that approach would fail?

@benbovy
Copy link

benbovy commented Mar 29, 2024

Are there any cases where that approach would fail?

I guess in theory such memory optimization might make sense for other, in-memory duck arrays? In practice I don't know much.

Exposing some control to advanced users for this behavior a-priori doesn't seem like a bad idea to me.

@TomNicholas
Copy link
Member Author

I realized that xr.combine_by_coords should actually already work fine - if you are willing to load the relevant coordinates into memory (and therefore also have those values saved into the resultant references on-disk).

In [1]: from virtualizarr import open_virtual_dataset

In [2]: import xarray as xr

In [3]: ds1 = open_virtual_dataset('air1.nc', loadable_variables=['time', 'lat', 'lon'], cftime_variables=['time'])

In [4]: ds2 = open_virtual_dataset('air2.nc', loadable_variables=['time', 'lat', 'lon'], cftime_variables=['time'])

In [5]: xr.combine_by_coords([ds2, ds1], coords='minimal', compat='override')
Out[5]: 
<xarray.Dataset> Size: 8MB
Dimensions:  (time: 2920, lat: 25, lon: 53)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) float32 12kB 1.867e+06 1.867e+06 ... 1.885e+06 1.885e+06
Data variables:
    air      (time, lat, lon) int16 8MB ManifestArray<shape=(2920, 25, 53), d...
Attributes:
    Conventions:  COARDS
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
    title:        4x daily NMC reanalysis (1948)

the only issue with this result is that somehow the dtype of time has been changed from datetime64[ns] to float32.


This approach doesn't solve the original issue, but it might also be fine in a lot of cases. xr.combine_by_coords can only auto-order along dimensions that have a 1D coordinate, and 1D variables are small, so if they are split across many files its likely that you wanted to include them in loadable_variables anyway.

@TomNicholas TomNicholas mentioned this issue Jul 1, 2024
7 tasks
TomNicholas added a commit that referenced this issue Jul 2, 2024
* add warning when user passes indexes=None

* expect warnings in tests

* new test to show how #18 doesn't fully work yet

* release notes

* lint
@TomNicholas
Copy link
Member Author

The tricky part is that we presumably don't want to write those actual data values back out during the serialization step, we want to only write the chunk manifest for each dimension coordinate.

In hindsight I think this was an unnecessary and overly-complicated requirement. It's often totally fine to just load dimension coordinates into memory and write them into the zarr (now icechunk) store. #357 changes the default to be to create indexes for any dimension coordinate which appears in loadable_variables. Anyone with a more pathological case than that can deal with it another way.

TomNicholas added a commit that referenced this issue Dec 18, 2024
TomNicholas added a commit that referenced this issue Dec 18, 2024
* remove warning

* fix test by loading dimension coordinates

* fix other test by passing loadable_variables

* move loadable variables docs section before combining

* remove recommendation to not create indexes

* de-emphasise avoiding creating indexes

* document using xr.combine_by_coords

* clarify todo

* signpost segue

* remove extra line

* release notes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* correct some PR numbers in release notes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refer to #18 in release notes

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
xarray Requires changes to xarray upstream
Projects
None yet
3 participants