Skip to content

Commit 9e3d3bd

Browse files
TomNicholasmax-sixtyshoyer
authored
Datatree alignment docs (#9501)
* remove too-long underline * draft section on data alignment * fixes * draft section on coordinate inheritance * various improvements * more improvements * link from other page * align call include all 3 datasets * link back to use cases * clarification * small improvements * remove TODO after #9532 * add todo about #9475 * correct xr.align example call * add links to netCDF4 documentation * Consistent voice Co-authored-by: Maximilian Roos <[email protected]> * keep indexes in lat lon selection to dodge #9475 * unpack generator properly Co-authored-by: Stephan Hoyer <[email protected]> * ideas for next section * briefly summarize what alignment means * clarify that it's the data in each node that was previously unrelated * fix incorrect indentation of code block * display the tree with redundant coordinates again * remove content about non-inherited coords for a follow-up PR * remove todo * remove todo now that aggregations are re-implemented * remove link to (unmerged) migration guide * remove todo about improving error message * correct statement in data-structures docs * fix internal link --------- Co-authored-by: Maximilian Roos <[email protected]> Co-authored-by: Stephan Hoyer <[email protected]>
1 parent 93b4859 commit 9e3d3bd

File tree

2 files changed

+151
-3
lines changed

2 files changed

+151
-3
lines changed

doc/user-guide/data-structures.rst

+2-1
Original file line numberDiff line numberDiff line change
@@ -771,7 +771,7 @@ Here there are four different coordinate variables, which apply to variables in
771771
``station`` is used only for ``weather`` variables
772772
``lat`` and ``lon`` are only use for ``satellite`` images
773773

774-
Coordinate variables are inherited to descendent nodes, which means that
774+
Coordinate variables are inherited to descendent nodes, which is only possible because
775775
variables at different levels of a hierarchical DataTree are always
776776
aligned. Placing the ``time`` variable at the root node automatically indicates
777777
that it applies to all descendent nodes. Similarly, ``station`` is in the base
@@ -800,6 +800,7 @@ included by default unless you exclude them with the ``inherit`` flag:
800800
801801
dt2["/weather/temperature"].to_dataset(inherit=False)
802802
803+
For more examples and further discussion see :ref:`alignment and coordinate inheritance <hierarchical-data.alignment-and-coordinate-inheritance>`.
803804

804805
.. _coordinates:
805806

doc/user-guide/hierarchical-data.rst

+149-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
.. _hierarchical-data:
1+
.. _userguide.hierarchical-data:
22

33
Hierarchical data
4-
==============================
4+
=================
55

66
.. ipython:: python
77
:suppress:
@@ -15,6 +15,8 @@ Hierarchical data
1515
1616
%xmode minimal
1717
18+
.. _why:
19+
1820
Why Hierarchical Data?
1921
----------------------
2022

@@ -644,3 +646,148 @@ We could use this feature to quickly calculate the electrical power in our signa
644646
645647
power = currents * voltages
646648
power
649+
650+
.. _hierarchical-data.alignment-and-coordinate-inheritance:
651+
652+
Alignment and Coordinate Inheritance
653+
------------------------------------
654+
655+
.. _data-alignment:
656+
657+
Data Alignment
658+
~~~~~~~~~~~~~~
659+
660+
The data in different datatree nodes are not totally independent. In particular dimensions (and indexes) in child nodes must be exactly aligned with those in their parent nodes.
661+
Exact aligment means that shared dimensions must be the same length, and indexes along those dimensions must be equal.
662+
663+
.. note::
664+
If you were a previous user of the prototype `xarray-contrib/datatree <https://github.com/xarray-contrib/datatree>`_ package, this is different from what you're used to!
665+
In that package the data model was that the data stored in each node actually was completely unrelated. The data model is now slightly stricter.
666+
This allows us to provide features like :ref:`coordinate-inheritance`.
667+
668+
To demonstrate, let's first generate some example datasets which are not aligned with one another:
669+
670+
.. ipython:: python
671+
672+
# (drop the attributes just to make the printed representation shorter)
673+
ds = xr.tutorial.open_dataset("air_temperature").drop_attrs()
674+
675+
ds_daily = ds.resample(time="D").mean("time")
676+
ds_weekly = ds.resample(time="W").mean("time")
677+
ds_monthly = ds.resample(time="ME").mean("time")
678+
679+
These datasets have different lengths along the ``time`` dimension, and are therefore not aligned along that dimension.
680+
681+
.. ipython:: python
682+
683+
ds_daily.sizes
684+
ds_weekly.sizes
685+
ds_monthly.sizes
686+
687+
We cannot store these non-alignable variables on a single :py:class:`~xarray.Dataset` object, because they do not exactly align:
688+
689+
.. ipython:: python
690+
:okexcept:
691+
692+
xr.align(ds_daily, ds_weekly, ds_monthly, join="exact")
693+
694+
But we :ref:`previously said <why>` that multi-resolution data is a good use case for :py:class:`~xarray.DataTree`, so surely we should be able to store these in a single :py:class:`~xarray.DataTree`?
695+
If we first try to create a :py:class:`~xarray.DataTree` with these different-length time dimensions present in both parents and children, we will still get an alignment error:
696+
697+
.. ipython:: python
698+
:okexcept:
699+
700+
xr.DataTree.from_dict({"daily": ds_daily, "daily/weekly": ds_weekly})
701+
702+
This is because DataTree checks that data in child nodes align exactly with their parents.
703+
704+
.. note::
705+
This requirement of aligned dimensions is similar to netCDF's concept of `inherited dimensions <https://www.unidata.ucar.edu/software/netcdf/workshops/2007/groups-types/Introduction.html>`_, as in netCDF-4 files dimensions are `visible to all child groups <https://docs.unidata.ucar.edu/netcdf-c/current/groups.html>`_.
706+
707+
This alignment check is performed up through the tree, all the way to the root, and so is therefore equivalent to requiring that this :py:func:`~xarray.align` command succeeds:
708+
709+
.. code:: python
710+
711+
xr.align(child.dataset, *(parent.dataset for parent in child.parents), join="exact")
712+
713+
To represent our unalignable data in a single :py:class:`~xarray.DataTree`, we must instead place all variables which are a function of these different-length dimensions into nodes that are not direct descendents of one another, e.g. organize them as siblings.
714+
715+
.. ipython:: python
716+
717+
dt = xr.DataTree.from_dict(
718+
{"daily": ds_daily, "weekly": ds_weekly, "monthly": ds_monthly}
719+
)
720+
dt
721+
722+
Now we have a valid :py:class:`~xarray.DataTree` structure which contains all the data at each different time frequency, stored in a separate group.
723+
724+
This is a useful way to organise our data because we can still operate on all the groups at once.
725+
For example we can extract all three timeseries at a specific lat-lon location:
726+
727+
.. ipython:: python
728+
729+
dt.sel(lat=75, lon=300)
730+
731+
or compute the standard deviation of each timeseries to find out how it varies with sampling frequency:
732+
733+
.. ipython:: python
734+
735+
dt.std(dim="time")
736+
737+
.. _coordinate-inheritance:
738+
739+
Coordinate Inheritance
740+
~~~~~~~~~~~~~~~~~~~~~~
741+
742+
Notice that in the trees we constructed above there is some redundancy - the ``lat`` and ``lon`` variables appear in each sibling group, but are identical across the groups.
743+
744+
.. ipython:: python
745+
746+
dt
747+
748+
We can use "Coordinate Inheritance" to define them only once in a parent group and remove this redundancy, whilst still being able to access those coordinate variables from the child groups.
749+
750+
.. note::
751+
This is also a new feature relative to the prototype `xarray-contrib/datatree <https://github.com/xarray-contrib/datatree>`_ package.
752+
753+
Let's instead place only the time-dependent variables in the child groups, and put the non-time-dependent ``lat`` and ``lon`` variables in the parent (root) group:
754+
755+
.. ipython:: python
756+
757+
dt = xr.DataTree.from_dict(
758+
{
759+
"/": ds.drop_dims("time"),
760+
"daily": ds_daily.drop_vars(["lat", "lon"]),
761+
"weekly": ds_weekly.drop_vars(["lat", "lon"]),
762+
"monthly": ds_monthly.drop_vars(["lat", "lon"]),
763+
}
764+
)
765+
dt
766+
767+
This is preferred to the previous representation because it now makes it clear that all of these datasets share common spatial grid coordinates.
768+
Defining the common coordinates just once also ensures that the spatial coordinates for each group cannot become out of sync with one another during operations.
769+
770+
We can still access the coordinates defined in the parent groups from any of the child groups as if they were actually present on the child groups:
771+
772+
.. ipython:: python
773+
774+
dt.daily.coords
775+
dt["daily/lat"]
776+
777+
As we can still access them, we say that the ``lat`` and ``lon`` coordinates in the child groups have been "inherited" from their common parent group.
778+
779+
If we print just one of the child nodes, it will still display inherited coordinates, but explicitly mark them as such:
780+
781+
.. ipython:: python
782+
783+
print(dt["/daily"])
784+
785+
This helps to differentiate which variables are defined on the datatree node that you are currently looking at, and which were defined somewhere above it.
786+
787+
We can also still perform all the same operations on the whole tree:
788+
789+
.. ipython:: python
790+
791+
dt.sel(lat=[75], lon=[300])
792+
793+
dt.std(dim="time")

0 commit comments

Comments
 (0)