diff --git a/doc/conf.py b/doc/conf.py
index 6c6efb47f6b..d2f6cdf3aa1 100644
--- a/doc/conf.py
+++ b/doc/conf.py
@@ -324,6 +324,7 @@
"cftime": ("https://unidata.github.io/cftime", None),
"sparse": ("https://sparse.pydata.org/en/latest/", None),
"cubed": ("https://tom-e-white.com/cubed/", None),
+ "datatree": ("https://xarray-datatree.readthedocs.io/en/latest/", None),
}
diff --git a/doc/internals/duck-arrays-integration.rst b/doc/internals/duck-arrays-integration.rst
index 1f1f57974df..a674acb04fe 100644
--- a/doc/internals/duck-arrays-integration.rst
+++ b/doc/internals/duck-arrays-integration.rst
@@ -35,7 +35,7 @@ Python Array API standard support
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As an integration library xarray benefits greatly from the standardization of duck-array libraries' APIs, and so is a
-big supporter of the `Python Array API Standard `_. .
+big supporter of the `Python Array API Standard `_.
We aim to support any array libraries that follow the Array API standard out-of-the-box. However, xarray does occasionally
call some numpy functions which are not (yet) part of the standard (e.g. :py:meth:`xarray.DataArray.pad` calls :py:func:`numpy.pad`).
diff --git a/doc/internals/extending-xarray.rst b/doc/internals/extending-xarray.rst
index a180b85044f..cb1b23e78eb 100644
--- a/doc/internals/extending-xarray.rst
+++ b/doc/internals/extending-xarray.rst
@@ -14,6 +14,11 @@ Xarray is designed as a general purpose library and hence tries to avoid
including overly domain specific functionality. But inevitably, the need for more
domain specific logic arises.
+.. _internals.accessors.composition:
+
+Composition over Inheritance
+----------------------------
+
One potential solution to this problem is to subclass Dataset and/or DataArray to
add domain specific functionality. However, inheritance is not very robust. It's
easy to inadvertently use internal APIs when subclassing, which means that your
@@ -23,11 +28,17 @@ only return native xarray objects.
The standard advice is to use :issue:`composition over inheritance <706>`, but
reimplementing an API as large as xarray's on your own objects can be an onerous
task, even if most methods are only forwarding to xarray implementations.
+(For an example of a project which took this approach of subclassing see `UXarray `_).
If you simply want the ability to call a function with the syntax of a
method call, then the builtin :py:meth:`~xarray.DataArray.pipe` method (copied
from pandas) may suffice.
+.. _internals.accessors.writing accessors:
+
+Writing Custom Accessors
+------------------------
+
To resolve this issue for more complex cases, xarray has the
:py:func:`~xarray.register_dataset_accessor` and
:py:func:`~xarray.register_dataarray_accessor` decorators for adding custom
diff --git a/doc/internals/index.rst b/doc/internals/index.rst
index 7e13f0cfe95..46972ff69bd 100644
--- a/doc/internals/index.rst
+++ b/doc/internals/index.rst
@@ -1,6 +1,6 @@
.. _internals:
-xarray Internals
+Xarray Internals
================
Xarray builds upon two of the foundational libraries of the scientific Python
@@ -11,15 +11,14 @@ compiled code to :ref:`optional dependencies`.
The pages in this section are intended for:
* Contributors to xarray who wish to better understand some of the internals,
-* Developers who wish to extend xarray with domain-specific logic, perhaps to support a new scientific community of users,
-* Developers who wish to interface xarray with their existing tooling, e.g. by creating a plugin for reading a new file format, or wrapping a custom array type.
-
+* Developers from other fields who wish to extend xarray with domain-specific logic, perhaps to support a new scientific community of users,
+* Developers of other packages who wish to interface xarray with their existing tools, e.g. by creating a backend for reading a new file format, or wrapping a custom array type.
.. toctree::
:maxdepth: 2
:hidden:
- variable-objects
+ internal-design
duck-arrays-integration
chunked-arrays
extending-xarray
diff --git a/doc/internals/internal-design.rst b/doc/internals/internal-design.rst
new file mode 100644
index 00000000000..11b4ee39da9
--- /dev/null
+++ b/doc/internals/internal-design.rst
@@ -0,0 +1,224 @@
+.. ipython:: python
+ :suppress:
+
+ import numpy as np
+ import pandas as pd
+ import xarray as xr
+
+ np.random.seed(123456)
+ np.set_printoptions(threshold=20)
+
+.. _internal design:
+
+Internal Design
+===============
+
+This page gives an overview of the internal design of xarray.
+
+In totality, the Xarray project defines 4 key data structures.
+In order of increasing complexity, they are:
+
+- :py:class:`xarray.Variable`,
+- :py:class:`xarray.DataArray`,
+- :py:class:`xarray.Dataset`,
+- :py:class:`datatree.DataTree`.
+
+The user guide lists only :py:class:`xarray.DataArray` and :py:class:`xarray.Dataset`,
+but :py:class:`~xarray.Variable` is the fundamental object internally,
+and :py:class:`~datatree.DataTree` is a natural generalisation of :py:class:`xarray.Dataset`.
+
+.. note::
+
+ Our :ref:`roadmap` includes plans both to document :py:class:`~xarray.Variable` as fully public API,
+ and to merge the `xarray-datatree `_ package into xarray's main repository.
+
+Internally private :ref:`lazy indexing classes ` are used to avoid loading more data than necessary,
+and flexible indexes classes (derived from :py:class:`~xarray.indexes.Index`) provide performant label-based lookups.
+
+
+.. _internal design.data structures:
+
+Data Structures
+---------------
+
+The :ref:`data structures` page in the user guide explains the basics and concentrates on user-facing behavior,
+whereas this section explains how xarray's data structure classes actually work internally.
+
+
+.. _internal design.data structures.variable:
+
+Variable Objects
+~~~~~~~~~~~~~~~~
+
+The core internal data structure in xarray is the :py:class:`~xarray.Variable`,
+which is used as the basic building block behind xarray's
+:py:class:`~xarray.Dataset`, :py:class:`~xarray.DataArray` types. A
+:py:class:`~xarray.Variable` consists of:
+
+- ``dims``: A tuple of dimension names.
+- ``data``: The N-dimensional array (typically a NumPy or Dask array) storing
+ the Variable's data. It must have the same number of dimensions as the length
+ of ``dims``.
+- ``attrs``: An ordered dictionary of metadata associated with this array. By
+ convention, xarray's built-in operations never use this metadata.
+- ``encoding``: Another ordered dictionary used to store information about how
+ these variable's data is represented on disk. See :ref:`io.encoding` for more
+ details.
+
+:py:class:`~xarray.Variable` has an interface similar to NumPy arrays, but extended to make use
+of named dimensions. For example, it uses ``dim`` in preference to an ``axis``
+argument for methods like ``mean``, and supports :ref:`compute.broadcasting`.
+
+However, unlike ``Dataset`` and ``DataArray``, the basic ``Variable`` does not
+include coordinate labels along each axis.
+
+:py:class:`~xarray.Variable` is public API, but because of its incomplete support for labeled
+data, it is mostly intended for advanced uses, such as in xarray itself, for
+writing new backends, or when creating custom indexes.
+You can access the variable objects that correspond to xarray objects via the (readonly)
+:py:attr:`Dataset.variables ` and
+:py:attr:`DataArray.variable ` attributes.
+
+
+.. _internal design.dataarray:
+
+DataArray Objects
+~~~~~~~~~~~~~~~~~
+
+The simplest data structure used by most users is :py:class:`~xarray.DataArray`.
+A :py:class:`~xarray.DataArray` is a composite object consisting of multiple
+:py:class:`~xarray.core.variable.Variable` objects which store related data.
+
+A single :py:class:`~xarray.core.Variable` is referred to as the "data variable", and stored under the :py:attr:`~xarray.DataArray.variable`` attribute.
+A :py:class:`~xarray.DataArray` inherits all of the properties of this data variable, i.e. ``dims``, ``data``, ``attrs`` and ``encoding``,
+all of which are implemented by forwarding on to the underlying ``Variable`` object.
+
+In addition, a :py:class:`~xarray.DataArray` stores additional ``Variable`` objects stored in a dict under the private ``_coords`` attribute,
+each of which is referred to as a "Coordinate Variable". These coordinate variable objects are only allowed to have ``dims`` that are a subset of the data variable's ``dims``,
+and each dim has a specific length. This means that the full :py:attr:`~xarray.DataArray.sizes` of the dataarray can be represented by a dictionary mapping dimension names to integer sizes.
+The underlying data variable has this exact same size, and the attached coordinate variables have sizes which are some subset of the size of the data variable.
+Another way of saying this is that all coordinate variables must be "alignable" with the data variable.
+
+When a coordinate is accessed by the user (e.g. via the dict-like :py:class:`~xarray.DataArray.__getitem__` syntax),
+then a new ``DataArray`` is constructed by finding all coordinate variables that have compatible dimensions and re-attaching them before the result is returned.
+This is why most users never see the ``Variable`` class underlying each coordinate variable - it is always promoted to a ``DataArray`` before returning.
+
+Lookups are performed by special :py:class:`~xarray.indexes.Index` objects, which are stored in a dict under the private ``_indexes`` attribute.
+Indexes must be associated with one or more coordinates, and essentially act by translating a query given in physical coordinate space
+(typically via the :py:meth:`~xarray.DataArray.sel` method) into a set of integer indices in array index space that can be used to index the underlying n-dimensional array-like ``data``.
+Indexing in array index space (typically performed via the :py:meth:`~xarray.DataArray.isel` method) does not require consulting an ``Index`` object.
+
+Finally a :py:class:`~xarray.DataArray` defines a :py:attr:`~xarray.DataArray.name` attribute, which refers to its data
+variable but is stored on the wrapping ``DataArray`` class.
+The ``name`` attribute is primarily used when one or more :py:class:`~xarray.DataArray` objects are promoted into a :py:class:`~xarray.Dataset`
+(e.g. via :py:meth:`~xarray.DataArray.to_dataset`).
+Note that the underlying :py:class:`~xarray.core.Variable` objects are all unnamed, so they can always be referred to uniquely via a
+dict-like mapping.
+
+.. _internal design.dataset:
+
+Dataset Objects
+~~~~~~~~~~~~~~~
+
+The :py:class:`~xarray.Dataset` class is a generalization of the :py:class:`~xarray.DataArray` class that can hold multiple data variables.
+Internally all data variables and coordinate variables are stored under a single ``variables`` dict, and coordinates are
+specified by storing their names in a private ``_coord_names`` dict.
+
+The dataset's dimensions are the set of all dims present across any variable, but (similar to in dataarrays) coordinate
+variables cannot have a dimension that is not present on any data variable.
+
+When a data variable or coordinate variable is accessed, a new ``DataArray`` is again constructed from all compatible
+coordinates before returning.
+
+.. _internal design.subclassing:
+
+.. note::
+
+ The way that selecting a variable from a ``DataArray`` or ``Dataset`` actually involves internally wrapping the
+ ``Variable`` object back up into a ``DataArray``/``Dataset`` is the primary reason :ref:`we recommend against subclassing `
+ Xarray objects. The main problem it creates is that we currently cannot easily guarantee that for example selecting
+ a coordinate variable from your ``SubclassedDataArray`` would return an instance of ``SubclassedDataArray`` instead
+ of just an :py:class:`xarray.DataArray`. See `GH issue `_ for more details.
+
+.. _internal design.lazy indexing:
+
+Lazy Indexing Classes
+---------------------
+
+Lazy Loading
+~~~~~~~~~~~~
+
+If we open a ``Variable`` object from disk using :py:func:`~xarray.open_dataset` we can see that the actual values of
+the array wrapped by the data variable are not displayed.
+
+.. ipython:: python
+
+ da = xr.tutorial.open_dataset("air_temperature")["air"]
+ var = da.variable
+ var
+
+We can see the size, and the dtype of the underlying array, but not the actual values.
+This is because the values have not yet been loaded.
+
+If we look at the private attribute :py:meth:`~xarray.Variable._data` containing the underlying array object, we see
+something interesting:
+
+.. ipython:: python
+
+ var._data
+
+You're looking at one of xarray's internal `Lazy Indexing Classes`. These powerful classes are hidden from the user,
+but provide important functionality.
+
+Calling the public :py:attr:`~xarray.Variable.data` property loads the underlying array into memory.
+
+.. ipython:: python
+
+ var.data
+
+This array is now cached, which we can see by accessing the private attribute again:
+
+.. ipython:: python
+
+ var._data
+
+Lazy Indexing
+~~~~~~~~~~~~~
+
+The purpose of these lazy indexing classes is to prevent more data being loaded into memory than is necessary for the
+subsequent analysis, by deferring loading data until after indexing is performed.
+
+Let's open the data from disk again.
+
+.. ipython:: python
+
+ da = xr.tutorial.open_dataset("air_temperature")["air"]
+ var = da.variable
+
+Now, notice how even after subsetting the data has does not get loaded:
+
+.. ipython:: python
+
+ var.isel(time=0)
+
+The shape has changed, but the values are still not shown.
+
+Looking at the private attribute again shows how this indexing information was propagated via the hidden lazy indexing classes:
+
+.. ipython:: python
+
+ var.isel(time=0)._data
+
+.. note::
+
+ Currently only certain indexing operations are lazy, not all array operations. For discussion of making all array
+ operations lazy see `GH issue #5081 `_.
+
+
+Lazy Dask Arrays
+~~~~~~~~~~~~~~~~
+
+Note that xarray's implementation of Lazy Indexing classes is completely separate from how :py:class:`dask.array.Array`
+objects evaluate lazily. Dask-backed xarray objects delay almost all operations until :py:meth:`~xarray.DataArray.compute`
+is called (either explicitly or implicitly via :py:meth:`~xarray.DataArray.plot` for example). The exceptions to this
+laziness are operations whose output shape is data-dependent, such as when calling :py:meth:`~xarray.DataArray.where`.
diff --git a/doc/internals/variable-objects.rst b/doc/internals/variable-objects.rst
deleted file mode 100644
index 6ae3c2f7e6d..00000000000
--- a/doc/internals/variable-objects.rst
+++ /dev/null
@@ -1,31 +0,0 @@
-Variable objects
-================
-
-The core internal data structure in xarray is the :py:class:`~xarray.Variable`,
-which is used as the basic building block behind xarray's
-:py:class:`~xarray.Dataset` and :py:class:`~xarray.DataArray` types. A
-``Variable`` consists of:
-
-- ``dims``: A tuple of dimension names.
-- ``data``: The N-dimensional array (typically, a NumPy or Dask array) storing
- the Variable's data. It must have the same number of dimensions as the length
- of ``dims``.
-- ``attrs``: An ordered dictionary of metadata associated with this array. By
- convention, xarray's built-in operations never use this metadata.
-- ``encoding``: Another ordered dictionary used to store information about how
- these variable's data is represented on disk. See :ref:`io.encoding` for more
- details.
-
-``Variable`` has an interface similar to NumPy arrays, but extended to make use
-of named dimensions. For example, it uses ``dim`` in preference to an ``axis``
-argument for methods like ``mean``, and supports :ref:`compute.broadcasting`.
-
-However, unlike ``Dataset`` and ``DataArray``, the basic ``Variable`` does not
-include coordinate labels along each axis.
-
-``Variable`` is public API, but because of its incomplete support for labeled
-data, it is mostly intended for advanced uses, such as in xarray itself or for
-writing new backends. You can access the variable objects that correspond to
-xarray objects via the (readonly) :py:attr:`Dataset.variables
-` and
-:py:attr:`DataArray.variable ` attributes.
diff --git a/doc/whats-new.rst b/doc/whats-new.rst
index 157795f08d1..b83697a3b20 100644
--- a/doc/whats-new.rst
+++ b/doc/whats-new.rst
@@ -145,6 +145,8 @@ Breaking changes
Documentation
~~~~~~~~~~~~~
+- Added page on the internal design of xarray objects.
+ (:pull:`7991`) By `Tom Nicholas `_.
- Added examples to docstrings of :py:meth:`Dataset.assign_attrs`, :py:meth:`Dataset.broadcast_equals`,
:py:meth:`Dataset.equals`, :py:meth:`Dataset.identical`, :py:meth:`Dataset.expand_dims`,:py:meth:`Dataset.drop_vars`
(:issue:`6793`, :pull:`7937`) By `Harshitha `_.