-
Notifications
You must be signed in to change notification settings - Fork 309
Dask merge back #2597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask merge back #2597
Changes from all commits
cc07254
d2e0849
04509be
a9a9cfd
5cdd2c3
f8d7e80
53edef0
0aed81f
899f005
6869f9a
a0a9a06
e974847
bd9afce
794a230
e1a8b1a
889ec70
0f1899c
e3d2549
051c81d
4f255fa
784c47d
7e0694a
959b6cd
3861b87
a7f5af4
f3efe3c
fa5dc6d
f5294a3
e8afc4a
a114cc5
2f7649f
de490cf
c1fd32a
d46b608
678dac0
42f219d
530e0c6
6c204cc
12735cb
25a798f
da3b947
51cb91c
083feca
8e35fb1
4521fa2
b1bf843
211e457
138440f
c0e1878
1c01ab7
042c9e4
d731cc0
1edeed2
8b95033
c2d9fa9
d61f455
0e26f81
e1ff306
c143dfc
d2f3ddb
6eade27
c3c04bd
08f15a9
3bbd675
441f77f
03e59b8
33d6498
e45acf7
85eff75
b23bd9a
0a6b516
8fdfede
af01efb
df22e62
c1223d8
b3ebb08
4a9a576
81d0d45
b7938ac
8fd3d48
4437b7c
faa86ac
bde6aac
d30a996
77eb942
0e5318f
02424e2
bbf4cd7
57e0392
f0b0cf9
b8aa885
f517087
3fd8925
91af9bb
c4d7a77
10a4088
d22aeff
ed485a3
09a0613
26906c3
1669ee8
4be3880
9d803a7
015cee9
51bb2ae
e30890b
3c57106
968ba9a
7276b56
5da2e2a
a65ddd4
fbf0bab
8c90b58
56825a5
bbf0941
821e325
e287ffe
92e50d0
78515fd
d2af0db
33c8249
e2eeea3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| Iris Dask Interface | ||
| ******************* | ||
|
|
||
| Iris uses `dask <http://dask.pydata.org>`_ to manage lazy data interfaces and processing graphs. | ||
| The key principles that define this interface are: | ||
|
|
||
| * A call to :attr:`cube.data` will always load all of the data. | ||
|
|
||
| * Once this has happened: | ||
|
|
||
| * :attr:`cube.data` is a mutable NumPy masked array or ``ndarray``, and | ||
| * ``cube._numpy_array`` is a private NumPy masked array, accessible via :attr:`cube.data`, which may strip off the mask and return a reference to the bare ``ndarray``. | ||
|
|
||
| * You can use :attr:`cube.data` to set the data. This accepts: | ||
|
|
||
| * a NumPy array (including masked array), which is assigned to ``cube._numpy_array``, or | ||
| * a dask array, which is assigned to ``cube._dask_array``, while ``cube._numpy_array`` is set to None. | ||
|
|
||
| * ``cube._dask_array`` may be None, otherwise it is expected to be a dask array: | ||
|
|
||
| * this may wrap a proxy to a file collection, or | ||
| * this may wrap the NumPy array in ``cube._numpy_array``. | ||
|
|
||
| * All dask arrays wrap array-like objects where missing data are represented by ``nan`` values: | ||
|
|
||
| * Masked arrays derived from these dask arrays create their mask using the locations of ``nan`` values. | ||
| * Where dask-wrapped arrays of ``int`` require masks, these arrays will first be cast to ``float``. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What kind of float container? The smallest one possible for the range of values and dtype defined?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @bjlittle There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't seem robust to me: Perhaps I'm missing something about the implementation but I don't think you can represent the full int64 range of values as float64 ones in one container: >>> arr = np.array([np.iinfo('int64').min, np.iinfo('int64').max])
array([-9223372036854775808, 9223372036854775807])
>>> arr.astype('float64').astype('int64')
array([-9223372036854775808, -9223372036854775808])I have not looked at the implementation. Perhaps this is done element-wise so isn't a problem?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cpelley Interesting observation. Do you have an actual data use case for this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My query is not driven by a usecase. I'm not sure I have seen 64bit integer field which spans a large enough range that it cannot be represented by a 64bit float field. However, this is my point, I don't know :)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @bjlittle
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can cast to float64 without overflow, but not back due to rounding: >>> arr = np.array([np.iinfo('int64').min, np.iinfo('int64').max])
array([-9223372036854775808, 9223372036854775807])
>>> arr.astype('float64')
array([ -9.22337204e+18, 9.22337204e+18])Casting to float is always a compromise though, you can't have a 1-1 mapping of all integers->floats with the same bit size. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To extend the illustration: >>> np.set_printoptions(precision=18)
>>> np.array([np.iinfo('int64').min, np.iinfo('int64').max], dtype='int64')
array([-9223372036854775808, 9223372036854775807])
>>> np.array([np.iinfo('int64').min, np.iinfo('int64').max], dtype='float64')
array([ -9.223372036854775808e+18, 9.223372036854775808e+18])Note, this problem is not restricted to the very extreme of the limits. |
||
|
|
||
| * In order to support this mask conversion, cubes have a ``fill_value`` defined as part of their metadata, which may be ``None``. | ||
|
|
||
| * Array copying is kept to an absolute minimum: | ||
|
|
||
| * array references should always be passed, not new arrays created, unless an explicit copy operation is requested. | ||
|
|
||
| * To test for the presence of a dask array of any sort, we use :func:`iris._lazy_data.is_lazy_data`. This is implemented as ``hasattr(data, 'compute')``. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -38,3 +38,4 @@ | |
| tests.rst | ||
| deprecations.rst | ||
| release.rst | ||
| dask_interface.rst | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -176,8 +176,8 @@ For example, to mask values that lie beyond the range of the original data: | |
| >>> scheme = iris.analysis.Linear(extrapolation_mode='mask') | ||
| >>> new_column = column.interpolate(sample_points, scheme) | ||
| >>> print(new_column.coord('altitude').points) | ||
| [ nan 494.44451904 588.88891602 683.33325195 777.77783203 | ||
| 872.222229 966.66674805 1061.11108398 1155.55541992 nan] | ||
| [-- 494.44451904296875 588.888916015625 683.333251953125 777.77783203125 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this shows the point I was making above.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. About differentiating between nan and masked values, I think so.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right about the point you are making, but I don't believe that this is the right time to be making it. This PR is to merge a feature branch into Iris which has been under construction for 4 months, and every decision has been discussed in great detail already. This method may not be ideal, but with dask having no support for masked values it is the best option we have. We have by no means kept development of the feature branch a secret, and there has been plenty of time and space for discussion of major implementation decisions, which is not in this PR. This is just to review the last 10 commits, as @bjlittle pointed out in his first comment, so even though you are right, there is really nothing we can do about it now. |
||
| 872.2222290039062 966.666748046875 1061.111083984375 1155.555419921875 --] | ||
|
|
||
|
|
||
| .. _caching_an_interpolator: | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,230 @@ | ||
| .. _real_and_lazy_data: | ||
|
|
||
|
|
||
| .. testsetup:: * | ||
|
|
||
| import dask.array as da | ||
| import iris | ||
| import numpy as np | ||
|
|
||
|
|
||
| ================== | ||
| Real and Lazy Data | ||
| ================== | ||
|
|
||
| We have seen in the :doc:`user_guide_introduction` section of the user guide that | ||
| Iris cubes contain data and metadata about a phenomenon. The data element of a cube | ||
| is always an array, but the array may be either "real" or "lazy". | ||
|
|
||
| In this section of the user guide we will look specifically at the concepts of | ||
| real and lazy data as they apply to the cube and other data structures in Iris. | ||
|
|
||
|
|
||
| What is real and lazy data? | ||
| --------------------------- | ||
|
|
||
| In Iris, we use the term **real data** to describe data arrays that are loaded | ||
| into memory. Real data is typically provided as a | ||
| `NumPy array <https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html>`_, | ||
| which has a shape and data type that are used to describe the array's data points. | ||
| Each data point takes up a small amount of memory, which means large NumPy arrays can | ||
| take up a large amount of memory. | ||
|
|
||
| Conversely, we use the term **lazy data** to describe data that is not loaded into memory. | ||
| (This is sometimes also referred to as **deferred data**.) | ||
| In Iris, lazy data is provided as a | ||
| `dask array <http://dask.pydata.org/en/latest/array-overview.html>`_. | ||
| A dask array also has a shape and data type | ||
| but typically the dask array's data points are not loaded into memory. | ||
| Instead the data points are stored on disk and only loaded into memory in | ||
| small chunks when absolutely necessary (see the section :ref:`when_real_data` | ||
| for examples of when this might happen). | ||
|
|
||
| The primary advantage of using lazy data is that it enables | ||
| `out-of-core processing <https://en.wikipedia.org/wiki/Out-of-core_algorithm>`_; | ||
| that is, the loading and manipulating of datasets that otherwise would not fit into memory. | ||
|
|
||
| You can check whether a cube has real data or lazy data by using the method | ||
| :meth:`~iris.cube.Cube.has_lazy_data`. For example:: | ||
|
|
||
| >>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp')) | ||
| >>> cube.has_lazy_data() | ||
| True | ||
| # Realise the lazy data. | ||
| >>> cube.data | ||
| >>> cube.has_lazy_data() | ||
| False | ||
|
|
||
|
|
||
| .. _when_real_data: | ||
|
|
||
| When does my data become real? | ||
| ------------------------------ | ||
|
|
||
| When you load a dataset using Iris the data array will almost always initially be | ||
| a lazy array. This section details some operations that will realise lazy data | ||
| as well as some operations that will maintain lazy data. We use the term **realise** | ||
| to mean converting lazy data into real data. | ||
|
|
||
| Most operations on data arrays can be run equivalently on both real and lazy data. | ||
| If the data array is real then the operation will be run on the data array | ||
| immediately. The results of the operation will be available as soon as processing is completed. | ||
| If the data array is lazy then the operation will be deferred and the data array will | ||
| remain lazy until you request the result (such as when you call ``cube.data``):: | ||
|
|
||
| >>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp')) | ||
| >>> cube.has_lazy_data() | ||
| True | ||
| >>> cube += 5 | ||
| >>> cube.has_lazy_data() | ||
| True | ||
|
|
||
| The process by which the operation is deferred until the result is requested is | ||
| referred to as **lazy evaluation**. | ||
|
|
||
| Certain operations, including regridding and plotting, can only be run on real data. | ||
| Calling such operations on lazy data will automatically realise your lazy data. | ||
|
|
||
| You can also realise (and so load into memory) your cube's lazy data if you 'touch' the data. | ||
| To 'touch' the data means directly accessing the data by calling ``cube.data``, | ||
| as in the previous example. | ||
|
|
||
| Core data | ||
| ^^^^^^^^^ | ||
|
|
||
| Cubes have the concept of "core data". This returns the cube's data in its | ||
| current state: | ||
|
|
||
| * If a cube has lazy data, calling the cube's :meth:`~iris.cube.Cube.core_data` method | ||
| will return the cube's lazy dask array. Calling the cube's | ||
| :meth:`~iris.cube.Cube.core_data` method **will never realise** the cube's data. | ||
| * If a cube has real data, calling the cube's :meth:`~iris.cube.Cube.core_data` method | ||
| will return the cube's real NumPy array. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While in such a related space, is it worth mentioning here
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. It may be worth including a mention of coord.lazy_data somewhere, which I will discuss with the dev team here, but this section is specifically about coord.core_data, which refers to the data's current state. This is therefore not the space to add an example of coord.lazy_data, which (as you say) will load a dask array regardless of the data's current state. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, related area being being under the parent level 'When does my data become real?'. |
||
|
|
||
| For example:: | ||
|
|
||
| >>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp')) | ||
| >>> cube.has_lazy_data() | ||
| True | ||
|
|
||
| >>> the_data = cube.core_data() | ||
| >>> type(the_data) | ||
| <class 'dask.array.core.Array'> | ||
| >>> cube.has_lazy_data() | ||
| True | ||
|
|
||
| # Realise the lazy data. | ||
| >>> cube.data | ||
| >>> the_data = cube.core_data() | ||
| >>> type(the_data) | ||
| <type 'numpy.ndarray'> | ||
| >>> cube.has_lazy_data() | ||
| False | ||
|
|
||
|
|
||
| Coordinates | ||
| ----------- | ||
|
|
||
| In the same way that Iris cubes contain a data array, Iris coordinates contain a | ||
| points array and an optional bounds array. | ||
| Coordinate points and bounds arrays can also be real or lazy: | ||
|
|
||
| * A :class:`~iris.coords.DimCoord` will only ever have **real** points and bounds | ||
| arrays because of monotonicity checks that realise lazy arrays. | ||
| * An :class:`~iris.coords.AuxCoord` can have **real or lazy** points and bounds. | ||
| * An :class:`~iris.aux_factory.AuxCoordFactory` (or derived coordinate) | ||
| can have **real or lazy** points and bounds. If all of the | ||
| :class:`~iris.coords.AuxCoord` instances used to construct the derived coordinate | ||
| have real points and bounds then the derived coordinate will have real points | ||
| and bounds, otherwise the derived coordinate will have lazy points and bounds. | ||
|
|
||
| Iris cubes and coordinates have very similar interfaces, which extends to accessing | ||
| coordinates' lazy points and bounds: | ||
|
|
||
| .. doctest:: | ||
|
|
||
| >>> cube = iris.load_cube(iris.sample_data_path('hybrid_height.nc')) | ||
|
|
||
| >>> dim_coord = cube.coord('model_level_number') | ||
| >>> print(dim_coord.has_lazy_points()) | ||
| False | ||
| >>> print(dim_coord.has_bounds()) | ||
| False | ||
| >>> print(dim_coord.has_lazy_bounds()) | ||
| False | ||
|
|
||
| >>> aux_coord = cube.coord('sigma') | ||
| >>> print(aux_coord.has_lazy_points()) | ||
| True | ||
| >>> print(aux_coord.has_bounds()) | ||
| True | ||
| >>> print(aux_coord.has_lazy_bounds()) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about it? What are you expecting to see here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Iris cubes and coordinates have very similar interfaces, which extends to accessing I expect to see
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. You are right, we should discuss those points in this section somewhere. Not in this PR though, as this is the mergeback of the feature branch for a pre-release candidate. But what I will do is add a link to this comment and the one above in the project ticket about final documentation so that we can include your suggestions in later revisions of the docs.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cpelley it does say There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks both, happy with that :) |
||
| True | ||
|
|
||
| # Realise the lazy points. This will **not** realise the lazy bounds. | ||
| >>> points = aux_coord.points | ||
| >>> print(aux_coord.has_lazy_points()) | ||
| False | ||
| >>> print(aux_coord.has_lazy_bounds()) | ||
| True | ||
|
|
||
| >>> derived_coord = cube.coord('altitude') | ||
| >>> print(derived_coord.has_lazy_points()) | ||
| True | ||
| >>> print(derived_coord.has_bounds()) | ||
| True | ||
| >>> print(derived_coord.has_lazy_bounds()) | ||
| True | ||
|
|
||
| .. note:: | ||
| Printing a lazy :class:`~iris.coords.AuxCoord` will realise its points and bounds arrays! | ||
|
|
||
|
|
||
| Dask processing options | ||
| ----------------------- | ||
|
|
||
| As stated earlier in this user guide section, Iris uses dask to provide | ||
| lazy data arrays for both Iris cubes and coordinates. Iris also uses dask | ||
| functionality for processing deferred operations on lazy arrays. | ||
|
|
||
| Dask provides processing options to control how deferred operations on lazy arrays | ||
| are computed. This is provided via the ``dask.set_options`` interface. | ||
| We can make use of this functionality in Iris. This means we can | ||
| control how dask arrays in Iris are processed, for example giving us power to | ||
| run Iris processing in parallel. | ||
|
|
||
| Iris by default applies a single dask processing option. This specifies that | ||
| all dask processing in Iris should be run in serial (that is, without any | ||
| parallel processing enabled). | ||
|
|
||
| The dask processing option applied by Iris can be overridden by manually setting | ||
| dask processing options for either or both of: | ||
|
|
||
| * the number of parallel workers to use, | ||
| * the scheduler to use. | ||
|
|
||
| This must be done **before** importing Iris. For example, to specify that dask | ||
| processing within Iris should use four workers in a thread pool:: | ||
|
|
||
| >>> from multiprocessing.pool import ThreadPool | ||
| >>> import dask | ||
| >>> dask.set_options(get=dask.threaded.get, pool=ThreadPool(4)) | ||
|
|
||
| >>> import iris | ||
| >>> # Iris processing here... | ||
|
|
||
| .. note:: | ||
| These dask processing options will last for the lifetime of the Python session | ||
| and must be re-applied in other or subsequent sessions. | ||
|
|
||
| Other dask processing options are also available. See the | ||
| `dask documentation <http://dask.pydata.org/en/latest/scheduler-overview.html>`_ | ||
| for more information on setting dask processing options. | ||
|
|
||
|
|
||
| Further reading | ||
| --------------- | ||
|
|
||
| This section of the Iris user guide provides a quick overview of real and lazy | ||
| data within Iris. For more details on these and related concepts, | ||
| see the whitepaper on lazy data. | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we lost the means to differentiate between 'nan' values and 'masked' values in that case?
Looks like this is the case:
iris master:
This branch:
Differentiating between masked values and nan values can be important. An example: Regridding a field with masked data to a target with a different coordinate system, where extrapolation is set to 'nan' and takes place due a mismatch between the source and target domains (i.e. not 100% overlap).
Though this behaviour I suspect has not changed for regridding, at the point of saving this data to disk and loading it back in again, we have lost this information which allows us to know which values were actually masked and which were 'nan' values. For our project, we cache data to disk which depends on knowing the difference between a masked value and a 'nan' status for this very reason above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cpelley. This is a know bug that we need to address, see #2578
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, I mentioned regridding only as a usecase for why one might have both masked and nan values present (I didn't realise there was a problem there).
The thing I'm demonstrating as no longer working above is to load data which has nan values within it (they are indistinguishable from masked values). I hope this is not intended behaviour, but either way it is not captured by #2578 :)
I think this would be a blocker for us using dask right now at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Captured in #2609