Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
122 commits
Select commit Hold shift + click to select a range
cc07254
Define @skip_biggus test decorator. (#2353)
pp-mo Feb 13, 2017
d2e0849
Generic lazy data handling. (#2356)
pp-mo Feb 13, 2017
04509be
Use _lazy_data functions for cube data.
pp-mo Feb 10, 2017
a9a9cfd
Hack for dual lazy support, i.e. biggus OR dask.
pp-mo Feb 10, 2017
5cdd2c3
Add mask/NaN translations into iris._lazy_data.
pp-mo Feb 12, 2017
f8d7e80
Started skipping tests.
pp-mo Feb 12, 2017
53edef0
Revert unnecessary change to integration/test_pp.
pp-mo Feb 13, 2017
0aed81f
Various skips.
pp-mo Feb 13, 2017
899f005
Disable Travis example + docs tests for now.
pp-mo Feb 13, 2017
6869f9a
dask based merge
marqh Feb 14, 2017
a0a9a06
skip all iris_grib tests
marqh Feb 14, 2017
e974847
Lazy pp loading
djkirkham Feb 14, 2017
bd9afce
switched netcdf loader from biggus to dask. untested. (#35)
corinnebosley Feb 14, 2017
794a230
skip failing netcdf unit mock tests: chunks do not add up to shape
marqh Feb 14, 2017
e1a8b1a
pp_load data property fix
marqh Feb 15, 2017
889ec70
as_concrete_array always returns a masked array
marqh Feb 15, 2017
0f1899c
Use Dask for concatenate (#38)
AlexHilson Feb 15, 2017
e3d2549
pp unit test
marqh Feb 15, 2017
051c81d
Don't make lazy wrappers for cube shape and dtype. (#37)
pp-mo Feb 15, 2017
4f255fa
biggus ArrayStack.multidim_array_stack with da.stack
marqh Feb 15, 2017
784c47d
is_lazy_data over isinstance
marqh Feb 15, 2017
7e0694a
test_field_collection with dask
marqh Feb 15, 2017
959b6cd
use np.dtype in mock tests
marqh Feb 15, 2017
3861b87
typo fix and fill_value guarantee (#39)
corinnebosley Feb 15, 2017
a7f5af4
Replace biggus ndarray with lazy as_concrete_data in pp pyke rules. (…
pp-mo Feb 15, 2017
f3efe3c
remove biggus lazy data, skip netcdf save
marqh Feb 15, 2017
fa5dc6d
fix cube pickle test
marqh Feb 15, 2017
f5294a3
skip netCDF save
marqh Feb 15, 2017
e8afc4a
Fixes for biggus array checks (#41)
corinnebosley Feb 15, 2017
a114cc5
replace biggus lazy use for now, patch out netcdf save tests
marqh Feb 15, 2017
2f7649f
skip as fill value lost
marqh Feb 15, 2017
de490cf
Don't try and merge 0-d arrays (#42)
AlexHilson Feb 15, 2017
c1fd32a
biggus skippers (#43)
corinnebosley Feb 16, 2017
d46b608
plot skippers (#45)
corinnebosley Feb 16, 2017
678dac0
skippers added for more not-serious failures (#44)
corinnebosley Feb 16, 2017
42f219d
header dates corrected (#46)
corinnebosley Feb 16, 2017
530e0c6
test implementation tweaks
marqh Feb 16, 2017
6c204cc
skip non lazy coord loading
marqh Feb 16, 2017
12735cb
pickling test skip
marqh Feb 16, 2017
25a798f
removed some unnecessary skippers, mostly on concatenate tests (#2388)
corinnebosley Feb 21, 2017
da3b947
Data first (#2392)
marqh Feb 23, 2017
51cb91c
first run at daskifying stats and maths (#2402)
corinnebosley Feb 23, 2017
083feca
Refactor AuxCoordFactory unit tests to remove biggus references (#2401)
djkirkham Feb 23, 2017
8e35fb1
Mocked PPFields need a useable _data property. (#2395)
pp-mo Feb 24, 2017
4521fa2
passing tests unskipped (#2408)
corinnebosley Feb 24, 2017
b1bf843
Dask aggs (#2406)
pp-mo Feb 24, 2017
211e457
PPfield core_data (#2410)
djkirkham Feb 27, 2017
138440f
Replace intersection (#2409)
corinnebosley Feb 27, 2017
c0e1878
Tidy dask section of dev guide
DPeterK Mar 7, 2017
1c01ab7
Coding standards tweaks to daskified cube.py. (#2416)
DPeterK Mar 8, 2017
042c9e4
Reinstate `as_lazy_data` (#2421)
DPeterK Mar 9, 2017
d731cc0
Interim CML refactor for fill_value and dtype
bjlittle Mar 13, 2017
1edeed2
Reuse multidim_daskstack in merge + fast um loading. (#2423)
lbdreyer Mar 13, 2017
8b95033
Tighten purpose of `array_masked_to_nans` (#2424)
DPeterK Mar 14, 2017
c2d9fa9
fill_value and dtype (#2433)
bjlittle Mar 15, 2017
d61f455
Consistent check for masked array types (#2434)
DPeterK Mar 15, 2017
0e26f81
NetCDF save with dask support (#2411)
bjlittle Mar 20, 2017
e1ff306
Replace dask 'compute()' usage with a common realisation call. (#2) (…
lbdreyer Mar 21, 2017
c143dfc
Remove some `@tests.skip_biggus` that are now unnecessary (#2450)
lbdreyer Mar 22, 2017
d2f3ddb
Cube.__init__ uses data setter.
bjlittle Mar 22, 2017
6eade27
Fix for dtype.kind (#2453)
bjlittle Mar 24, 2017
c3c04bd
Only use truediv in cubes (#2458)
DPeterK Mar 27, 2017
08f15a9
Remove the last bits of biggus (#2451)
DPeterK Mar 30, 2017
3bbd675
Dask data manager (#2461)
bjlittle Mar 30, 2017
441f77f
Fix test fo test_rules.py
lbdreyer Mar 22, 2017
03e59b8
Remove biggus test skipper: fix pp field data (#2473)
lbdreyer Apr 6, 2017
33d6498
Remove biggus from test comment.
bjlittle Apr 12, 2017
e45acf7
Bring iris_grib back into iris.
lbdreyer Apr 16, 2017
85eff75
Daskify grib loading/saving
lbdreyer Apr 16, 2017
b23bd9a
GribWrapper deferred data testing
corinnebosley Apr 13, 2017
0a6b516
Fix import order. (#4)
bjlittle Apr 20, 2017
8fdfede
Dask options for Iris processing (#2511)
DPeterK Apr 25, 2017
af01efb
Fix iris_grib integration tests.
lbdreyer Apr 25, 2017
df22e62
Fix handling of data with missing values in the DataProxy for Grib ed…
lbdreyer Apr 26, 2017
c1223d8
Fix fill value usage at grib save time
lbdreyer Apr 26, 2017
b3ebb08
DataManager used for cell measures (#2513)
pp-mo Apr 28, 2017
4a9a576
Dask cube and data manager integration (#2492)
bjlittle May 2, 2017
81d0d45
Prevent sliced coords sharing data with their parent coord. (#2517)
pp-mo May 2, 2017
b7938ac
Remove `LazyArray` class from coords (#2518)
DPeterK May 3, 2017
8fd3d48
Populate cube.fill_value thru cube.data setter.
bjlittle May 4, 2017
4437b7c
Dask core data (#2521)
bjlittle May 4, 2017
faa86ac
Dask merge concat fill value (#2520)
bjlittle May 4, 2017
bde6aac
Unify getitem code between Cube, Coord and CellMeasures. (#2519)
pp-mo May 5, 2017
d30a996
Move fill-value handling into DataManager.
bjlittle May 10, 2017
77eb942
Improve util funcs for testing coord regularity
DPeterK May 11, 2017
0e5318f
Update docstring
DPeterK May 11, 2017
02424e2
Simplify numeric tests
DPeterK May 11, 2017
bbf4cd7
Add _proportion() comment.
bjlittle May 11, 2017
57e0392
Update __eq__ doc-string.
bjlittle May 11, 2017
f0b0cf9
Integrate DataManager with Coord, DimCoord, and AuxCoord (#2527)
DPeterK May 11, 2017
b8aa885
Dask data manager propagate (#2548)
bjlittle May 15, 2017
f517087
Remove biggus skippers and update cml
DPeterK May 15, 2017
3fd8925
More Coordinate tests for variously lazy/real points and bounds (#2547)
pp-mo May 15, 2017
91af9bb
Fix non-lazy pickle test
DPeterK May 15, 2017
c4d7a77
Unexpectedly good
DPeterK May 16, 2017
10a4088
Check realised data without copying anything
DPeterK May 16, 2017
d22aeff
Ensure declared dtype of data proxy objects matches returned dtype (#…
djkirkham May 17, 2017
ed485a3
Update test data sha
djkirkham May 17, 2017
09a0613
Coord points and bounds inheritance. (#2553)
pp-mo May 17, 2017
26906c3
Purge spurious imports (#2560)
DPeterK May 17, 2017
1669ee8
De-mock some NetCDF tests (#2556)
DPeterK May 17, 2017
4be3880
Fix rules test for _make_cube (#2557)
DPeterK May 18, 2017
9d803a7
Handle MaskedConstant in cube maths (#2526)
djkirkham May 18, 2017
015cee9
Fix PP bmdi handling (#2564)
pp-mo May 18, 2017
51bb2ae
Remove the last bits of biggus (#2568)
DPeterK May 18, 2017
e30890b
Use six.assertRegex to avoid Python 3 test method deprecation. (#2565)
pp-mo May 19, 2017
3c57106
Ensure cube.transpose functionality (#2580)
DPeterK May 24, 2017
968ba9a
Reinstate doctest and extest
DPeterK May 23, 2017
7276b56
Reinstate deprecated Future options
DPeterK May 23, 2017
5da2e2a
Update interpolation doc example vals
DPeterK May 24, 2017
a65ddd4
Add missing unit test tests.main call
bjlittle Jun 2, 2017
fbf0bab
Backout PR #2452.
bjlittle Jun 2, 2017
8c90b58
Fix experimental stratify CML
bjlittle Jun 2, 2017
56825a5
Reinstate #2477 isMaskedArray to is_masked
bjlittle Jun 5, 2017
bbf0941
Align TestCubeCollapsed CML based on #2440
bjlittle Jun 5, 2017
821e325
Align TestAnalysisWeights CML based on #2440
bjlittle Jun 5, 2017
e287ffe
Fix unit test for numpy float index slicing.
bjlittle Jun 8, 2017
92e50d0
Avoid mock assert_called_once python3.6 specific.
bjlittle Jun 8, 2017
78515fd
Manually add "Lazy bounds test (#2590)".
bjlittle Jun 9, 2017
d2af0db
Manually add "Dask docs, first draft (#2583)".
bjlittle Jun 9, 2017
33c8249
Review actions.
bjlittle Jun 13, 2017
e2eeea3
Fix for dask v0.15+ get_sync.
bjlittle Jun 14, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 2 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ git:
depth: 10000

install:
- export IRIS_TEST_DATA_REF="7c0e32c8812b464e467a9555bdc25dc1e0c5be0c"
- export IRIS_TEST_DATA_REF="2f3a6bcf25f81bd152b3d66223394074c9069a96"
- export IRIS_TEST_DATA_SUFFIX=$(echo "${IRIS_TEST_DATA_REF}" | sed "s/^v//")

# Install miniconda
Expand Down Expand Up @@ -54,7 +54,7 @@ install:
conda install --quiet --file minimal-conda-requirements.txt;
else
if [[ "$TRAVIS_PYTHON_VERSION" == 3* ]]; then
sed -e '/ecmwf_grib/d' -e '/esmpy/d' -e '/iris-grib/d' -e 's/#.\+$//' conda-requirements.txt | xargs conda install --quiet;
sed -e '/ecmwf_grib/d' -e '/esmpy/d' -e 's/#.\+$//' conda-requirements.txt | xargs conda install --quiet;
else
conda install --quiet --file conda-requirements.txt;
fi
Expand Down
7 changes: 0 additions & 7 deletions INSTALL
Original file line number Diff line number Diff line change
Expand Up @@ -80,9 +80,6 @@ numpy 1.9 or later (http://numpy.scipy.org/)
Python package for scientific computing including a powerful N-dimensional
array object.

biggus 0.14 or later (https://github.com/SciTools/biggus)
Virtual large arrays and lazy evaluation.

scipy 0.10 or later (http://www.scipy.org/)
Python package for scientific computing.

Expand Down Expand Up @@ -128,10 +125,6 @@ grib-api 1.9.16 or later
edition 2 messages. A compression library such as Jasper is required
to read JPEG2000 compressed GRIB2 files.

iris-grib 0.9 or later
(https://github.com/scitools/iris-grib)
Iris interface to ECMWF's GRIB API

matplotlib 1.2.0 (http://matplotlib.sourceforge.net/)
Python package for 2D plotting.

Expand Down
10 changes: 5 additions & 5 deletions conda-requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@
# conda create -n <name> --file conda-requirements.txt

# Mandatory dependencies
biggus
cartopy
matplotlib<1.9
netcdf4
numpy
pyke
udunits2
cf_units
dask

# Iris build dependencies
setuptools
Expand All @@ -25,12 +25,12 @@ imagehash
requests

# Optional iris dependencies
nc_time_axis
iris-grib
ecmwf_grib
esmpy>=7.0
gdal
libmo_unpack
pandas
pyugrid
mo_pack
nc_time_axis
pandas
python-stratify
pyugrid
2 changes: 0 additions & 2 deletions docs/iris/src/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,8 +158,6 @@
'scipy': ('http://docs.scipy.org/doc/scipy/reference/', None),
'matplotlib': ('http://matplotlib.org/', None),
'cartopy': ('http://scitools.org.uk/cartopy/docs/latest/', None),
'biggus': ('https://biggus.readthedocs.io/en/latest/', None),
'iris-grib': ('http://iris-grib.readthedocs.io/en/latest/', None),
}


Expand Down
35 changes: 35 additions & 0 deletions docs/iris/src/developers_guide/dask_interface.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
Iris Dask Interface
*******************

Iris uses `dask <http://dask.pydata.org>`_ to manage lazy data interfaces and processing graphs.
The key principles that define this interface are:

* A call to :attr:`cube.data` will always load all of the data.

* Once this has happened:

* :attr:`cube.data` is a mutable NumPy masked array or ``ndarray``, and
* ``cube._numpy_array`` is a private NumPy masked array, accessible via :attr:`cube.data`, which may strip off the mask and return a reference to the bare ``ndarray``.

* You can use :attr:`cube.data` to set the data. This accepts:

* a NumPy array (including masked array), which is assigned to ``cube._numpy_array``, or
* a dask array, which is assigned to ``cube._dask_array``, while ``cube._numpy_array`` is set to None.

* ``cube._dask_array`` may be None, otherwise it is expected to be a dask array:

* this may wrap a proxy to a file collection, or
* this may wrap the NumPy array in ``cube._numpy_array``.

* All dask arrays wrap array-like objects where missing data are represented by ``nan`` values:

@cpelley cpelley Jun 13, 2017

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we lost the means to differentiate between 'nan' values and 'masked' values in that case?
Looks like this is the case:

iris master:

>>> iris.load_cube('nan_mask_tmp.nc').data
masked_array(data = [1.0 -- nan],
             mask = [False  True False],
       fill_value = 1e+20)

This branch:

>>> iris.load_cube('nan_mask_tmp.nc').data
masked_array(data = [1.0 -- --],
             mask = [False  True  True],
       fill_value = 1e+20)

Differentiating between masked values and nan values can be important. An example: Regridding a field with masked data to a target with a different coordinate system, where extrapolation is set to 'nan' and takes place due a mismatch between the source and target domains (i.e. not 100% overlap).
Though this behaviour I suspect has not changed for regridding, at the point of saving this data to disk and loading it back in again, we have lost this information which allows us to know which values were actually masked and which were 'nan' values. For our project, we cache data to disk which depends on knowing the difference between a masked value and a 'nan' status for this very reason above.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cpelley. This is a know bug that we need to address, see #2578

@cpelley cpelley Jun 14, 2017

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I mentioned regridding only as a usecase for why one might have both masked and nan values present (I didn't realise there was a problem there).

The thing I'm demonstrating as no longer working above is to load data which has nan values within it (they are indistinguishable from masked values). I hope this is not intended behaviour, but either way it is not captured by #2578 :)

I think this would be a blocker for us using dask right now at least.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Captured in #2609


* Masked arrays derived from these dask arrays create their mask using the locations of ``nan`` values.
* Where dask-wrapped arrays of ``int`` require masks, these arrays will first be cast to ``float``.

@cpelley cpelley Jun 13, 2017

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of float container? The smallest one possible for the range of values and dtype defined?
My first thought is of memory consumption and performance (speed). As I say, I have no looked at the implementation or have any idea of any benchmarking performed, but it would give me greater confidence if I knew what this might means for performance.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpelley I've created issue #2602 to address this concern.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bjlittle

@cpelley cpelley Jun 13, 2017

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem robust to me:

Perhaps I'm missing something about the implementation but I don't think you can represent the full int64 range of values as float64 ones in one container:

>>> arr = np.array([np.iinfo('int64').min, np.iinfo('int64').max])
array([-9223372036854775808, 9223372036854775807])
>>> arr.astype('float64').astype('int64')
array([-9223372036854775808, -9223372036854775808])

I have not looked at the implementation. Perhaps this is done element-wise so isn't a problem?
Either way, I think further explanation in the docs here would be useful.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpelley Interesting observation. Do you have an actual data use case for this?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My query is not driven by a usecase. I'm not sure I have seen 64bit integer field which spans a large enough range that it cannot be represented by a 64bit float field. However, this is my point, I don't know :)
Currently it won't fall over, it will silently overflow.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpelley I've raised issue #2603 to investigate this further.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bjlittle

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can cast to float64 without overflow, but not back due to rounding:

>>> arr = np.array([np.iinfo('int64').min, np.iinfo('int64').max])
array([-9223372036854775808, 9223372036854775807])
>>> arr.astype('float64')
array([ -9.22337204e+18,   9.22337204e+18])

Casting to float is always a compromise though, you can't have a 1-1 mapping of all integers->floats with the same bit size.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To extend the illustration:

>>> np.set_printoptions(precision=18)
>>> np.array([np.iinfo('int64').min, np.iinfo('int64').max], dtype='int64')
array([-9223372036854775808, 9223372036854775807])
>>> np.array([np.iinfo('int64').min, np.iinfo('int64').max], dtype='float64')
array([ -9.223372036854775808e+18,   9.223372036854775808e+18])

Note, this problem is not restricted to the very extreme of the limits.


* In order to support this mask conversion, cubes have a ``fill_value`` defined as part of their metadata, which may be ``None``.

* Array copying is kept to an absolute minimum:

* array references should always be passed, not new arrays created, unless an explicit copy operation is requested.

* To test for the presence of a dask array of any sort, we use :func:`iris._lazy_data.is_lazy_data`. This is implemented as ``hasattr(data, 'compute')``.
1 change: 1 addition & 0 deletions docs/iris/src/developers_guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,3 +38,4 @@
tests.rst
deprecations.rst
release.rst
dask_interface.rst
7 changes: 4 additions & 3 deletions docs/iris/src/userguide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ Iris user guide

How to use the user guide
---------------------------
If you are reading this user guide for the first time it is strongly recommended that you read the user guide
fully before experimenting with your own data files.
If you are reading this user guide for the first time it is strongly recommended that you read the user guide
fully before experimenting with your own data files.


Much of the content has supplementary links to the reference documentation; you will not need to follow these
Much of the content has supplementary links to the reference documentation; you will not need to follow these
links in order to understand the guide but they may serve as a useful reference for future exploration.

.. htmlonly::
Expand All @@ -30,6 +30,7 @@ User guide table of contents
saving_iris_cubes.rst
navigating_a_cube.rst
subsetting_a_cube.rst
real_and_lazy_data.rst
plotting_a_cube.rst
interpolation_and_regridding.rst
merge_and_concat.rst
Expand Down
4 changes: 2 additions & 2 deletions docs/iris/src/userguide/interpolation_and_regridding.rst
Original file line number Diff line number Diff line change
Expand Up @@ -176,8 +176,8 @@ For example, to mask values that lie beyond the range of the original data:
>>> scheme = iris.analysis.Linear(extrapolation_mode='mask')
>>> new_column = column.interpolate(sample_points, scheme)
>>> print(new_column.coord('altitude').points)
[ nan 494.44451904 588.88891602 683.33325195 777.77783203
872.222229 966.66674805 1061.11108398 1155.55541992 nan]
[-- 494.44451904296875 588.888916015625 683.333251953125 777.77783203125

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this shows the point I was making above.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About differentiating between nan and masked values, I think so.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right about the point you are making, but I don't believe that this is the right time to be making it.

This PR is to merge a feature branch into Iris which has been under construction for 4 months, and every decision has been discussed in great detail already. This method may not be ideal, but with dask having no support for masked values it is the best option we have.

We have by no means kept development of the feature branch a secret, and there has been plenty of time and space for discussion of major implementation decisions, which is not in this PR. This is just to review the last 10 commits, as @bjlittle pointed out in his first comment, so even though you are right, there is really nothing we can do about it now.

872.2222290039062 966.666748046875 1061.111083984375 1155.555419921875 --]


.. _caching_an_interpolator:
Expand Down
230 changes: 230 additions & 0 deletions docs/iris/src/userguide/real_and_lazy_data.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
.. _real_and_lazy_data:


.. testsetup:: *

import dask.array as da
import iris
import numpy as np


==================
Real and Lazy Data
==================

We have seen in the :doc:`user_guide_introduction` section of the user guide that
Iris cubes contain data and metadata about a phenomenon. The data element of a cube
is always an array, but the array may be either "real" or "lazy".

In this section of the user guide we will look specifically at the concepts of
real and lazy data as they apply to the cube and other data structures in Iris.


What is real and lazy data?
---------------------------

In Iris, we use the term **real data** to describe data arrays that are loaded
into memory. Real data is typically provided as a
`NumPy array <https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html>`_,
which has a shape and data type that are used to describe the array's data points.
Each data point takes up a small amount of memory, which means large NumPy arrays can
take up a large amount of memory.

Conversely, we use the term **lazy data** to describe data that is not loaded into memory.
(This is sometimes also referred to as **deferred data**.)
In Iris, lazy data is provided as a
`dask array <http://dask.pydata.org/en/latest/array-overview.html>`_.
A dask array also has a shape and data type
but typically the dask array's data points are not loaded into memory.
Instead the data points are stored on disk and only loaded into memory in
small chunks when absolutely necessary (see the section :ref:`when_real_data`
for examples of when this might happen).

The primary advantage of using lazy data is that it enables
`out-of-core processing <https://en.wikipedia.org/wiki/Out-of-core_algorithm>`_;
that is, the loading and manipulating of datasets that otherwise would not fit into memory.

You can check whether a cube has real data or lazy data by using the method
:meth:`~iris.cube.Cube.has_lazy_data`. For example::

>>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp'))
>>> cube.has_lazy_data()
True
# Realise the lazy data.
>>> cube.data
>>> cube.has_lazy_data()
False


.. _when_real_data:

When does my data become real?
------------------------------

When you load a dataset using Iris the data array will almost always initially be
a lazy array. This section details some operations that will realise lazy data
as well as some operations that will maintain lazy data. We use the term **realise**
to mean converting lazy data into real data.

Most operations on data arrays can be run equivalently on both real and lazy data.
If the data array is real then the operation will be run on the data array
immediately. The results of the operation will be available as soon as processing is completed.
If the data array is lazy then the operation will be deferred and the data array will
remain lazy until you request the result (such as when you call ``cube.data``)::

>>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp'))
>>> cube.has_lazy_data()
True
>>> cube += 5
>>> cube.has_lazy_data()
True

The process by which the operation is deferred until the result is requested is
referred to as **lazy evaluation**.

Certain operations, including regridding and plotting, can only be run on real data.
Calling such operations on lazy data will automatically realise your lazy data.

You can also realise (and so load into memory) your cube's lazy data if you 'touch' the data.
To 'touch' the data means directly accessing the data by calling ``cube.data``,
as in the previous example.

Core data
^^^^^^^^^

Cubes have the concept of "core data". This returns the cube's data in its
current state:

* If a cube has lazy data, calling the cube's :meth:`~iris.cube.Cube.core_data` method
will return the cube's lazy dask array. Calling the cube's
:meth:`~iris.cube.Cube.core_data` method **will never realise** the cube's data.
* If a cube has real data, calling the cube's :meth:`~iris.cube.Cube.core_data` method
will return the cube's real NumPy array.

@cpelley cpelley Jun 13, 2017

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While in such a related space, is it worth mentioning here Cube.lazy_data(), which will return a dask array regardless no? (unless this has changed/removed, I haven't looked at anything which lies outside this PR). Is the context/reason to providing this property that converting a numpy array into a dask array is much more expensive than before with biggus arrays?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It may be worth including a mention of coord.lazy_data somewhere, which I will discuss with the dev team here, but this section is specifically about coord.core_data, which refers to the data's current state. This is therefore not the space to add an example of coord.lazy_data, which (as you say) will load a dask array regardless of the data's current state.

@cpelley cpelley Jun 13, 2017

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, related area being being under the parent level 'When does my data become real?'.
Reading the documentation I was expecting to see it discussed or at least referenced to another area of the docs perhaps.


For example::

>>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp'))
>>> cube.has_lazy_data()
True

>>> the_data = cube.core_data()
>>> type(the_data)
<class 'dask.array.core.Array'>
>>> cube.has_lazy_data()
True

# Realise the lazy data.
>>> cube.data
>>> the_data = cube.core_data()
>>> type(the_data)
<type 'numpy.ndarray'>
>>> cube.has_lazy_data()
False


Coordinates
-----------

In the same way that Iris cubes contain a data array, Iris coordinates contain a
points array and an optional bounds array.
Coordinate points and bounds arrays can also be real or lazy:

* A :class:`~iris.coords.DimCoord` will only ever have **real** points and bounds
arrays because of monotonicity checks that realise lazy arrays.
* An :class:`~iris.coords.AuxCoord` can have **real or lazy** points and bounds.
* An :class:`~iris.aux_factory.AuxCoordFactory` (or derived coordinate)
can have **real or lazy** points and bounds. If all of the
:class:`~iris.coords.AuxCoord` instances used to construct the derived coordinate
have real points and bounds then the derived coordinate will have real points
and bounds, otherwise the derived coordinate will have lazy points and bounds.

Iris cubes and coordinates have very similar interfaces, which extends to accessing
coordinates' lazy points and bounds:

.. doctest::

>>> cube = iris.load_cube(iris.sample_data_path('hybrid_height.nc'))

>>> dim_coord = cube.coord('model_level_number')
>>> print(dim_coord.has_lazy_points())
False
>>> print(dim_coord.has_bounds())
False
>>> print(dim_coord.has_lazy_bounds())
False

>>> aux_coord = cube.coord('sigma')
>>> print(aux_coord.has_lazy_points())
True
>>> print(aux_coord.has_bounds())
True
>>> print(aux_coord.has_lazy_bounds())

@cpelley cpelley Jun 13, 2017

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no Coord.lazy_data()? what about Coord.core_data()?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about it? What are you expecting to see here?

@cpelley cpelley Jun 13, 2017

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Iris cubes and coordinates have very similar interfaces, which extends to accessing
coordinates' lazy points and bounds"

I expect to see Coord.lazy_data() and Coord.coord_data() illustrated here if they do indeed do apply to Coordinates like they do with Cubes (and if not, to say so too).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You are right, we should discuss those points in this section somewhere. Not in this PR though, as this is the mergeback of the feature branch for a pre-release candidate. But what I will do is add a link to this comment and the one above in the project ticket about final documentation so that we can include your suggestions in later revisions of the docs.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpelley it does say very similar and not identical. As @corinnebosley states, we're going to iterate over the documentation (we know it's not complete or perfect), so this feedback is welcomed; that's why we're keen to make a pre-release candidate available asap.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both, happy with that :)

True

# Realise the lazy points. This will **not** realise the lazy bounds.
>>> points = aux_coord.points
>>> print(aux_coord.has_lazy_points())
False
>>> print(aux_coord.has_lazy_bounds())
True

>>> derived_coord = cube.coord('altitude')
>>> print(derived_coord.has_lazy_points())
True
>>> print(derived_coord.has_bounds())
True
>>> print(derived_coord.has_lazy_bounds())
True

.. note::
Printing a lazy :class:`~iris.coords.AuxCoord` will realise its points and bounds arrays!


Dask processing options
-----------------------

As stated earlier in this user guide section, Iris uses dask to provide
lazy data arrays for both Iris cubes and coordinates. Iris also uses dask
functionality for processing deferred operations on lazy arrays.

Dask provides processing options to control how deferred operations on lazy arrays
are computed. This is provided via the ``dask.set_options`` interface.
We can make use of this functionality in Iris. This means we can
control how dask arrays in Iris are processed, for example giving us power to
run Iris processing in parallel.

Iris by default applies a single dask processing option. This specifies that
all dask processing in Iris should be run in serial (that is, without any
parallel processing enabled).

The dask processing option applied by Iris can be overridden by manually setting
dask processing options for either or both of:

* the number of parallel workers to use,
* the scheduler to use.

This must be done **before** importing Iris. For example, to specify that dask
processing within Iris should use four workers in a thread pool::

>>> from multiprocessing.pool import ThreadPool
>>> import dask
>>> dask.set_options(get=dask.threaded.get, pool=ThreadPool(4))

>>> import iris
>>> # Iris processing here...

.. note::
These dask processing options will last for the lifetime of the Python session
and must be re-applied in other or subsequent sessions.

Other dask processing options are also available. See the
`dask documentation <http://dask.pydata.org/en/latest/scheduler-overview.html>`_
for more information on setting dask processing options.


Further reading
---------------

This section of the Iris user guide provides a quick overview of real and lazy
data within Iris. For more details on these and related concepts,
see the whitepaper on lazy data.
Loading