Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a blocked K-Fold cross-validator #254

Merged
merged 9 commits into from
Jun 4, 2020
Merged

Add a blocked K-Fold cross-validator #254

merged 9 commits into from
Jun 4, 2020

Conversation

leouieda
Copy link
Member

The BlockKFold class splits the data into spatial blocks and does
K-fold cross-validation splits on the blocks. Unlike BlockShuffleSplit
it's difficult to balance the folds to have the same amount of data
(since blocks can have different numbers of points) so each fold will
have the same number of blocks but not necessarily data. Refactor code
from BlockShuffleSplit into a base class that can be shared with the
new class.

Reminders:

  • Run make format and make check to make sure the code follows the style guide.
  • Add tests for new features or tests that would have caught the bug that you're fixing.
  • Add new public functions/methods/classes to doc/api/index.rst and the base __init__.py file for the package.
  • Write detailed docstrings for all functions/classes/methods. It often helps to design better code if you write the docstrings first.
  • If adding new functionality, add an example to the docstring, gallery, and/or tutorials.
  • Add your full name, affiliation, and ORCID (optional) to the AUTHORS.md file (if you haven't already) in case you'd like to be listed as an author on the Zenodo archive of the next release.

The `BlockKFold` class splits the data into spatial blocks and does
K-fold cross-validation splits on the blocks. Unlike `BlockShuffleSplit`
it's difficult to balance the folds to have the same amount of data
(since blocks can have different numbers of points) so each fold will
have the same number of blocks but not necessarily data. Refactor code
from `BlockShuffleSplit` into a base class that can be shared with the
new class.
@leouieda
Copy link
Member Author

Here is a follow-up to #251 adding a blocked K-fold cross-validator. I couldn't think of a good way to do the balancing on this one. Any ideas?

@jessepisel
Copy link
Contributor

That's tough to get the number of points balanced while keeping the block sizes consistent from fold to fold. Is there any way that it could use some kind of nearest neighbor search and adjust the size of blocks for each fold ensuring the number is the same. Or some kind of weighting scheme for blocks? That could be a rabbit hole itself...

@leouieda
Copy link
Member Author

That could be a rabbit hole itself...

It certainly is.

I've been trying to think of a way to make sure the folds have close to the same number of data points without interfering with the block size. That would likely mean that the block order wouldn't be random, but would be such that folds are balanced. That could be a good extra option, so you could have a shuffle mode and a balanced mode.

Copy link
Member

@santisoler santisoler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature for Verde!
I only have one question regarding the X array-like argument. Why not ask for coordinates and rearrange them inside the split method? (I've just seen this is also present in BlockShuffleSplit added in #251, although I haven't spot it).

@leouieda
Copy link
Member Author

@santisoler that was how I had implemented them originally. The problem is they are not compatible with their sklearn counterparts if that is the case. And that meant that cross_val_score needed to know which type of class it was getting: a sklearn one or a Verde one. It would also mean that people doing machine learning on spatial data would be able to use our classes.

The memory copy when generating X is not ideal but I thought it was a good trade-off to making these classes generally applicable.

@leouieda
Copy link
Member Author

I think I might have found a way to always balance the folds, even without shuffling.

  1. Make a list of the block sizes
  2. Split this list into N partitions of equal sum. The split preserves the order of the blocks.
  3. Do the k-fold on these N partitions instead of the actual blocks. Basically do KFold(n_splits=N).split(range(N)).

The number of blocks per partition will vary depending on how much data is in each block. The folds will not be shuffled since the partition split preserves the order of blocks. To make this work shuffled, we can shuffle the blocks before doing the partition.

Splitting the list is the challenging part but I found this solution: https://stackoverflow.com/a/54024280

It could be implemented as a balanced argument that does the partitioning if True or doesn't if False. Probably best to default to True.

@jessepisel
Copy link
Contributor

So the larger blocks are all the same size, but the sub chunks are varying sizes that contain equal numbers? Or do I have it backwards, vary the large blocks but keep the sub chunks the same size? I think the first option would work because you could hold the number of samples the same for each fold.

@leouieda
Copy link
Member Author

This got a bit confusing in terms of terminology:

  • Block: spatial block defined by block_split (a regular mesh). All blocks have the same size but different amounts of data. All blocks have at least 1 data point.
  • Fold: partition of "data" in KFold CV. One fold is used for testing, the rest are used for training. Each data point can only be assigned to one fold (no repetition).

The current implementation does this:

  1. Split the data into blocks.
  2. Optionally shuffle the blocks.
  3. Partition the blocks evenly into folds (each fold has the same number of blocks).
  4. Assign data to folds based on which blocks are in each fold (if fold 1 contains blocks 3 and 4, the data in those blocks is assigned to fold 1).

What I'm proposing is:

  1. Split the data into blocks.
  2. Optionally shuffle the blocks.
  3. Make a list with the number of data points in each block (block_n_data).
  4. Split the blocks into folds in a way that each fold has the same number of data points. To do this, we can find the partition points by splitting the list block_n_data into parts of near equal sum. This will tell use which blocks belong in each fold.
  5. Assign data to folds based on which blocks are in each fold.

@jessepisel
Copy link
Contributor

Thanks for the clarification. What you are proposing makes sense, the equal number of points for each fold seems reasonable, and like you said it will help with understanding which blocks are in which fold.

The function finds the indices where an array should be split to have n
parts of approximate equal sum. Sometimes this can't be done (there
would be empty partitions) so raise an exception when that happens. It's
more likely to happen when the number of parts is close to the number of
elements.
@leouieda
Copy link
Member Author

leouieda commented May 1, 2020

@jessepisel just coded that method it does work pretty well! See the gallery example comparing balanced and unbalaced results:

image

There were a few tricks involved since the method fails if the number of splits is too close to the number of blocks.

@santisoler and @jessepisel would either of you mind doing another round of reviews? Thank you for all the input!

@jessepisel
Copy link
Contributor

@leouieda I like this visual. It turned out really nice, I will look through the changes this week and see what I can find

@leouieda
Copy link
Member Author

leouieda commented Jun 4, 2020

Merging this in since all checks pass.

@jessepisel sorry to fast track this but I wanted to get it in for Transform next week. I think the API is settled so any changes that may be needed can be handled in a future release easily. Thanks for all your input!

@leouieda leouieda merged commit 263b29a into master Jun 4, 2020
@leouieda leouieda deleted the blockkfold branch June 4, 2020 15:31
@jessepisel
Copy link
Contributor

No worries, I apologize for not being more prompt.

@leouieda
Copy link
Member Author

leouieda commented Jun 4, 2020

I'm the last person you should apologize to for that 😉

jessepisel added a commit to jessepisel/verde that referenced this pull request Oct 28, 2020
* Bug fix in grid_to_table for wrong order of Dataset coords (fatiando#229)

We were relying on the order of the coords attribute of the data set to meshgrid 
the coordinates. This is wrong because xarray takes the coordinate order from 
dims instead, which is what we should also do. Took this chance to refactor the 
code and included a test that would have caught the bug.

* Update files from fatiando/contributing (fatiando#233)

Changes have been made in https://github.com/fatiando/contributing to:
MAINTENANCE.md
Update the copies in this repository to match.
See fatiando/community@91699a0

* Use newer Mac on Azure Pipelines (fatiando#234)

Changed from 10.13 (High Sierra) to 10.15 (Catalina) because Azure will
remove the 10.13 VMs soon.

* Allow grid_to_table to take DataArrays (fatiando#235)

Previously would fail because we rely on getting the data variable name
from the Dataset. Need to check if it's a Dataset (has the `data_vars`
attribute) and set the data column name to "scalars" if it's not.

Fixes fatiando#225

* Add function for rolling windows on irregular data (fatiando#236)

The `rolling_window` function returns the indices of points falling on
rolling/moving windows. It works for irregularly sampled data and is a
nice complement to `xarray.DataArray.rolling`  (which only works on
grids). Uses a KDTree to select points but is not compatible with
pykdtree (it's missing the method we need).

* Add function to split data on an expanding window (fatiando#238)

For selecting data in windows of various size around a center point.
Sizes do not have to be increasing necessarily. Returns only the indices
of points falling inside each window.

* Explicitly set PR trigger for all branches on Azure (fatiando#244)

For some reason, Azure Pipelines CI wasn't starting up PRs, only on
master and tags. Explicitly setting a trigger for PRs against any branch
seems to work.

* Fix bug for 3+ coordinates in windowing functions (fatiando#243)

The `rolling_window` and `expanding_window` functions fail if given more 2 coordinates.
If we had full support for N-D coordinates, they could be N-D windowing functions
but until that time, they will have to be 2D (horizontal) only functions. Add test cases using
`extra_coords` in `grid_coordinates` to generate a third coordinate.

Fixes fatiando#242

* Generalize coordinate systems in BaseGridder docs (fatiando#240)

Make docstrings of `BaseGridder` more general, without specifying
a coordinate system. Add `dims` as a class variable that will be used by
default on `grid`, `scatter` and `profile` methods.

* Add a convex hull masking function (fatiando#237)

Use scipy's Delaunay triangulation to determine points that are outside
the convex hull of the data points. Useful to mask out wildly
extrapolated points from grids. It's easier to use than `distance_mask`
because it doesn't require distance calculations (and the associated
projections). Use the new function in some examples and tutorials.
Normalize the coordinates prior to calculations to avoid errors
with floating point conversion when triangulating (as suggested by 
Ari Hartikainen).

Fixes fatiando#126

* Fix bug with profile distances being unprojected (fatiando#231)

`BaseGridder.profile` has the option to apply a projection to the
coordinates before predicting so we can pass geographic coordinates to
cartesian gridders. In these cases, the distance along the profile is
calculated by `profile_coordinates` with the unprojected coordinates (in
the geographic case it would be degrees). The profile point calculation
is also done assuming that coordinates are Cartesian, which is clearly
wrong if inputs are longitude and latitude. To fix this, project the input
points prior to passing to `profile_coordinates`. This means that the 
distances are Cartesian and generation of profile points is also 
Cartesian (as is assumed by `profile_coordinates`). The generated 
coordinates are projected back so that the user gets longitude and 
latitude but distances are still projected Cartesian. Add a warning in the
docstring of the modified behaviour due to this bug and a section in the
tutorial on projections.

* Update TravisCI configuration to new formats (fatiando#247)

Travis started warning that some options in the config were deprecated.
Update them to the new format to avoid crashing builds later on.

* Add function to project 2D gridded data (fatiando#246)

The `project_grid` function transforms a grid to the given projection. 
It re-samples the data using `ScipyGridder` (by default) and runs a 
blocked mean (optional) to avoid aliasing when the points aren't evenly 
distributed in the projected coordinates (like in polar projections). 
Applies a `convexhul_mask` to the grid always to avoid extrapolation to 
points that had no original data (this is done by scipy for linear and 
cubic interpolation already). The function only works for 
`xarray.DataArray`s (not `Dataset`) because I couldn't find an easy way 
to guarantee that a variables have the same dimensions. Move the new 
function and `project_region` to the new module `verde/projections.py`.

* Changelog entry for v1.4.0 (fatiando#248)

New minor release with new features and bug fixes. No breaking changes
or deprecations.

* Fix broken TravisCI deploy (fatiando#250)

The docs and configuration validation tool had recommended using
`cleanup: false` instead of `skip_cleanup: true` when deploying. Turns
out Travis decided to ignore the new parameter and our Github Pages
deploy was broken since the docs were deleted before we could deploy.
Revert back to the old argument until Travis really removes it.

* Add a block version of ShuffleSplit cross-validator (fatiando#251)

Cross-validation of spatial data with random splits can often lead to 
overestimation of accuracy (see https://doi.org/10.1111/ecog.02881 
for a nice overview). To account for this, the splits can be done by 
spatial blocks. In this case, the data are partitioned into blocks and 
blocks are split randomly. That guarantees some independence of 
measurements between blocks assigned to test/train sets. This 
change adds the `BlockShuffleSplit` cross-validator which does the 
random splitting of blocks in a scikit-learn compatible class. It can be 
used with `verde.cross_val_score` like any other scikit-learn cross-
validator. In fact, it can be used entirely outside of Verde for any type 
of machine learning on spatial data. The class takes care of balancing 
the random splits so that the right amount of train/test data are 
included in each (since blocks can have wildly different numbers of 
points).

* Add optional argument to least_squares to copy Jacobian matrix (fatiando#255)

Add new copy_jacobian optional argument to least_squares to avoid
scaling Jacobian matrix inplace. Add warning on its docstring warning
about memory consumption when copy_jacobian is True. Add test function
for this new feature.

* Update files from fatiando/contributing (fatiando#256)

Changes have been made in https://github.com/fatiando/contributing to:
MAINTENANCE.md
Update the copies in this repository to match.
See fatiando/community@340dafd

* Add a blocked option for train_test_split (fatiando#253)

Add option `blocked=False` to `verde.train_test_split` to split along 
blocks instead of randomly. Uses `BlockShuffleSplit` if `True` instead 
of the scikit-learn `ShuffleSplit`. Add a gallery example comparing 
the two options.

* Fix typo in README contributing section (fatiando#258)

Equality -> Equally

* Pin pylint to 2.4.* (fatiando#259)

Version 2.5.0 introduced a bug with the multiprocessing option. This
happens often enough that we should have pylint pinned and only upgrade
when we choose to do it.

* Update tests to latest release of Scikit Learn (fatiando#267)

Change the expected representation of any gridder based on the new
default behaviour of Scikit Learn. After 0.23, Scikit Learn only shows
those parameters whose default value has been changed when giving
a string representation of the gridder. See
scikit-learn/scikit-learn#17061 and its changelog for more information.

* Gridders apply projections using only the first two coordinates (fatiando#264)

Create a private function used by BaseGridder that applies projections only on the first two
elements of the coordinates, but returns the whole set of coordinates, replacing the first two with their projected versions. Add a test function that passes extra_coords to include more than two coordinates when predicting through the grid, scatter and profile methods.

* Remove "blocked" argument from train_test_split (fatiando#257)

It was a bit redundant since we have to specify `shape` or `spacing`
along with it. So we can just ask for `spacing` or `shape` and assume
`blocked=True` if they are provided. Also fix the missing description in
the gallery example and make slight changes to its wording.

Co-authored-by: Santiago Soler <[email protected]>

* Add extra_coords on BaseGridder generated objects (fatiando#265)

After predicting through the profile, scatter and grid methods,
BaseGridder returns generated objects like xr.Dataset or pd.DataFrame.
Any extra coordinate passed through the extra_coords keyword argument
is now included on those objects as new columns or non-dimension
coordinates (depending on the generated object).

* Add a blocked K-Fold cross-validator (fatiando#254)

The `BlockKFold` class splits the data into spatial blocks and does
K-fold cross-validation splits on the blocks. Refactor code from 
`BlockShuffleSplit` into a base class that can be shared with the
new class. Folds are balanced by making splits based on the number
of data points in each fold instead of the number of blocks.

* Pin Cartopy to 0.17.* until we sort out problems (fatiando#270)

For some reason, the coast lines features are plotting on the entire
figure instead of only the land or ocean. Might be an error on our side.
For now, pin cartopy on CI to fix our docs.

* Changelog entry for v1.5.0 (fatiando#271)

Includes the new cross-validators and fixes to our base classes to
accommodate Harmonica.

* Allow specifing the scoring function in cross_val_score (fatiando#273)

Currently, `cross_val_score` only uses the estimator's `.score` method (which
is R² for all our gridders). But sometimes we want to use a different metric,
like MSE or MAE. These changes allow passing in any scikit-learn compatible
scorer through a new `scoring` parameter. This can be a string or callable,
like in scikit-learn's `cross_val_score`. Needed to delegate score computation
to a separate function in `verde.base.utils` to reuse in `cross_val_score` and
a dummy class to be able to use the scikit-learn scorers correctly.

* Format the doc/conf.py file with Black (fatiando#275)

Ran black on `doc/conf.py` and added it to the `black` and `flake8` runs
in the main Makefile. Removed the setting of `sys.path` since we're
installing the packaged in edit mode so it should be importable without
that hack.

* Require Black >=20.8b1 (fatiando#284)

After version 19, black changed the formatting slightly in a few files.
This updates the format and sets a minimum version of black in the
development environment and the CI.

* Fix Cartopy issue and require >= 0.18 (fatiando#283)

Unpin Cartopy because Cartopy 0.17.0 has a compatibility issue with Matplotlib > 3.3.0.
Remove the unneeded tight_layout that was causing errors.
Replace the filled polygons used for plotting land and ocean for the coastlines.

* Fix typos on docstrings (fatiando#281)

Fix typo on description of data_names parameters on BaseGridder.grid
method. Fix typo on description of BaseGridder.project_coordinates
method.

* Raise errors for invalid rolling windows arguments (fatiando#280)

Raise a specific error if no spacing or shape is passed to
rolling_window(). Raise a specific error if window size is larger than
the smaller dimension of the region. Add tests for these new methods
that include matching error messages to identify which error has been
raised.

* Remove configuration files for unused CI (fatiando#292)

We no longer use Stickler, Codacy, or CodeClimate. So there is no reason
to keep these files around.

* Update files from fatiando/contributing (fatiando#294)

Changes have been made in https://github.com/fatiando/contributing to:
CONTRIBUTING.md
Update the copies in this repository to match.
See fatiando/community@e899f3f

* Add function to convert grid to Dataset (fatiando#282)

Add a function that converts coordinates and data as np.arrays to
xarray.Dataset. This would make handling grids easier, with the option of
efficiently saving them as netCDF files. Make BaseGridder.grid method to use
this function instead of building the xarray.Dataset by itself. Add new
check_data_names function that converts a str into a tuple with a single
str. Make BaseGridder.grid, BaseGridder.scatter and BaseGridder.profile
able to take a single str for the data_names argument. Add test function
for default data_names. Change data_names on all examples and tutorials to
a single str.

* Refactor get_data_names and related check functions (fatiando#295)

Refactor `get_data_names` as a private method of `BaseGridder`: simplify
`get_data_names` logic, make `get_data_names` to use `check_data_names` to
convert single string to a tuple and check that `data_names` has the same
number of elements as data. Add `data_names_defaults` as a class attribute of
`BaseGridder`. Move `check_extra_coords_names` to `verde/base/utils.py` along
with the other check functions. Make `check_data_names` to raise error if
`data_names` is `None`. But `BaseGridder.get_data_names` will return default
values if `data_names` is `None`. Rename test functions for `make_xarray_grid`.

Co-authored-by: Leonardo Uieda <[email protected]>
Co-authored-by: Fatiando a Terra Bot <[email protected]>
Co-authored-by: Santiago Soler <[email protected]>
Co-authored-by: Rowan Cockett <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants