Add a blocked K-Fold cross-validator #254

leouieda · 2020-04-10T17:22:29Z

The BlockKFold class splits the data into spatial blocks and does
K-fold cross-validation splits on the blocks. Unlike BlockShuffleSplit
it's difficult to balance the folds to have the same amount of data
(since blocks can have different numbers of points) so each fold will
have the same number of blocks but not necessarily data. Refactor code
from BlockShuffleSplit into a base class that can be shared with the
new class.

Reminders:

Run make format and make check to make sure the code follows the style guide.
Add tests for new features or tests that would have caught the bug that you're fixing.
Add new public functions/methods/classes to doc/api/index.rst and the base __init__.py file for the package.
Write detailed docstrings for all functions/classes/methods. It often helps to design better code if you write the docstrings first.
If adding new functionality, add an example to the docstring, gallery, and/or tutorials.
Add your full name, affiliation, and ORCID (optional) to the AUTHORS.md file (if you haven't already) in case you'd like to be listed as an author on the Zenodo archive of the next release.

The `BlockKFold` class splits the data into spatial blocks and does K-fold cross-validation splits on the blocks. Unlike `BlockShuffleSplit` it's difficult to balance the folds to have the same amount of data (since blocks can have different numbers of points) so each fold will have the same number of blocks but not necessarily data. Refactor code from `BlockShuffleSplit` into a base class that can be shared with the new class.

leouieda · 2020-04-10T17:23:37Z

Here is a follow-up to #251 adding a blocked K-fold cross-validator. I couldn't think of a good way to do the balancing on this one. Any ideas?

jessepisel · 2020-04-13T19:03:43Z

That's tough to get the number of points balanced while keeping the block sizes consistent from fold to fold. Is there any way that it could use some kind of nearest neighbor search and adjust the size of blocks for each fold ensuring the number is the same. Or some kind of weighting scheme for blocks? That could be a rabbit hole itself...

leouieda · 2020-04-14T10:27:03Z

That could be a rabbit hole itself...

It certainly is.

I've been trying to think of a way to make sure the folds have close to the same number of data points without interfering with the block size. That would likely mean that the block order wouldn't be random, but would be such that folds are balanced. That could be a good extra option, so you could have a shuffle mode and a balanced mode.

santisoler

Nice feature for Verde!
I only have one question regarding the X array-like argument. Why not ask for coordinates and rearrange them inside the split method? (I've just seen this is also present in BlockShuffleSplit added in #251, although I haven't spot it).

leouieda · 2020-04-16T10:23:17Z

@santisoler that was how I had implemented them originally. The problem is they are not compatible with their sklearn counterparts if that is the case. And that meant that cross_val_score needed to know which type of class it was getting: a sklearn one or a Verde one. It would also mean that people doing machine learning on spatial data would be able to use our classes.

The memory copy when generating X is not ideal but I thought it was a good trade-off to making these classes generally applicable.

leouieda · 2020-04-17T17:16:09Z

I think I might have found a way to always balance the folds, even without shuffling.

Make a list of the block sizes
Split this list into N partitions of equal sum. The split preserves the order of the blocks.
Do the k-fold on these N partitions instead of the actual blocks. Basically do KFold(n_splits=N).split(range(N)).

The number of blocks per partition will vary depending on how much data is in each block. The folds will not be shuffled since the partition split preserves the order of blocks. To make this work shuffled, we can shuffle the blocks before doing the partition.

Splitting the list is the challenging part but I found this solution: https://stackoverflow.com/a/54024280

It could be implemented as a balanced argument that does the partitioning if True or doesn't if False. Probably best to default to True.

jessepisel · 2020-04-20T21:01:04Z

So the larger blocks are all the same size, but the sub chunks are varying sizes that contain equal numbers? Or do I have it backwards, vary the large blocks but keep the sub chunks the same size? I think the first option would work because you could hold the number of samples the same for each fold.

leouieda · 2020-04-27T09:06:04Z

This got a bit confusing in terms of terminology:

Block: spatial block defined by block_split (a regular mesh). All blocks have the same size but different amounts of data. All blocks have at least 1 data point.
Fold: partition of "data" in KFold CV. One fold is used for testing, the rest are used for training. Each data point can only be assigned to one fold (no repetition).

The current implementation does this:

Split the data into blocks.
Optionally shuffle the blocks.
Partition the blocks evenly into folds (each fold has the same number of blocks).
Assign data to folds based on which blocks are in each fold (if fold 1 contains blocks 3 and 4, the data in those blocks is assigned to fold 1).

What I'm proposing is:

Split the data into blocks.
Optionally shuffle the blocks.
Make a list with the number of data points in each block (block_n_data).
Split the blocks into folds in a way that each fold has the same number of data points. To do this, we can find the partition points by splitting the list block_n_data into parts of near equal sum. This will tell use which blocks belong in each fold.
Assign data to folds based on which blocks are in each fold.

jessepisel · 2020-04-29T18:44:22Z

Thanks for the clarification. What you are proposing makes sense, the equal number of points for each fold seems reasonable, and like you said it will help with understanding which blocks are in which fold.

The function finds the indices where an array should be split to have n parts of approximate equal sum. Sometimes this can't be done (there would be empty partitions) so raise an exception when that happens. It's more likely to happen when the number of parts is close to the number of elements.

leouieda · 2020-05-01T17:20:36Z

@jessepisel just coded that method it does work pretty well! See the gallery example comparing balanced and unbalaced results:

There were a few tricks involved since the method fails if the number of splits is too close to the number of blocks.

@santisoler and @jessepisel would either of you mind doing another round of reviews? Thank you for all the input!

jessepisel · 2020-05-04T19:17:20Z

@leouieda I like this visual. It turned out really nice, I will look through the changes this week and see what I can find

leouieda · 2020-06-04T15:30:06Z

Merging this in since all checks pass.

@jessepisel sorry to fast track this but I wanted to get it in for Transform next week. I think the API is settled so any changes that may be needed can be handled in a future release easily. Thanks for all your input!

jessepisel · 2020-06-04T19:09:26Z

No worries, I apologize for not being more prompt.

leouieda · 2020-06-04T19:21:23Z

I'm the last person you should apologize to for that 😉

* Bug fix in grid_to_table for wrong order of Dataset coords (fatiando#229) We were relying on the order of the coords attribute of the data set to meshgrid the coordinates. This is wrong because xarray takes the coordinate order from dims instead, which is what we should also do. Took this chance to refactor the code and included a test that would have caught the bug. * Update files from fatiando/contributing (fatiando#233) Changes have been made in https://github.com/fatiando/contributing to: MAINTENANCE.md Update the copies in this repository to match. See fatiando/community@91699a0 * Use newer Mac on Azure Pipelines (fatiando#234) Changed from 10.13 (High Sierra) to 10.15 (Catalina) because Azure will remove the 10.13 VMs soon. * Allow grid_to_table to take DataArrays (fatiando#235) Previously would fail because we rely on getting the data variable name from the Dataset. Need to check if it's a Dataset (has the `data_vars` attribute) and set the data column name to "scalars" if it's not. Fixes fatiando#225 * Add function for rolling windows on irregular data (fatiando#236) The `rolling_window` function returns the indices of points falling on rolling/moving windows. It works for irregularly sampled data and is a nice complement to `xarray.DataArray.rolling` (which only works on grids). Uses a KDTree to select points but is not compatible with pykdtree (it's missing the method we need). * Add function to split data on an expanding window (fatiando#238) For selecting data in windows of various size around a center point. Sizes do not have to be increasing necessarily. Returns only the indices of points falling inside each window. * Explicitly set PR trigger for all branches on Azure (fatiando#244) For some reason, Azure Pipelines CI wasn't starting up PRs, only on master and tags. Explicitly setting a trigger for PRs against any branch seems to work. * Fix bug for 3+ coordinates in windowing functions (fatiando#243) The `rolling_window` and `expanding_window` functions fail if given more 2 coordinates. If we had full support for N-D coordinates, they could be N-D windowing functions but until that time, they will have to be 2D (horizontal) only functions. Add test cases using `extra_coords` in `grid_coordinates` to generate a third coordinate. Fixes fatiando#242 * Generalize coordinate systems in BaseGridder docs (fatiando#240) Make docstrings of `BaseGridder` more general, without specifying a coordinate system. Add `dims` as a class variable that will be used by default on `grid`, `scatter` and `profile` methods. * Add a convex hull masking function (fatiando#237) Use scipy's Delaunay triangulation to determine points that are outside the convex hull of the data points. Useful to mask out wildly extrapolated points from grids. It's easier to use than `distance_mask` because it doesn't require distance calculations (and the associated projections). Use the new function in some examples and tutorials. Normalize the coordinates prior to calculations to avoid errors with floating point conversion when triangulating (as suggested by Ari Hartikainen). Fixes fatiando#126 * Fix bug with profile distances being unprojected (fatiando#231) `BaseGridder.profile` has the option to apply a projection to the coordinates before predicting so we can pass geographic coordinates to cartesian gridders. In these cases, the distance along the profile is calculated by `profile_coordinates` with the unprojected coordinates (in the geographic case it would be degrees). The profile point calculation is also done assuming that coordinates are Cartesian, which is clearly wrong if inputs are longitude and latitude. To fix this, project the input points prior to passing to `profile_coordinates`. This means that the distances are Cartesian and generation of profile points is also Cartesian (as is assumed by `profile_coordinates`). The generated coordinates are projected back so that the user gets longitude and latitude but distances are still projected Cartesian. Add a warning in the docstring of the modified behaviour due to this bug and a section in the tutorial on projections. * Update TravisCI configuration to new formats (fatiando#247) Travis started warning that some options in the config were deprecated. Update them to the new format to avoid crashing builds later on. * Add function to project 2D gridded data (fatiando#246) The `project_grid` function transforms a grid to the given projection. It re-samples the data using `ScipyGridder` (by default) and runs a blocked mean (optional) to avoid aliasing when the points aren't evenly distributed in the projected coordinates (like in polar projections). Applies a `convexhul_mask` to the grid always to avoid extrapolation to points that had no original data (this is done by scipy for linear and cubic interpolation already). The function only works for `xarray.DataArray`s (not `Dataset`) because I couldn't find an easy way to guarantee that a variables have the same dimensions. Move the new function and `project_region` to the new module `verde/projections.py`. * Changelog entry for v1.4.0 (fatiando#248) New minor release with new features and bug fixes. No breaking changes or deprecations. * Fix broken TravisCI deploy (fatiando#250) The docs and configuration validation tool had recommended using `cleanup: false` instead of `skip_cleanup: true` when deploying. Turns out Travis decided to ignore the new parameter and our Github Pages deploy was broken since the docs were deleted before we could deploy. Revert back to the old argument until Travis really removes it. * Add a block version of ShuffleSplit cross-validator (fatiando#251) Cross-validation of spatial data with random splits can often lead to overestimation of accuracy (see https://doi.org/10.1111/ecog.02881 for a nice overview). To account for this, the splits can be done by spatial blocks. In this case, the data are partitioned into blocks and blocks are split randomly. That guarantees some independence of measurements between blocks assigned to test/train sets. This change adds the `BlockShuffleSplit` cross-validator which does the random splitting of blocks in a scikit-learn compatible class. It can be used with `verde.cross_val_score` like any other scikit-learn cross- validator. In fact, it can be used entirely outside of Verde for any type of machine learning on spatial data. The class takes care of balancing the random splits so that the right amount of train/test data are included in each (since blocks can have wildly different numbers of points). * Add optional argument to least_squares to copy Jacobian matrix (fatiando#255) Add new copy_jacobian optional argument to least_squares to avoid scaling Jacobian matrix inplace. Add warning on its docstring warning about memory consumption when copy_jacobian is True. Add test function for this new feature. * Update files from fatiando/contributing (fatiando#256) Changes have been made in https://github.com/fatiando/contributing to: MAINTENANCE.md Update the copies in this repository to match. See fatiando/community@340dafd * Add a blocked option for train_test_split (fatiando#253) Add option `blocked=False` to `verde.train_test_split` to split along blocks instead of randomly. Uses `BlockShuffleSplit` if `True` instead of the scikit-learn `ShuffleSplit`. Add a gallery example comparing the two options. * Fix typo in README contributing section (fatiando#258) Equality -> Equally * Pin pylint to 2.4.* (fatiando#259) Version 2.5.0 introduced a bug with the multiprocessing option. This happens often enough that we should have pylint pinned and only upgrade when we choose to do it. * Update tests to latest release of Scikit Learn (fatiando#267) Change the expected representation of any gridder based on the new default behaviour of Scikit Learn. After 0.23, Scikit Learn only shows those parameters whose default value has been changed when giving a string representation of the gridder. See scikit-learn/scikit-learn#17061 and its changelog for more information. * Gridders apply projections using only the first two coordinates (fatiando#264) Create a private function used by BaseGridder that applies projections only on the first two elements of the coordinates, but returns the whole set of coordinates, replacing the first two with their projected versions. Add a test function that passes extra_coords to include more than two coordinates when predicting through the grid, scatter and profile methods. * Remove "blocked" argument from train_test_split (fatiando#257) It was a bit redundant since we have to specify `shape` or `spacing` along with it. So we can just ask for `spacing` or `shape` and assume `blocked=True` if they are provided. Also fix the missing description in the gallery example and make slight changes to its wording. Co-authored-by: Santiago Soler <[email protected]> * Add extra_coords on BaseGridder generated objects (fatiando#265) After predicting through the profile, scatter and grid methods, BaseGridder returns generated objects like xr.Dataset or pd.DataFrame. Any extra coordinate passed through the extra_coords keyword argument is now included on those objects as new columns or non-dimension coordinates (depending on the generated object). * Add a blocked K-Fold cross-validator (fatiando#254) The `BlockKFold` class splits the data into spatial blocks and does K-fold cross-validation splits on the blocks. Refactor code from `BlockShuffleSplit` into a base class that can be shared with the new class. Folds are balanced by making splits based on the number of data points in each fold instead of the number of blocks. * Pin Cartopy to 0.17.* until we sort out problems (fatiando#270) For some reason, the coast lines features are plotting on the entire figure instead of only the land or ocean. Might be an error on our side. For now, pin cartopy on CI to fix our docs. * Changelog entry for v1.5.0 (fatiando#271) Includes the new cross-validators and fixes to our base classes to accommodate Harmonica. * Allow specifing the scoring function in cross_val_score (fatiando#273) Currently, `cross_val_score` only uses the estimator's `.score` method (which is R² for all our gridders). But sometimes we want to use a different metric, like MSE or MAE. These changes allow passing in any scikit-learn compatible scorer through a new `scoring` parameter. This can be a string or callable, like in scikit-learn's `cross_val_score`. Needed to delegate score computation to a separate function in `verde.base.utils` to reuse in `cross_val_score` and a dummy class to be able to use the scikit-learn scorers correctly. * Format the doc/conf.py file with Black (fatiando#275) Ran black on `doc/conf.py` and added it to the `black` and `flake8` runs in the main Makefile. Removed the setting of `sys.path` since we're installing the packaged in edit mode so it should be importable without that hack. * Require Black >=20.8b1 (fatiando#284) After version 19, black changed the formatting slightly in a few files. This updates the format and sets a minimum version of black in the development environment and the CI. * Fix Cartopy issue and require >= 0.18 (fatiando#283) Unpin Cartopy because Cartopy 0.17.0 has a compatibility issue with Matplotlib > 3.3.0. Remove the unneeded tight_layout that was causing errors. Replace the filled polygons used for plotting land and ocean for the coastlines. * Fix typos on docstrings (fatiando#281) Fix typo on description of data_names parameters on BaseGridder.grid method. Fix typo on description of BaseGridder.project_coordinates method. * Raise errors for invalid rolling windows arguments (fatiando#280) Raise a specific error if no spacing or shape is passed to rolling_window(). Raise a specific error if window size is larger than the smaller dimension of the region. Add tests for these new methods that include matching error messages to identify which error has been raised. * Remove configuration files for unused CI (fatiando#292) We no longer use Stickler, Codacy, or CodeClimate. So there is no reason to keep these files around. * Update files from fatiando/contributing (fatiando#294) Changes have been made in https://github.com/fatiando/contributing to: CONTRIBUTING.md Update the copies in this repository to match. See fatiando/community@e899f3f * Add function to convert grid to Dataset (fatiando#282) Add a function that converts coordinates and data as np.arrays to xarray.Dataset. This would make handling grids easier, with the option of efficiently saving them as netCDF files. Make BaseGridder.grid method to use this function instead of building the xarray.Dataset by itself. Add new check_data_names function that converts a str into a tuple with a single str. Make BaseGridder.grid, BaseGridder.scatter and BaseGridder.profile able to take a single str for the data_names argument. Add test function for default data_names. Change data_names on all examples and tutorials to a single str. * Refactor get_data_names and related check functions (fatiando#295) Refactor `get_data_names` as a private method of `BaseGridder`: simplify `get_data_names` logic, make `get_data_names` to use `check_data_names` to convert single string to a tuple and check that `data_names` has the same number of elements as data. Add `data_names_defaults` as a class attribute of `BaseGridder`. Move `check_extra_coords_names` to `verde/base/utils.py` along with the other check functions. Make `check_data_names` to raise error if `data_names` is `None`. But `BaseGridder.get_data_names` will return default values if `data_names` is `None`. Rename test functions for `make_xarray_grid`. Co-authored-by: Leonardo Uieda <[email protected]> Co-authored-by: Fatiando a Terra Bot <[email protected]> Co-authored-by: Santiago Soler <[email protected]> Co-authored-by: Rowan Cockett <[email protected]>

leouieda requested review from santisoler and jessepisel April 10, 2020 17:22

santisoler approved these changes Apr 14, 2020

View reviewed changes

jessepisel approved these changes Apr 16, 2020

View reviewed changes

leouieda added 2 commits April 17, 2020 13:44

Merge branch 'master' into blockkfold

3b4a76e

Fix formatting issue

a818c53

leouieda added 5 commits May 1, 2020 15:46

Revert to unbalanced on error

aa29e9b

Merge branch 'master' into blockkfold

43d3a83

Add a gallery example

514bb84

Fix splitting block_ids instead of indices

3ebf064

Merge branch 'master' into blockkfold

543f008

leouieda merged commit 263b29a into master Jun 4, 2020

leouieda deleted the blockkfold branch June 4, 2020 15:31

leouieda mentioned this pull request Jun 4, 2020

Deal with class imbalance in blocked cross-validation #262

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a blocked K-Fold cross-validator #254

Add a blocked K-Fold cross-validator #254

leouieda commented Apr 10, 2020

leouieda commented Apr 10, 2020

jessepisel commented Apr 13, 2020

leouieda commented Apr 14, 2020

santisoler left a comment

leouieda commented Apr 16, 2020

leouieda commented Apr 17, 2020

jessepisel commented Apr 20, 2020

leouieda commented Apr 27, 2020

jessepisel commented Apr 29, 2020

leouieda commented May 1, 2020

jessepisel commented May 4, 2020

leouieda commented Jun 4, 2020

jessepisel commented Jun 4, 2020

leouieda commented Jun 4, 2020

Add a blocked K-Fold cross-validator #254

Add a blocked K-Fold cross-validator #254

Conversation

leouieda commented Apr 10, 2020

leouieda commented Apr 10, 2020

jessepisel commented Apr 13, 2020

leouieda commented Apr 14, 2020

santisoler left a comment

Choose a reason for hiding this comment

leouieda commented Apr 16, 2020

leouieda commented Apr 17, 2020

jessepisel commented Apr 20, 2020

leouieda commented Apr 27, 2020

jessepisel commented Apr 29, 2020

leouieda commented May 1, 2020

jessepisel commented May 4, 2020

leouieda commented Jun 4, 2020

jessepisel commented Jun 4, 2020

leouieda commented Jun 4, 2020