-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathREADME.Rmd
308 lines (214 loc) · 12.3 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
fig.width = 7,
fig.height = 5,
dpi = 300)
```
# `ppsr` - Predictive Power Score
<!-- badges: start -->
[![R-CMD-check](https://github.com/paulvanderlaken/ppsr/workflows/R-CMD-check/badge.svg)](https://github.com/paulvanderlaken/ppsr/actions)
[![CRAN status](https://www.r-pkg.org/badges/version/ppsr)](https://cran.r-project.org/package=ppsr)
[![CRAN_Downloads_Total](http://cranlogs.r-pkg.org/badges/grand-total/ppsr)](https://cran.r-project.org/package=ppsr)
<!-- badges: end -->
`ppsr` is the R implementation of the **Predictive Power Score** (PPS).
The PPS is an asymmetric, data-type-agnostic score that can detect linear or
non-linear relationships between two variables.
The score ranges from 0 (no predictive power) to 1 (perfect predictive power).
The general concept of PPS is useful for data exploration purposes,
in the same way correlation analysis is.
You can read more about the (dis)advantages of using PPS in [this blog post](https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598).
## Installation
You can install the latest stable version of `ppsr` from CRAN:
```{r, eval = TRUE, echo = TRUE}
install.packages('ppsr')
```
Not all recent features and bugfixes may be included in the CRAN release.
Instead, you might want to download the most recent developmental version of `ppsr` from Github:
```{r, eval = FALSE, echo = TRUE}
# install.packages('devtools') # Install devtools if needed
devtools::install_github('https://github.com/paulvanderlaken/ppsr')
```
## Computing PPS
**PPS** represents a **framework for evaluating predictive validity**.
There is _not one single way_ of computing a predictive power score, but rather there are _many different ways_.
You can select different machine learning algorithms, their associated parameters,
cross-validation schemes, and/or model evaluation metrics.
Each of these design decisions will affect your model's predictive performance and,
in turn, affect the resulting predictive power score you compute.
Hence, you can compute many different PPS for any given predictor and target variable.
For example, the PPS computed with a _decision tree_ regression model...
```{r}
ppsr::score(iris, x = 'Sepal.Length', y = 'Petal.Length', algorithm = 'tree')[['pps']]
```
...will differ from the PPS computed with a _simple linear regression_ model.
```{r}
ppsr::score(iris, x = 'Sepal.Length', y = 'Petal.Length', algorithm = 'glm')[['pps']]
```
## Usage
The `ppsr` package has four main functions to compute PPS:
* `score()` computes an x-y PPS
* `score_predictors()` computes all X-y PPS
* `score_df()` computes all X-Y PPS
* `score_matrix()` computes all X-Y PPS, and shows them in a matrix
where `x` and `y` represent an individual predictor/target,
and `X` and `Y` represent all predictors/targets in a given dataset.
### Examples:
`score()` computes the PPS for a single target and predictor
```{r}
ppsr::score(iris, x = 'Sepal.Length', y = 'Petal.Length')
```
`score_predictors()` computes all PPSs for a single target using all predictors in a dataframe
```{r}
ppsr::score_predictors(df = iris, y = 'Species')
```
`score_df()` computes all PPSs for every target-predictor combination in a dataframe
```{r}
ppsr::score_df(df = iris)
```
`score_df()` computes all PPSs for every target-predictor combination in a dataframe,
but returns only the scores arranged in a neat matrix, like the familiar correlation matrix
```{r}
ppsr::score_matrix(df = iris)
```
Currently, the `ppsr` package computes PPS by default using...
* the default decision tree implementation of the `rpart` package, wrapped by `parsnip`
* *weighted F1* scores to evaluate classification models, and _MAE_ to evaluate regression models
* 5 cross-validations
You can call the `available_algorithms()` and `available_evaluation_metrics()` functions
to see what alternative settings are supported.
Note that the calculated PPS reflects the **out-of-sample** predictive validity
when more than a single cross-validation is used.
If you prefer to look at in-sample scores, you can set `cv_folds = 1`.
Note that in such cases overfitting can become an issue, particularly with the more flexible algorithms.
## Visualizing PPS
Subsequently, there are three main functions that wrap around these computational
functions to help you visualize your PPS using `ggplot2`:
* `visualize_pps()` produces a barplot of all X-y PPS, or a heatmap of all X-Y PPS
* `visualize_correlations()` produces a heatmap of all X-Y correlations
* `visualize_both()` produces the two heatmaps of all X-Y PPS and correlations side-by-side
### Examples:
If you specify a target variable (`y`) in `visualize_pps()`, you get a barplot of its predictors.
```{r, PPS-barplot}
ppsr::visualize_pps(df = iris, y = 'Species')
```
If you do not specify a target variable in `visualize_pps()`, you get the PPS matrix visualized as a heatmap.
```{r, PPS-heatmap}
ppsr::visualize_pps(df = iris)
```
Some users might find it useful to look at a correlation matrix for comparison.
```{r, correlation-heatmap}
ppsr::visualize_correlations(df = iris)
```
With `visualize_both` you generate the PPS and correlation matrices side-by-side, for easy comparison.
```{r, sbs-heatmap, fig.width=14}
ppsr::visualize_both(df = iris)
```
You can change the colors of the visualizations using the functions arguments.
There are also arguments to change the color of the text scores.
Furthermore, the functions return `ggplot2` objects, so that you can easily change the theme and other settings.
```{r, custom-plot}
ppsr::visualize_pps(df = iris,
color_value_high = 'red',
color_value_low = 'yellow',
color_text = 'black') +
ggplot2::theme_classic() +
ggplot2::theme(plot.background = ggplot2::element_rect(fill = "lightgrey")) +
ggplot2::theme(title = ggplot2::element_text(size = 15)) +
ggplot2::labs(title = 'Add your own title',
subtitle = 'Maybe an informative subtitle',
caption = 'Did you know ggplot2 includes captions?',
x = 'You could call this\nthe independent variable\nas well')
```
## Parallelization
The number of predictive models that one needs to build in order to fill
the PPS matrix belonging to a dataframe increases exponentially
with every new column in that dataframe.
For traditional correlation analyses, this is not a problem.
Yet, with more computation-intensive algorithms, with many train-test splits,
and with large or high-dimensional datasets, it can take a decent amount of time
to build all the predictive models and derive their PPSs.
One way to speed matters up is to use the `ppsr::score_predictors()` function and
focus on predicting only the target/dependent variable you are most interested in.
Yet, since version `0.0.1`, all `ppsr::score_*` and `pssr::visualize_*` functions
now take in two arguments that facilitate parallel computing.
You can parallelize `ppsr`'s computations by setting the `do_parallel` argument to `TRUE`.
If done so, a cluster will be created using the `parallel` package.
By default, this cluster will use the maximum number of cores (see `parallel::detectCores()`) minus 1.
However, with the second argument -- `n_cores` -- you can manually specify the number of cores you want `ppsr` to use.
Examples:
```{r, eval = FALSE}
ppsr::score_df(df = mtcars, do_parallel = TRUE)
```
```{r, eval = FALSE}
ppsr::visualize_pps(df = iris, do_parallel = TRUE, n_cores = 2)
```
## Interpreting PPS
The PPS is a **normalized score** that ranges from 0 (no predictive power)
to 1 (perfect predictive power).
The normalization occurs by comparing how well we are able to predict the values of
a _target_ variable (`y`) using the values of a _predictor_ variable (`x`),
respective to two **benchmarks**: a perfect prediction, and a naive prediction
The **perfect prediction** can be theoretically derived.
A perfect regression model produces no error (=0.0),
whereas a perfect classification model results in 100% accuracy, recall, et cetera (=1.0).
The **naive prediction** is derived empirically.
A naive _regression_ model is simulated by predicting the mean `y` value for all observations.
This is similar to how R-squared is calculated.
A naive _classification_ model is simulated by taking the best among two models:
one predicting the modal `y` class, and one predicting random `y` classes for all observations.
Whenever we train an "informed" model to predict `y` using `x`,
we can assess how well it performs by comparing it to these two benchmarks.
Suppose we train a regression model, and its mean average error (MAE) is 0.10.
Suppose the naive model resulted in an MAE of 0.40.
We know the perfect model would produce no error, which means an MAE of 0.0.
With these three scores, we can normalize the performance of our informed regression model
by interpolating its score between the perfect and the naive benchmarks.
In this case, our model's performance lies about 1/4<sup>th</sup> of the way from the perfect model,
and 3/4<sup>ths</sup> of the way from the naive model.
In other words, our model's predictive power score is 75%: it produced 75% less error than the naive baseline, and was only 25% short of perfect predictions.
Using such normalized scores for model performance allows us to easily interpret
how much better our models are as compared to a naive baseline.
Moreover, such normalized scores allow us to compare and contrast different
modeling approaches, in terms of the algorithms, the target's data type,
the evaluation metrics, and any other settings used.
## Considerations
The main use of PPS is as a tool for data exploration.
It trains out-of-the-box machine learning models to assess the predictive relations in your dataset.
However, this PPS is quite a "quick and dirty" approach.
The trained models are not at all tailored to your specific regression/classification problem.
For example, it could be that you get many PPSs of 0 with the default settings.
A known issue is that the default decision tree often does not find valuable splits
and reverts to predicting the mean `y` value found at its root.
Here, it could help to try calculating PPS with different settings (e.g., `algorithm = 'glm'`).
At other times, predictive relationships may rely on a combination of variables
(i.e. interaction/moderation). These are not captured by the PPS calculations,
which consider only univariate relations.
PPS is simply not suited for capturing such complexities.
In these cases, it might be more interesting to train models on all your features simultaneously
and turn to concepts like [feature/variable importance](https://topepo.github.io/caret/variable-importance.html),
[partial dependency](https://christophm.github.io/interpretable-ml-book/pdp.html), [conditional expectations](https://christophm.github.io/interpretable-ml-book/ice.html), [accumulated local effects](https://christophm.github.io/interpretable-ml-book/ale.html), and others.
In general, the PPS should not be considered more than a fast and easy tool to
finding starting points for further, in-depth analysis.
Keep in mind that you can build much better predictive models than the default
PPS functions if you tailor your modeling efforts to your specific data context.
## Open issues & development
PPS is a relatively young concept, and likewise the `ppsr` package is still under development.
If you spot any bugs or potential improvements, please raise an issue or submit a pull request.
On the developmental agenda are currently:
* Support for different modeling techniques/algorithms
* Support for generalized linear models for multinomial classification
* Passing/setting of parameters for models
* Different model evaluation metrics
* Support for user-defined model evaluation metrics
* Downsampling for large datasets
## Attribution
This R package was inspired by 8080labs' Python package [`ppscore`](https://github.com/8080labs/ppscore).
The same 8080labs also developed an earlier, unfinished [R implementation of PPS](https://github.com/8080labs/ppscoreR).
Read more about the big ideas behind PPS in [this blog post](https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598).