-
Notifications
You must be signed in to change notification settings - Fork 13
/
causal.qmd
1138 lines (817 loc) · 66.4 KB
/
causal.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Causal Modeling {#sec-causal}
![](img/chapter_gp_plots/gp_plot_10.svg){width=75%}
:::{.content-visible when-format='html'}
> All those causal effects will be lost in time, like tears in rain... without adequate counterfactual considerations.
> ~ Roy Batty (paraphrased)
:::
Causal inference is a very important topic in machine learning and statistical modeling approaches. It is also a very difficult one to understand well, or consistently, because *not everyone agrees on how to define a cause in the first place*. Our focus here is merely practical- we just want to discuss some of the modeling approaches commonly used when attempting to answer causal questions. But causal modeling in general is such a deep topic that we won't be able to go into as much detail as it deserves. However, we will try to give you a sense of the landscape and some of the key ideas.
## Key Ideas {#sec-causal-key-ideas}
- No model can tell you whether a relationship is causal or not. Causality is inferred, not proven, based on the available evidence.
- The same models could be used for similar data settings to answer a causal question or a purely predictive question. A key difference is in the interpretation of the results.
- Experimental design, such as randomized control trials, are considered the gold standard for causal inference. But the gold standard is often not practical, and not without its limitations even when it is.
- Causal inference is often done with observational data, which is often the only option, and that's okay.
- Counterfactual thinking is at the heart of causal inference, but can be useful for all modeling contexts.
- Several models exist which are typically employed to answer a more causal-oriented question. These include graphical models, uplift modeling, and more.
- Interactions are the norm for most modeling scenarios, while causal inference generally regards a single effect. If an effect varies depending on other features, you should be cautious trying to aggregate your results to a single effect, since that effect would be potentially misleading.
### Why it matters {#sec-causal-why}
Often we need a precise statement about the feature-target relationship, not just a declaration that there is 'some' relationship. For example, we might want to know how well a drug works and for whom, or show that an advertisement results in a certain amount of new sales. We generally need to know whether the effect is real, and the size of the effect, and often, the uncertainty in that estimate.
Causal modeling is, like machine learning, more of an approach than a specific model, and that approach may involve the design or implementation of models we've already seen, but conducted in a different way to answer the key question. Without more precision in our understanding, we could miss the effect, or overstate it, and make bad decisions as a result.
### Helpful context {#sec-causal-good-to-know}
This section is pretty high level, and we are not going to go into much detail here, so even just some understanding of correlation and modeling would likely be enough.
## Prediction and Explanation Revisited {#sec-causal-prediction-explanation}
We introduced the idea of prediction and explanation in the context of linear models in @sec-lm-prediction-vs-explanation, and it's worth revisiting here. One attribute of a causal model is an intense focus on the explanatory power of the model. We want to demonstrate that there is a relationship between (usually) a single feature and the target, and we want to know the precise manner of this relationship as much as possible. Even if we use complex models, the endeavor is to explain the specifics.
```{r}
#| echo: false
#| label: prediction-vs-explanation-demo
set.seed(42)
n = 10000
x = rnorm(n)
y_num = rbinom(n, 1, plogis(.05*x))
y = factor(y_num)
model = glm(y ~ x, family = 'binomial')
x_OR = round(exp(coef(model)['x']), 2)
# summary(model)
# summary(model)
#confint(model)['x',]
y_pred0 = predict(model, data.frame(x=0), type='response')
y_pred1 = predict(model, data.frame(x=1), type='response')
prob_diff = label_percent()(y_pred1 - y_pred0)
pred_class = factor(as.integer(predict(model, type = 'response') > .5))
base_rate = label_percent()(mean(y == '1'))
# mlr3measures::confusion_matrix(y, pred_class, positive = '1')
x_new = data.frame(x = rnorm(100))
y_new = factor(rbinom(100, 1, plogis(.05 * x_new$x)))
pred_class_new = factor(as.integer(predict(model, newdata = x_new, type = 'response') > .5))
# mean(y_new == '1')
# mlr3measures::confusion_matrix(y_new, pred_class_new, positive = '1')
causal_power = readRDS('data/causal_power.rds')
```
```{r}
#| echo: false
#| eval: false
#| label: prediction-vs-explanation-demo-power
get_power = function(reps = 1000) {
pvals = replicate(reps, {
y = rbinom(n, 1, plogis(.05*x)) # -3 + .05*x for lower base rate ~5%
model = glm(y ~ x, family = 'binomial')
summary(model)$coefficients['x', 4]
})
mean(pvals < .05)
}
causal_power = scales::label_percent()(get_power())
saveRDS(causal_power, 'data/causal_power.rds')
```
Let's say that we used some particular causal modeling approach to explain a feature-target relationship in a classification setting. We have 10,000 observations, and the baseline rate of the target is about ~`r base_rate`. We have a model that predicts the target `y` based on the feature of interest `x`, and we may have used some causal technique like propensity score weighting or some other approach to help control for confounding (we'll discuss these later).
The coefficient, though small with an odds ratio of `r x_OR`, is statistically significant (take our word for it), so we have a slight positive relationship. Under certain settings such as this, where we are interested in causal effects and where we have controlled for various other factors to obtain this result, we might be satisfied with interpreting this relationship as is.
```{r}
#| echo: false
#| eval: false
#| label: fig-prediction-vs-explanation-demo-plot
#| fig-cap: Results from a hypothetical causal model
#| out-width: 100%
pdat = plot(ggeffects::ggpredict(model, terms = 'x [-4:4]'))$data
ggeffects::ggpredict(model, terms = 'x') |>
plot(use_theme = FALSE) +
geom_segment(
# aes(
x = 0,
xend = 1,
y = y_pred0,
yend = y_pred1,
# ),
color = okabe_ito[['darkblue']],
linewidth = 1.5
) +
geom_segment(
x = 0,
xend = min(pdat$x),
y = y_pred0,
yend = y_pred0,
color = okabe_ito[['darkblue']],
arrow = arrow(type = 'closed', angle = 30, length = unit(0.1, 'inches'))
) +
geom_segment(
x = 1,
xend = min(pdat$x),
y = y_pred1,
yend = y_pred1,
color = okabe_ito[['darkblue']],
arrow = arrow(type = 'closed', angle = 30, length = unit(0.1, 'inches'))
) +
geom_polygon(
data = pdat[1:4,], # just make some data the right length
aes(x = c(0, 0, 1, 1), y = c(.4, y_pred0, y_pred1, .4)),
fill = okabe_ito[['darkblue']],
alpha = .05
) +
labs(
title = '',
caption = 'The model does not predict well but suggests a small positive effect.',
) +
scale_y_continuous(
breaks = c(seq(.4, .6, .05), seq(.49, .52, .01)),
labels = scales::label_percent(),
limits = c(.4, .6)
) +
scale_x_continuous(breaks = seq(-4, 4, 1), labels = scales::label_number(), limits = range(x)) +
coord_cartesian(x = range(x))
ggsave('img/causal-prediction-vs-explanation-demo-plot.svg', width = 8, height = 6)
```
![Results from a hypothetical causal model](img/causal-prediction-vs-explanation-demo-plot.svg){width=100% #fig-prediction-vs-explanation-demo-plot}
But if we are interested in predictive performance, we would be disappointed with this model. It predicts the target at about the same rate as guessing, even on the data it's fit on, and does even worse with new data. Even the effect as shown is quite small by typical standards, as it would take a standard deviation change in the feature to get a ~`r prob_diff` change in the probability of the target (x is standardized).
If we are concerned solely with explanation, we now would want to ask ourselves first if we can trust our result based on the data, model, and various issues that went into producing it. If so, we can then see if the effect is large enough to be of interest, and if the result is useful in making decisions[^seedsofchange]. It may very well be, maybe the target concerns the rate of survival, where any increase is worthwhile. Or perhaps the data circumstances demand such interpretation, because it is costly to obtain more. For more exploratory efforts however, this sort of result would likely not be enough to come to any strong conclusion, even if explanation is the only goal.
[^seedsofchange]: This is a contrived example, but it is definitely something that you might see in the wild. The relationship is weak, and though statistically significant, the model can't predict the target well at all. The **statistical power** is actually decent in this case, roughly `r causal_power`, but this is mainly because the sample size is so large and it is a very simple model setting. The same coefficient with a base rate of around 5% would have a power of around 20%. <br> This is a common issue, and it's why we always need to be careful about how we interpret our models. In practice, we would generally need to consider other factors, such as the cost of a false positive or false negative, or the cost of the data and running the model itself, to determine if the model is worth using.
As another example, consider the world happiness data we've used in previous demonstrations. We want to explain the association of country level characteristics and the population's happiness. We likely aren't going to be as interested in predicting next year's happiness score, but rather what attributes are correlated with a happy populace in general. For another example, in the U.S., we might be interested in specific factors related to presidential elections, of which there are relatively very few data points. In these cases, explanation is the focus, and we may not even need a model at all to come to our conclusions.
So we can see that in some settings we may be more interested in understanding the underlying mechanisms of the data, and in others we may be more interested in predictive performance. However, the distinction between prediction and explanation in the end is a bit problematic, not the least of which is that we often want to do both.
Although it's often implied as such, *prediction is not just what we do with new data*. It is the very means by which we get any explanation of effects via coefficients, marginal effects, visualizations, and other model results. Additionally, when the focus is on predictive performance, if we can't explain the results we get, we will typically feel dissatisfied, and may still question how well the model is actually doing.
Here are some ways we might think about different modeling contexts:
- **Descriptive Analysis**: Here we have an exploration of data with no modeling focus. We'll use descriptive statistics and visualizations to help us understand what's going on. An end product may be an infographic or a highly visual report. Even here, we might use models to aid visualizations, or otherwise to help us understand the data better, but their specific implementation or result is not of much interest.
- **Exploratory Modeling**: When using models for exploration, focus should probably be on both prediction and explanation. The former can help inform the strength of the results for future exploration, while the latter will often provide useful insights.
- **Causal Modeling**: Here the focus is on understanding causal effects. We focus on explanation, and prediction on the current data. We may very well be interested in predictive performance also, and often are in industry.
- **Generalization**: When our goal is generalizing to unseen data as we have discussed elsewhere, the focus is mostly on predictive performance[^transport], as we need something to help us predict things in the future. This does not mean we can't use the model to understand the data though, and explanation could still possibly be as important depending on the context.
[^transport]: In causal modeling, there is the notion of **transportability**, which is the idea that a model can be used in, or generalize to, a different setting than it was trained on. For example, you may see an effect for one demographic group and want to know whether it holds for another. It is closely related to the notion of external validity, and is also related to the concepts we've hit on in our discussion of interactions(@sec-lm-extend-interactions).
Depending on the context, we may be more interested explanation or predictive performance, but in practice we often want both. It is crucial to remind yourself why you are interested in the problem, what a model is capable of telling you about it, and to be clear about what you want to get out of the result.
## Classic Experimental Design {#sec-causal-classic}
```{r}
#| eval: false
#| echo: false
#| label: r-random-assignment-HTML
#| fig-cap: Random Assignment
#|
# Load the necessary library
# Create a data frame with random assignments
set.seed(1234)
N = 250
df = data.frame(
x = rnorm(N, sd = .1),
y = rnorm(N, sd = .1),
group = sample(c('A', 'B'), N, replace = TRUE),
col_group = sample(rainbow(5), N, replace = TRUE)
)
dot_size = 3
# Plot the data as a single cluster
p_all = ggplot(df, aes(x, y)) +
geom_point(
aes(color = col_group, shape = col_group),
fill = 'white',
size = dot_size,
show.legend = FALSE,
alpha = 1
) +
labs(
title = 'Random Assignment to Groups A and B',
) +
theme_void() +
theme(
plot.title = element_text(size = 30, hjust = 0.5, margin = margin(.5, .5, .5, .5, 'cm'))
)
# p_all
# Plot the data separated into groups A and B
p_groups = ggplot(df, aes(x, y)) +
geom_point(
aes(color = col_group, shape = col_group),
fill = 'white',
# border = 5,
# pch=21,
size = dot_size,
show.legend = FALSE,
alpha = 1
) +
facet_wrap(~ group) +
labs(
# title = 'Random Assignment to Groups A and B',
) +
theme_void() +
theme(
strip.text = element_text(size = 36, margin = margin(.5, .5, .5, .5, 'cm')),
panel.border = element_rect(color = 'gray25', fill=NA),
panel.spacing = unit(2, 'cm'),
plot.margin = margin(1, 1, 1, 1, 'cm')
)
library(waffle)
p_waff = df |>
count(group, col_group) |>
ggplot( aes(values=n, fill = col_group)) +
geom_waffle(n_rows = 10, size = .5, color = 'white', show.legend = FALSE) +
facet_wrap(~group) +
theme_void() +
theme(
strip.text = element_text(size = 36, margin = margin(.5, .5, .5, .5, 'cm')),
# panel.border = element_rect(color = 'black', fill=NA),
panel.spacing = unit(2, 'cm'),
plot.margin = margin(1, 1, 1, 1, 'cm')
)
design = '
##11##
222222
333333
'
(p_all /
p_groups/
p_waff) +
plot_layout(width = c(1, 2, 2), design = design) &
# scale_color_brewer(palette = 'Set1', aesthetics = c('fill', 'color'))# + set1 supposedly print friendly
scale_color_manual(values = unname(okabe_ito), aesthetics = c('fill', 'color'))
ggsave(
'img/causal-random-assignment.svg',
width = 9,
height = 12,
dpi = 300
)
```
```{r}
#| eval: false
#| echo: false
#| label: r-random-assignment-PDF
#| fig-cap: Random Assignment
#|
# Load the necessary library
# Create a data frame with random assignments
set.seed(1234)
N = 250
df = data.frame(
x = rnorm(N, sd = .1),
y = rnorm(N, sd = .1),
group = sample(c('A', 'B'), N, replace = TRUE),
col_group = sample(rainbow(5), N, replace = TRUE)
)
dot_size = 5
# Plot the data as a single cluster
p_all = ggplot(df, aes(x, y)) +
geom_point(
aes(shape = col_group),
fill = 'white',
size = dot_size,
show.legend = FALSE,
alpha = 1
) +
labs(
title = 'Random Assignment to Groups A and B',
) +
theme_void() +
theme(
plot.title = element_text(size = 30, hjust = 0.5, margin = margin(.5, .5, .5, .5, 'cm'))
)
# p_all
# Plot the data separated into groups A and B
p_groups = ggplot(df, aes(x, y)) +
geom_point(
aes(shape = col_group),
fill = 'white',
size = dot_size,
show.legend = FALSE,
alpha = 1
) +
facet_wrap(~ group) +
labs(
# title = 'Random Assignment to Groups A and B',
) +
theme_void() +
theme(
strip.text = element_text(size = 36, margin = margin(.5, .5, .5, .5, 'cm')),
panel.border = element_rect(color = 'gray25', fill=NA),
panel.spacing = unit(2, 'cm'),
plot.margin = margin(1, 1, 1, 1, 'cm')
)
library(ggpattern)
pch_values = c(1:3, 5:6)
shape_labels = c("\u25A1", "\u25CB", "\u25B3", "\u25C7", "\u25BD") # Square, Circle, Triangle, Diamond, Inverted Triangle (all open)
p_waff = df |>
count(group, col_group) |>
mutate(col_group = factor(col_group, levels = unique(col_group), labels = shape_labels)) |>
ggplot( aes()) +
geom_col(aes(x = col_group, shape = col_group, y = n), alpha = .05) +
# geom_col_pattern(aes(x = col_group, pattern = col_group, y = n), alpha = .05) +
facet_wrap(~group) +
theme_void() +
labs(
x = '',
) +
theme(
strip.text = element_text(size = 36, margin = margin(.5, .5, .5, .5, 'cm')),
axis.text.x = element_text(size = 12, color = 'gray50'),
# panel.border = element_rect(color = 'black', fill=NA),
panel.spacing = unit(2, 'cm'),
plot.margin = margin(1, 1, 1, 1, 'cm')
)
# p_waff
# mc thinks we're
design = '
##11##
222222
333333
'
(p_all /
p_groups/
p_waff) +
plot_layout(width = c(1, 2, 2), design = design) &
# scale_color_brewer(palette = 'Set1', aesthetics = c('fill', 'color'))# + set1 supposedly print friendly
# scale_color_manual(values = unname(okabe_ito), aesthetics = c('fill', 'color'))
scale_shape_manual(values = c(1:3, 5:6, 12))
ggsave(
'img/causal-random-assignment-pdf.svg',
width = 9,
height = 12,
dpi = 300
)
```
:::{.content-visible when-format="html"}
![Random Assignment](img/causal-random-assignment.svg){width=40% #fig-random-assignment}
:::
:::{.content-visible when-format="pdf"}
![Random Assignment](img/causal-random-assignment-pdf.svg){width=40% #fig-random-assignment}
:::
Many are familiar with the basic idea of an experiment, where we have a **treatment** group and a **control** group, and we want to measure the difference between the two groups. The 'treatment' could regard a new drug, a marketing campaign, or a new app's feature. If we randomly assign our observational units to the two groups, say, one that gets the new app feature and the other doesn't, we can be more confident that the two groups are essentially the same aside from the treatment. Furthermore, any difference we see in the outcome, for example, customer satisfaction with the app, is probably due to the treatment.
This is the basic idea behind a **randomized control trial**, or **RCT**. We can randomly assign the groups in a variety of ways, but you can think of it as flipping a coin, and assigning each sample to the treatment when the coin comes up on one side, and to the control when it comes up on the other. The idea is that the only difference between the two groups is the treatment, and so any difference in the outcome can be attributed to the treatment. This is visualized in @fig-random-assignment, where the color/shapes represent different groups that are the same. Their distribution is roughly similar after assignment to the treatment groups, and would become more so with more data.
### Analysis of Experiments {#sec-causal-experiments}
Many of those who have taken a statistics course have been exposed to the simple **t-test** to determine whether two groups are different. For many this is their first introduction to statistical modeling. The t-test tells us whether the difference in means between the two groups is *statistically* significant. However, it definitely *does not* tell us whether the treatment itself caused the difference, whether the effect is large, nor whether the effect is real, or even if the treatment is a good idea to do in the first place. It just tells us whether the two groups are statistically different.
It turns out that a t-test is just a linear regression model. It's a special case of linear regression where there is only one independent variable, and it is a categorical variable with two levels. The coefficient from the linear regression would tell you the mean difference of the outcome between the two groups. Under the same conditions, the t-statistic from the linear regression and the t-test from a separate function would have identical statistical results.
Analysis of variance, or **ANOVA**, allows the t-test to be extended to more than two groups, and multiple features, and is also commonly employed to analyze the results of experimental design settings. But ANOVA is still just a linear regression. Even when we get into more complicated design settings such as repeated measures and mixed design, it's still just a linear model, we'd just be using mixed models (@sec-lm-extend-mixed-models). In general, we're going to use similar tools to analyze the results of our experiments as we would for other modeling settings.
If linear regression didn't suggest any notion of causality to you before, it shouldn't now either. The model is *identical* whether there was an experimental design with random assignment or not. The only difference is that the data was collected in a different way, and the theoretical assumptions and motivations are different. Even the statistical assumptions are the same whether you use random assignment, or there are more than two groups, or whether the treatment is continuous or categorical.
Experimental design[^exprand] can give us more confidence in the causal explanation of model results, whatever model is used, and this is why we like to use it when we can. It helps us control for the unobserved factors that might otherwise be influencing the results. If we can be fairly certain the observations are essentially the same *except* for the treatment, then we can be more confident that the treatment is the cause of any differences we see, and be more confident in a causal interpretation of the results. But it doesn't change the model itself, and the results of a model don't prove a causal relationship on their own. Your experimental study will also be limited by the quality of the data, and the population it generalizes to. Even with strong design and modeling, if care isn't taken in the modeling process to even assess the generalization of the results (@sec-ml-generalization), you may find they don't hold up[^explimits].
[^exprand]: Note that experimental design is not just any setting that uses random assignment, but more generally how we introduce *control* in the sample settings.
[^explimits]: Many experimental design settings involve sometimes very small samples due to the cost of the treatment implementation and other reasons. This often limits exploration of more complex relationships (e.g., interactions), and it is relatively rare to see any assessment of performance generalization. It would probably worry many to know how many important experimental results are based on p-values with small data, and this is the part of the problem seen with the [replication crisis](https://en.wikipedia.org/wiki/Replication_crisis) in science.
:::{.callout-note title='A/B Testing' collapse='true'}
**A/B testing** is just marketing-speak for a project focused on comparing two groups. It implies randomized assignment, but you'd have to understand the context to know if that is actually the case.
:::
## Natural Experiments {#sec-causal-natural}
```{r}
#| label: r-causal-covid-vax-deaths
#| eval: false
#| echo: false
deaths = read_csv('misc/covid_new_deaths_per_million.csv', col_select= matches('date|United States$')) |>
rename(deaths_per_mi = `United States`)
vax = read_csv('misc/us_covid_vaccinations.csv')
dim(deaths)
dim(vax)
deaths |> print(n=100)
# min(deaths$date)
# length(seq.Date(as_date('2020-01-05'), as_date('2022-12-31'), by = 'day'))
covid_us = full_join(
vax, deaths |> select(date, deaths_per_mi) |> filter(deaths_per_mi>0), by = 'date'
)
covid_us |>
select(date, people_vaccinated, deaths_per_mi) |>
filter(date < '2022-01-01') |>
drop_na() |>
# mutate(total_vaccinations = total_vaccinations - dplyr::lag(total_vaccinations)) |>
mutate(people_vaccinated_million = people_vaccinated/1e6) |>
select(-people_vaccinated) |>
pivot_longer(-date, names_to = 'variable', values_to = 'value') |>
filter(variable != 'people_vax_million') |>
mutate(variable = if_else(variable == 'deaths_per_mi', '# deaths per million', '# vaccinated in millions')) |>
# group_by(variable) |>
# mutate(scale_value = scale(value)) |>
ggplot(aes(x = date, y = value)) +
# geom_point(aes(color = variable)) +
geom_line(aes(color = variable, linetype = variable)) +
geom_smooth(aes(color = variable, linetype = variable), se=FALSE, method = 'gam', formula = y ~ s(x, bs='cs', k = 25)) +
scale_x_date(date_breaks = '1 month', labels = label_date_short()) +
see::scale_color_okabeito() +
# geom_smooth(method = 'lm') +
labs(
x = '',
y = '',
caption = 'Data from Our World in Data: https://github.com/owid/',
title = 'Covid Vaccinations and Deaths in the US'
) +
theme(
legend.direction = 'vertical',
)
ggsave(
'img/causal-covid-vax-deaths.svg',
width = 8,
height = 6
)
```
```{r}
#| eval: false
#| echo: false
#| label: r-causal-covid-derivs
# just eda
library(mgcv)
library(gratia)
model_data = covid_us |>
select(date, people_vaccinated, deaths_per_mi) |>
filter(date < '2022-01-01') |>
drop_na() |>
# mutate(total_vaccinations = total_vaccinations - dplyr::lag(total_vaccinations)) |>
mutate(people_vaccinated_million = people_vaccinated/1e6)
# Model death data
death_model = gam(deaths_per_mi ~ s(date), data = model_data |> mutate(date = as.numeric(date)))
death_derivatives = derivatives(death_model, term = 's(date)')
# Model vaccination data
vax_model = gam(people_vaccinated_million ~ s(date), data = model_data |> mutate(date = as.numeric(date)))
vax_derivatives = derivatives(vax_model, term = 's(date)')
# Plot death derivatives
# draw(death_derivatives)
# draw(vax_derivatives)
vax_derivatives |>
select(date = data, vax_d = derivative) |>
left_join(death_derivatives |> select(date = data, death_d = derivative), by = 'date') |>
# arrange(date) |>
mutate(
date = as_date(date),
# vax_d = dplyr::lag(vax_d, 4)
) |>
pivot_longer(-date, names_to = 'variable', values_to = 'value') |>
ggplot(aes(x = date, y = value)) +
geom_line(aes(color = variable, linetype = variable)) +
scale_x_date(date_breaks = '1 month', labels = label_date_short()) +
see::scale_color_okabeito() +
labs(
x = '',
y = '',
)
# ggplot(aes(x = vax_d, y = death_d)) +
# geom_point()
```
![Covid Vaccinations and Deaths in the US](img/causal-covid-vax-deaths.svg){width=50% #fig-covid-vax-deaths}
As we noted, random assignment or a formal experiment is not always possible or practical to implement. But sometimes we get to do it anyway, or at least we can get pretty close! Occasionally, the world gives us a **natural experiment**, where the assignment to the groups is essentially random, or where there is clear break before and after some event occurs, such that we examine the change as we would in pre-post design.
The COVID-19 pandemic provides an example of a natural experiment. The pandemic introduced sudden and widespread changes that were not influenced by individuals' prior characteristics or behaviors, such as lockdowns, remote work, and vaccination campaigns. The randomness in the timing and implementation of these changes allows researchers to compare outcomes before and after the policy implementation or pandemic, or between different regions with varying policies, to infer causal effects.
For instance, we could compare states or counties that had mask mandates to those that didn't at the same time or with similar characteristics. Or we might compare areas that had high vaccination rates to those nearby that didn't. But these still aren't true experiments. So we'd need to control for as many additional factors that might influence the results, like population density, age, wealth and so on, and eventually we might still get a pretty good idea of the causal impact of these interventions.
## Causal Inference {#sec-causal-inference}
While we all have a natural intuition about causality, it can actually be a fairly elusive notion to grasp. Causality is a very old topic, philosophically dating back millennia, and more formally [hundreds of years](https://plato.stanford.edu/entries/causation-medieval/). Random assignment is a relatively new idea, say [150 years old](https://plato.stanford.edu/entries/peirce/), and was posited even before Wright, Fisher, and Neyman, and the 20th century rise of statistics. But with stats and random assignment we had a way to start using models to help us reason about causal relationships. [Pearl and others](https://muse.jhu.edu/pub/56/article/867087/summary) came along to provide an algorithmic perspective from computer science, and economists like Heckman also got into the game too. We were even using programming approaches to do causal inference back in the 1970s! Eventually most scientific academic disciplines were well acquainted with causal inference in some fashion, and things have been progressing along for some time.
Because of its long history, causal inference is a broad field, and there are many ways to approach it. We've already discussed some of the basics, but there are many other ways to reason about causality. And of course, we can use models to help us understand the causal effects we are interested in.
### Key assumptions of causal inference {#sec-causal-assumptions}
Causal inference at its core is the process of identifying and estimating causal effects. But like other scientific and modeling endeavors, it relies on several key assumptions to identify and estimate those effects. The main assumptions include:
- **Consistency**: The *potential* outcome under the observed treatment is the same as the *observed* outcome. This suggests there is *no interference* between units, and that there are *no hidden variations of the treatment*.
- **Exchangeability**: The treatment assignment is independent of the potential outcomes, given the observed covariates. In other words, the treatment assignment is as good as random after conditioning on the covariates. This is often referred to as *no unmeasured **confounding***.
- **Positivity** : Every individual has a positive probability of receiving each treatment level.
It can be difficult to meet these assumptions, and there is not always a clear path to a solution. As an example, say we want to assess a new curriculum's effect on student performance. We can randomly assign students, but they can interact with one another both in and outside of the classroom. Those who receive the treatment may be more likely to talk to one another, and this could affect the outcome, enhancing its effects if it is beneficial. This would violate our assumption of no interference between units, and we'd need to maybe choose an alternative design or outcome to account for this.
The following demonstrates a common assumption that is regularly guarded against in causal modeling - confounding. The confounder `U`, is a variable that affects both treatment `X` and target `Y`. We'll generate some synthetic data with a confounder, and fit two models, one with the confounder and one without. We'll compare the coefficients of the feature of interest in both models.
:::{.panel-tabset}
##### Python
```{python}
#| label: py-causal-assumptions
#| eval: false
from numpy.random import normal as rnorm
import pandas as pd
import statsmodels.api as sm
def get_coefs(n = 100, true = 1):
U = rnorm(size=n) # Unmeasured confounder
X = 0.5 * U + rnorm(size=n) # Treatment influenced by U
Y = true * X + U + rnorm(size=n) # Outcome influenced by X and U
data = pd.DataFrame({'X': X, 'U': U, 'Y': Y})
# Fit a linear regression model with and
# without adjusting for the unmeasured confounder
model = sm.OLS(data['Y'], sm.add_constant(data['X'])).fit()
model2 = sm.OLS(data['Y'], sm.add_constant(data[['X', 'U']])).fit()
return model.params['X'], model2.params['X']
def simulate_confounding(nreps = 100, n = 100, true=1):
results = []
for _ in range(nreps):
results.append(get_coefs(n, true))
results = np.mean(results, axis=0)
return pd.DataFrame({
'true': true,
'estimate_1': results[0],
'estimate_2': results[1],
}, index=['X']).round(3)
simulate_confounding(n=1000, nreps=500)
```
##### R
```{r}
#| label: r-causal-assumptions
#| eval: false
get_coefficients = function(n = 100, true = 1) {
U = rnorm(n) # Unmeasured confounder
X = 0.5 * U + rnorm(n) # Treatment influenced by U
Y = true * X + U + rnorm(n) # Outcome influenced by X and U
data = data.frame(X = X, Y = Y)
# Fit a linear regression model with and
# without adjusting for the unmeasured confounder
model = lm(Y ~ X, data = data)
model2 = lm(Y ~ X + U, data = data)
c(coef(model)['X'], coef(model2)['X'])
}
simulate_confounding = function(nreps, n, true) {
results = replicate(nreps, get_coefficients(n, true))
results = rowMeans(results)
data.frame(
true = true,
estimate_1 = results[1],
estimate_2 = results[2]
)
}
simulate_confounding(nreps = 500, n = 1000, true = 1)
```
:::
Results suggest that the coefficient for `X` is different in the two models. If we don't include the confounder, the feature's relationship with the target is biased upward. The nature of the bias ultimately depends on the relationship between the confounder and the treatment and target, but in this case it's pretty clear!
```{r}
#| echo: false
#| label: tbl-r-causal-assumptions
#| tbl-cap: Coefficients with and without the confounder
simulate_confounding = function(n, nreps, true = 1) {
results = replicate(
nreps,
{
U = rnorm(n) # Unmeasured confounder
X = 0.5 * U + rnorm(n) # Treatment influenced by U
Y = true * X + U + rnorm(n) # Outcome influenced by X and U
# Create a tibble
data = tibble(X = X, Y = Y)
# Fit a linear regression model with and
# without adjusting for the unmeasured confounder
model = lm(Y ~ X, data = data)
model2 = lm(Y ~ X + U, data = data)
c(coef(model)['X'], coef(model2)['X'])
}
)
results = rowMeans(results)
tibble(
true = true,
estimate_1 = results[1],
estimate_2 = results[2]
)
}
as_tibble(simulate_confounding(n = 1000, nreps = 500)) |>
gt()
```
Though this is a simple demonstration, it shows why we need to be careful in our modeling and analysis, and if we are interested in causal relationships, we need to be aware of our assumptions and help make them plausible. If we suspect something is a confounder, we can include it in our model to get a more accurate estimate of the effect of the treatment.
More generally, with causal approaches to modeling, we are expressly interested in interpreting the effect of one feature on another, and we are interested in the mechanisms that bring about that effect. We are not just interested in the mere correlation between variables, or just predictive capabilities of the model. As we'll see though, we can use the same models we've seen already, but need these additional considerations to draw causal conclusions.
## Models for Causal Inference {#sec-causal-models}
We can use many modeling approaches to help us reason about causal relationships, and this can be both a blessing and a curse. Our models can be more complex, and we can use more data, which can potentially give us more confidence in our conclusions. But we can still be easily fooled by our models, as well as by ourselves. We'll need to be careful in how we go about things, but let's see what some of our options are!
Any model can potentially be used to answer a causal question, and which one you use will depend on the data setting and the question you are asking. The following covers a few models that might be seen in various academic and professional settings.
### Linear regression {#sec-causal-lm}
Yep, [linear regression](https://matheusfacure.github.io/python-causality-handbook/05-The-Unreasonable-Effectiveness-of-Linear-Regression.html). The old standby is possibly the mostly widely used model for causal inference, historically speaking and even today. We've seen linear regression as a kind of graphical model @fig-graph-lm, and in that sense, it can serve as the starting point for structural equation models and others that we'll talk about next that many consider to be true causal models. It can also be used as a baseline model for other more complex causal model approaches.
Linear regression can potentially tell us for any particular feature, what that feature's relationship with the target is, holding the other features constant. This **ceteris paribus** interpretation - 'all else being equal' - already gets us into a causal mindset. If we had randomization and no confounding, and the feature-target relationship was linear, we could interpret the coefficient of the feature as the causal effect.
However, your standard linear model doesn't care where the data came from or what the underlying structure *should* be. It only does what you ask of it, and will tell you about group differences whether they come from a randomized experiment or not. For example, as we saw earlier, if you don't include potential confounders, you could get a biased estimate of the effect[^biasedcause]. It also cannot tell you whether X effects Y or vice versa. So linear regression by itself cannot save us from the difficulties of causal inference, nor really can be considered a causal model. But it can be useful as a starting point in conjunction with other approaches.
[^biasedcause]: A reminder that a conclusion of 'no effect' is also a causal statement, and can also be a biased one. Also, you can come to the same *practical* conclusion with a biased estimate as with an unbiased one.
<!--
As an example, let's consider a simple linear model with a confounder. We'll generate some synthetic data with a confounder, and fit two models, one with the confounder and one without. We'll compare the coefficients of the feature of interest, `x`, in both models.
:::{.panel-tabset}
##### R
```{r}
#| label: r-demo-confounding
# Set seed for reproducibility
set.seed(42)
# Generate synthetic data
n = 100
z = rnorm(n) # the confounder
x = 2 * z + rnorm(n) # the confounded feature
y = 3 * z + rnorm(n) # the target
data = tibble(x = x, y = y, z = z)
# Fit linear models
model_without_z = lm(y ~ x, data = data)
model_with_z = lm(y ~ x + z, data = data)
# Compare x coefficients
c(coef(model_without_z)['x'], coef(model_with_z)['x'])
```
##### Python
```{python}
#| label: py-demo-confounding
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# Set seed for reproducibility
np.random.seed(42)
# Generate synthetic data
n = 100
z = np.random.normal(size=n) # the confounder
x = 2 * z + np.random.normal(size=n) # the confounded feature
y = 3 * z + np.random.normal(size=n) # the target
data = pd.DataFrame({'x': x, 'y': y, 'z': z})
# Fit linear models
model_without_z = LinearRegression().fit(data[['x']], data['y'])
model_with_z = LinearRegression().fit(data[['x', 'z']], data['y'])
# Compare x coefficients
model_without_z.coef_[0].round(3), model_with_z.coef_[0].round(3)
```
:::
The results show that the coefficient of `x` is different in the two models. If we don't include the confounder, the feature's relationship with the target, which in this case is zero, is a reflection of the correlation it has with the confounder, which is also correlated with the target. Without including the confounder, we can come away with the wrong conclusion about the relationship between `x` and `y`.
As we can see from this simple demo, linear regression by itself cannot save us from the difficulties of causal inference, nor really can be considered a causal model. But it can be useful as a starting point in conjunction with other approaches.
-->
:::{.callout-note title='Weighting and Sampling Methods' collapse='true'}
Common techniques for traditional statistical models used for causal inference include a variety of **weighting** or **sampling** methods. These methods are used to adjust the data so that the 'treatment' groups are more similar, and a causal effect can be more accurately estimated. Sampling methods include techniques such as **stratification** and **matching**, which focus on the selection of the sample as a means to balance treatment and control groups. Weighting methods include **inverse probability weighting** and **propensity score weighting**, which focus on adjusting the weights of the observations to make the groups more similar. They have extensions to continuous treatments as well.
Sampling and weighting methods are not models themselves, and potentially can be used with just about any model that attempts to estimate the effect of a treatment. An nice overview of using such methods vs. standard regression/ML can be found on [Cross Validated](https://stats.stackexchange.com/a/544958).
:::
### Graphical models & structural equation models {#sec-causal-graphical-sem}
```{r}
#| echo: false
#| eval: false
#| label: fig-causal-dag
#| fig-cap: A Causal DAG
library(ggdag)
# using more R-like syntax to create the same DAG
tidy_ggdag = dagify(
target ~ x + z,
x ~ z + w,
z ~ w,
# z1 ~ w1 + v,
# z2 ~ w2 + v,
# w1 ~ ~w2, # bidirected path
exposure = 'x',
outcome = 'target',
labels = c(x = 'Phys.\nAct.', z = 'Diet.\nHabit', w = 'Access', target = 'Cholesterol')
)
# ggdag(
# tidy_ggdag,
# node_size = 25,
# use_labels = "label",
# label_col = okabe_ito[3],
# point_col = 'blue',
# text = FALSE,
# stylized = FALSE
# ) +
# geom_dag_edges(
# edge_width = 1.5,
# edge_color = okabe_ito[2],
# edge_alpha = 0.5,
# arrow_directed = grid::arrow(length = grid::unit(10, 'pt'), type = 'closed')
# ) +
# geom_dag_point(color = okabe_ito[1])
# tidy_ggdag |>
# ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
p = tidy_dagitty(tidy_ggdag, layout = 'nicely') |>
ggdag(label_col = okabe_ito[1], text = FALSE)
p +
geom_dag_edges(
edge_width = 1.5,
edge_color = okabe_ito[2],
edge_alpha = 0.5,
arrow_directed = grid::arrow(length = grid::unit(15, 'pt'), type = 'closed'),
) +
# add focus to physical activity
geom_point(
data = p$data |> filter(name=='x', to == 'target') |> slice(1) ,
aes(x, y),
color = okabe_ito[2],
alpha = 1,
size = 28
) +
geom_point(
data = p$data |> filter(name=='target') ,
aes(x, y),
color = okabe_ito[6],
alpha = 1,
size = 28
) +
geom_dag_point(aes(), show.legend = FALSE, color = okabe_ito[1], fill = NA, size = 25) +
geom_dag_label(aes(label = label), size = 3) +
theme_void()
ggsave('img/causal-dag.svg')
```
![A Causal DAG](img/causal-dag.svg){width=75% #fig-causal-dag}
**Graphical and Structural Equation Models (SEM)** are flexible approaches to regression and classification, and have one of the longest histories of formal statistical modeling, dating back over a century[^wright]. As an example, @fig-causal-dag shows a *directed acyclic graph* (DAG) that represents a causal model. The arrows indicate the direction of the causal relationship, and each node is a feature or target, and some features are influenced by others.
[^wright]: [Wright](https://en.wikipedia.org/wiki/Sewall_Wright) is credited with coming up with what would be called **path analysis** in the 1920s, which is a precursor to and part of SEM and form of graphical model.
In that graph, our focal treatment, or 'exposure', is physical activity, and we want to see its effect on a health outcome like cholesterol levels. However, dietary habits would also affect the outcome, and affect how much physical activity one does. Both dietary habits and physical activity may in part reflect access to healthy food. The target in question does not affect any other nodes, and in fact the causal flow is in one direction, so there is no cycle in the graph (i.e., it is 'acyclic').
One thing to note relative to the other graphical model depictions we've seen, is that the arrows directly flow to a target or set of targets, as opposed to just producing an 'output' that we then compare with the target. In graphical causal models, we're making clear the direction and focus of the causal relationships, i.e., the causal structure, as opposed to the model structure. Also, in graphical causal models, the effects for any given feature are adjusted for the other features in the model in a particular way, so that we can think about them in isolation, rather than as a collective set of features that are all influencing the target[^graphout].
[^graphout]: If we were to model this in an overly simple fashion with linear regressions for any variable with an arrow to it, you could say physical activity and dietary habits would basically be the output of their respective models. It isn't that simple in practice though, such that we can just run separate regressions and feed in the results to the next one, though that's how they used to do it back in the day. We have to take more care in how we adjust for all features in the model, as well as correctly account for the uncertainty if we do take a multi-stage approach.
Structural equation models are widely employed in the social sciences and education, and are often used to model both observed and *latent* variables (@sec-data-latent), with either serving as features or targets[^sembias]. They are also used to model causal relationships, to the point that historically they were even called 'causal graphical models' or 'causal structural models'. SEMs are actually a special case of the graphical models just described, which are more common in non-social science disciplines. Compared to other graphical modeling techniques like DAGs, SEMs will typically have more assumptions, and these are often difficult to meet[^semass].
[^sembias]: Your authors have to admit some bias here, but we hope the presentation for SEM is balanced. We've spent a lot of our past dealing with SEMs, and almost every application we saw had too little data and were grossly overfit. Many SEM programming approaches even added multiple ways to overfit the data even further, and it is difficult to trust the results reported in many papers that used them. But that's not the fault of SEM in general- it can be a useful tool when used correctly, and it can help answer causal questions. But it can easily be misused by those not familiar with its assumptions and limitations.
[^semass]: @vanderweele_invited_2012 provides a nice overview of the increased assumptions of SEM relative to other methods.
The following shows a relatively simple SEM, a latent variable **mediation model** involving social support and self esteem, and with depression as the outcome of interest (@fig-sem). Each latent variable has three observed measures, e.g., item scores collected from a psychological inventory or personal survey. The observed variables are *caused* by the latent, i.e., *unseen* or *hidden*, variables. In other words, the observed item score is a less than perfect reflection of the true underlying latent variable, which is what we're really interested in. The effects of the latent constructs of social support and self esteem on depression may be of equal interest in this setting. For social support, we'd be interested in the direct effect on depression, as well as the indirect effect through self esteem.
```{r}
#| label: r-causal-sem-plot
#| echo: false
g = DiagrammeR::grViz('img/graphical-sem.dot')
g |>
DiagrammeRsvg::export_svg() |>
charToRaw() |>
rsvg::rsvg_svg("img/sem.svg")
```
![SEM with latent and observed variables](img/sem.svg){width=75% #fig-sem}
Formal graphical models provide a much richer set of tools for controlling various confounding, interaction, and indirect effects than simpler linear models. For this reason, they can be very useful for causal inference. It may be cautionary to note that models like linear regression can be seen as a special case, and we know that linear regression by itself is not a causal model. So in order for these tools to provide valid causal estimates, they need to be used in a way that is consistent with the assumptions of both the underlying causal model as well as the model estimation approach.
:::{.callout-note title='Causal Language' collapse='true'}
It's often been suggested that we keep certain phrasing, for example, feature X has an *effect* on target Y, only for the causal model setting. But the model we use can only tell us that the data is consistent with the effect we're trying to understand, not that it actually exists. In everyday language, we often use causal language whenever we think the relationship is or should be causal, and that's fine, and we think that's okay in a modeling context too, as long as you are clear about the limits of your generalizability.
:::
### Counterfactual thinking {#sec-causal-counterfactual}
<!-- if need a viz here; see dalle gen image at img/what_if_counterfactual.jpeg otherwise drop-->
![What if I had done something different...?](img/causal-what_if_counterfactual.jpeg){width=25%}
When we think about causality, we really ought to think about **counterfactuals**. What would have happened if I had done something different? What would have happened if I had done something sooner rather than later? What would have happened if I had done nothing at all? It's natural to question our own actions in this way, but we can think like this in a modeling context too. In terms of our treatment effect example, we can summarize counterfactual thinking as:
> The question is not whether there is a difference between A and B but whether there would still be a difference if A *was* B and B *was* A.
This is the essence of counterfactual thinking. It's not about whether there is a difference between two groups, but whether there would still be a difference if those in one group had actually been treated differently. In this sense, we are concerned with the **potential outcomes** of the treatment, however defined.
Here is a more concrete example:
- Roy is shown ad A, and buys the product.
- Pris is shown ad B, and does not buy the product.
What are we to make of this? Which ad is better? **A** seems to be, but maybe Pris wouldn't have bought the product if shown that ad either, and maybe Roy would have bought the product if shown ad **B** too! With counterfactual thinking, we are concerned with the potential outcomes of the treatment, which in this case is whether or not to show the ad.
Let's say ad A is the new one, i.e., our treatment group, and B is the status quo ad, our control group. Without randomization, our real question can't be answered by a simple test of whether means or predictions are different among the two groups, as this estimate would be biased if the groups are already different in some way to start with. The real effect is whether, for those who saw ad A, what the difference in the outcome would be if they hadn't seen it.
From a prediction stand point, we can get an initial estimate straightforwardly. We demonstrated counterfactual predictions before in @sec-knowing-counterfactual-predictions, but we can revisit it briefly here. For those in the treatment, we can just plug in their feature values with treatment set to ad A. Then we just make a prediction with treatment set to ad B. This approach is basically the **S-Learner** approach to meta-learning, which we'll discuss in a bit, as well as a simple form of **G-computation**, widely used in causal inference.
:::{.panel-tabset}
##### Python
```{python}
#| eval: false
#| label: py-demo-counterfactual
model.predict(X.assign(treatment = 'A')) -
model.predict(X.assign(treatment = 'B'))
```
##### R
```{r}
#| eval: false
#| label: r-demo-counterfactual
predict(model, X |> mutate(treatment = 'A')) -
predict(model, X |> mutate(treatment = 'B'))
```
:::
With counterfactual thinking explicitly in mind, we can see that the difference in predictions is the difference in the potential outcomes of the treatment. This is a very simple demo to illustrate how easy it is to start getting some counterfactual results from our models. But it's typically not quite that simple in practice, and there are many ways to get this estimate wrong as well. As in other circumstances, the data and our assumptions about the problem can potentially lead us astray. But, assuming those aspects of our modeling endeavor are in order, this is one way to get an estimate of a causal effect.
### Uplift modeling {#sec-causal-uplift}
![A sleeping dog](img/causal-sleeping_dog.jpeg){width=25%}
The counterfactual prediction we just did provides a result that can be called the **uplift** or **gain** from the treatment, particularly when compared to a baseline metric. **Uplift modeling** is a general term applied to models where counterfactual thinking is at the forefront, especially in a marketing context. Uplift modeling is not a specific model per se, but any model that is used to answer a question about the potential outcomes of a treatment. The key question is what is the gain, or uplift, in applying a treatment vs. the baseline? Typically any statistical model can be used to answer this question, and often the model is a classification model, for example, whether Roy from the previous section bought the product or not.
It is common in uplift modeling to distinguish certain types of individuals or instances, and we think it's useful to extend this to other modeling contexts as well. In the context of our previous example they are:
- **Sure things**: those who would buy the product whether or not shown the ad.
- **Lost causes**: those who would not buy the product whether or not shown the ad.
- **Sleeping dogs**: those who would buy the product if not shown the ad, but not if they are shown the ad. Also referred to as the 'Do not disturb' group!
- **Persuadables**: those who would buy the product if shown the ad, but not if not shown the ad.
We can generalize these conceptual groups beyond the marketing context to any treatment effect we might be interested in. So it's worthwhile to think about which aspects of your data could correspond to these groups. One of the additional goals in uplift modeling is to identify persuadables for additional treatment efforts, and to avoid wasting money on the lost causes. But to reach such goals, we have to think causally first!
:::{.callout-note title='Uplift Modeling in R and Python' collapse='true'}
There are more widely used tools for uplift modeling and meta-learners in Python than in R, but there are some options in R as well. In Python you can check out [causalml](https://causalml.readthedocs.io/en/latest/index.html) and [sci-kit uplift](https://www.uplift-modeling.com/en/v0.5.1/index.html) for some nice tutorials and documentation.
:::
### Meta-Learning {#sec-causal-meta}
[Meta-learners](https://arxiv.org/pdf/1706.03461.pdf) are used in machine learning contexts to assess potentially causal relationships between some treatment and outcome. The core model can actually be any kind you might want to use, but in which extra steps are taken to assess the causal relationship. The most common types of meta-learners are:
- **S-learner** - **s**ingle model for both groups; predict the (counterfactual) difference as when all observations are treated vs when all are not, similar to our previous demonstrations of counterfactual predictions.
- **T-learner** - **t**wo models, one for each of the control and treatment groups respectively; get predictions as if all observations are 'treated' (i.e., using the treatment model) versus when all are 'control' (using the control model), and take the difference.
- **X-learner** - a more complicated modification to the T-learner using a multi-step approach.
- **R-learner** - also called (Double) Debiased ML. An approach that uses a **r**esidual-based model to adjust for the treatment effect[^fancypath].
[^fancypath]: As a simple overview, think of it this way with Y outcome, T treatment and X confounders/other features. Y and T are each regressed on X via some ML model, and the residuals from both are used in a subsequent model, $Y_{res} \sim T_{res}$, to estimate the treatment effect. Or, if you know how path analysis works, or even standard linear regression, it's pretty much just that with ML. As an exercise, start with a linear regression for the target on all features, then just do a linear regression for a chosen focal feature predicted by the non-focal features. Next, regress the target on the non-focal features. Finally, just do a linear regression with the residuals from both models. The resulting coefficient will be what you started with for the focal feature in the first regression.
Some variants of these models exist also. As elsewhere, the key idea is to use the model to predict the potential outcomes of the treatment levels to estimate the causal effect. Most models traditionally used in a machine learning context, e.g., random forests, boosted trees, or neural networks, are not designed to accurately estimate causal effects, nor correctly estimate the uncertainty in those effects. Meta-learners attempt to address the issue with regard to the effect, but you'll typically still have your work cut out for you to understand the uncertainty in that effect.
:::{.callout-note title='Meta-Learners vs. Meta-Analysis' collapse='true'}
Meta-learners are not to be confused with **meta-analysis**, which is also related to understanding causal effects. Meta-analysis attempts to combine the results of multiple *studies* to get a better estimate of the true effect. The studies are typically conducted by different researchers and in different settings. The term **meta-learning** has also been used to refer to what is more commonly called **ensemble learning**, the approach used in random forests and boosting. It is also probably used by other people that don't bother to look things up before naming their technical terms.
:::
### Others models used for causal inference {#sec-causal-others}