-
Notifications
You must be signed in to change notification settings - Fork 51
Order of variables in by
matters?
#1373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If we switch the order of columns in the data grid, results are exactly opposite: set.seed(123)
n <- 200
d <- data.frame(
score = rnorm(n),
grp = as.factor(sample(c("treatment", "control"), n, TRUE)),
time = as.factor(sample(1:3, n, TRUE))
)
model2 <- lm(score ~ grp * time, data = d)
grid <- insight::get_datagrid(model2, c("grp", "time"), factors = "all")
marginaleffects::avg_predictions(
model2,
newdata = grid,
by = c("time", "grp"),
hypothesis = ~pairwise | time
)
#>
#> time Hypothesis Estimate Std. Error z Pr(>|z|) S
#> 1 (1 treatment) - (1 control) -0.4183 0.233 -1.797 0.0724 3.8
#> 2 (2 treatment) - (2 control) -0.0554 0.238 -0.233 0.8161 0.3
#> 3 (3 treatment) - (3 control) -0.0503 0.222 -0.226 0.8212 0.3
#> 2.5 % 97.5 %
#> -0.875 0.038
#> -0.522 0.411
#> -0.486 0.386
#>
#> Type: response
marginaleffects::avg_predictions(
model2,
newdata = grid,
by = c("grp", "time"),
hypothesis = ~pairwise | time
)
#>
#> [1] time Hypothesis Estimate Std. Error z Pr(>|z|) S
#> [8] 2.5 % 97.5 %
#> <0 rows> (or 0-length row.names)
#>
#> Type: response Created on 2025-01-28 with reprex v2.1.1 |
Yes, this is expected.
|
I can understand that the sign will differ, but I wouldn't expect that one approach returns no results at all? |
Or that the column order in the data grid is responsible whether you get a result or not? |
To be clear: the rational would be that you get only a result if the variables in |
Another example: The same function call with literally the same data grid produces no output in one situation, and output in the other situation. I doubt this is expected, else it should be very clearly documented how the column order in the data grid affects computation: library(marginaleffects)
# data
dat <- get_dataset("thornton")
dat$incentive <- as.factor(dat$incentive)
dat$hiv2004 <- as.factor(dat$hiv2004)
# model
mod <- glm(
outcome ~ incentive * agecat,
data = dat,
family = binomial
)
# data grid
grid <- marginaleffects::datagrid(model = mod, by = c("incentive", "agecat"))
# column order
head(grid)
#> incentive agecat
#> 1 0 <18
#> 2 1 <18
#> 3 0 18 to 35
#> 4 1 18 to 35
#> 5 0 >35
#> 6 1 >35
# no results / output!
marginaleffects::avg_predictions(
mod,
by = c("incentive", "agecat"),
newdata = grid,
hypothesis = ~ pairwise | agecat
)
#>
#> [1] agecat Hypothesis Estimate Std. Error z Pr(>|z|) S
#> [8] 2.5 % 97.5 %
#> <0 rows> (or 0-length row.names)
#>
#> Type: response
# change order of data grid columns
grid <- marginaleffects::datagrid(model = mod, by = c("agecat", "incentive"))
# column order, essentially same datagrid
head(grid)
#> agecat incentive
#> 1 <18 0
#> 2 18 to 35 0
#> 3 >35 0
#> 4 <18 1
#> 5 18 to 35 1
#> 6 >35 1
# results / outout!
marginaleffects::avg_predictions(
mod,
by = c("incentive", "agecat"),
newdata = grid,
hypothesis = ~ pairwise | agecat
)
#>
#> agecat Hypothesis Estimate Std. Error z Pr(>|z|) S
#> <18 (1 <18) - (0 <18) 0.474 0.0609 7.79 <0.001 47.1
#> 18 to 35 (1 18 to 35) - (0 18 to 35) 0.458 0.0293 15.65 <0.001 181.0
#> >35 (1 >35) - (0 >35) 0.432 0.0341 12.66 <0.001 119.6
#> 2.5 % 97.5 %
#> 0.355 0.593
#> 0.401 0.515
#> 0.365 0.499
#>
#> Type: response Created on 2025-01-29 with reprex v2.1.1 |
Can you try that again with latest |
I would expect that both function calls return the same output (the second one), and that the order of variables in
by
does not matter, or am I missing something?Created on 2025-01-27 with reprex v2.1.1
The text was updated successfully, but these errors were encountered: