Skip to content

Order of variables in by matters? #1373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
strengejacke opened this issue Jan 27, 2025 · 7 comments
Closed

Order of variables in by matters? #1373

strengejacke opened this issue Jan 27, 2025 · 7 comments

Comments

@strengejacke
Copy link
Contributor

I would expect that both function calls return the same output (the second one), and that the order of variables in by does not matter, or am I missing something?

set.seed(123)
n <- 200
d <- data.frame(
  score = rnorm(n),
  grp = as.factor(sample(c("treatment", "control"), n, TRUE)),
  time = as.factor(sample(1:3, n, TRUE))
)

model2 <- lm(score ~ grp * time, data = d)
grid <- insight::get_datagrid(model2, c("time", "grp"), factors = "all")

marginaleffects::avg_predictions(
  model2,
  newdata = grid,
  by = c("time", "grp"),
  hypothesis = ~pairwise | time
)
#> 
#> [1] time       Hypothesis Estimate   Std. Error z          Pr(>|z|)   S         
#> [8] 2.5 %      97.5 %    
#> <0 rows> (or 0-length row.names)
#> 
#> Type:  response


marginaleffects::avg_predictions(
  model2,
  newdata = grid,
  by = c("grp", "time"),
  hypothesis = ~pairwise | time
)
#> 
#>  time                  Hypothesis Estimate Std. Error      z Pr(>|z|)   S
#>     1 (treatment 1) - (control 1)  -0.4183      0.233 -1.797   0.0724 3.8
#>     2 (treatment 2) - (control 2)  -0.0554      0.238 -0.233   0.8161 0.3
#>     3 (treatment 3) - (control 3)  -0.0503      0.222 -0.226   0.8212 0.3
#>   2.5 % 97.5 %
#>  -0.875  0.038
#>  -0.522  0.411
#>  -0.486  0.386
#> 
#> Type:  response

Created on 2025-01-27 with reprex v2.1.1

@strengejacke
Copy link
Contributor Author

If we switch the order of columns in the data grid, results are exactly opposite:

set.seed(123)
n <- 200
d <- data.frame(
  score = rnorm(n),
  grp = as.factor(sample(c("treatment", "control"), n, TRUE)),
  time = as.factor(sample(1:3, n, TRUE))
)

model2 <- lm(score ~ grp * time, data = d)

grid <- insight::get_datagrid(model2, c("grp", "time"), factors = "all")
marginaleffects::avg_predictions(
  model2,
  newdata = grid,
  by = c("time", "grp"),
  hypothesis = ~pairwise | time
)
#> 
#>  time                  Hypothesis Estimate Std. Error      z Pr(>|z|)   S
#>     1 (1 treatment) - (1 control)  -0.4183      0.233 -1.797   0.0724 3.8
#>     2 (2 treatment) - (2 control)  -0.0554      0.238 -0.233   0.8161 0.3
#>     3 (3 treatment) - (3 control)  -0.0503      0.222 -0.226   0.8212 0.3
#>   2.5 % 97.5 %
#>  -0.875  0.038
#>  -0.522  0.411
#>  -0.486  0.386
#> 
#> Type:  response

marginaleffects::avg_predictions(
  model2,
  newdata = grid,
  by = c("grp", "time"),
  hypothesis = ~pairwise | time
)
#> 
#> [1] time       Hypothesis Estimate   Std. Error z          Pr(>|z|)   S         
#> [8] 2.5 %      97.5 %    
#> <0 rows> (or 0-length row.names)
#> 
#> Type:  response

Created on 2025-01-28 with reprex v2.1.1

@vincentarelbundock
Copy link
Owner

Yes, this is expected.

pairwise takes the outer product of estimates as they arrive, and returns the upper triangular elements. Similarly, reference takes the difference between each estimate and the first estimate. In both cases, the order matters.

@strengejacke
Copy link
Contributor Author

I can understand that the sign will differ, but I wouldn't expect that one approach returns no results at all?

@strengejacke
Copy link
Contributor Author

Or that the column order in the data grid is responsible whether you get a result or not?

@strengejacke
Copy link
Contributor Author

To be clear: the rational would be that you get only a result if the variables in by have the opposite order as columns in the data grid?

@strengejacke
Copy link
Contributor Author

Another example: The same function call with literally the same data grid produces no output in one situation, and output in the other situation. I doubt this is expected, else it should be very clearly documented how the column order in the data grid affects computation:

library(marginaleffects)

# data
dat <- get_dataset("thornton")
dat$incentive <- as.factor(dat$incentive)
dat$hiv2004 <- as.factor(dat$hiv2004)

# model
mod <- glm(
    outcome ~ incentive * agecat,
    data = dat,
    family = binomial
)

# data grid
grid <- marginaleffects::datagrid(model = mod, by = c("incentive", "agecat"))

# column order
head(grid)
#>   incentive   agecat
#> 1         0      <18
#> 2         1      <18
#> 3         0 18 to 35
#> 4         1 18 to 35
#> 5         0      >35
#> 6         1      >35

# no results / output!
marginaleffects::avg_predictions(
    mod,
    by = c("incentive", "agecat"),
    newdata = grid,
    hypothesis = ~ pairwise | agecat
)
#> 
#> [1] agecat     Hypothesis Estimate   Std. Error z          Pr(>|z|)   S         
#> [8] 2.5 %      97.5 %    
#> <0 rows> (or 0-length row.names)
#> 
#> Type:  response

# change order of data grid columns
grid <- marginaleffects::datagrid(model = mod, by = c("agecat", "incentive"))

# column order, essentially same datagrid
head(grid)
#>     agecat incentive
#> 1      <18         0
#> 2 18 to 35         0
#> 3      >35         0
#> 4      <18         1
#> 5 18 to 35         1
#> 6      >35         1

# results / outout!
marginaleffects::avg_predictions(
    mod,
    by = c("incentive", "agecat"),
    newdata = grid,
    hypothesis = ~ pairwise | agecat
)
#> 
#>    agecat                  Hypothesis Estimate Std. Error     z Pr(>|z|)     S
#>  <18      (1 <18) - (0 <18)              0.474     0.0609  7.79   <0.001  47.1
#>  18 to 35 (1 18 to 35) - (0 18 to 35)    0.458     0.0293 15.65   <0.001 181.0
#>  >35      (1 >35) - (0 >35)              0.432     0.0341 12.66   <0.001 119.6
#>  2.5 % 97.5 %
#>  0.355  0.593
#>  0.401  0.515
#>  0.365  0.499
#> 
#> Type:  response

Created on 2025-01-29 with reprex v2.1.1

@vincentarelbundock
Copy link
Owner

Can you try that again with latest main?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants