Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

select() does not keep variable.labels attribute created by foreign functions #5831

Closed
stragu opened this issue Mar 30, 2021 · 8 comments
Closed
Assignees
Labels
bug an unexpected problem or unintended behavior

Comments

@stragu
Copy link

stragu commented Mar 30, 2021

I am not sure this is a dplyr problem per se, but thought I'd report it to see what you think and if I should take it somewhere else instead.

A dataframe that has column labels stored as a dataframe attribute variable.labels will lose them when using dplyr::select(). An example of such a dataframe is a labelled SPSS file imported with foreign::read.spss(). Other verbs in dplyr (as far as I have tested) do not lose the labels, which is why it is surprising and looks like a bug.

A workaround is to stick to Tidyverse packages and import the data with haven::read_sav(), which stores the label attribute per column.

But would it be possible for dplyr to cater for the first case, as the attribute comes from a base package (even though the square-bracket slicing doesn't even cater for it)? Or is it out of scope?

Here is a reproducible example:

library(foreign)
# import example SAV from foreign
sav <- system.file("files", "electric.sav", package = "foreign")
dat_for <- read.spss(file = sav, to.data.frame = TRUE)

# labels are saved as a variable.labels attribute for the whole dataframe
str(dat_for)
#> 'data.frame':    240 obs. of  13 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
#>  $ AGE     : num  40 49 43 50 43 50 45 47 53 49 ...
#>  $ DBP58   : num  70 87 89 105 110 88 70 79 102 99 ...
#>  $ EDUYR   : num  16 11 12 8 NA 8 NA 9 12 14 ...
#>  $ CHOL58  : num  321 246 262 275 301 261 212 372 216 251 ...
#>  $ CGT58   : num  0 60 0 15 25 30 0 30 0 10 ...
#>  $ HT58    : num  68.8 72.2 69 62.5 68 68 66.5 67 67 64.3 ...
#>  $ WT58    : num  190 204 162 152 148 142 196 193 172 162 ...
#>  $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: NA 5 7 4 2 1 NA 1 3 5 ...
#>  $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 1 1 2 1 2 2 1 1 2 2 ...
#>  $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 2 1 1 2 1 1 1 1 1 2 ...
#>  $ CHD     : num  1 1 1 1 1 1 1 1 1 1 ...
#>  - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#>   ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...

# dplyr::select() loses the variable.labels attribute:
library(dplyr, warn.conflicts = FALSE)
dat_for %>% select(1:2) %>% str()
#> 'data.frame':    240 obs. of  2 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...

# other verbs don't lose them, for example:
dat_for %>% mutate(test = "test") %>% str()
#> 'data.frame':    240 obs. of  14 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
#>  $ AGE     : num  40 49 43 50 43 50 45 47 53 49 ...
#>  $ DBP58   : num  70 87 89 105 110 88 70 79 102 99 ...
#>  $ EDUYR   : num  16 11 12 8 NA 8 NA 9 12 14 ...
#>  $ CHOL58  : num  321 246 262 275 301 261 212 372 216 251 ...
#>  $ CGT58   : num  0 60 0 15 25 30 0 30 0 10 ...
#>  $ HT58    : num  68.8 72.2 69 62.5 68 68 66.5 67 67 64.3 ...
#>  $ WT58    : num  190 204 162 152 148 142 196 193 172 162 ...
#>  $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: NA 5 7 4 2 1 NA 1 3 5 ...
#>  $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 1 1 2 1 2 2 1 1 2 2 ...
#>  $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 2 1 1 2 1 1 1 1 1 2 ...
#>  $ CHD     : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ test    : chr  "test" "test" "test" "test" ...
#>  - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#>   ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...
dat_for %>% filter(AGE > 50) %>% str()
#> 'data.frame':    77 obs. of  13 variables:
#>  $ CASEID  : num  151 161 178 314 344 349 429 460 467 482 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 2 5 3 3 2 3 3 3 2 2 ...
#>  $ AGE     : num  53 54 52 53 52 52 54 53 54 54 ...
#>  $ DBP58   : num  102 93 83 91 107 83 90 99 98 87 ...
#>  $ EDUYR   : num  12 9 9 7 NA 14 8 NA 8 11 ...
#>  $ CHOL58  : num  216 265 269 292 351 292 235 273 215 283 ...
#>  $ CGT58   : num  0 0 15 0 0 10 0 30 20 22 ...
#>  $ HT58    : num  67 65.5 68.8 60.9 68 64.4 74.9 67.3 67.4 68.9 ...
#>  $ WT58    : num  172 165 140 149 160 154 254 168 171 181 ...
#>  $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: 3 7 4 NA 6 5 6 4 4 2 ...
#>  $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 2 2 2 1 2 1 1 1 2 2 ...
#>  $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 1 2 2 1 2 1 1 1 1 2 ...
#>  $ CHD     : num  1 1 1 1 1 1 1 1 1 1 ...
#>  - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#>   ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...
dat_for %>% rename(ages = AGE) %>% str()
#> 'data.frame':    240 obs. of  13 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
#>  $ ages    : num  40 49 43 50 43 50 45 47 53 49 ...
#>  $ DBP58   : num  70 87 89 105 110 88 70 79 102 99 ...
#>  $ EDUYR   : num  16 11 12 8 NA 8 NA 9 12 14 ...
#>  $ CHOL58  : num  321 246 262 275 301 261 212 372 216 251 ...
#>  $ CGT58   : num  0 60 0 15 25 30 0 30 0 10 ...
#>  $ HT58    : num  68.8 72.2 69 62.5 68 68 66.5 67 67 64.3 ...
#>  $ WT58    : num  190 204 162 152 148 142 196 193 172 162 ...
#>  $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: NA 5 7 4 2 1 NA 1 3 5 ...
#>  $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 1 1 2 1 2 2 1 1 2 2 ...
#>  $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 2 1 1 2 1 1 1 1 1 2 ...
#>  $ CHD     : num  1 1 1 1 1 1 1 1 1 1 ...
#>  - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#>   ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...

# but to be fair, base slicing also loses the attribute:
dat_for[,1:2] %>% str()
#> 'data.frame':    240 obs. of  2 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...

# import same file with haven
library(haven)
dat_hav <- read_sav(file = sav)

# labels are attributes of columns
str(dat_hav$CASEID)
#>  num [1:240] 13 30 53 84 89 102 117 132 151 153 ...
#>  - attr(*, "label")= chr "CASE IDENTIFICATION NUMBER"
#>  - attr(*, "format.spss")= chr "F4.0"
#>  - attr(*, "display_width")= int 0

# dplyr::select() keeps them
dat_hav %>% select(1:2) %>% str()
#> tibble[,2] [240 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ CASEID  : num [1:240] 13 30 53 84 89 102 117 132 151 153 ...
#>   ..- attr(*, "label")= chr "CASE IDENTIFICATION NUMBER"
#>   ..- attr(*, "format.spss")= chr "F4.0"
#>   ..- attr(*, "display_width")= int 0
#>  $ FIRSTCHD: dbl+lbl [1:240] 3, 3, 2, 3, 2, 3, 3, 3, 2, 2, 6, 2, 3, 5, 3, 3, 3, 3, ...
#>    ..@ label        : chr "FIRST CHD EVENT"
#>    ..@ format.spss  : chr "F1.0"
#>    ..@ display_width: int 0
#>    ..@ labels       : Named num [1:5] 1 2 3 5 6
#>    .. ..- attr(*, "names")= chr [1:5] "NO CHD" "SUDDEN  DEATH" "NONFATALMI" "FATAL   MI" ...
#>  - attr(*, "label")= chr "                       SPSS/PC+"

Created on 2021-03-30 by the reprex package (v1.0.0)

using:

  • R 4.0.4
  • dplyr 1.0.5
  • foreign 0.8-81
  • haven 2.3.1
@romainfrancois
Copy link
Member

🤔 actually, mutate() only keeps them in the .keep = all case:

# other verbs don't lose them, for example:
dat_for %>% mutate(test = "test", .keep = "all") %>% str()
#> 'data.frame':    240 obs. of  14 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
#>  $ AGE     : num  40 49 43 50 43 50 45 47 53 49 ...
#>  $ DBP58   : num  70 87 89 105 110 88 70 79 102 99 ...
#>  $ EDUYR   : num  16 11 12 8 NA 8 NA 9 12 14 ...
#>  $ CHOL58  : num  321 246 262 275 301 261 212 372 216 251 ...
#>  $ CGT58   : num  0 60 0 15 25 30 0 30 0 10 ...
#>  $ HT58    : num  68.8 72.2 69 62.5 68 68 66.5 67 67 64.3 ...
#>  $ WT58    : num  190 204 162 152 148 142 196 193 172 162 ...
#>  $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: NA 5 7 4 2 1 NA 1 3 5 ...
#>  $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 1 1 2 1 2 2 1 1 2 2 ...
#>  $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 2 1 1 2 1 1 1 1 1 2 ...
#>  $ CHD     : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ test    : chr  "test" "test" "test" "test" ...
#>  - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#>   ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...

# however they're not kept in these cases:
dat_for %>% mutate(test = "test", .keep = "used") %>% str()
#> 'data.frame':    240 obs. of  1 variable:
#>  $ test: chr  "test" "test" "test" "test" ...
dat_for %>% mutate(test = "test", .keep = "unused") %>% str()
#> 'data.frame':    240 obs. of  14 variables:
#>  $ CASEID  : num  13 30 53 84 89 102 117 132 151 153 ...
#>  $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN  DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
#>  $ AGE     : num  40 49 43 50 43 50 45 47 53 49 ...
#>  $ DBP58   : num  70 87 89 105 110 88 70 79 102 99 ...
#>  $ EDUYR   : num  16 11 12 8 NA 8 NA 9 12 14 ...
#>  $ CHOL58  : num  321 246 262 275 301 261 212 372 216 251 ...
#>  $ CGT58   : num  0 60 0 15 25 30 0 30 0 10 ...
#>  $ HT58    : num  68.8 72.2 69 62.5 68 68 66.5 67 67 64.3 ...
#>  $ WT58    : num  190 204 162 152 148 142 196 193 172 162 ...
#>  $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: NA 5 7 4 2 1 NA 1 3 5 ...
#>  $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 1 1 2 1 2 2 1 1 2 2 ...
#>  $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 2 1 1 2 1 1 1 1 1 2 ...
#>  $ CHD     : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ test    : chr  "test" "test" "test" "test" ...
dat_for %>% mutate(test = "test", .keep = "none") %>% str()
#> 'data.frame':    240 obs. of  1 variable:
#>  $ test: chr  "test" "test" "test" "test" ...

Created on 2021-03-30 by the reprex package (v0.3.0)

@romainfrancois romainfrancois added the bug an unexpected problem or unintended behavior label Mar 30, 2021
@romainfrancois romainfrancois self-assigned this Mar 30, 2021
@stragu
Copy link
Author

stragu commented Mar 31, 2021

Thanks, @romainfrancois

It also disappears when moving the mutated columns:

library(foreign); library(dplyr, warn.conflicts = FALSE)
#> Warning: package 'dplyr' was built under R version 4.0.4
sav <- system.file("files", "electric.sav", package = "foreign")
dat_for <- read.spss(file = sav, to.data.frame = TRUE)
# mutate and move
dat_for %>% mutate(test = "test", .before = 1) %>% attributes() %>% names()
#> [1] "names"     "row.names" "class"

Created on 2021-03-31 by the reprex package (v1.0.0)

@hadley
Copy link
Member

hadley commented Mar 31, 2021

It's worth noting that we are consistent with base R here:

library(foreign)
sav <- system.file("files", "electric.sav", package = "foreign")
dat_for <- read.spss(file = sav, to.data.frame = TRUE)
attr(dat_for, "variable.labels")
#>                                 CASEID                               FIRSTCHD 
#>           "CASE IDENTIFICATION NUMBER"                      "FIRST CHD EVENT" 
#>                                    AGE                                  DBP58 
#>                         "AGE AT ENTRY"      "AVERAGE DIAST BLOOD PRESSURE 58" 
#>                                  EDUYR                                 CHOL58 
#>                   "YEARS OF EDUCATION"    "SERUM CHOLESTEROL 58 -- MG PER DL" 
#>                                  CGT58                                   HT58 
#>     "NO OF CIGARETTES PER DAY IN 1958" "STATURE, 1958 -- TO NEAREST 0.1 INCH" 
#>                                   WT58                                DAYOFWK 
#>             "BODY WEIGHT, 1958 -- LBS"                         "DAY OF DEATH" 
#>                                VITAL10                               FAMHXCVR 
#>                  "STATUS AT TEN YEARS"                "FAMILY HISTORY OF CHD" 
#>                                    CHD 
#>  "INCIDENCE OF CORONARY HEART DISEASE"
attr(dat_for[1:2], "variable.labels")
#> NULL

Created on 2021-03-31 by the reprex package (v1.0.0)

@hadley
Copy link
Member

hadley commented Mar 31, 2021

And you don't have this problem with haven because it attaches the labels to the individual variables:

sav <- system.file("files", "electric.sav", package = "foreign")

df <- haven::read_spss(sav)
attr(df[[1]], "label")
#> [1] "CASE IDENTIFICATION NUMBER"

Created on 2021-03-31 by the reprex package (v1.0.0)

@hadley
Copy link
Member

hadley commented Mar 31, 2021

If we do implement attribute preserving, we'll need to document that we're now inconsistent with base R.

@stragu
Copy link
Author

stragu commented Mar 31, 2021

Thanks, @hadley

The consistence with base was indeed what I was highlighting in my (lengthy) reprex. Should this instead be reported to R Core, given that foreign is part of it?

I guess the main issue for dplyr is the inconsistency in how it preserves the labels or not depending on the verb used.

@hadley
Copy link
Member

hadley commented Apr 1, 2021

The problem is there are two types of consistency at play here:

  • internally consistency (i.e. should all dplyr functions treat attributes the same way)
  • external consistency (i.e. should dplyr treat attributes the same was a base R)

It's not possible to have both, and you're basically asking to switch from external consistency to internal consistency. I think that's reasonable, but we have had this discussion before so we need to be careful that we not just flip-flopping between the two as different folks file issues and we forget the original context.

The implementation is straightforward; we just need to make sure that this is a net change for the better.

@DavisVaughan
Copy link
Member

Related issue #5294 - which also mentions that dplyr_col_select() should use dplyr_reconstruct() #5294 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants