-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
select() does not keep variable.labels attribute created by foreign functions #5831
Comments
🤔 actually, # other verbs don't lose them, for example:
dat_for %>% mutate(test = "test", .keep = "all") %>% str()
#> 'data.frame': 240 obs. of 14 variables:
#> $ CASEID : num 13 30 53 84 89 102 117 132 151 153 ...
#> $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
#> $ AGE : num 40 49 43 50 43 50 45 47 53 49 ...
#> $ DBP58 : num 70 87 89 105 110 88 70 79 102 99 ...
#> $ EDUYR : num 16 11 12 8 NA 8 NA 9 12 14 ...
#> $ CHOL58 : num 321 246 262 275 301 261 212 372 216 251 ...
#> $ CGT58 : num 0 60 0 15 25 30 0 30 0 10 ...
#> $ HT58 : num 68.8 72.2 69 62.5 68 68 66.5 67 67 64.3 ...
#> $ WT58 : num 190 204 162 152 148 142 196 193 172 162 ...
#> $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: NA 5 7 4 2 1 NA 1 3 5 ...
#> $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 1 1 2 1 2 2 1 1 2 2 ...
#> $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 2 1 1 2 1 1 1 1 1 2 ...
#> $ CHD : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ test : chr "test" "test" "test" "test" ...
#> - attr(*, "variable.labels")= Named chr [1:13] "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT" "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58" ...
#> ..- attr(*, "names")= chr [1:13] "CASEID" "FIRSTCHD" "AGE" "DBP58" ...
# however they're not kept in these cases:
dat_for %>% mutate(test = "test", .keep = "used") %>% str()
#> 'data.frame': 240 obs. of 1 variable:
#> $ test: chr "test" "test" "test" "test" ...
dat_for %>% mutate(test = "test", .keep = "unused") %>% str()
#> 'data.frame': 240 obs. of 14 variables:
#> $ CASEID : num 13 30 53 84 89 102 117 132 151 153 ...
#> $ FIRSTCHD: Factor w/ 5 levels "NO CHD","SUDDEN DEATH",..: 3 3 2 3 2 3 3 3 2 2 ...
#> $ AGE : num 40 49 43 50 43 50 45 47 53 49 ...
#> $ DBP58 : num 70 87 89 105 110 88 70 79 102 99 ...
#> $ EDUYR : num 16 11 12 8 NA 8 NA 9 12 14 ...
#> $ CHOL58 : num 321 246 262 275 301 261 212 372 216 251 ...
#> $ CGT58 : num 0 60 0 15 25 30 0 30 0 10 ...
#> $ HT58 : num 68.8 72.2 69 62.5 68 68 66.5 67 67 64.3 ...
#> $ WT58 : num 190 204 162 152 148 142 196 193 172 162 ...
#> $ DAYOFWK : Factor w/ 7 levels "SUNDAY","MONDAY",..: NA 5 7 4 2 1 NA 1 3 5 ...
#> $ VITAL10 : Factor w/ 2 levels "ALIVE","DEAD": 1 1 2 1 2 2 1 1 2 2 ...
#> $ FAMHXCVR: Factor w/ 2 levels "NO","YES": 2 1 1 2 1 1 1 1 1 2 ...
#> $ CHD : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ test : chr "test" "test" "test" "test" ...
dat_for %>% mutate(test = "test", .keep = "none") %>% str()
#> 'data.frame': 240 obs. of 1 variable:
#> $ test: chr "test" "test" "test" "test" ... Created on 2021-03-30 by the reprex package (v0.3.0) |
Thanks, @romainfrancois It also disappears when moving the mutated columns: library(foreign); library(dplyr, warn.conflicts = FALSE)
#> Warning: package 'dplyr' was built under R version 4.0.4
sav <- system.file("files", "electric.sav", package = "foreign")
dat_for <- read.spss(file = sav, to.data.frame = TRUE)
# mutate and move
dat_for %>% mutate(test = "test", .before = 1) %>% attributes() %>% names()
#> [1] "names" "row.names" "class" Created on 2021-03-31 by the reprex package (v1.0.0) |
It's worth noting that we are consistent with base R here: library(foreign)
sav <- system.file("files", "electric.sav", package = "foreign")
dat_for <- read.spss(file = sav, to.data.frame = TRUE)
attr(dat_for, "variable.labels")
#> CASEID FIRSTCHD
#> "CASE IDENTIFICATION NUMBER" "FIRST CHD EVENT"
#> AGE DBP58
#> "AGE AT ENTRY" "AVERAGE DIAST BLOOD PRESSURE 58"
#> EDUYR CHOL58
#> "YEARS OF EDUCATION" "SERUM CHOLESTEROL 58 -- MG PER DL"
#> CGT58 HT58
#> "NO OF CIGARETTES PER DAY IN 1958" "STATURE, 1958 -- TO NEAREST 0.1 INCH"
#> WT58 DAYOFWK
#> "BODY WEIGHT, 1958 -- LBS" "DAY OF DEATH"
#> VITAL10 FAMHXCVR
#> "STATUS AT TEN YEARS" "FAMILY HISTORY OF CHD"
#> CHD
#> "INCIDENCE OF CORONARY HEART DISEASE"
attr(dat_for[1:2], "variable.labels")
#> NULL Created on 2021-03-31 by the reprex package (v1.0.0) |
And you don't have this problem with haven because it attaches the labels to the individual variables: sav <- system.file("files", "electric.sav", package = "foreign")
df <- haven::read_spss(sav)
attr(df[[1]], "label")
#> [1] "CASE IDENTIFICATION NUMBER" Created on 2021-03-31 by the reprex package (v1.0.0) |
If we do implement attribute preserving, we'll need to document that we're now inconsistent with base R. |
Thanks, @hadley The consistence with base was indeed what I was highlighting in my (lengthy) reprex. Should this instead be reported to R Core, given that foreign is part of it? I guess the main issue for dplyr is the inconsistency in how it preserves the labels or not depending on the verb used. |
The problem is there are two types of consistency at play here:
It's not possible to have both, and you're basically asking to switch from external consistency to internal consistency. I think that's reasonable, but we have had this discussion before so we need to be careful that we not just flip-flopping between the two as different folks file issues and we forget the original context. The implementation is straightforward; we just need to make sure that this is a net change for the better. |
Related issue #5294 - which also mentions that |
I am not sure this is a dplyr problem per se, but thought I'd report it to see what you think and if I should take it somewhere else instead.
A dataframe that has column labels stored as a dataframe attribute
variable.labels
will lose them when usingdplyr::select()
. An example of such a dataframe is a labelled SPSS file imported withforeign::read.spss()
. Other verbs in dplyr (as far as I have tested) do not lose the labels, which is why it is surprising and looks like a bug.A workaround is to stick to Tidyverse packages and import the data with
haven::read_sav()
, which stores thelabel
attribute per column.But would it be possible for dplyr to cater for the first case, as the attribute comes from a base package (even though the square-bracket slicing doesn't even cater for it)? Or is it out of scope?
Here is a reproducible example:
Created on 2021-03-30 by the reprex package (v1.0.0)
using:
The text was updated successfully, but these errors were encountered: