Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure 20.8 not working #51

Open
ale-ch opened this issue May 16, 2021 · 2 comments
Open

Figure 20.8 not working #51

ale-ch opened this issue May 16, 2021 · 2 comments

Comments

@ale-ch
Copy link

ale-ch commented May 16, 2021

The following code:

set.seed(123)

fviz_nbclust(
  ames_1hot_scaled, 
  kmeans, 
  method = "wss", 
  k.max = 25, 
  verbose = FALSE
)

Returns:

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

My environment is:
R version 4.0.5 (2021-03-31)
factoextra 1.0.7
AmesHousing 0.0.4
caret 6.0.86
dplyr 1.0.5

@panjunchang
Copy link

I think the resaon is that in the scale step produced NA (I dont know why it does this)
before running this code,you run the following code
ames_1hot_scaled[,"Neighborhood.Hayden_Lake"] <- 0

then it will run well.

@bradleyboehmke
Copy link
Member

bradleyboehmke commented May 10, 2022

Apparently the ames data set was updated from v0.0.3 to v0.0.4 and the Neighborhood variable now contains a "Hayden_Lake" factor level but there are no observations for that neighborhood when using AmesHousing::make_ames() (see last bullet in this NEWS.md file).

# Hayden_Lake shows up as a level
levels(ames_full[["Neighborhood"]])
 ## [1] "North_Ames"                              "College_Creek"                          
 ## [3] "Old_Town"                                "Edwards"                                
## [5] "Somerset"                                "Northridge_Heights"                     
## [7] "Gilbert"                                 "Sawyer"                                 
## [9] "Northwest_Ames"                          "Sawyer_West"                            
## [11] "Mitchell"                                "Brookside"                              
## [13] "Crawford"                                "Iowa_DOT_and_Rail_Road"                 
## [15] "Timberland"                              "Northridge"                             
## [17] "Stone_Brook"                             "South_and_West_of_Iowa_State_University"
## [19] "Clear_Creek"                             "Meadow_Village"                         
## [21] "Briardale"                               "Bloomington_Heights"                    
## [23] "Veenker"                                 "Northpark_Villa"                        
## [25] "Blueste"                                 "Greens"                                 
## [27] "Green_Hills"                             "Landmark"                               
##[29] "Hayden_Lake"   

# But there are no observations for that level
as_tibble(ames_1hot) %>% 
  select(Neighborhood.Hayden_Lake) %>% 
  distinct()
## # A tibble: 1 × 1
## Neighborhood.Hayden_Lake
##                    <dbl>
## 1                      0

Consequently, when you one-hot encode that column you end up getting the Neighborhood.Hayden_Lake column filled with zeros and then when you try to scale this you get NaNs:

> as_tibble(ames_1hot_scaled) %>% select(Neighborhood.Hayden_Lake)
## # A tibble: 2,930 × 1
##    Neighborhood.Hayden_Lake
##                       <dbl>
##  1                      NaN
##  2                      NaN
##  3                      NaN
##  4                      NaN
##  5                      NaN
##  6                      NaN
##  7                      NaN
##  8                      NaN
##  9                      NaN
## 10                      NaN

If we coerce this column to a character data type prior to one-hot encoding then it works as illustrated in the book:

ames_full <- AmesHousing::make_ames() %>%
  mutate_if(str_detect(names(.), 'Qual|Cond|QC|Qu'), as.numeric) %>% 
  mutate_if(is.factor, as.character)

full_rank  <- caret::dummyVars(Sale_Price ~ ., data = ames_full,  fullRank = TRUE)
ames_1hot <- predict(full_rank, ames_full)
dim(ames_1hot_scaled)
## [1] 2930  240

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants