Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lab 06 #22

Open
sunaynagoel opened this issue Sep 30, 2019 · 44 comments
Open

lab 06 #22

sunaynagoel opened this issue Sep 30, 2019 · 44 comments
Labels

Comments

@sunaynagoel
Copy link

Is the Lab for Wk 06 going to be actually due on Oct 3rd or are you going to move the due date?

@lecy
Copy link
Collaborator

lecy commented Oct 1, 2019

I'm adjusting the date. This semester has gotten the best of me.

Lab 6 will be short and sweet, the remainder of assignments will be posted tonight. I appreciate the patience.

@RickyDuran
Copy link

@lecy Lab 6 is showing up in the schedule, but the page is not live yet.

I am not sure if it is supposed to be.

Thanks!
Ricky

image

@lecy
Copy link
Collaborator

lecy commented Oct 2, 2019

Thanks Ricky - it was a broken link, should be working now.

@jmacost5
Copy link

jmacost5 commented Oct 2, 2019

So I made this code for the first part of the lab 6, how do I find the variable for year?

Salaries <- 
Salaries %>% 
mutate( salary.adj = salary*(1.03)^( max(yearID) - yearID ) )
group_by( Year, Teams)
summarize(Salaries)

@lecy
Copy link
Collaborator

lecy commented Oct 2, 2019

You can find data dictionaries or peek at your data using:

help( Salaries )
head( Salaries )

@jmacost5
Copy link

jmacost5 commented Oct 2, 2019

So when I replace year with yearID in the code it still cannot find year

@lecy
Copy link
Collaborator

lecy commented Oct 2, 2019

I think you might be confusing dataset names (Teams) with ID names in the dataset.

@sunaynagoel
Copy link
Author

I am working on the same question as well. I do have a question about merging? This is taking forever and not producing any result?


merge(Teams, Salaries, by.x= "teamID",by.y = "yearID", all = "TRUE") %>%
pander()

@lecy
Copy link
Collaborator

lecy commented Oct 2, 2019

By.x and by.y are used when variable names differ (e.g. “team.id” and “teamID”). You should be using a compound ID instead.

c(“id1”,”id2”)

@sunaynagoel
Copy link
Author

It is doing the same thing . What am I doing wrong ?

merge (Teams, Salaries, by= c("teamID","yearID"),all = "TRUE") %>%
pander()

@lecy
Copy link
Collaborator

lecy commented Oct 2, 2019

TRUE is an object not a string in R?

merge( Teams, Salaries, by=c("teamID","yearID"), all=TRUE )

I am assuming you have already aggregated your Salaries dataset up to the team level by year?

I'm not sure you want to keep all because the Teams dataset covers all seasons back to 1885, while the Salaries dataset only starts in 1985.

@jmacost5
Copy link

jmacost5 commented Oct 2, 2019

yearID is what is being said the name of the variable year is, I put both year and yearID in so I am not understanding what I am doing wrong to not get the right variable for year? Am I not understanding what the question is asking?

year <- 
Salaries <- 
  Salaries %>% 
  mutate( salary.adj = salary*(1.03)^( max(yearID) - yearID ) )
  merge( Teams, Salaries, by=c("teamID","yearID"), all=TRUE )
group_by(Teams)
summarize(Salaries)
pander()

@sunaynagoel
Copy link
Author

sunaynagoel commented Oct 2, 2019

@jmacost5 I am working on the same thing but I am going piece by piece. I have this. I am sure I missing something here.

#new variable for total salary team wise
Salaries %>%
  group_by(teamID, yearID)%>%
  summarise ( n=n(),
              totalsalary= sum(salary.adj,na.rm = TRUE))%>%
  
   pander()

@sunaynagoel
Copy link
Author

@lecy

TRUE is an object not a string in R?

  • When I put it under quotes it gives me error.
merge( Teams, Salaries, by=c("teamID","yearID"), all=TRUE )

I am assuming you have already aggregated your Salaries dataset up to the team level by year?

  • I tried.
    I'm not sure you want to keep all because the Teams dataset covers all seasons back to 1885, while the Salaries dataset only starts in 1985.
  • I don't want to keep all the data. i just want to keep data from 1985.

@lecy
Copy link
Collaborator

lecy commented Oct 2, 2019

Which merge would make sense here?

image

Note that in base R the merge arguments translate to:

merge( ..., all=TRUE )  # creates an outer join
merge( ..., all=FALSE)  # creates an inner join
merge( ..., all.x=TRUE)  # creates a left join
merge( ..., all.y=TRUE)  # creates a right join

@lecy
Copy link
Collaborator

lecy commented Oct 2, 2019

@jmacost5 I would first create your new team salaries dataset, then merge it with Teams. Not sure what a couple of these steps are doing.

year <- 
Salaries <- 
  Salaries %>% 
  mutate( salary.adj = salary*(1.03)^( max(yearID) - yearID ) )
  merge( Teams, Salaries, by=c("teamID","yearID"), all=TRUE )
group_by(Teams)
summarize(Salaries)
pander()

Maybe something like?

team.salaries <- 
  Salaries %>% 
  mutate( salary.adj = salary*(1.03)^( max(yearID) - yearID ) ) %>%
  group_by( ... ) %>%
  summarize( ... )

dat.teams.salaries <- merge( Teams, team.salaries, by=c("teamID","yearID"), all=? )

@sunaynagoel
Copy link
Author

The submission link for Lab-06 is not working for me.

@etbartell
Copy link

etbartell commented Oct 3, 2019

Here is my code for creating the aggregate salary dataset:

team.Salaries <-
  Salaries %>%
  mutate( salary.adj = salary*(1.03)^( max(yearID) - yearID ) ) %>%
  group_by( teamID, yearID ) %>%
  summarize( n = n(), team.budget = sum(salary.adj) ) %>%
  filter( n > 24 ) %>% pander()

It results in 4 columns of data without titles (should be teamID, yearID, n, & team.budget). Then when I try to merge this dataset with Teams I get this error message:

Unified <- 
  merge(team.Salaries, Teams, by=c("teamID", "yearID"), all=F)

Error in fix.by(by.x, x) : 'by' must specify uniquely valid columns

I can only assume this is because summary table had no column titles and therefore no variable names to match with Teams. I've updated and reloaded all of the packages in use and re-started R-studio several times. I'm not sure what else could be the problem.

@jmacost5
Copy link

jmacost5 commented Oct 3, 2019

Here is my code for creating the aggregate salary dataset:

team.Salaries <-
  Salaries %>%
  mutate( salary.adj = salary*(1.03)^( max(yearID) - yearID ) ) %>%
  group_by( teamID, yearID ) %>%
  summarize( n = n(), team.budget = sum(salary.adj) ) %>%
  filter( n > 24 ) %>% pander()

It results in 4 columns of data without titles (should be teamID, yearID, n, & team.budget). Then when I try to merge this dataset with Teams I get this error message:

Unified <- 
  merge(team.Salaries, Teams, by=c("teamID", "yearID"), all=F)

Error in fix.by(by.x, x) : 'by' must specify uniquely valid columns

I can only assume this is because summary table had no column titles and therefore no variable names to match with Teams. I've updated and reloaded all of the packages in use and re-started R-studio several times. I'm not sure what else could be the problem.

I am not super good at this but I believe that the all=F needs to be all= TRUE

@etbartell
Copy link

@jmacost5 I tried all 4 merging possibilities. All gave the same error message. I agree though, that one should be correct.

@jmacost5
Copy link

jmacost5 commented Oct 3, 2019

this is giving me the 4 variables, not the labels, but it should work. {r} team.Salaries <- Salaries %>% mutate( salary.adj = salary*(1.03)^( max(yearID) - yearID ) ) %>% group_by( teamID, yearID ) %>% summarize( n = n(), team.budget = sum(salary.adj) ) %>% filter( n > 24 ) %>% pander() dat.team.salaries <- merge( Teams, Salaries, by=c("teamID","yearID"), all=TRUE )
My question thought is how to make a variable for the cost per win, and do we have to scale it up for inflation?

@etbartell
Copy link

@jmacost5 That's the thing. We have to merge Teams with the new dataset ("team.Salaries") rather than the original "Salaries" dataset. This will combine inflation-adjusted team budgets with Win stats so that we can calculate cost-per-win. The problem I'm stuck on is that it won't let me merge the two datasets because the labels are missing from my "team.Salaries" set.

@etbartell
Copy link

I would think it should look something like this but anyone please feel free to show me where I'm wrong:

# aggregate team salaries, adjust team budgets for inflation, and screen out teams with less than 25 players
team.Salaries <-
  Salaries %>%
  mutate( salary.adj = salary*(1.03)^( max(yearID) - yearID ) ) %>%
  group_by( teamID, yearID ) %>%
  summarize( n = n(), team.budget = sum(salary.adj) ) %>%
  filter( n > 24 ) %>% pander()
# merge aggregate salary dataset with Teams
dat.team.salaries <-
  merge(team.Salaries, Teams, by=c("teamID","yearID"), all=T)
# calculate cost per-win and list the 25 lowest-paying teams
dat.team.salaries %>%
  mutate(cost.per.win = team.budget/W) %>%
  arrange(cost.per.win) %>%
  select("yearID","teamID","lgID","Rank","G","W","n","team.budget","cost.per.win") %>%
  head(25) %>% pander()

@lecy
Copy link
Collaborator

lecy commented Oct 3, 2019

@etbartell It's strange because when I run the code you posted here I get the correct results.

head( Unified )
  teamID yearID  n team.budget lgID franchID divID Rank   G Ghome  W  L DivWin
1    ANA   1997 31    54596239   AL      ANA     W    2 162    82 84 78      N
2    ANA   1998 34    70278139   AL      ANA     W    2 162    81 85 77      N
3    ANA   1999 40    91548199   AL      ANA     W    4 162    81 70 92      N
4    ANA   2000 30    82584880   AL      ANA     W    3 162    81 82 80      N
5    ANA   2001 30    74058241   AL      ANA     W    3 162    81 75 87      N
6    ANA   2002 28    93359559   AL      ANA     W    2 162    81 99 63      N
  WCWin LgWin WSWin   R   AB    H X2B X3B  HR  BB   SO  SB CS HBP SF  RA  ER
1     N     N     N 829 5628 1531 279  25 161 617  953 126 72  45 57 794 730
2     N     N     N 787 5630 1530 314  27 147 510 1028  93 45  48 41 783 720
3     N     N     N 711 5494 1404 248  22 158 511 1022  71 45  43 42 826 762
4     N     N     N 864 5628 1574 309  34 236 608 1024  93 52  47 43 869 805
5     N     N     N 691 5551 1447 275  26 158 494 1001 116 52  77 53 730 671
6     Y     Y     Y 851 5678 1603 333  32 152 462  805 117 51  74 64 644 595
   ERA CG SHO SV IPouts   HA HRA BBA  SOA   E  DP    FP           name
1 4.52  9   5 39   4364 1506 202 605 1050 123 140 0.980 Anaheim Angels
2 4.49  3   5 52   4332 1481 164 630 1091 106 146 0.983 Anaheim Angels
3 4.79  4   7 37   4294 1472 177 624  877 106 156 0.983 Anaheim Angels
4 5.00  5   3 46   4344 1534 228 662  846 134 182 0.978 Anaheim Angels
5 4.20  6   1 43   4313 1452 168 525  947 103 142 0.983 Anaheim Angels
6 3.69  7  14 54   4357 1345 169 509  999  87 151 0.986 Anaheim Angels
                        park attendance BPF PPF teamIDBR teamIDlahman45
1 Edison International Field    1767330 102 102      ANA            ANA
2 Edison International Field    2519280 102 102      ANA            ANA
3 Edison International Field    2253123  99 100      ANA            ANA
4 Edison International Field    2066982 102 103      ANA            ANA
5 Edison International Field    2000919 101 101      ANA            ANA
6 Edison International Field    2305547 100  99      ANA            ANA
  teamIDretro
1         ANA
2         ANA
3         ANA
4         ANA
5         ANA
6         ANA

I have not encountered this issue before with the functions stripping column names.

Can you print your session info while you are getting these results?

sessionInfo()

And send me your RMD file by email?

@sunaynagoel
Copy link
Author

Here is my code for creating the aggregate salary dataset:

team.Salaries <-
  Salaries %>%
  mutate( salary.adj = salary*(1.03)^( max(yearID) - yearID ) ) %>%
  group_by( teamID, yearID ) %>%
  summarize( n = n(), team.budget = sum(salary.adj) ) %>%
  filter( n > 24 ) %>% pander()

It results in 4 columns of data without titles (should be teamID, yearID, n, & team.budget). Then when I try to merge this dataset with Teams I get this error message:

Unified <- 
  merge(team.Salaries, Teams, by=c("teamID", "yearID"), all=F)

Error in fix.by(by.x, x) : 'by' must specify uniquely valid columns

I can only assume this is because summary table had no column titles and therefore no variable names to match with Teams. I've updated and reloaded all of the packages in use and re-started R-studio several times. I'm not sure what else could be the problem.

Try
Unified <-
merge (Teams, team.salaries, all = FALSE)

@lecy
Copy link
Collaborator

lecy commented Oct 3, 2019

@sunaynagoel Link for submission should be active now.

@etbartell
Copy link

etbartell commented Oct 3, 2019

@lecy Sending my Rmd now. Here is the sessionInfo:

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Lahman_7.0-1 pander_0.6.3 dplyr_0.8.3 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2       digest_0.6.21    crayon_1.3.4     assertthat_0.2.1 R6_2.4.0         magrittr_1.5     pillar_1.4.2     rlang_0.4.0     
 [9] tools_3.6.1      glue_1.3.1       purrr_0.3.2      xfun_0.9         compiler_3.6.1   pkgconfig_2.0.3  knitr_1.25       tidyselect_0.2.5
[17] tibble_2.1.3    

@lecy
Copy link
Collaborator

lecy commented Oct 3, 2019

@etbartell Here is the problem:

# calculate cost per-win and list the 25 lowest-paying teams
team.Salaries <-
  Salaries %>%
  mutate( salary.adj = salary*(1.03)^( max(yearID) - yearID ) ) %>%
  group_by( teamID, yearID ) %>%
  summarize( n = n(), team.budget = sum(salary.adj) ) %>%
  filter( n > 24 ) %>% pander()

Assignment occurs after the last operation on the right-hand side has completed. What is the last operation here?

So it was assigning the formatted pander table to your team.Salaries object, not the data frame.

class( team.Salaries )
[1] "NULL"

Easy fix, remove the pander() function:

team.Salaries <-
  Salaries %>%
  mutate( salary.adj = salary*(1.03)^( max(yearID) - yearID ) ) %>%
  group_by( teamID, yearID ) %>%
  summarize( n = n(), team.budget = sum(salary.adj) ) %>%
  filter( n > 24 )

@jmacost5
Copy link

jmacost5 commented Oct 3, 2019

For some reason when I put in my code it says that team.budget is not found.

team.Salaries <-
  Salaries %>%
  mutate( salary.adj = salary*(1.03)^( max(yearID) - yearID ) ) %>%
  group_by( teamID, yearID ) %>%
  summarize( n = n(), team.budget = sum(salary.adj) ) %>%
  filter( n > 24 ) %>% pander()
dat.team.salaries <- merge( Teams, Salaries, by=c("teamID","yearID"), all=TRUE )
 mutate(cost.per.win = team.budget/W) %>%
  arrange(cost.per.win) %>%
  select("yearID","teamID","lgID","Rank","G","W","n","team.budget","cost.per.win") %>%
  head(25) %>% pander()

@lecy
Copy link
Collaborator

lecy commented Oct 3, 2019

@jmacost5 Check out the comment right above. You need to remove pander() during assignment.

@jmacost5
Copy link

jmacost5 commented Oct 3, 2019

I removed pander, I honestly thought I did after the comment, but I am getting the same error.

Error in mutate(cost.per.win = team.budget/W) :
object 'team.budget' not found

team.Salaries <-
  Salaries %>%
  mutate( salary.adj = salary*(1.03)^( max(yearID) - yearID ) ) %>%
  group_by( teamID, yearID ) %>%
  summarize( n = n(), team.budget = sum(salary.adj) ) %>%
  filter( n > 24 ) %>% pander()
dat.team.salaries <- merge( Teams, Salaries, by=c("teamID","yearID"), all=TRUE )
 mutate(cost.per.win = team.budget/W)
  arrange(cost.per.win)
  select("yearID","teamID","lgID","Rank","G","W","n","team.budget","cost.per.win")
  head(25)

@etbartell
Copy link

etbartell commented Oct 3, 2019

@jmacost5 I separated the merge chunk from the rest of it and it finally gave me the right table:

dat.team.salaries <-
  merge(team.Salaries, Teams, by=c("teamID","yearID"), all=T)
dat.team.salaries %>%
  mutate(cost.per.win = team.budget/W) %>%
  arrange(cost.per.win) %>%
  select("yearID","teamID","lgID","Rank","G","W","n","team.budget","cost.per.win") %>%
  head(25)

@etbartell
Copy link

@lecy Thank you, that worked! I'm just curious though, why would the pander() function get rid of the column names? I thought I had defined the team.Salaries data frame with the group_by, summary, and filter functions, and that it would just retain the names from Salaries.

@jmacost5
Copy link

jmacost5 commented Oct 3, 2019

I am still getting a the same error when I remove the

team.Salaries <-
  Salaries %>%
  mutate( salary.adj = salary*(1.03)^( max(yearID) - yearID ) ) %>%
  group_by( teamID, yearID ) %>%
  summarize( n = n(), team.budget = sum(salary.adj) ) %>%
  filter( n > 24 ) %>% 
dat.team.salaries <- 
  merge( Teams, Salaries, by=c("teamID","yearID"), all=TRUE )
dat.team.salaries %>%
 mutate(cost.per.win = team.budget/W) %>%
  arrange(cost.per.win) %>%
  select("yearID","teamID","lgID","Rank","G","W","n","team.budget","cost.per.win") %>%
  head(25)

Error: object 'team.budget' not found

@etbartell
Copy link

@jmacost5 Your code there doesn't include the merge, and also you need to remove the pander() function after the filter.

@jmacost5
Copy link

jmacost5 commented Oct 3, 2019

I literally swear that I removed pander I do not understand how it keeps popping up. Can you see if the code is ok now?

@lecy
Copy link
Collaborator

lecy commented Oct 3, 2019

@etbartell Thanks for catching that! @jmacost5 is it working now?

@jmacost5
Copy link

jmacost5 commented Oct 3, 2019

It is still giving me an error, It is the code listed above (I edited it)

Error in Salaries %>% mutate(salary.adj = salary * (1.03)^(max(yearID) - :
could not find function "%>%<-"

@lecy
Copy link
Collaborator

lecy commented Oct 3, 2019

@jmacost5 The pipe operator lives in the dplyr package, so make sure you have library(dplyr) somewhere.

@jmacost5
Copy link

jmacost5 commented Oct 3, 2019

I do, in the beginning of the paper I have all the library() for the assignment

# note that you cannot include an install.packages() call in an RMD file when knitting
library( dplyr )
library( pander )
library( Lahman )

@lecy
Copy link
Collaborator

lecy commented Oct 3, 2019

Are you getting the error during a knit, or working interactively with code chunks?

If it's interactive, you have to load the dplyr package by running that chunk at the top first, or typing the library() command into the console.

@jmacost5
Copy link

jmacost5 commented Oct 3, 2019

Yes I am, I ran all the chunks above (i just found this function) and I tried knitting it and I still get the error, the run all above chunks should cause the programs to be loaded right? I looked and it says that they are installed too.

@lecy
Copy link
Collaborator

lecy commented Oct 3, 2019

Send me your RMD by email please.

@lecy
Copy link
Collaborator

lecy commented Oct 3, 2019

Ok, here are the issues:

(1) You left a pipe operator at the end of the line when creating your team budget dataset.
(2) You create a team budget dataset, but then use the original Salaries dataset in the merge step.

team.Salaries <-
  Salaries %>%
  mutate( salary.adj = salary*(1.03)^( max(yearID) - yearID ) ) %>%
  group_by( teamID, yearID ) %>%
  summarize( n = n(), team.budget = sum(salary.adj) ) %>%
  filter( n > 24 )    # %>%     never end a recipe with a pipe! 

dat.team.salaries <- 
  # merge( Teams, Salaries, by=c("teamID","yearID"), all=TRUE )   # need to use your new dataset, not original 
  merge( Teams, team.Salaries, by=c("teamID","yearID"), all=TRUE )

@lecy lecy added the lab-06 label Oct 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants