|
| 1 | +--- |
| 2 | +title: "Introducing nflscrapR: A how to guide" |
| 3 | +author: "Maksim Horowitz (@bklynmaks)" |
| 4 | +date: "`r Sys.Date()`" |
| 5 | +output: rmarkdown::html_vignette |
| 6 | +vignette: > |
| 7 | + %\VignetteIndexEntry{Vignette Title} |
| 8 | + %\VignetteEngine{knitr::rmarkdown} |
| 9 | + %\VignetteEncoding{UTF-8} |
| 10 | +--- |
| 11 | + |
| 12 | +## Introduction |
| 13 | + |
| 14 | +The lack of publicly available National Football League (NFL) data sources has been a major obstacle in the creation of modern, reproducible research in football analytics. While clean play-by-play and season-level data is available via open-source software packages in other sports, the equivalent datasets are not freely available for researchers interested in the statistical analysis of the NFL. We created a publicly available, open-source R package called nflscrapR that allows easy access to NFL data from 2009-2016. Using a JSON API maintained by the NFL, this package downloads, cleans, parses, and outputs datasets at the individual play, player, game, and season levels. Our package allows for the advancement of NFL research in the public domain by allowing analysts to develop from a common source, enhancing reproducibility of NFL research. This document shows example use cases of the different functions of nflscrapR. These examples explore just a few different areas of analysis in the NFL showing the ease of using this package: |
| 15 | + |
| 16 | + |
| 17 | +## Loading the `nflscrapR` package |
| 18 | + |
| 19 | +Use the following code. The devtools packages is needed to download `nflscrapR` as the package is currently hosted on Github: |
| 20 | + |
| 21 | +```{r, eval = FALSE} |
| 22 | +library(devtools) |
| 23 | +
|
| 24 | +devtools::install_github(repo = "maksimhorowitz/nflscrapR") |
| 25 | +
|
| 26 | +# Load the package |
| 27 | +
|
| 28 | +library(nflscrapR, quietly = TRUE) |
| 29 | +
|
| 30 | +``` |
| 31 | + |
| 32 | +```{r, echo = FALSE} |
| 33 | +library(nflscrapR) |
| 34 | +``` |
| 35 | + |
| 36 | +## Games and GameID Function |
| 37 | + |
| 38 | +The `nflscrapR` package provides a function which allows users to create a list of all games in a season along with each games associated GameID. Using the the `season_games` function, a dataframe with the home team, away team, the game date, and the GameID is created. The teams are denoted by their respective abbreviations. This function allows users to identify the GameIDs for matchups of interest to be used in the other functions of `nflscrapR`. See the example code below for how to use the season_games function: |
| 39 | + |
| 40 | +*Note: The `season_games` function takes a few minutes to run* |
| 41 | + |
| 42 | +```{r} |
| 43 | +# Loading all games from the 2014 season |
| 44 | +games2014 <- season_games(Season = 2014) |
| 45 | +
|
| 46 | +head(games2014) |
| 47 | +``` |
| 48 | + |
| 49 | +## Play-by-Play Functions |
| 50 | + |
| 51 | +The `nflscrapR` package contains two `play-by-play` functions. The single game function outputs a 62 column dataframe and the season long function outputs a 63 column dataframe, each with detailed information about each play including pre-snap informations, play call, and post snap results. To explore the documentation of the function and to learn more about what each column describes, use the following code: |
| 52 | + |
| 53 | +```{r, eval= FALSE} |
| 54 | +
|
| 55 | +help(game_play_by_play) |
| 56 | +
|
| 57 | +``` |
| 58 | + |
| 59 | +- `game_play_by_play`: extracts a single game's play-by-play data |
| 60 | +- `season_play_by_play`: extracts a full season's play-by-play data |
| 61 | + + _**Note this function takes a minute or two to run**_ |
| 62 | + |
| 63 | +*Note: There are errors within the NFL API raw data. Numerous extra-point scores are |
| 64 | +missing due to what seems to be a bug in the API. Alas, this does not have a major |
| 65 | +effect on the games outside of a few point difference in scores.* |
| 66 | + |
| 67 | +### Using `game_play_by_play` |
| 68 | + |
| 69 | +Here, we will explore Superbowl XLVII between the Baltimore Ravens and the San |
| 70 | +Francisco 49ers. The game was won by the Ravens 34-31. Below we explore some interesting elements of the game using the `game_play_by_play` function: |
| 71 | + |
| 72 | +```{r, warning = FALSE} |
| 73 | +
|
| 74 | +# Downlaod the game data |
| 75 | +superbowl47 <- game_play_by_play(GameID = 2013020300) |
| 76 | +
|
| 77 | +# Explore dataframe dimensions |
| 78 | +dim(superbowl47) |
| 79 | +``` |
| 80 | + |
| 81 | +We see that Superbowl XLVII had `r nrow(superbowl47)` plays. Now we will explore whether one team dominated offensive possession over another: |
| 82 | + |
| 83 | +```{r} |
| 84 | +# Counting Offensive Plays using dplyr |
| 85 | +suppressMessages(library(dplyr)) |
| 86 | +superbowl47 %>% group_by(posteam) %>% summarise(offensiveplays = n()) %>% |
| 87 | + filter(., posteam != "") |
| 88 | +``` |
| 89 | + |
| 90 | +As seen above the Ravens ran more plays than the 49ers. There are many directions |
| 91 | +one can move for further insight. Some ideas are as follows: |
| 92 | + |
| 93 | +- Examining time of possession to see if one team dictated pace of play |
| 94 | +- Analyze run pass breaks downs of both teams to see play calling tenancies |
| 95 | +- Take a further step and look at plays by drive to see if there was one |
| 96 | +specific drive that created the differences in total offensive plays between |
| 97 | +the teams |
| 98 | +- Add more statistics on the "play" level such as yards per play, points |
| 99 | +per play, or play duration |
| 100 | + |
| 101 | +For the sake of this example the last bullet is explored in more detail. To explore |
| 102 | +play level statistics we manipulate the following variables to get yards per play, |
| 103 | +points per play, and play duration: |
| 104 | + |
| 105 | +- `Yards.Gained` |
| 106 | +- `PosTeamScore` |
| 107 | +- `PlayTimeDiff` |
| 108 | + |
| 109 | +Below using dplyr we add the aforementioned statistics to our dataframe and use |
| 110 | +`ggplot2` to visualize the summarized data: |
| 111 | + |
| 112 | +```{r, warning=FALSE, fig.align='center', fig.height= 7, fig.width= 9.5} |
| 113 | +# Loading the ggplot2 library |
| 114 | +suppressMessages(library(ggplot2)) |
| 115 | +suppressMessages(library(gridExtra)) |
| 116 | +
|
| 117 | +# Using dplyr and knitr to find statistics |
| 118 | +sb_team_summary_stats <- superbowl47 %>% group_by(posteam) %>% |
| 119 | + summarise(offensiveplays = n(), |
| 120 | + avg.yards.gained = mean(Yards.Gained, |
| 121 | + na.rm = TRUE), |
| 122 | + pointsperplay = max(PosTeamScore) / n(), |
| 123 | + playduration = mean(PlayTimeDiff)) %>% |
| 124 | + filter(., posteam != "") %>% |
| 125 | + as.data.frame() |
| 126 | +
|
| 127 | +# Yards per play plot |
| 128 | +plot_yards <- ggplot(sb_team_summary_stats, aes(x = posteam, y = avg.yards.gained)) + |
| 129 | + geom_bar(aes(fill = posteam), stat = "identity") + |
| 130 | + geom_label(aes(x = posteam, y = avg.yards.gained + .3, |
| 131 | + label = round(avg.yards.gained,2)), |
| 132 | + size = 4, fontface = "bold") + |
| 133 | + labs(title = "Superbowl 47: Yards per Play by Team", |
| 134 | + x = "Teams", y = "Average Yards per Play") + |
| 135 | + scale_fill_manual(values = c("#241773", "#B3995D")) |
| 136 | +
|
| 137 | +# Points per play plot |
| 138 | +plot_points <- ggplot(sb_team_summary_stats, aes(x = posteam, y = pointsperplay)) + |
| 139 | + geom_bar(aes(fill = posteam), stat = "identity") + |
| 140 | + geom_label(aes(x = posteam, y = pointsperplay + .05, |
| 141 | + label = round(pointsperplay,5)), |
| 142 | + size = 4, fontface = "bold") + |
| 143 | + labs(title = "Superbowl 47: Points per Play by Team", |
| 144 | + x = "Teams", y = "Points per Play") + |
| 145 | + scale_fill_manual(values = c("#241773", "#B3995D")) |
| 146 | +
|
| 147 | +# Play duration plot |
| 148 | +plot_time <- ggplot(sb_team_summary_stats, aes(x = posteam, y = playduration)) + |
| 149 | + geom_bar(aes(fill = posteam), stat = "identity") + |
| 150 | + geom_label(aes(x = posteam, y = playduration + .05, |
| 151 | + label = round(playduration,2)), |
| 152 | + size = 4, fontface = "bold") + |
| 153 | + labs(title = "Superbowl 47: Average Play Time Duration \n by Team", |
| 154 | + x = "Teams", y = "Average Play Duration") + |
| 155 | + scale_fill_manual(values = c("#241773", "#B3995D")) |
| 156 | +
|
| 157 | +
|
| 158 | +grid.arrange(plot_yards, plot_points, plot_time, ncol =2) |
| 159 | +
|
| 160 | +``` |
| 161 | + |
| 162 | +### The season_play_by_play function |
| 163 | + |
| 164 | +The above example is just one way to manipulate play-by-play dataframes gather insights about a single football game. The `season_play_by_play` function allows users to gather further insights on aggregate levels. Seasons generate tens of thousands of plays primed for insights. To read an example of `season_play_by_play`, check out this analysis of [Adrian Peterson's running tendancies posted on the CMU Sports Analytics club blog](https://tartansportsanalytics.com/2016/03/24/introducing-nflscrapr-part-2/). |
| 165 | + |
| 166 | +## Detailed Boxscore Functions |
| 167 | + |
| 168 | +Another set of useful and interesting functions in nflscrapR are the detailed box score functions. These functions output game and season level statistics ranging from passing to defense to kick returning. The season level functions output dataframes |
| 169 | +with 56 columns while the game level function outputs a dataframe with 55 columns. |
| 170 | + |
| 171 | +There are three different detailed box score functions. The are summarized as follows: |
| 172 | + |
| 173 | +- `player_game`: outputs a dataframe for a single game with one line per player who recorded any |
| 174 | +measurable statistics ranging from passing to kick returning |
| 175 | +- `season_player_game`: outputs a dataframe for an entire season. **One line per player per game**. All statistics measured statistics are reported. |
| 176 | +- `agg_player_season`: outputs a dataframe with aggregate statistics for an entire season. There is one line per player in this dataframe. |
| 177 | + |
| 178 | +*Note: The `season_player_game` and `agg_player_game` functions take a few minutes |
| 179 | +to run.* |
| 180 | + |
| 181 | +### Using season_player_game |
| 182 | + |
| 183 | +Below, an example is shown using the `season_player_game` function. Explored in these plots are a number of ways to visualize player data across the seasons. The first plot shows Joe Flacco's passing yards by game across the past 7 seasons. Also, plotted below that are his pass attempts per game to visualize his evolution from a game manager to a top-tier quarterback. |
| 184 | + |
| 185 | +```{r, echo = FALSE} |
| 186 | +data(playerstats09, playerstats10, playerstats11, playerstats12, |
| 187 | + playerstats13, playerstats14, playerstats15) |
| 188 | +
|
| 189 | +allplayerstats <- rbind(playerstats09, playerstats10, playerstats11, |
| 190 | + playerstats12, playerstats13, playerstats14, |
| 191 | + playerstats15) |
| 192 | +``` |
| 193 | + |
| 194 | +```{r, eval = FALSE} |
| 195 | +
|
| 196 | +# Loading all the statistics for each player by game from 2009-2015 |
| 197 | +# Note: The below code takes about 15 minutes to run. |
| 198 | +
|
| 199 | +playerstats09 <- season_player_game(Season = 2009) |
| 200 | +playerstats10 <- season_player_game(Season = 2010) |
| 201 | +playerstats11 <- season_player_game(Season = 2011) |
| 202 | +playerstats12 <- season_player_game(Season = 2012) |
| 203 | +playerstats13 <- season_player_game(Season = 2013) |
| 204 | +playerstats14 <- season_player_game(Season = 2014) |
| 205 | +playerstats15 <- season_player_game(Season = 2015) |
| 206 | +
|
| 207 | +# Combining into one dataframe using rbind() |
| 208 | +
|
| 209 | +allplayerstats <- rbind(playerstats09, playerstats10, playerstats11, |
| 210 | + playerstats12, playerstats13, playerstats14, |
| 211 | + playerstats15) |
| 212 | +
|
| 213 | +# Examining dimensions: |
| 214 | +dim(allplayerstats) |
| 215 | +``` |
| 216 | + |
| 217 | +```{r} |
| 218 | +#### Using ggplot to explore Joe Flacco's passing tends ### |
| 219 | +
|
| 220 | +# filter for Flacco |
| 221 | +flacco_data <- filter(allplayerstats, name == "J.Flacco") |
| 222 | +flacco_data <- flacco_data[order(flacco_data$date),] |
| 223 | +
|
| 224 | +# Add games player. Note in 2015 Flacco was injured in Week 10 |
| 225 | +flacco_data$gamenumber <- as.factor(c(rep(1:16, times = 6), 1:10)) |
| 226 | +
|
| 227 | +# Creating Passing Yards Plot by game |
| 228 | +flacco_passyds_plot <- ggplot(flacco_data, aes(x = gamenumber, y = passyds)) + |
| 229 | + theme_bw() + |
| 230 | + geom_bar(stat = "identity", aes(alpha = passyds), fill = "#241773") + |
| 231 | + theme(strip.background = element_rect(fill = "black", size = 1.5), |
| 232 | + strip.text = element_text(color = "white", face = "bold")) + |
| 233 | + geom_label(aes(x = gamenumber, y = ifelse(passyds > 70, passyds - 60, |
| 234 | + passyds + 10), |
| 235 | + label = passyds), |
| 236 | + size = 3) + |
| 237 | + ggtitle("Joe Flacco's Passing Yards by Game \n Across Seasons") + |
| 238 | + ylab("Passing Yards") + xlab("Game Number") + guides(fill = FALSE) + |
| 239 | + facet_wrap(~Season, ncol = 1) |
| 240 | +``` |
| 241 | + |
| 242 | +The plot below of Flacco's passing yards across games provides an interesting visual. |
| 243 | +By examining the plot one can see that he usually has a strong first game followed by a much weaker second game (excluding the 2015 season). One can also see that in 2015 Flacco was one pace to throw for a career high in yards in a season before he was derailed by a torn ACL. |
| 244 | + |
| 245 | +```{r, echo = FALSE, fig.align= "center", fig.height= 9, fig.width=9} |
| 246 | +flacco_passyds_plot |
| 247 | +``` |
| 248 | + |
| 249 | +Now examined are Flacco's pass attempts across games. This allows visualization of |
| 250 | +his progression from a game manager to the elite level quarterback that he his today. What one can observe the following insights from the bar plots below: |
| 251 | + |
| 252 | +- Flacco's most consistent season in terms of pass attempts was 2010 |
| 253 | +- In 2013 and 2014 the Ravens started the season aggressively relying heavily on |
| 254 | +the passing game with 62 attempts in the first game |
| 255 | +- In 2009, 2010, and 2011 Flacco's attempts tailed off at the end of the seasons. This could be related to the Ravens having already clinched a playoff berth by that point in the season. |
| 256 | + |
| 257 | +```{r, fig.align= "center", fig.height= 9, fig.width=9} |
| 258 | +
|
| 259 | +# Creating Passing Attempts Plot by Game across Seasons |
| 260 | +ggplot(flacco_data, aes(x = gamenumber, y = pass.att)) + |
| 261 | + theme_bw() + |
| 262 | + geom_bar(stat = "identity", aes(alpha = pass.att), fill = "#241773") + |
| 263 | + theme(strip.background = element_rect(fill = "black", size = 1.5), |
| 264 | + strip.text = element_text(color = "white", face = "bold")) + |
| 265 | + geom_label(aes(x = gamenumber, y = ifelse(pass.att > 60, pass.att - 5, |
| 266 | + pass.att), |
| 267 | + label = pass.att), |
| 268 | + size = 3) + |
| 269 | + ggtitle("Joe Flacco's Passing Attempts by Game \n Across Seasons") + |
| 270 | + ylab("Passing Attempts") + xlab("Game Number") + guides(fill = FALSE) + |
| 271 | + facet_wrap(~Season, ncol = 1) |
| 272 | +``` |
| 273 | + |
| 274 | +### Using `player_game` and `agg_player_season` |
| 275 | + |
| 276 | +Similar exploration of game level statistics can be done with with the `player_game` |
| 277 | +function. The nice thing about the `player_game` function is that it downloads, parses, and cleans the data very quickly so if you are interested in particular games, you can download the data almost instantaneously. |
| 278 | + |
| 279 | +The `agg_player_season` function allows users to visualize season total statistics which is beneficial for building running list of statistics or easily calculating total from the available range of seasons. |
| 280 | + |
| 281 | +Note, there is also a simple box score function available which separates each of the |
| 282 | +measured statistics into different lists of dataframes. This is much more similar what is seen in a standard box score. The function is named: `simple_boxscore` |
| 283 | + |
| 284 | +## Roster Function |
| 285 | + |
| 286 | +Users can also download a data set of rosters for each team by season. The rosters |
| 287 | +include all players **who recorded a measured statistic** in the raw data. That is, if a player did not record either a passing, rushing, receiving, defensive, punt return, kick return, punting, or kicking statistic they will not be on the roster. To use the function, you must identify a season of interest and also identify the team abbreviations of your team of interest. To find team initials, load the nflteams |
| 288 | +data set stored in the package. The following code shows the roster for the Tennessee Titans in 2013: |
| 289 | + |
| 290 | +*Note: The `season_rosters` function takes a few minutes to run* |
| 291 | + |
| 292 | +```{r} |
| 293 | +
|
| 294 | +# Find Titans Abbreviation |
| 295 | +data(nflteams) |
| 296 | +ten_abbr <- filter(nflteams, TeamName == "Tennessee Titans")$Abbr |
| 297 | +
|
| 298 | +# Load the Titans Roster from 2013 |
| 299 | +tenroster2013 <- season_rosters(Season = 2013, TeamInt = ten_abbr) |
| 300 | +
|
| 301 | +head(tenroster2013) |
| 302 | +``` |
| 303 | + |
| 304 | +## Drive Summary Function |
| 305 | + |
| 306 | +The `drive_summary` function allows users to download data sets with information about each drive in a game. The input requires a GameID, and the outputted dataframe has |
| 307 | +18 columns. The help documentation for the function describes each of the columns |
| 308 | +in more detail. Drives summaries provide interesting insight into what type of play calling leads to specific type of drive results such as scoring plays, punts, or turnovers. |
| 309 | + |
| 310 | +### Using the drive_summary function |
| 311 | + |
| 312 | +Displayed below is an interactive table of the drive summaries of the last game of the 2015 season between the Minnesota Viking and the Green Bay Packers. |
| 313 | +```{r} |
| 314 | +# Drive summary from final game of 2015 season |
| 315 | +min_gb_drivesummary <- drive_summary(GameID = 2016010310) |
| 316 | +
|
| 317 | +head(min_gb_drivesummary, 8) |
| 318 | +``` |
0 commit comments