-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathweek5.html
354 lines (345 loc) · 57.9 KB
/
week5.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>R, R Studio, and the tidyverse</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link href="css/bootstrap.min.css" rel="stylesheet">
<link href="css/custom.css" rel="stylesheet">
</head>
<body class="markdown github">
<header class="navbar-inverse navbar-fixed-top">
<div class="container">
<nav role="navigation">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a href="index.html" class="navbar-brand">J298 Data Journalism</a>
</div> <!-- /.navbar-header -->
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
<ul class="nav navbar-nav">
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Class notes<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="week1.html">What is data?</a></li>
<li><a href="week2.html">Types of stories</a></li>
<li><a href="week3.html">Working with spreadsheets</a></li>
<li><a href="week4.html">Acquiring, cleaning, and formatting data</a></li>
<li><a href="week5.html">R, RStudio, and the tidyverse</a></li>
<li><a href="week6.html">Data journalism in the tidyverse</a></li>
<li><a href="week7.html">Don't let the data lie to you</a></li>
<li><a href="week8.html">Databases and SQL</a></li>
<li><a href="week9.html">Finding stories using maps</a></li>
<li><a href="week10.html">Maps meet databases</a></li>
<li><a href="week11.html">More PostGIS</a></li>
<li><a href="week12.html">R practice</a></li>
<li><a href="week13.html">PostGIS practice</a></li>
<li><a href="week14.html">More fun with R</a></li>
</ul>
</li>
<li><a href="software.html">Software</a></li>
<li><a href="datasets.html">Data</a></li>
<li><a href="questions.html">If you get stuck</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Email instructors<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="mailto:[email protected]">Peter Aldhous</a></li>
<li><a href="mailto:[email protected]">Amanda Hickman</a></li>
</ul>
</li>
</ul>
</div><!-- /.navbar-collapse -->
</nav>
</div> <!-- /.navbar-header -->
</header>
<div class="container all">
<h1 id="r,-rstudio,-and-the-tidyverse"><a name="r,-rstudio,-and-the-tidyverse" href="#r,-rstudio,-and-the-tidyverse"></a>R, RStudio, and the tidyverse</h1><h3 id="introducing-r-and-rstudio"><a name="introducing-r-and-rstudio" href="#introducing-r-and-rstudio"></a>Introducing R and RStudio</h3><p>In today’s class we will work with <strong><a href="http://www.r-project.org/">R</a></strong>, which is a very powerful tool, designed by statisticians for data analysis. Described on its website as “free software environment for statistical computing and graphics,” R is a programming language that opens a world of possibilities for making graphics and analyzing and processing data. Indeed, just about anything you may want to do with data can be done with R, from web scraping to making interactive graphics.</p><p><strong><a href="https://www.rstudio.com/">RStudio</a></strong> is an “integrated development environment,” or IDE, for R that provides a user-friendly interface.</p><p>Launch RStudio, and the screen should look like this:</p><p><img src="img/class5_1.jpg" alt=""></p><p>The main panel to the left is the R Console. Type valid R code into here, hit <code>return</code>, and it will be run. See what happens if you run:</p><pre class="r hljs"><code class="R" data-origin="<pre><code class="R">print("Hello World!")
</code></pre>">print(<span class="hljs-string">"Hello World!"</span>)
</code></pre><h3 id="the-data-we-will-use"><a name="the-data-we-will-use" href="#the-data-we-will-use"></a>The data we will use</h3><p>Download the data for this class from <a href="data/week5.zip">here</a>, unzip the folder and place it on your desktop. We will use this data over the next two weeks. It contains the following files:</p><ul>
<li><p><code>ca_discipline.csv</code> Disciplinary alerts and actions issued by the Medical Board of California from 2008 to 2017. Processed from downloads available <a href="http://www.mbc.ca.gov/Publications/Disciplinary_Actions/">here</a>. Contains the following variables:</p>
<ul>
<li><code>alert_date</code> Date alert issued.</li><li><code>last_name</code> Last name of doctor/health care provider.</li><li><code>first_name</code> First name of doctor/health care provider.</li><li><code>middle_name</code> Middle/other names.</li><li><code>name_suffix</code> Name suffix (Jr., II etc)</li><li><code>city</code> City of practive location.</li><li><code>state</code> State of practice location.</li><li><code>license</code> California medical license number.</li><li><code>action_type</code> Type of action.</li><li><code>action_date</code> Date of action.</li></ul>
</li></ul><ul>
<li><p><code>ca_medicare_opioids.csv</code> Data on prescriptions of opioid drugs under the Medicare Part D Prescription Drug Program by doctors in California, from 2013 to 2015. Filtered from the national data downloads available <a href="https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Part-D-Prescriber.html">here</a>. This is the public release of the data that ProPublica used FOIA to obtain for earlier years for the story we discussed in Week 2. Contains the following variables:</p>
<ul>
<li><code>npi</code> <a href="https://npiregistry.cms.hhs.gov/">National Provider Identifier</a> (NPI) for the doctor/organization making the claim. This is a unique code for each health care provider.</li><li><code>nppes_provider_last_org_name</code> For individual doctors, their last name. For organizations, the organziation name.</li><li><code>nppes_provider_first_name</code> First name for indivisual doctors, blank for organizations.</li><li><code>nppes_provider_city</code> City where the provider is located.</li><li><code>nppes_provider_state</code> State where the provider is located; “CA” for all of these records.</li><li><code>specialty_description</code> Provider’s medical speciality, reported on their medicare claims. For providers that have more than one Medicare specialty code reported on their claims, the code associated with the largest number of services.</li><li><code>description_flag</code> Source of the <code>specialty_description</code>. <ul>
<li><code>S</code> Medicare Specialty Code description.</li><li><code>T</code> Taxonomy Code Classification description.</li></ul>
</li><li><code>drug_name</code> Includes both brand names (drugs that have a trademarked name) and generic names (drugs that do not have a trademarked name).</li><li><code>generic_name</code> The chemical ingredient of a drug rather than the trademarked brand name under which the drug is sold.</li><li><code>bene_count</code> Total number of unique Medicare Part D beneficiaries (i.e. patients) with at least one claim for the drug. Counts fewer than 11 are suppressed and are indicated by a blank.</li><li><code>total_claim_count</code> Number of Medicare Part D claims; includes original prescriptions and refills. If less than 11, counts are not included in the data file.</li><li><code>total_30_day_fill_count</code> Total number of Medicare Part D standardized 30-day fills. The standardized 30-day fill is derived from the number of days supplied on each Part D claim divided by 30.</li><li><code>total_day_supply</code> Total number of days’ supply for this drug.</li><li><code>total_drug_cost</code> Total cost paid for all associated claims; includes ingredient cost, dispensing fee, sales tax, and any applicable fees.</li><li><code>bene_count_ge65</code> Total number of unique Medicare Part D beneficiaries age 65 and older with at least one claim for the drug. A blank indicates the value is suppressed.</li><li><code>bene_count_ge65_suppress_flag</code> Why the <code>bene_count_ge65</code> variable is suppressed:<ul>
<li><code>*</code> Suppressed due to <code>bene_count_ge65</code> between 1 and 10.</li><li><code>#</code> Suppressed because the “less than 65 year old” group (not displayed) contains a beneficiary count between 1 and 10.</li><li><code>total_claim_count_ge65</code> Number of Medicare Part D claims for beneficiaries age 65 and older; includes original prescriptions and refills. A blank indicates the value is suppressed.</li><li><code>ge65_suppress_flag</code> Why the <code>total_claim_count_ge65</code>, <code>total_30_day_fill_count ge65</code>, <code>total_day_supply_ge65</code>, and <code>total_drug_cost_ge65 variables</code> are suppressed:<ul>
<li><code>*</code> Suppressed due to <code>total_claim_count_ge65</code> between 1 and 10.</li></ul>
</li><li><code>#</code> Suppressed because the “less than 65 year old” group (not displayed)<br>contains a claim count between 1 and 1.</li></ul>
</li><li><code>total_30_day_fill_count_ge65</code> Number of Medicare Part D standardized 30-day fills for beneficiaries age 65 and older. If <code>total_claim_count_ge65</code> is suppressed, this variable is also suppressed.</li><li><code>total_day_supply_ge65</code> Total days’ supply for which this drug was dispensed, for beneficiaries age 65 and older. If <code>total_claim_count_ge65</code> is suppressed, this variable is also suppressed.</li><li><code>total_drug_cost_ge65</code> Total drug cost paid for all associated claims for beneficiaries age 65 and older. If <code>total_claim_count_ge65</code> is suppressed, this is also suppressed.</li><li><code>year</code> 2013, 2014, or 2015.</li></ul>
</li><li><p><code>npi_license.csv</code> Crosswalk file to join NPI identifiers to state license numbers, processed from the download available <a href="http://www.nber.org/data/npi-state-license-crosswalk.html">here</a> to include license numbers potentially matching California doctors. This will provide one way of joining the precription data to the medical board disciplinary data. As we shall see, problems with the data mean that it is not infallible. Contains the following variables:</p>
<ul>
<li><code>npi</code> National Provider Identifier, as described above.</li><li><code>plicnum</code> State license number, from the original file.</li><li><code>license</code> Processed from <code>pclicnum</code> to conform to the format of California medical license numbers.</li></ul>
</li></ul><h3 id="some-words-of-caution,-before-we-start"><a name="some-words-of-caution,-before-we-start" href="#some-words-of-caution,-before-we-start"></a>Some words of caution, before we start</h3><p>The US is currently in the grip of an epidemic of opioid abuse and addiction. Although <a href="https://www.buzzfeed.com/danvergano/2-old-painkiller-papers">widespread medical prescription</a> of opioids <a href="https://www.buzzfeed.com/danvergano/whats-causing-the-opioid-crisis">helped drive addiction</a>, a <a href="https://www.buzzfeed.com/danvergano/life-expectancy-opioid-overdoses">majority of overdoses</a> now occur through the consumption of drugs purchased illegally.</p><p>Opioids have important medical uses, and just because a doctor prescribes large amounts of the drugs doesn’t necessarily mean they are practising irresponsibly. Turning any of the analyses in the next two classes into stories would require a lot of additional reporting, beyond the data work.</p><p>As ProPublica explained, in <a href="https://www.propublica.org/article/how-we-analyzed-medicares-drug-data-long-methodology">the methods</a> for its stories based on Medicare Part D prescription data:</p><blockquote>
<p>The data could not tell us everything. We interviewed many high-volume prescribers to better understand their patients and their practices. Some told us their numbers were high because they were credited with prescriptions by others working in the same practice. In addition, providers who primarily work in long-term care facilities or busy clinics with many patients naturally may write more prescriptions.</p>
</blockquote><h3 id="reproducibility:-save-your-scripts"><a name="reproducibility:-save-your-scripts" href="#reproducibility:-save-your-scripts"></a>Reproducibility: Save your scripts</h3><p>Data journalism should ideally be fully documented and reproducible. R makes this easy, as every operation performed can be saved in a script, and repeated by running that script. Click on the <img src="img/class5_2.jpg" alt=""> icon at top left and select <code>R Script</code>. A new panel should now open:</p><p><img src="img/class5_3.jpg" alt=""></p><p>Any code we type in here can be run in the console. Hitting <code>Run</code> will run the line of code on which the cursor is sitting. To run multiple lines of code, highlight them and click <code>Run</code>.</p><p>Click on the save/disk icon in the script panel and save the blank script to the file on your desktop with the data for this week, calling it <code>week5.R</code>.</p><h3 id="set-your-working-directory"><a name="set-your-working-directory" href="#set-your-working-directory"></a>Set your working directory</h3><p>Now we can set the working directory to this folder by selecting from the top menu <code>Session>Set Working Directory>To Source File Location</code>. (Doing so means we can load the files in this directory without having to refer to the full path for their location, and anything we save will be written to this folder.)</p><p>Notice how this code appears in the console:</p><pre class="r hljs"><code class="R" data-origin="<pre><code class="R">setwd("~/Desktop/week5")
</code></pre>">setwd(<span class="hljs-string">"~/Desktop/week5"</span>)
</code></pre><h3 id="save-your-data"><a name="save-your-data" href="#save-your-data"></a>Save your data</h3><p>The panel at top right has three tabs, the first showing the <code>Environment</code>, or all of the “objects” loaded into memory for this R session. Save this as well, and you won’t have to load and process all of the data again if you return to return to a project later.</p><p>Click on the save/disk icon in the <code>Environment</code> panel to save the file as <code>week5.RData</code>. The following code should appear in the Console:</p><pre class="r hljs"><code class="r" data-origin="<pre><code class="r">save.image("~/Desktop/week5/week5.RData")
</code></pre>">save.image(<span class="hljs-string">"~/Desktop/week5/week5.RData"</span>)
</code></pre><p>Copy this code into your script, placing it at the end, with a comment, explaining what it does:</p><pre class="r hljs"><code class="r" data-origin="<pre><code class="r"># save session data
save.image("~/Desktop//week5/week5.RData")
</code></pre>"><span class="hljs-comment"># save session data</span>
save.image(<span class="hljs-string">"~/Desktop//week5/week5.RData"</span>)
</code></pre><p>Now if you run your entire script, the last action will always be to save the data in your environment.</p><h3 id="comment-your-code"><a name="comment-your-code" href="#comment-your-code"></a>Comment your code</h3><p>Anything that appears on a line after <code>#</code> will be treated as a comment, and will be ignored when the code is run. You can use this to explain what the code does. Get into the habit of commenting your code: Don’t trust yourself to remember!</p><h3 id="some-r-code-basics"><a name="some-r-code-basics" href="#some-r-code-basics"></a>Some R code basics</h3><ul>
<li><code><-</code> is known as an “assignment operator.” It means: “Make the object named to the left equal to the output of the code to the right.”</li><li><code>&</code> means AND, in Boolean logic.</li><li><code>|</code> means OR, in Boolean logic.</li><li><code>!</code> means NOT, in Boolean logic.</li><li>When referring to values entered as text, or to dates, put them in quote marks, like this: <code>"United States"</code>, or <code>"2016-07-26"</code>. Numbers are not quoted.</li><li>When entering two or more values as a list, combine them using the function <code>c</code>, for combine, with the values separated by commas, for example: <code>c("2017-07-26","2017-08-04")</code></li><li>As in a spreadsheet, you can specify a range of values with a colon, for example: <code>c(1:10)</code> creates a list of integers (whole numbers) from one to ten.</li><li><p>Some common operators:</p>
<ul>
<li><code>+</code> <code>-</code> add, subtract.</li><li><code>*</code> <code>/</code> multiply, divide.</li><li><code>></code> <code><</code> greater than, less than.</li><li><code>>=</code> <code><=</code> greater than or equal to, less than or equal to.</li><li><code>!=</code> not equal to.</li></ul>
</li><li><p>Equals signs can be a little confusing, but see how they are used in the code we use today:</p>
<ul>
<li><code>==</code> test whether an object is equal to a value. This is often used when filtering data, as we will see.</li><li><code>=</code> make an object equal to a value; similar to <code><-</code>, but used within a function (see below).</li></ul>
</li><li><p>Handling null values:</p>
<ul>
<li>Nulls are designated as <code>NA</code>.</li><li><code>is.na(x)</code> looks for nulls within variable <code>x</code>.</li><li><code>!is.na(x)</code> looks for non-null values within variable <code>x</code>.</li></ul>
</li></ul><p>Here, <code>is.na</code> is a <strong>function</strong>. Functions are followed by parentheses, and act on the data/code in the parentheses.</p><p><strong>Important:</strong> Object and variable names in R should not contain spaces.</p><h3 id="introducing-r-packages-and-the-tidyverse"><a name="introducing-r-packages-and-the-tidyverse" href="#introducing-r-packages-and-the-tidyverse"></a>Introducing R packages and the tidyverse</h3><p>Much of the power of R comes from the thousands of “packages” written by its community of open source contributors. These are optimized for specific statistical, graphical or data-processing tasks. To see what packages are available in the basic distribution of R, select the <code>Packages</code> tab in the panel at bottom right. To find packages for particular tasks, try searching Google using appropriate keywords and the phrase “R package.”</p><p>Our goal for today’s class is to get used to processing and analyzing data using a powerful series of R packages known as the <strong><a href="https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/">tidyverse</a></strong>.</p><p>The tidyverse was pioneered by <a href="http://hadley.nz/">Hadley Wickham</a>, chief scientist at RStudio, but now has many contributors.</p><p>Today, we will start by using:</p><ul>
<li><strong><a href="http://readr.tidyverse.org/">readr</a></strong> Reads and writes CSV and other text files.</li><li><strong><a href="http://dplyr.tidyverse.org/">dplyr</a></strong> Processes and analyzes data, using the operations we discussed in the first class.</li><li><strong><a href="http://lubridate.tidyverse.org/">lubridate</a></strong> Makes working with dates and times much easier.</li></ul><p>To install a package, click on the <code>Install</code> icon in the <code>Packages</code> tab, type its name into the dialog box, and make sure that <code>Install dependencies</code> is checked, as some packages will only run correctly if other packages are also installed. The tidyverse packages can be installed in one go. Click <code>Install</code> and all of the required packages should install:</p><p><img src="img/class5_4.jpg" alt=""></p><p>Notice that the following code appears in the console:</p><pre class="r hljs"><code class="r" data-origin="<pre><code class="r">install.packages("tidyverse")
</code></pre>">install.packages(<span class="hljs-string">"tidyverse"</span>)
</code></pre><p>So you can also install packages with code in this format, without using the point-and-click interface.</p><p>Each time you start R, it’s a good idea to click on <code>Update</code> in the <code>Packages</code> panel to update all your installed packages to the latest versions.</p><p>Installing a package makes it available to you, but to use it in any R session you need to load it. You can do this by checking its box in the <code>Packages</code> panel. However, we will enter the following code into our script, then highlight these lines of code and run them:</p><pre class="r hljs"><code class="r" data-origin="<pre><code class="r"># load packages to read and write csv files, process data, and work with dates
library(readr)
library(dplyr)
library(lubridate)
</code></pre>"><span class="hljs-comment"># load packages to read and write csv files, process data, and work with dates</span>
<span class="hljs-keyword">library</span>(readr)
<span class="hljs-keyword">library</span>(dplyr)
<span class="hljs-keyword">library</span>(lubridate)
</code></pre><p>At this point, and at regular intervals, save your script, by clicking the save/disk icon in the script panel, or using the <code>⌘-S</code> keyboard shortcut.</p><h3 id="load-and-view-data"><a name="load-and-view-data" href="#load-and-view-data"></a>Load and view data</h3><h4 id="load-data"><a name="load-data" href="#load-data"></a>Load data</h4><p>You can load data into the current R session by selecting <code>Import Dataset>From Text File...</code> in the <code>Environment</code> tab.</p><p>However, we will use the <code>read_csv</code> function from the <strong>readr</strong> package. Copy the following code into your script and <code>Run</code>:</p><pre class="r hljs"><code class="r" data-origin="<pre><code class="r"># load ca medical board disciplinary actions data
ca_discipline &lt;- read_csv("ca_discipline.csv")
</code></pre>"><span class="hljs-comment"># load ca medical board disciplinary actions data</span>
ca_discipline <- read_csv(<span class="hljs-string">"ca_discipline.csv"</span>)
</code></pre><p>Notice that the <code>Environment</code> now contains an objects of the type <code>tbl_df</code>, a variety of the standard R object for holding tables of data, known as a <strong>data frame</strong>:</p><p><img src="img/class5_5.jpg" alt=""></p><h4 id="examine-the-data"><a name="examine-the-data" href="#examine-the-data"></a>Examine the data</h4><p>We can <code>View</code> data at any time by clicking on its table icon in the <code>Environment</code> tab in the <code>Grid</code> view. The following code has the same effect:</p><pre class="r hljs"><code class="r" data-origin="<pre><code class="r">View(ca_discipline)
</code></pre>">View(ca_discipline)
</code></pre><p>The <code>glimpse</code> function from <strong>dplyr</strong> will tell you more about the variables in your data, including their data type. Copy this code into your script and <code>Run</code>:</p><pre class="r hljs"><code class="r" data-origin="<pre><code class="r"># view structure of data
glimpse(ca_discipline)
</code></pre>"><span class="hljs-comment"># view structure of data</span>
glimpse(ca_discipline)
</code></pre><p>This should give the following output in the R Console:</p><pre class="json hljs"><code class="JSON" data-origin="<pre><code class="JSON">Variables: 10
$ alert_date &lt;date&gt; 2008-04-18, 2008-04-21, 2008-04-23, 2008-04-28, 2008-05-15, 2008-05-15, 2008-06-18, 2008-06-27, 2008...
$ last_name &lt;chr&gt; "Boyajian", "Cragen", "Chow", "Gravich", "Kabacy", "Aboulhosn", "Harron", "Fitzpatrick", "Adrian", "M...
$ first_name &lt;chr&gt; "John", "Richard", "Hubert", "Anna", "George", "Kamal", "Raymond", "Christian", "Adrian", "Pamela", "...
$ middle_name &lt;chr&gt; "Arthur", "Darin", "Wing", NA, "E.", "Fouad", "A.", "John", NA, "J.", "Quoc", "M.", "Elisabet", NA, "...
$ name_suffix &lt;chr&gt; NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ city &lt;chr&gt; "Boise", "Temecula", "San Gabriel", "Los Angeles", "Lacey", "Yakima", "Bridgeport", "Las Vegas", "Las...
$ state &lt;chr&gt; "ID", "CA", "CA", "CA", "WA", "WA", "WV", "NM", "NV", "CA", "CA", "CA", "CA", "CA", "CA", "TN", "CA",...
$ license &lt;chr&gt; "A25855", "A54872", "G45435", "A40805", "G13766", "CFE40080", "G8415", "G47520", "AFE56237", "G85601"...
$ action_type &lt;chr&gt; "Surrendered", "Surrendered", "Superior Court Order/Restrictions", "Superior Court Order/Restrictions...
$ action_date &lt;date&gt; 2008-04-18, 2008-04-21, 2008-03-10, 2008-04-25, 2008-05-15, 2008-05-15, 2008-06-18, 2008-06-27, 2008...
</code></pre>">Variables: 10
$ alert_date <date> 2008-04-18, 2008-04-21, 2008-04-23, 2008-04-28, 2008-05-15, 2008-05-15, 2008-06-18, 2008-06-27, 2008...
$ last_name <chr> "Boyajian", "Cragen", "Chow", "Gravich", "Kabacy", "Aboulhosn", "Harron", "Fitzpatrick", "Adrian", "M...
$ first_name <chr> "John", "Richard", "Hubert", "Anna", "George", "Kamal", "Raymond", "Christian", "Adrian", "Pamela", "...
$ middle_name <chr> "Arthur", "Darin", "Wing", NA, "E.", "Fouad", "A.", "John", NA, "J.", "Quoc", "M.", "Elisabet", NA, "...
$ name_suffix <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ city <chr> "Boise", "Temecula", "San Gabriel", "Los Angeles", "Lacey", "Yakima", "Bridgeport", "Las Vegas", "Las...
$ state <chr> "ID", "CA", "CA", "CA", "WA", "WA", "WV", "NM", "NV", "CA", "CA", "CA", "CA", "CA", "CA", "TN", "CA",...
$ license <chr> "A25855", "A54872", "G45435", "A40805", "G13766", "CFE40080", "G8415", "G47520", "AFE56237", "G85601"...
$ action_type <chr> "Surrendered", "Surrendered", "Superior Court Order/Restrictions", "Superior Court Order/Restrictions...
$ action_date <date> 2008-04-18, 2008-04-21, 2008-03-10, 2008-04-25, 2008-05-15, 2008-05-15, 2008-06-18, 2008-06-27, 2008...
</code></pre><p><code>chr</code> means “character,” or a string of text (which can also be treated as a categorical variable); <code>date</code> means a date. While we don’t have these data types here, <code>int</code> means an integer, or whole number; <code>dbl</code> means a number that may include decimal fractions; and <code>POSIXct</code> means a full date and timestamp.</p><p>If you run into any trouble importing data with <strong>readr</strong>, you may need to specify the data types for some columns — in particular for date and time. <a href="https://github.com/hadley/readr/blob/master/vignettes/column-types.Rmd">This link</a> explains how to set data types for individual variables when importing data with <strong>readr</strong>.</p><p>To specify an individual column use the name of the data frame and the column name, separated by <code>$</code>. Type this into your script and run:</p><pre class="r hljs"><code class="r" data-origin="<pre><code class="r"># print values for alert_date in the ca_discipline data
print(ca_discipline$alert_date)
</code></pre>"><span class="hljs-comment"># print values for alert_date in the ca_discipline data</span>
print(ca_discipline$alert_date)
</code></pre><p>The output will be the first 1,000 values for that variable.</p><p>If you need to change the data type for any variable, use the following functions:</p><ul>
<li><code>as.character</code> converts to a text string.</li><li><code>as.numeric</code> converts to a number that may include decimal fractions (<code>dbl</code>).</li><li><code>as.factor</code> converts to a categorical variable.</li><li><code>as.integer</code> converts to an integer</li><li><code>as.Date</code> converts to a date</li><li><code>as.POSIXct</code> converts to a full date and timestamp.</li></ul><p>So this code will convert <code>alert_date</code> codes to text:</p><pre class="r hljs"><code class="r" data-origin="<pre><code class="r"># convert alert_date to text
ca_discipline$alert_date &lt;- as.character(ca_discipline$alert_date)
glimpse(ca_discipline)
</code></pre>"><span class="hljs-comment"># convert alert_date to text</span>
ca_discipline$alert_date <- as.character(ca_discipline$alert_date)
glimpse(ca_discipline)
</code></pre><p>Notice that the data type for <code>alert_date</code> has now changed:</p><pre class="json hljs"><code class="JSON" data-origin="<pre><code class="JSON">Observations: 7,561
Variables: 10
$ alert_date &lt;chr&gt; "2008-04-18", "2008-04-21", "2008-04-23", "2008-04-28", "2008-05-15", "2008-05-15", "2008-06-18", "20...
$ last_name &lt;chr&gt; "Boyajian", "Cragen", "Chow", "Gravich", "Kabacy", "Aboulhosn", "Harron", "Fitzpatrick", "Adrian", "M...
$ first_name &lt;chr&gt; "John", "Richard", "Hubert", "Anna", "George", "Kamal", "Raymond", "Christian", "Adrian", "Pamela", "...
$ middle_name &lt;chr&gt; "Arthur", "Darin", "Wing", NA, "E.", "Fouad", "A.", "John", NA, "J.", "Quoc", "M.", "Elisabet", NA, "...
$ name_suffix &lt;chr&gt; NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ city &lt;chr&gt; "Boise", "Temecula", "San Gabriel", "Los Angeles", "Lacey", "Yakima", "Bridgeport", "Las Vegas", "Las...
$ state &lt;chr&gt; "ID", "CA", "CA", "CA", "WA", "WA", "WV", "NM", "NV", "CA", "CA", "CA", "CA", "CA", "CA", "TN", "CA",...
$ license &lt;chr&gt; "A25855", "A54872", "G45435", "A40805", "G13766", "CFE40080", "G8415", "G47520", "AFE56237", "G85601"...
$ action_type &lt;chr&gt; "Surrendered", "Surrendered", "Superior Court Order/Restrictions", "Superior Court Order/Restrictions...
$ action_date &lt;date&gt; 2008-04-18, 2008-04-21, 2008-03-10, 2008-04-25, 2008-05-15, 2008-05-15, 2008-06-18, 2008-06-27, 2008...
</code></pre>">Observations: 7,561
Variables: 10
$ alert_date <chr> "2008-04-18", "2008-04-21", "2008-04-23", "2008-04-28", "2008-05-15", "2008-05-15", "2008-06-18", "20...
$ last_name <chr> "Boyajian", "Cragen", "Chow", "Gravich", "Kabacy", "Aboulhosn", "Harron", "Fitzpatrick", "Adrian", "M...
$ first_name <chr> "John", "Richard", "Hubert", "Anna", "George", "Kamal", "Raymond", "Christian", "Adrian", "Pamela", "...
$ middle_name <chr> "Arthur", "Darin", "Wing", NA, "E.", "Fouad", "A.", "John", NA, "J.", "Quoc", "M.", "Elisabet", NA, "...
$ name_suffix <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ city <chr> "Boise", "Temecula", "San Gabriel", "Los Angeles", "Lacey", "Yakima", "Bridgeport", "Las Vegas", "Las...
$ state <chr> "ID", "CA", "CA", "CA", "WA", "WA", "WV", "NM", "NV", "CA", "CA", "CA", "CA", "CA", "CA", "TN", "CA",...
$ license <chr> "A25855", "A54872", "G45435", "A40805", "G13766", "CFE40080", "G8415", "G47520", "AFE56237", "G85601"...
$ action_type <chr> "Surrendered", "Surrendered", "Superior Court Order/Restrictions", "Superior Court Order/Restrictions...
$ action_date <date> 2008-04-18, 2008-04-21, 2008-03-10, 2008-04-25, 2008-05-15, 2008-05-15, 2008-06-18, 2008-06-27, 2008...
</code></pre><h3 id="process-and-analyze-data-with-dplyr"><a name="process-and-analyze-data-with-dplyr" href="#process-and-analyze-data-with-dplyr"></a>Process and analyze data with dplyr</h3><p>Now we will use <strong>dplyr</strong> to process the data, using the basic operations we discussed in week 1:</p><ul>
<li><p><strong>Sort:</strong> Largest to smallest, oldest to newest, alphabetical etc.</p>
</li><li><p><strong>Filter:</strong> Select a defined subset of the data.</p>
</li><li><p><strong>Summarize/Aggregate:</strong> Deriving one value from a series of other values to produce a summary statistic. Examples include: count, sum, mean, median, maximum, minimum etc. Often you’ll <strong>group</strong> data into categories first, and then aggregate by group.</p>
</li><li><p><strong>Join:</strong> Merging entries from two or more datasets based on common field(s), e.g. unique ID number, last name and first name.</p>
</li></ul><p>Here are some of the most useful functions in <strong>dplyr</strong>:</p><ul>
<li><code>select</code> Choose which columns to include.</li><li><code>filter</code> <strong>Filter</strong> the data.</li><li><code>arrange</code> <strong>Sort</strong> the data, by size for continuous variables, by date, or alphabetically.</li><li><code>group_by</code> <strong>Group</strong> the data by a categorical variable.</li><li><code>summarize</code> <strong>Summarize</strong>, or aggregate (for each group if following <code>group_by</code>). Often used in conjunction with functions including:<ul>
<li><code>mean(x)</code> Calculate the mean, or average, for variable <code>x</code>.</li><li><code>median(x)</code> Calculate the median.</li><li><code>max(x)</code> Find the maximum value.</li><li><code>min(x)</code> Find the minimum value.</li><li><code>sum(x)</code> Add all the values together.</li><li><code>n()</code> Count the number of records. Here there isn’t a variable in the brackets of the function, because the number of records applies to all variables.</li><li><code>n_distinct(x</code>) Count the number of unique values in variable <code>x</code>.</li></ul>
</li><li><code>mutate</code> Create new column(s) in the data, or change existing column(s).</li><li><code>rename</code> Rename column(s).</li><li><code>bind_rows</code> Merge two data frames into one, combining data from columns with the same name.</li></ul><p>There are also various functions to <strong>join</strong> data, which we will explore next week.</p><p>These functions can be chained together using the “pipe” operator <code>%>%</code>, which makes the output of one line of code the input for the next. This allows you to run through a series of operations in a logical order. I find it helpful to think of <code>%>%</code> as meaning “then.”</p><p>Now we will use <strong>dplyr</strong> to turn the <code>alert_date</code> variable back to dates:</p><pre class="r hljs"><code class="R" data-origin="<pre><code class="R"># convert alert_date to date using dplyr
ca_discipline &lt;- ca_discipline %&gt;%
mutate(alert_date = as.Date(alert_date))
</code></pre>"><span class="hljs-comment"># convert alert_date to date using dplyr</span>
ca_discipline <- ca_discipline %>%
mutate(alert_date = as.Date(alert_date))
</code></pre><p>This code copies the <code>ca_discipline</code> data frame, and then (<code>%>%</code>) uses <strong>dplyr</strong>‘s <code>mutate</code> function to change the data type for the <code>alert_date</code> variable. Because the copied data frame has the same name, it overwrites the original version.</p><h4 id="filter-and-sort-data"><a name="filter-and-sort-data" href="#filter-and-sort-data"></a>Filter and sort data</h4><p>To get used to working with <strong>dplyr</strong>, we will now start filtering and sorting the data according to the doctors’ locations, and the types of disciplinary actions they faced.</p><p>First, let’s look at the types of disciplinary actions in the data:</p><pre class="r hljs"><code class="R" data-origin="<pre><code class="R"># look at types of disciplinary actions
types &lt;- ca_discipline %&gt;%
select(action_type) %&gt;%
unique()
</code></pre>"><span class="hljs-comment"># look at types of disciplinary actions</span>
types <- ca_discipline %>%
select(action_type) %>%
unique()
</code></pre><p>This code first copies the <code>ca_disciplinary</code> data frame into a new object called <code>types</code>. Then (<code>%>%</code>) it <code>select</code>s the <code>action_type</code> variable only. Finally, it uses a function called <code>unique</code> to display the unique values in that variable, with no duplicates.</p><p>The new data frame has one column and 75 rows. If you <code>View</code>, the first few rows should look like this:</p><p><img src="img/class5_6.jpg" alt=""></p><p>Many of the values for <code>action_types</code> seem to be subtle variations of the same thing. If we were going to all analyze the types in detail, we may need to group some of these together, after speaking to an expert to understand what they all mean. But some of the action types are clear and unambiguous: <code>Revoked</code> is the medical board’s most severe sanction, which cancels a doctor’s license to practise. So let’s first <strong>filter</strong> the data to look at those actions:</p><pre class="r hljs"><code class="R" data-origin="<pre><code class="R"># filter for license revocations only
revoked &lt;- ca_discipline %&gt;%
filter(action_type == "Revoked")
</code></pre>"><span class="hljs-comment"># filter for license revocations only</span>
revoked <- ca_discipline %>%
filter(action_type == <span class="hljs-string">"Revoked"</span>)
</code></pre><p>This code copies the <code>ca_discipline</code> data into a new data frame called <code>revoked</code> and then (<code>%>%</code>) uses the <code>filter</code> function to include only actions in which a doctor’s license was revoked. Notice the use of <code>==</code> to test whether <code>action_type</code> is <code>Revoked</code>.</p><p>There should be 446 rows in the filtered data, and the first few should look like this:</p><p><img src="img/class5_7.jpg" alt=""></p><p>Some of the doctors are not actually based in California. Doctors can be licensed to practise in more than one state, and the Medical Board of California typically issues its own sanction if a doctor is disciplined in their home state. So let’s now <strong>filter</strong> the data to look only at doctors with revoked licenses who were based in California, and <strong>sort</strong> them by city.</p><pre class="r hljs"><code class="R" data-origin="<pre><code class="R"># filter for license revocations by doctors based in California, and sort by city
revoked_ca &lt;- ca_discipline %&gt;%
filter(action_type == "Revoked"
&amp; state == "CA") %&gt;%
arrange(city)
</code></pre>"><span class="hljs-comment"># filter for license revocations by doctors based in California, and sort by city</span>
revoked_ca <- ca_discipline %>%
filter(action_type == <span class="hljs-string">"Revoked"</span>
& state == <span class="hljs-string">"CA"</span>) %>%
arrange(city)
</code></pre><p>Here, the <code>filter</code> combines two conditions with <code>&</code>. That means that both have to be met for the data to be included. Then <code>arrange</code> <strong>sorts</strong> the data, which for a text variable will be in alphabetical order. If you wanted to sort in reverse alphabetical order, the code would be: <code>(arrange(desc(city))</code>.</p><p>There should be 274 rows in the filtered data, and the first few should look like this:</p><p><img src="img/class5_8.jpg" alt=""></p><p>This code will achieve the same result. You should be able to work out why:</p><pre class="r hljs"><code class="R" data-origin="<pre><code class="R"># filter for license revocations by doctors based in California, and sort by city
revoked_ca &lt;- revoked %&gt;%
filter(state == "CA") %&gt;%
arrange(city)
</code></pre>"><span class="hljs-comment"># filter for license revocations by doctors based in California, and sort by city</span>
revoked_ca <- revoked %>%
filter(state == <span class="hljs-string">"CA"</span>) %>%
arrange(city)
</code></pre><p>Now let’s <strong>filter</strong> for doctors based in Berkeley or Oakland who have had their licenses revoked:</p><pre class="r hljs"><code class="R" data-origin="<pre><code class="R"># doctors in Berkeley or Oakland who have had their licenses revoked
revoked_oak_berk &lt;- ca_discipline %&gt;%
filter(action_type == "Revoked"
&amp; (city == "Oakland" | city == "Berkeley"))
</code></pre>"><span class="hljs-comment"># doctors in Berkeley or Oakland who have had their licenses revoked </span>
revoked_oak_berk <- ca_discipline %>%
filter(action_type == <span class="hljs-string">"Revoked"</span>
& (city == <span class="hljs-string">"Oakland"</span> | city == <span class="hljs-string">"Berkeley"</span>))
</code></pre><p>There should be just two doctors:</p><p><img src="img/class5_9.jpg" alt=""></p><p>This code uses <code>|</code> to look for doctors in either Oakland <strong>or</strong> Berkeley. That part of the <code>filter</code> function is wrapped in parantheses to ensure that it is carried out first.</p><p>See what happens if you remove those parentheses, and work out why the result has changed. We will discuss this in class.</p><h4 id="append-data-using-`bind_rows`"><a name="append-data-using-`bind_rows`" href="#append-data-using-`bind_rows`"></a>Append data using <code>bind_rows</code></h4><p>To demonstrate the <code>bind_rows</code> function, we will <strong>filter</strong> for doctors with revoked licenses in each of the two cities separately, and then append one data frame to the other.</p><pre class="r hljs"><code class="R" data-origin="<pre><code class="R"># doctors in Berkeley who had their licenses revoked
revoked_berk &lt;- ca_discipline %&gt;%
filter(action_type == "Revoked"
&amp; city == "Berkeley")
# doctors in Oakland who had their licenses revoked
revoked_oak &lt;- ca_discipline %&gt;%
filter(action_type == "Revoked"
&amp; city == "Oakland")
# doctors in Berkeley or Oakland who have had their licenses revoked
revoked_oak_berk &lt;- bind_rows(revoked_oak, revoked_berk)
</code></pre>"><span class="hljs-comment"># doctors in Berkeley who had their licenses revoked</span>
revoked_berk <- ca_discipline %>%
filter(action_type == <span class="hljs-string">"Revoked"</span>
& city == <span class="hljs-string">"Berkeley"</span>)
<span class="hljs-comment"># doctors in Oakland who had their licenses revoked</span>
revoked_oak <- ca_discipline %>%
filter(action_type == <span class="hljs-string">"Revoked"</span>
& city == <span class="hljs-string">"Oakland"</span>)
<span class="hljs-comment"># doctors in Berkeley or Oakland who have had their licenses revoked</span>
revoked_oak_berk <- bind_rows(revoked_oak, revoked_berk)
</code></pre><h4 id="in-class-practice-with-filtering-and-sorting"><a name="in-class-practice-with-filtering-and-sorting" href="#in-class-practice-with-filtering-and-sorting"></a>In-class practice with filtering and sorting</h4><ul>
<li><p><strong>Filter</strong> the <code>ca_discipline</code> data to show licenses <code>Revoked</code> for doctors based in Los Angeles. <strong>Sort</strong> the result in reverse date order, most recent first. </p>
</li><li><p><strong>Filter</strong> the data to show licenses <code>Suspended</code> or <code>Revoked</code> for doctors in Los Angeles or San Diego. <strong>Sort</strong> the result in alphabetical order of the doctors’ names, first by last name, then by first name, then by middle name(s). (Hint: You can sort by multiple variables by separating them with a comma.)</p>
</li></ul><h4 id="write-data-to-a-csv-file"><a name="write-data-to-a-csv-file" href="#write-data-to-a-csv-file"></a>Write data to a CSV file</h4><p>The <strong>readr</strong> package can also be used to write data from your environment into a CSV file:</p><pre class="r hljs"><code class="R" data-origin="<pre><code class="R"># write data to CSV file
write_csv(revoked_oak_berk, "revoked_oak_berk.csv", na = "")
</code></pre>"><span class="hljs-comment"># write data to CSV file</span>
write_csv(revoked_oak_berk, <span class="hljs-string">"revoked_oak_berk.csv"</span>, na = <span class="hljs-string">""</span>)
</code></pre><p>The code <code>na = ""</code> ensures that null values in the data are written as blank cells; otherwise they would contain the letters <code>NA</code>.</p><h4 id="group-and-summarize-data"><a name="group-and-summarize-data" href="#group-and-summarize-data"></a>Group and summarize data</h4><p>Next we will <strong>group</strong> and <strong>summarize</strong> data by counting disciplinary actions by year and by month. But before doing that, we need to use the <strong>lubridate</strong> package to extract the year and month from <code>action_date</code>.</p><pre class="r hljs"><code class="R" data-origin="<pre><code class="R"># extract year and month from action_date
ca_discipline &lt;- ca_discipline %&gt;%
mutate(year = year(action_date),
month = month(action_date))
</code></pre>"><span class="hljs-comment"># extract year and month from action_date</span>
ca_discipline <- ca_discipline %>%
mutate(year = year(action_date),
month = month(action_date))
</code></pre><p>Now we have two extra columns in the data, giving the year and the month as a number:</p><p><img src="img/class5_10.jpg" alt=""></p><p>The <code>year</code> and <code>month</code> functions are from the <strong>lubridate</strong> package. </p><p>Previously we used <strong>dplyr</strong>‘s <code>mutate</code> function to modify an existing variable. Here we used it to create new variables. You can create or modify multiple variables in the same <code>mutate</code> function, separating each one by commas. Notice the use of <code>=</code> to make a variable equal to the output of code, within the <code>mutate</code> function.</p><p>Now we can calculate the number of license revokations for doctors based in California by year:</p><pre class="r hljs"><code class="R" data-origin="<pre><code class="R"># license revokations for doctors based in Califorina, by year
revoked_ca_year &lt;- ca_discipline %&gt;%
filter(action_type == "Revoked"
&amp; state == "CA") %&gt;%
group_by(year) %&gt;%
summarize(revocations = n())
</code></pre>"><span class="hljs-comment"># license revokations for doctors based in Califorina, by year</span>
revoked_ca_year <- ca_discipline %>%
filter(action_type == <span class="hljs-string">"Revoked"</span>
& state == <span class="hljs-string">"CA"</span>) %>%
group_by(year) %>%
summarize(revocations = n())
</code></pre><p>This should be the result:</p><p><img src="img/class5_11.jpg" alt=""></p><p>The code first <strong>filters</strong> the data, as before, then <strong>groups</strong> by the new variable <code>year</code> using <code>group_by</code>, then <strong>summarizes</strong> by counting the number of records for each year using <code>summarize</code>. The last function creates a new variable called <code>revocations</code> from <code>n()</code>, which is a count of the rows in the data for each year.</p><p>Looking at this result and the raw <code>ca_discipline</code> data, we only have partial data for 2008. So if we want to count the number of license revocations by month over all the years, we should first <strong>filter</strong> out the data for 2008, which will otherwise skew the result.</p><pre class="r hljs"><code class="R" data-origin="<pre><code class="R"># license revokations for doctors based in Califorina, by month
revoked_ca_month &lt;- ca_discipline %&gt;%
filter(action_type == "Revoked"
&amp; state == "CA"
&amp; year &gt;= 2009) %&gt;%
group_by(month) %&gt;%
summarize(revocations = n())
</code></pre>"><span class="hljs-comment"># license revokations for doctors based in Califorina, by month</span>
revoked_ca_month <- ca_discipline %>%
filter(action_type == <span class="hljs-string">"Revoked"</span>
& state == <span class="hljs-string">"CA"</span>
& year >= <span class="hljs-number">2009</span>) %>%
group_by(month) %>%
summarize(revocations = n())
</code></pre><p>This should be the result:</p><p><img src="img/class5_12.jpg" alt=""></p><p>Notice how we used <code>>=</code> to filter for data where the year was 2009 or greater. The following code will achieve the same result:</p><pre class="r hljs"><code class="R" data-origin="<pre><code class="R"># license revokations for doctors based in Califorina, by month
revoked_ca_month &lt;- ca_discipline %&gt;%
filter(action_type == "Revoked"
&amp; state == "CA"
&amp; year != 2008) %&gt;%
group_by(month) %&gt;%
summarize(revocations = n())
</code></pre>"><span class="hljs-comment"># license revokations for doctors based in Califorina, by month</span>
revoked_ca_month <- ca_discipline %>%
filter(action_type == <span class="hljs-string">"Revoked"</span>
& state == <span class="hljs-string">"CA"</span>
& year != <span class="hljs-number">2008</span>) %>%
group_by(month) %>%
summarize(revocations = n())
</code></pre><p>We can <strong>group</strong> and <strong>summarize</strong> by more than one variable at a time. The following code counts the number of actions of all types, by month and year. Again, we will first <strong>filter</strong> out the incomplete data for 2008.</p><pre class="r hljs"><code class="R" data-origin="<pre><code class="R"># disciplinary actions for doctors in California by year and month, from 2009 to 2017
actions_year_month &lt;- ca_discipline %&gt;%
filter(state == "CA"
&amp; year &gt;= 2009) %&gt;%
group_by(year, month) %&gt;%
summarize(actions = n()) %&gt;%
arrange(year, month)
</code></pre>"><span class="hljs-comment"># disciplinary actions for doctors in California by year and month, from 2009 to 2017</span>
actions_year_month <- ca_discipline %>%
filter(state == <span class="hljs-string">"CA"</span>
& year >= <span class="hljs-number">2009</span>) %>%
group_by(year, month) %>%
summarize(actions = n()) %>%
arrange(year, month)
</code></pre><p>The first few rows should look like this:</p><p><img src="img/class5_13.jpg" alt=""></p><p>Notice that both the <code>group_by</code> and <code>arrange</code> functions <strong>group</strong> and <strong>sort</strong> the data, respectively, by two variables, separated by commas.</p><p>Finally, let’s calculate the <strong>mean</strong> and <strong>median</strong> number of disciplinary actions issued per month, over the period 2009 to 2017:</p><pre class="r hljs"><code class="R" data-origin="<pre><code class="R"># mean and median actions per month, 2009 to 2017
summary_year_month &lt;- actions_year_month %&gt;%
ungroup() %&gt;%
summarize(mean = mean(actions),
median = median(actions))
</code></pre>"><span class="hljs-comment"># mean and median actions per month, 2009 to 2017</span>
summary_year_month <- actions_year_month %>%
ungroup() %>%
summarize(mean = mean(actions),
median = median(actions))
</code></pre><p>This should be the result:</p><p><img src="img/class5_14.jpg" alt=""></p><p>In this code we calculated two summary statistics, <code>mean</code> and <code>median</code>, in the same <code>summarize</code> function, separating each calculation by a comma. First, however, we had to <code>ungroup</code> the grouped <code>actions_year_month</code> data frame.</p><p>Think about the similarities and differences between <strong>grouping</strong> and <strong>summarizing</strong> data using <strong>dplyr</strong> and the spreadsheet pivot tables you made in week 3. We will discuss this in class.</p><h4 id="in-class-practice-with-filtering,-grouping,-and-summarizing"><a name="in-class-practice-with-filtering,-grouping,-and-summarizing" href="#in-class-practice-with-filtering,-grouping,-and-summarizing"></a>In-class practice with filtering, grouping, and summarizing</h4><ul>
<li><p>Calculate the total number of licenses <code>Suspended</code> or <code>Revoked</code> for doctors based in California for each year.</p>
</li><li><p>Calculate the total number of licences for doctors based in states other than California that were revoked for each year.</p>
</li></ul><h3 id="closing-down-properly"><a name="closing-down-properly" href="#closing-down-properly"></a>Closing down properly</h3><p>Whenever you exit R, get into the habit of saving your script amd the data in your environment. Then close your script and any data frames open in <code>View</code>. When you close R, select <code>Don't Save</code> at this prompt:</p><p><img src="img/class5_15.jpg" alt=""></p><p>These actions ensure that R Studio will open cleanly, without the remants from a previous session, when you next launch it.</p><h3 id="exercises/assignment"><a name="exercises/assignment" href="#exercises/assignment"></a>Exercises/assignment</h3><p>These exercises are designed for you to practice writing code to load, <strong>filter</strong>, <strong>sort</strong>, <strong>group</strong>, and <strong>summarize</strong> data.</p><p>First open the <code>.RData</code> file from class, by clicking this icon (<img src="img/class5_16.jpg" alt="">) in the <code>Environment</code> panel, and navigating to the folder with the data. If you do this, you won’t need to reload the <code>ca_discipline</code> data, and you won’t need to create the variable <code>year</code> in the data, which you will need for the exercises below.</p><p>However, if you’d like to practice these things, you are also welcome to start from scratch and include in your script the code that loads the <code>ca_discipline</code> data and creates the <code>year</code> variable.</p><p>Now open a new R script, save it into the same folder as your data with the name <code>week5_assignment.R</code>, and set your working directory to this location, as before.</p><ul>
<li><p>Using the <code>ca_discipline</code> data, count the number of revoked licenses in each city in 2017 only, and sort so that the cities with the most revoked licenses appear first. Hint: <code>filter</code> by year first, then <code>group_by</code> city, then <code>summarize</code> with a count (<code>n()</code>), before sorting with <code>arrange</code>.</p>
</li><li><p>Count the number of actions of any type by city and by year, not including the incomplete data for 2008. Hint: again you need to <code>filter</code> to remove the data for 2008, but this time you will need to <code>group_by</code> two variables before counting with <code>summarize</code>.</p>
</li><li><p>Find the doctor(s) based in California with the largest number of actions of any type in the <code>ca_discipline</code> data. Hint: There is no need to <code>filter</code> the data for this one. You will first need to <code>group_by</code> several variables so that you can easily identify the doctors from the data that is returned (think names, location etc). Then you will need to <code>summarize</code> with a count, before sorting with <code>arrange</code> so that the doctors with the most actions on their record are at the top of the data.</p>
</li><li><p>Write this data to a file called <code>doctors_all_actions.csv</code>.</p>
</li><li><p>Load the file <code>ca_medicare_opioids.csv</code> using <code>read_csv</code> to make a data frame called <code>ca_opioids</code>. Then make a data frame with just one column, <code>generic_name</code>, showing all the generic drugs in the <code>ca_opioids</code> data (with no duplicates). Hint: This is very similar to looking for all of the types of actions in the <code>ca_discipline</code> data, which we did in class. You will need to use <code>select</code>.</p>
</li></ul><p>File your R script (<code>week_5_assignment.R</code>) with the code to complete these exercises, and the saved CSV file (<code>doctors_all_actions.csv</code>), via bCourses by <strong>Weds Feb 21 at 8.00pm</strong>.</p><h3 id="further-reading"><a name="further-reading" href="#further-reading"></a>Further reading</h3><p><strong><a href="https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html">Introduction to dplyr</a></strong></p><p><strong><a href="https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf">RStudio Data Wrangling Cheet Sheet</a></strong><br>Also introduces the <a href="https://blog.rstudio.org/2014/07/22/introducing-tidyr/"><strong>tidyr</strong></a> package, which can manage wide-to-long transformations, and text-to-columns splits, among other data manipulations.</p><p><strong><a href="http://stackoverflow.com/">Stack Overflow</a></strong><br>For any work involving code, this question-and-answer site is a great resource for when you get stuck, to see how others have solved similar problems. Search the site, or <a href="http://stackoverflow.com/questions/tagged/r">browse R questions</a></p>
</div> <!-- /.container all -->
<script src="https://code.jquery.com/jquery.min.js"></script>
<script src="js/bootstrap.min.js"></script>
</body>
</html>