week1.html

<!DOCTYPE html>
<html>

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <title>What is data?</title>
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <link href="css/bootstrap.min.css" rel="stylesheet"> 
  <link href="css/custom.css" rel="stylesheet">
</head>

<body class="markdown github">

	<header class="navbar-inverse navbar-fixed-top">
		<div class="container">
			<nav role="navigation">
				<div class="navbar-header">
					<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
						<span class="sr-only">Toggle navigation</span>
						<span class="icon-bar"></span>
						<span class="icon-bar"></span>
						<span class="icon-bar"></span>
					</button>
					<a href="index.html" class="navbar-brand">J298 Data Journalism</a>
				</div> <!-- /.navbar-header -->
				<!-- Collect the nav links, forms, and other content for toggling -->
				<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
					<ul class="nav navbar-nav">
						<li class="dropdown">
							<a href="#" class="dropdown-toggle" data-toggle="dropdown">Class notes<b class="caret"></b></a>
							<ul class="dropdown-menu">
								<li><a href="week1.html">What is data?</a></li>
								<li><a href="week2.html">Types of stories</a></li>
								<li><a href="week3.html">Working with spreadsheets</a></li>
								<li><a href="week4.html">Acquiring, cleaning, and formatting data</a></li>
								<li><a href="week5.html">R, RStudio, and the tidyverse</a></li>
								<li><a href="week6.html">Data journalism in the tidyverse</a></li>
								<li><a href="week7.html">Don't let the data lie to you</a></li>
								<li><a href="week8.html">Databases and SQL</a></li>
								<li><a href="week9.html">Finding stories using maps</a></li>
								<li><a href="week10.html">Maps meet databases</a></li>
								<li><a href="week11.html">More PostGIS</a></li>
								<li><a href="week12.html">R practice</a></li>
								<li><a href="week13.html">PostGIS practice</a></li>
								<li><a href="week14.html">More fun with R</a></li>
							</ul>
						</li>
						<li><a href="software.html">Software</a></li>
						<li><a href="datasets.html">Data</a></li>
						<li><a href="questions.html">If you get stuck</a></li>
						<li class="dropdown">
							<a href="#" class="dropdown-toggle" data-toggle="dropdown">Email instructors<b class="caret"></b></a>
							<ul class="dropdown-menu">
								<li><a href="mailto:p.aldhous@gmail.com">Peter Aldhous</a></li>
								<li><a href="mailto:abh@berkeley.edu">Amanda Hickman</a></li>
							</ul>
						</li>
					</ul>
				</div><!-- /.navbar-collapse -->
			</nav>
		</div> <!-- /.navbar-header -->
	</header>

	
	<div class="container all">

<h1 id="what-is-data?"><a name="what-is-data?" href="#what-is-data?"></a>What is data?</h1><h3 id="what-can-data-journalism-do-for-me?"><a name="what-can-data-journalism-do-for-me?" href="#what-can-data-journalism-do-for-me?"></a>What can data journalism do for me?</h3><p>You almost certainly didn’t come to J-school to do math. So why should you want to learn about data journalism?</p><p>Presumably you did come here for some of these reasons:</p><ul>
<li><p>To find and tell great stories.</p>
</li><li><p>To help people understand the complex and often confusing world in which we live.</p>
</li><li><p>To hold those in positions of power to account.</p>
</li><li><p>To expose injustice.</p>
</li></ul><p>Over the coming semester, we hope to equip you with  some basic skills with data that will help you achieve those goals.</p><p>Data can provide valuable context for any story. Importantly, it can be where you find the idea for a story in the first place. It can also expose when fallible or manipulative human sources are giving you false information: If you don’t want to be bamboozled by statistics, you need to be able to make sense of the data for yourself.</p><p>But data journalism isn’t a skill to be practised in isolation. We want you to think of data as just another source of information: like public records, and like the people you interview. Indeed, we’re going to encourage you to think in terms of “interviewing” data, and show you how to ask questions of a dataset.</p><h3 id="the-data-frame-of-mind"><a name="the-data-frame-of-mind" href="#the-data-frame-of-mind"></a>The data frame of mind</h3><p>Some good news: This is <em>not</em> rocket science!</p><p>Working with data needn’t be mysterious or daunting. If you think clearly, get to know the data you’re working with, and apply the  skills we’ll cover in the coming weeks, you’ll be able to use data to become a better journalist.</p><p>Data journalism is another tool in the modern journalist’s toolbox. To put that another way, it’s the <em>journalism</em> part of data journalism that really matters.</p><p>So we’re going to encourage you to develop a data frame of mind. Rather then being an afterthought, researched when an editor asks if you’ve got some numbers or a chart to go with that story, we want you to put it front-and-center when you start your reporting: As well as thinking about who you need to speak to, think about what sources of data are available, and how you can ask questions of that data that will inform the rest of your reporting.</p><h3 id="don't-drown-your-audience-in-a-sea-of-statistics"><a name="don't-drown-your-audience-in-a-sea-of-statistics" href="#don't-drown-your-audience-in-a-sea-of-statistics"></a>Don’t drown your audience in a sea of statistics</h3><p>A common mistake made by reporters starting to work with data is to throw lots of numbers into their stories. This is rarely enlightening and usually off-putting. The goal of working with data is for you to find new stories, or gain a deeper understanding of what your’re covering. When you tell those stories, use only the numbers that are necessary to get the point across.</p><h3 id="the-data-we-will-use"><a name="the-data-we-will-use" href="#the-data-we-will-use"></a>The data we will use</h3><p>Download the data for this session from <a href="data/week1.zip">here</a>, unzip the folder and place it on your desktop. It contains the following files:</p><ul>
<li><p><code>berkeley_collisions.csv</code> Data on injury and fatal traffic accidents in Berkeley from 2006 to 2014, from the <a href="http://tims.berkeley.edu/">Transportation Injury Mapping System</a>. The data comes from the California Highway Patrol’s <a href="http://iswitrs.chp.ca.gov/Reports/jsp/userLogin.jsp">Statewide Integrated Traffic Records System</a> and was then geocoded for mapping by UC Berkeley’s Safe Transportation Research &amp; Education Center.</p>
</li><li><p><code>mlb_salaries_2015.csv</code> Salaries of players in Major League Baseball at the start of the 2015 season, from the <a href="http://www.seanlahman.com/baseball-archive/statistics/">Lahman Baseball Database</a>.</p>
</li></ul><p>These files are in CSV format, which stands for comma-separated values; they have a <code>.csv</code> extension. These are plain text files, in which columns in the data are separated by commas. CSV is a common format for storing and exchanging data. Values that are intended to be treated as text, rather than numbers, are often enclosed in quote marks.</p><p>Here is what the <code>berkeley_collisions.csv</code> file looks like when you open it in a text editor.</p><p><img src="img/class1_1.jpg" alt=""></p><p>When you ask for data, requesting it as CSVs or other plain text files is a good idea, as just about all software that handles data can export and import data as text. (If a government agency tells you that they cannot export data from their systems as text files, they are almost certainly mistaken, or lying!) You will also find that many online databases provide data for download in this format.</p><p>The characters used to separate the columns in a text file, called “delimiters,” may vary. A <code>.tsv</code> extension, for instance, indicates that the variables are separated by tabs. More generally, text files may have the extension <code>.txt</code>.</p><h3 id="get-to-know-your-data"><a name="get-to-know-your-data" href="#get-to-know-your-data"></a>Get to know your data</h3><p>Before attempting to analyze a dataset, it’s important to know what, exactly, you’re working with. So the first thing you usually want to do with any data is open it up in a spreadsheet, and make sure you understand how it is structured.</p><p>Go to your <a href="https://drive.google.com/drive/my-drive"><strong>Google Drive</strong></a> account. Now select <code>NEW&gt;Google Sheets</code>. In the spreadsheet, select <code>File&gt;Import...</code> from the top menu, and at the next dialog box, select the  <code>Upload</code> tab:</p><p><img src="img/class1_2.jpg" alt=""></p><p>Navigate to the file <code>berkeley_collisions.csv</code> and complete the import.</p><p>In Google Sheets, the file should look like this:</p><p><img src="img/class1_4.jpg" alt=""></p><p>The top row should be treated as a header row. So drag the gray divider at the base of the empty cell at top left, between column <code>A</code> and row <code>1</code>, like this:</p><p><strong>Before:</strong></p><p><img src="img/class1_5.jpg" alt=""></p><p><strong>After:</strong></p><p><img src="img/class1_6.jpg" alt=""></p><p>Select the header row by clicking the number <code>1</code> at the left, and then type <code>⌘-B</code> to make the headers bold. Notice how the header row now remains in place at the top of the screen as you scroll up and down through the data.</p><p><img src="img/class1_7.jpg" alt=""></p><h3 id="types-of-data:-categorical-vs.-continuous"><a name="types-of-data:-categorical-vs.-continuous" href="#types-of-data:-categorical-vs.-continuous"></a>Types of data: categorical vs. continuous</h3><p>Tables of data consist of a series of “variables,” which are simply measurements or attributes of all of the “records” in the dataset. For example, school students might gather data about themselves for a class project, noting their gender and eye color, and height and weight. Here, each student is a record in the data, for which there will be a value for gender, eye color, height, and weight.</p><p>But there’s an important difference between gender and eye color, called “categorical” variables, and height and weight, termed “continuous.”</p><ul>
<li><p><strong>Categorical</strong> variables are descriptive labels given to individual records, assigning them to different groups. The simplest categorical data is dichotomous, meaning that there are just two possible groups — in an election, for instance, people either voted, or they did not. More commonly, there are multiple categories. When analyzing traffic accidents, for example, you might consider the weather conditions when the accident happened, in categories such as “clear,” “cloudy,” “raining,” “fog,” and so on.</p>
</li><li><p><strong>Continuous</strong> data is richer, consisting of numbers that can have a range of values on a sliding scale. The number of people killed or injured in a traffic accident would be one example.</p>
</li></ul><p>There’s a third type of data we often need to consider: <strong>date and time</strong>. A simple timeline treats date/time as a continuous variable, but date/time data can also be made categorical (days of the week, months of the year, and so on).</p><h3 id="know-the-variables-in-your-data,-and-how-they-are-coded"><a name="know-the-variables-in-your-data,-and-how-they-are-coded" href="#know-the-variables-in-your-data,-and-how-they-are-coded"></a>Know the variables in your data, and how they are coded</h3><p>Datasets will usually contain a mixture of categorical, continuous, and date/time variables.</p><p>The Berkeley traffic accidents data is in a typical layout, with each row in the data representing an individual collision, each of which has a unique code, its <code>CASEID</code>. Having a unique ID for each record in a dataset is good practice.</p><p>Here, each of the columns in the data is a separate variable.</p><p>Don’t assume, however, that every number in a dataset represents a continuous variable. Text descriptions can make datasets unwieldy, so database managers often adopt simpler codes, which are often be numbers, to store categorical data.</p><p>Looking at the first few columns, <code>YEAR_</code> is fairly obviously a date-related variable, while the longitudes and latitudes (<code>POINT_Y</code> and <code>POINT_X</code>) for each collision, and the numbers of people <code>KILLED</code> or <code>INJURED</code>, are continuous variables.</p><p>But the other columns in this view, including <code>DAYWEEK</code> and <code>CRASHSEV</code> are actually categorical variables, coded as numbers.</p><p>Like this example, many datasets are hard to understand without their supporting documentation. So each time you acquire a dataset, make sure you also obtain any documents/descriptions that are necessary to interpret it. These might be called the “codebook,” “data dictionary,” or “record layout.” Whatever they are called, you will need to understand all of the variables in the data, and how they are coded.</p><p><a href="TIMS.html">Here is the codebook</a> for the Berkeley traffic accident data. Notice, for example, how the day of the week and the severity of the collision are coded:</p><p><img src="img/class1_8.jpg" alt=""></p><h3 id="is-your-data-wide-or-long?"><a name="is-your-data-wide-or-long?" href="#is-your-data-wide-or-long?"></a>Is your data wide or long?</h3><p>Often, especially if you are working with data over time, the data you obtain may not be in the format above, with one variable in each column.</p><p>Here, for example, is some data downloaded from the <a href="http://data.worldbank.org/indicator/?tab=all">World Bank’s data site</a> on the <a href="http://data.worldbank.org/indicator/TX.VAL.TECH.CD?view=chart">value of high-technology exports</a> for different countries and groups of countries over time, expressed in current U.S. dollars. There are four variables in this data, which I’ve color-coded to make them easier to spot:</p><p><img src="img/class1_9.jpg" alt=""></p><p>The variables are:</p><ul>
<li><code>Country Name</code> Yellow</li><li><code>Country Code</code> Green</li><li><code>Year</code> Blue</li><li><code>High-Tech Exports</code> Pink</li></ul><p>While this “wide” data format makes the spreadsheet easier for people to scan, most software for data analysis and visualization wants the data in a neat “long” format, with one variable in each column, like this:</p><p><img src="img/class1_10.jpg" alt=""></p><p>So you may need to convert data from wide to long format. We will learn how to do this in Week 4.</p><h3 id="how-do-i-interview-data?-the-basic-operations"><a name="how-do-i-interview-data?-the-basic-operations" href="#how-do-i-interview-data?-the-basic-operations"></a>How do I interview data? The basic operations</h3><p>There are many sophisticted statistical methods for crunching data, beyond the scope of these classes. But the majority of a data journalist’s work involves the following simple operations:</p><ul>
<li><p><strong>Sort:</strong> Largest to smallest, oldest to newest, alphabetical etc.</p>
</li><li><p><strong>Filter:</strong> Select a defined subset of the data.</p>
</li><li><p><strong>Summarize/Aggregate:</strong> Deriving one value from a series of other values to produce a summary statistic. Examples include: count, sum, mean, median, maximum, minimum etc. Often you’ll <strong>group</strong> data into categories first, and then aggregate by group.</p>
</li><li><p><strong>Join:</strong> Merging entries from two or more datasets based on common field(s), for example a unique code, or last name and first name.</p>
</li></ul><p>We’ll return to these basic operations with data repeatedly over the coming weeks, as we pose questions of various datasets.</p><h3 id="working-with-categorical-data"><a name="working-with-categorical-data" href="#working-with-categorical-data"></a>Working with categorical data</h3><p>You might imagine that there is little that you can do with categorical data alone, but it can be powerful, and can also be used to create new continuous variables.</p><p>The most basic operation with categorical data is to <strong>group</strong> and <strong>aggregate</strong> it by counting the number of records that fall into each category. This gives a table of “frequencies.” Often these are divided by the total number of records, and then multiplied by 100 to show them as percentages of the total.</p><p>Here is an example, showing data on the racial and ethnic identities of residents of Oakland, from the 2010 Census:</p><p><img src="img/class1_11.jpg" alt=""></p><p>(Source: <a href="http://factfinder2.census.gov/faces/nav/jsf/pages/index.xhtml">American FactFinder</a>, U.S. Census Bureau)</p><p>Creating frequency counts from categorical data creates a new continuous variable — what has changed is the level of analysis. In this example, the original data would consist of a huge table with a record for each person, noting their racial/ethnic identity as categorical variables; in creating the frequency table shown here, the level of analysis has shifted from the individual to the racial/ethnic group.</p><h3 id="working-with-continuous-data:-consider-the-distribution"><a name="working-with-continuous-data:-consider-the-distribution" href="#working-with-continuous-data:-consider-the-distribution"></a>Working with continuous data: Consider the distribution</h3><p>When handling continuous data, there are more possibilities for <strong>aggregation</strong> than simply counting: You can add the numbers to give a total, for example, or calculate an average.</p><p>But summarizing continuous data in a single value inevitably loses a lot of information held in variation within the data. Understanding this variation may be key to working out the story the data may tell, and deciding how to analyze and visualize it. So often the first thing a good data journalist does when examining a dataset is to chart the <strong>distribution</strong> of each continuous variable. You can think of this as the “shape” of the dataset, for each variable.</p><p>Many variables, such as human height and weight, follow a “normal” distribution. If you draw a graph plotting the range of values in the data along the horizontal axis (also known as the X axis), and the number of individual data points for each value on the vertical or Y axis, a normal distribution gives a bell-shaped curve:</p><p><img src="img/class1_12.jpg" alt=""></p><p>(Source: edited from <a href="http://en.wikipedia.org/wiki/Normal_distribution#mediaviewer/File:Standard_deviation_diagram.svg">Wikimedia Commons</a>)</p><p>This type of chart, showing the distribution as a smoothed line, is known as a “density plot.”</p><p>In this example, the X axis is labeled with multiples of a summary statistic called the “standard deviation.” This is a measure of the spread of the data: if you extend one standard deviation either side of the average, it will cover just over 68% of the data points; two standard deviations will cover just over 95%. In simple terms, the standard deviation is a single number that summarizes whether the curve is tall and thin, or short and fat.</p><p>Sometimes, however, it’s very clear just from looking at the shape of a dataset that it is not normally distributed. Here, for example, is the distribution of Major League Baseball salaries at the start of the 2015 season, drawn as columns in increments of $500,000. This type of chart is called a “histogram.”</p><p><img src="img/class1_13.png" alt=""></p><p>(Source: Peter Aldhous, data from the <a href="http://www.seanlahman.com/baseball-archive/statistics/">Lahman Baseball Database</a>)</p><p>This distribution for this labor market is highly “skewed.” Almost half of the players were paid less than $1 million, while there are just a handful of players who were paid more than $20 million; the highest-paid was pitcher Clayton Kershaw, paid more than $32 million by the Los Angeles Dodgers. If you wanted to write a story about the lifestyle of a “typical” baseball player, who would you choose?</p><p>As you think about data in your beat, remember that almost all economic data is highly skewed. In general, there are a lot of “have nots” (or at least “have littles”) at the bottom end of the distribution, and a long tail with a only few “have a lots” at the top.</p><h3 id="beyond-the-“average”:-mean,-median,-and-mode"><a name="beyond-the-“average”:-mean,-median,-and-mode" href="#beyond-the-“average”:-mean,-median,-and-mode"></a>Beyond the “average”: mean, median, and mode</h3><p>Most people know how to calculate an average: add everything up, and divide this sum by the total number of values. Statisticians call this <strong>aggregate</strong> measure the “mean,” and for normally distributed data, it sits right on the top of the bell curve.</p><p>The mean is just one example of what statisticians call a “measure of central tendency.” And for skewed data like our baseball salaries, it may not be the most useful <strong>aggregation</strong> of the data.</p><p>The most common alternative is the “median,” which is the number that sits in the middle, when all the values are arranged in order. (If you have an even number of values, and no single number occupies the middle position, it would be the average of the two middle values.)</p><p>Notice how leading media outlets, such as The Upshot at <em>The New York Times</em>, often use medians, rather than means, in graphics summarizing skewed distributions, such as incomes or house prices. Here is an example from April 2014:</p><p><img src="img/class1_14.jpg" alt=""></p><p>(Source: The Upshot, <a href="http://www.nytimes.com/2014/04/23/upshot/the-american-middle-class-is-no-longer-the-worlds-richest.html"><em>The New York Times</em></a>)</p><p>Statisticians also sometimes consider the “mode,” which is the value that appears most frequently in the dataset.</p><h3 id="plot-a-histogram-of-the-salary-distribution"><a name="plot-a-histogram-of-the-salary-distribution" href="#plot-a-histogram-of-the-salary-distribution"></a>Plot a histogram of the salary distribution</h3><p>Import the file <code>mlb_salaries_2015.csv</code> to a new Google Sheet.</p><p>To quickly plot a histogram in Google Sheets, select the letter at the top to highlight the column for which you want to see the distribution (here <code>H</code> for <code>salary_mil</code>). Then select <code>Insert&gt;Chart</code> from the top menu.</p><p>The default chart should look like this:</p><p><img src="img/class1_19.jpg" alt=""></p><p>Now change the <code>Chart type</code> to <code>Histogram chart</code> you will need to scroll down to find this option, highlighted here:</p><p><img src="img/class1_20.jpg" alt=""></p><p>Once selected, the chart should look like this:</p><p><img src="img/class1_21.jpg" alt=""></p><p>You can also <code>CUSTOMIZE</code> the chart by selecting a different increment, or bin width, for the columsn:</p><p><img src="img/class1_22.jpg" alt=""></p><h3 id="calculate-mean,-median,-and-mode"><a name="calculate-mean,-median,-and-mode" href="#calculate-mean,-median,-and-mode"></a>Calculate mean, median, and mode</h3><p>Now create three new column headers: <code>mean</code>, <code>median</code>, and <code>mode</code>:</p><p><img src="img/class1_23.jpg" alt=""></p><p>To run calculations in a spreadsheet we need to use <strong>formulas</strong>, which all start with the <code>=</code> symbol.</p><p>In the first cell of the <code>mean</code> column enter the following formula, which calculates the mean (called <code>AVERAGE</code> in a spreadsheet) of all of the values in column <code>H</code>. The colon tells the formula to use all of the values in the range from H2 to H818.</p><pre class="sql hljs"><code class="SQL" data-origin="<pre><code class=&quot;SQL&quot;>=AVERAGE(H2:H818)
</code></pre>">=AVERAGE(H2:H818)
</code></pre><p>Or alternatively, to select all the values in column <code>H</code> without having to define their row numbers:</p><pre class="sql hljs"><code class="SQL" data-origin="<pre><code class=&quot;SQL&quot;>=AVERAGE(H:H)
</code></pre>">=AVERAGE(H:H)
</code></pre><p>Now calculate the median salary:</p><pre class="sql hljs"><code class="SQL" data-origin="<pre><code class=&quot;SQL&quot;>=MEDIAN(H:H)
</code></pre>">=MEDIAN(H:H)
</code></pre><p>And the mode:</p><pre class="sql hljs"><code class="SQL" data-origin="<pre><code class=&quot;SQL&quot;>=MODE(H:H)
</code></pre>">=MODE(H:H)
</code></pre><p>Across Major League Baseball at the start of the 2015 season, the mean salary was $4.3 million. But when summarizing a distribution in a single value, we usually want to give a “typical” number. Here the mean is inflated by the vast salaries paid to a handful of star players, and may be a bad choice. The median salary of $1.9 million gives a more realistic view of what a typical MLB player was paid.</p><p>The mode is less commonly used, but in this case also tells us something interesting: it was $507,500, a sum earned by 19 out of the 817 players. This was the minimum salary paid under 2015 MLB contracts, which explains why it turns up more frequently than any other number. A journalist who considered the median, mode, and full range of the salary distribution may produce a richer story than one who failed to think beyond the “average.”</p><h3 id="spreadsheet-functions"><a name="spreadsheet-functions" href="#spreadsheet-functions"></a>Spreadsheet functions</h3><p>In the formulas above, <code>AVERAGE</code>, <code>MEDIAN</code> and <code>MODE</code> are <strong>functions</strong>. They act on the data specified in the parentheses. We’ll become much more familiar with functions as we work with R and SQL code in the coming weeks.</p><p>Notice as you start to type a formula that Google Sheets will suggest functions that you can use:</p><p><img src="img/class1_24.jpg" alt=""></p><p>And when it’s clear which function you are using, Google Sheets gives some hints on how it should be used:</p><p><img src="img/class1_25.jpg" alt=""></p><p><a href="https://support.google.com/docs/table/25273?hl=en">Here</a> is a full list of the functions available in Google Sheets.</p><h3 id="rounding:-avoid-spurious-precision"><a name="rounding:-avoid-spurious-precision" href="#rounding:-avoid-spurious-precision"></a>Rounding: Avoid spurious precision</h3><p>Often when you run calculations on numbers, you’ll obtain precise answers that can run to many decimal places. But think about the precision with which the original numbers were measured, and don’t quote numbers that are more precise than this. When rounding numbers to the appropriate level of precision, if the next digit is four or less, round down; if it’s six or more, round up. There are various schemes for rounding if the next digit is five, and there are no further digits to go on: I’d suggest rounding to an even number, which may be up or down, as this is the international standard in computing.</p><p>To round the mean value for the baseball salary data to two decimal places, use the following formula in an empty cell in the spreadsheet:</p><pre class="sql hljs"><code class="SQL" data-origin="<pre><code class=&quot;SQL&quot;>=ROUND(I2,2)
</code></pre>">=ROUND(I2,2)
</code></pre><p>Here, the value of <code>2</code>, after the comma, defines the number of decimal points to round to. (Similarly, <code>0</code> would round to the nearest whole number, <code>-1</code> to the nearest 10, <code>-2</code> to the nearest hundred, and so on).</p><p>You can also run functions on functions. Notice that you get the same result if you edit the original formula to the following:</p><pre class="sql hljs"><code class="SQL" data-origin="<pre><code class=&quot;SQL&quot;>=ROUND(AVERAGE(H:H),2)
</code></pre>">=ROUND(AVERAGE(H:H),2)
</code></pre><p>This formula runs the <code>ROUND</code> function on the result of the <code>AVERAGE</code> function.</p><h3 id="sampling-and-margins-of-error"><a name="sampling-and-margins-of-error" href="#sampling-and-margins-of-error"></a>Sampling and margins of error</h3><p>Only sometimes is it possible to obtain and analyze all of the data, like we just did for the 2015 baseball salaries. Other times we may need to draw conclusions by taking a sample of the data. Opinion and election polling is the most obvious example.</p><p>For a sample to be valid, it must obey a simple statistical rule: Every member of the group to which you wish to generalize the results of your analysis must have an equal chance of being included.</p><p>Entire textbooks have been written on sampling methods. The simplest form is random sampling — such as when numbers are written on pieces of paper, put into a bag, shaken up, and then drawn out one by one. Opinion pollsters may generate their samples by randomly generating valid telephone numbers, and calling those numbers.</p><p>But there are other methods, and the important thing is not that a sample was derived randomly, but that it is <em>representative</em> of the group from which it is drawn. In other words, sampling needs to avoid systematic bias that makes particular people more or less likely to be included.</p><p>Be especially wary of using data from any sample that was not selected to be representative of a wider group. Media organizations frequently run informal online “polls” to engage their audience, but they tell us little about public opinion, as people who happened to visit a news website and cared enough to answer the questions posed may not be representative of the wider population.</p><p>To have a good chance of being representative, samples must also be sufficiently large. If you randomly sample ten people, for instance, chance effects mean that you may draw a sample that contains eight women and two men, or perhaps no men at all. Sample 1,000 people from the same population, however, and the proportions of men and women sampled won’t deviate so far from an even split.</p><p>This is why polls should give a “margin of error,” which is a measure of the uncertainty that arises from taking a relatively small sample. These margins of error are usually derived from a range of values that statisticians call the “95% confidence interval.” This means that if the same population were sampled repeatedly, the results would fall within this range of values 95 times out of 100.</p><p>Here is a listing of the polls conducted in the run-up to the Alabama Senate Special Election on December 12, 2017:</p><p><img src="img/class1_26.jpg" alt=""></p><p>(Source: <a href="https://www.realclearpolitics.com/epolls/2017/senate/al/alabama_senate_special_election_moore_vs_jones-6271.html">RealClearPolitics</a>)</p><p>Notice the sample sizes — here all for Likely Voters (LV), and the figures for Margin of Error (MoE).</p><p>The Trafalgar Group poll, for instance, gave Republican Roy Moore a lead of 5 percentage points, 51% to 46%, but the margin of error was 2.6 percentage points. What this means is that the pollsters were 95% confident from their sample of more than 1,400 likely voters that Moores’s support lay between 48.4% and 53.6% (51%, plus or minus 2.6 percentage points), while Democrat Doug Jones’s support lay between 43.4% and 48.6% (46%, plus or minus 2.6 percentage points) — not exactly clear cut.</p><p>But look at the wide divergence in results from the various polls. Especially in a volatile race like this, you don’t want to place too much reliance on a single poll.</p><p>When dealing with polling and survey data, look for the margins of error. Be careful not to mislead your audience by making a big deal of differences that may just be due to sampling error. Consider quoting polling averages for a defined period, which are likely to be more reliable, as it’s unlikely that all of the polls will be affected by sampling error in exactly the same way.</p><p><a href="https://www.surveymonkey.com/mp/margin-of-error-calculator/">Here</a> is a simple web app, from SurveyMonkey, that allows you to estimate the margin of error given the sample size, the confidence level, and the size of the population. There are about 225 million eligible U.S. voters. While fewer will be registered and fewer still will actually vote, see what happens if you put 225,000,000 into the population box, and 500 or 1,000 into the sample size box. A larger sample will give a smaller margin of error.</p><p><a href="https://www.surveymonkey.com/mp/sample-size-calculator/">This app</a>, meanwhile, calcuates the sample size needed to obtain results within a given margin of error.</p><h3 id="be-skeptical-of-your-data:-ask-what's-really-being-measured"><a name="be-skeptical-of-your-data:-ask-what's-really-being-measured" href="#be-skeptical-of-your-data:-ask-what's-really-being-measured"></a>Be skeptical of your data: Ask what’s really being measured</h3><p>Data can be seductive, but you need to approach it skeptically, just like you would any oher source.</p><p>Always question how the data was obtained, and what is actually being measured. Even commonly quoted numbers like the unemployment rate depend on assumptions that can be questioned. As this graph shows, the <a href="http://www.bls.gov/">Bureau of Labor Statistics</a> actually has a range of measures of unemployment:</p><p><img src="img/class1_27.png" alt=""></p><p>(Source: Peter Aldhous, from <a href="https://download.bls.gov/pub/time.series/ln/ln.data.1.AllData">Bureau of Labor Statistics</a> data)</p><p>This what those measures mean:</p><ul>
<li>U1: Unemployed for 15 weeks or more.</li><li>U2: Unemployed who involuntarily lost their last job, or completed a temporary job.</li><li><strong>U3: the generally cited unemployment rate.</strong></li><li>U4: U3 plus “discouraged” workers, not looking for work because they don’t think jobs are available.</li><li>U5: U3 plus discouraged and “marginally attached” workers, who hadn’t searched for work in four weeks prior to survey.</li><li>U6: as U5, but also includes people working part time but who want full-time work.</li></ul><p>So what is the best measure of unemployment? That may depend on the story you are trying to tell.</p><h3 id="remember-that-summarizing-the-data-with-a-single-number-can-mask-the-most-striking-stories"><a name="remember-that-summarizing-the-data-with-a-single-number-can-mask-the-most-striking-stories" href="#remember-that-summarizing-the-data-with-a-single-number-can-mask-the-most-striking-stories"></a>Remember that summarizing the data with a single number can mask the most striking stories</h3><p>Always think about how to <strong>filter</strong>, <strong>group</strong>, and <strong>aggregate</strong> your data to tell stories that are relevant to your particular audience, or groups within it. While many news stories talk about the unemployment rate, unemployment is experienced very differently by different parts of the population.</p><p>The following charts break down the unemployment rate (U3) over the years by sex, race, and age group:</p><p><img src="img/class1_28.png" alt=""></p><p><img src="img/class1_29.png" alt=""></p><p><img src="img/class1_30.png" alt=""></p><p>(Source: Peter Aldhous, from <a href="https://download.bls.gov/pub/time.series/ln/ln.data.1.AllData">Bureau of Labor Statistics</a> data)</p><h3 id="per-what?-working-with-rates-and-percentages"><a name="per-what?-working-with-rates-and-percentages" href="#per-what?-working-with-rates-and-percentages"></a>Per what? Working with rates and percentages</h3><p>Often it doesn’t make much sense to consider raw numbers. There are more murders in Oakland (population from 2010 U.S. Census: 390,724) than in Orinda (2010 population: 17,643). But that’s a fairly meaningless comparison, unless we level the playing field by correcting for the size of the two cities. For fair and meaningful comparisons, data journalists often need to work with rates: per capita, per thousand people, and so on.</p><p>In simple terms, a rate is one number divided by another number. The key word is “per.” Per capita means “per person,” so to calculate a per capita figure you must divide the total value by the population size. But remember that most people find very small numbers hard to grasp: 0.001 and 0.0001 look similarly small at a glance, even though the first is ten times as large as the second. So when calculating rates for rare events like murders, per capita may not be a good choice. You may need to consider the rate per 1,000 people, per 10,000 people, or even per 100,000 people: simply divide the numbers as before, then multiply by the “per” figure.</p><p>In addition to leveling the playing field to allow meaningful comparisons, rates can also help bring large numbers, which are again hard for most people to grasp, into perspective: it means little to most people to be told that the annual GDP of the United States is almost $17 trillion, but knowing that GDP per person is just over $50,000 is easier to comprehend.</p><p>Percentages are just a special case of rates, meaning “per hundred.” So to calculate a percentage, you divide one number by another and then multiply by 100.</p><h3 id="doing-simple-math-with-rates-and-percentages"><a name="doing-simple-math-with-rates-and-percentages" href="#doing-simple-math-with-rates-and-percentages"></a>Doing simple math with rates and percentages</h3><p>Often you will need to calculate percentage change. The formula for this is:</p><pre class="sql hljs"><code class="SQL" data-origin="<pre><code class=&quot;SQL&quot;>(new value - old value) / old value * 100
</code></pre>">(new value - old value) / old value * 100
</code></pre><p>Here <code>/</code> means “divided by” and <code>*</code> means “multiplied by.” If you were using this formula in a spreadsheet, you would start with <code>=</code>. The brackets around the first part of the calculation show that it should be conducted first, before dividing the result by the old value.</p><p>Percentage increases are hard to comprehend once the number is doubled or more. Doubling corresponds to a 100% increase, tripling to a 200% increase, and so on. So rather than saying something increased by 125%, perhaps say that the number “more than doubled,” and give the before and after values.</p><p>Also, remember that a large percentage increase on a small number still gives a small number. If there were just five burglaries in a neighborhood in one year, and six the next, that’s a 20% increase in burglaries, which sounds alarming, until you’re told the actual numbers. So always consider when it’s helpful to tell your reader the actual numbers, rather than rattling off percentage changes that could be misleading.</p><p>Sometimes you may need to compare two rates or percentages. For example, if 50 out of 150 black mortgage applicants in a given income bracket are denied a mortgage, and 300 out of 2,400 white applicants in the same income bracket are denied a mortgage, the percentage rates of denial for the two groups are:</p><p><strong>Black:</strong></p><pre class="sql hljs"><code class="SQL" data-origin="<pre><code class=&quot;SQL&quot;>50 / 150 * 100 = 33.3%
</code></pre>">50 / 150 * 100 = 33.3%
</code></pre><p><strong>White:</strong></p><pre class="sql hljs"><code class="SQL" data-origin="<pre><code class=&quot;SQL&quot;>300 / 2,400 * 100 = 12.5%
</code></pre>">300 / 2,400 * 100 = 12.5%
</code></pre><p>You can divide one percentage or rate by the other, but be careful how you describe the result:</p><pre class="sql hljs"><code class="SQL" data-origin="<pre><code class=&quot;SQL&quot;>33.3 / 12.5 = 2.664
</code></pre>">33.3 / 12.5 = 2.664
</code></pre><p>You can say from this calculation that black applicants are about 2.7 times <em>as</em> likely to be denied loans as whites. But even though the Associated Press style guide doesn’t make the distinction, don’t say black applicants are about 2.7 times <em>more</em> likely to be denied loans. Strictly speaking, <em>more</em> likely refers to following calculation:</p><pre class="sql hljs"><code class="SQL" data-origin="<pre><code class=&quot;SQL&quot;>(33.3 - 12.5) / 12.5 = 1.664
</code></pre>">(33.3 - 12.5) / 12.5 = 1.664
</code></pre><h3 id="how-statisticians-ask-questions-with-data"><a name="how-statisticians-ask-questions-with-data" href="#how-statisticians-ask-questions-with-data"></a>How statisticians ask questions with data</h3><p>As data journalists, we want to ask questions of data. When statisticians do this, they assign probabilities to the answers to specific questions. They might ask whether variables are related to one another: for instance, do wealthier people tend to live longer? Or they might ask whether different groups are different from one another: for example, do patients given an experimental drug get better more quickly than those given the standard treatment?</p><p>When asking these questions, the most common statistical approach may seem back to front. Rather than asking whether the answer they’re interested in is likely to be true, statisticians usually instead calculate probabilities that the observed results would be obtained if the “null hypothesis” is correct.</p><p>In the examples given above, the null hypotheses are that there is no relationship between wealth and lifespan, and that the new drug is just as effective as the old treatment.</p><p>The resulting probabilities are often given as <em>p</em> values, which are shown as decimal numbers between 0 and 1.</p><p>The decimal 0.001 is the same as the fraction 1/1000, and <code>&lt;</code> is the mathematical symbol for “less than.” So this means that there was less than one in a thousand chance that the difference in participation in the riot between Northerners and Southerners was caused by a chance sampling effect.</p><p>This would be called a “significant” result. When statisticians use this word, they don’t necessarily mean that the result has real-world consequence. It just means that the result is unlikely to be due to chance. However, if you have framed your question carefully, a statistically significant result may be very consequential indeed.</p><p>There is no fixed cut-off for judging a result to be statistically significant. But as a general rule, <code>p &lt;0.05</code> is considered the minimum standard. This means you are likely to get this result by chance less than 5 times out of 100. If Meyer had obtained a result only just exceeding this standard, he may still have concluded that Northerners were more likely to riot, but would probably have been more cautious in how he worded his story.</p><p>When considering differences between groups, statisticians sometimes avoid <em>p</em> values, and instead give 95% confidence intervals, like the margins of error on opinion polls. Only if these don’t overlap would a statistician assume that the results for different groups are significantly different.</p><p>So when interpreting numbers from studies, pay attention to <em>p</em> values and confidence intervals.</p><h3 id="relationships-between-variables:-correlation-and-its-pitfalls"><a name="relationships-between-variables:-correlation-and-its-pitfalls" href="#relationships-between-variables:-correlation-and-its-pitfalls"></a>Relationships between variables: correlation and its pitfalls</h3><p>Some of the most powerful stories that data can tell examine how one variable relates to another. This video from a BBC documentary made by Hans Rosling of the Gapminder Foundation, for example, explores the relationship between life expectancy in different countries and the nations’ wealth:</p><p class="oembeded"><iframe src="http://www.youtube.com/embed/jbkSRLYSojo?wmode=transparent&amp;jqoemcache=XU48E" width="425" height="349" allowfullscreen="true" allowscriptaccess="always" scrolling="no" frameborder="0"></iframe></p><p>(Source: <a href="http://www.gapminder.org/videos/200-years-that-changed-the-world-bbc/">BBC/Gapminder</a>)</p><p>Correlation refers to statistical methods that test the strength of the relationship between two variables recorded for each of the records in a dataset. Correlations can either be positive, which means that two variables tend to increase together; or negative, which means that as one variable increases in value, the other one tends to decrease.</p><p>Tests of correlation determine whether the recorded relationship between the two variables is likely to have arisen by chance — here the null hypothesis is that there is actually no relationship between the two.</p><p>Statisticians usually test for correlation because they suspect that variation in one variable causes variation in the other, but correlation cannot prove causation. For example, there is a statistically significant correlation between children’s shoe sizes and their reading test scores, but clearly having bigger feet doesn’t make a child a better reader. In reality, older children are likely both to have bigger feet and be better at reading — the causation lies elsewhere.</p><p>Here, the child’s age is a “lurking” variable. Lurking variables are a general problem in data analysis, not just in tests of correlation, and some can be hard even for experts to spot.</p><p>For example, by the early 1990s epidemiological studies suggested that women who took Hormone Replacement Therapy (HRT) after menopause were less likely to suffer from coronary heart disease. But some years later, when doctors ran clinical trials in which they gave women HRT to test this protective effect, it actually caused a statistically significant <em>increase</em> in heart disease. Going back to the original studies, researchers found that women who had HRT tended to be from higher socioeconomic groups, who had better diets and exercised more.</p><p>Data journalists should be very wary of falling into similar traps. While you may not be able to gather all of the necessary data and run statistical tests, take special care to think about possible lurking variables.</p><h3 id="assignment"><a name="assignment" href="#assignment"></a>Assignment</h3><ul>
<li><p><strong>File a preliminary pitch for a data-driven story.</strong> This should clearly articulate the thought you’ve already put into this reporting idea. The more thorough your pitch, the more feedback we can give you. At a minimum, it should include:</p>
<ul>
<li><p>A description of the area you’re interested in exploring or reporting on, with the questions to intend to address.</p>
</li><li><p>A news hook, or explanation of why this matters now.</p>
</li><li>A description of the data that’s available and the agencies or organizations that maintain it.</li></ul>
<p>You are only required to submit one pitch, however you may submit up to three, if you’d like us to advise on which one looks most promising.</p>
<p><strong>Due: Sat Jan 27 at 8pm</strong> </p>
</li><li><p><strong>Reading assignment.</strong> Read the stories for discussion in <a href="week2.html">week 2</a>, and come to class next week prepared to address the questions about them posed in the class notes.</p>
<p><strong>Due: By next week’s class</strong> </p>
</li></ul><h3 id="further-reading"><a name="further-reading" href="#further-reading"></a>Further reading</h3><p>Sarah Cohen: <a href="http://store.ire.org/products/numbers-in-the-newsroom-using-math-and-statistics-in-news-second-edition"><em>Numbers in the Newsroom: Using Math and Statistics in News</em></a></p><p>Philip Meyer: <a href="http://www.amazon.com/Precision-Journalism-Reporters-Introduction-Science/dp/0742510883"><em>Precision Journalism: A Reporter’s Introduction to Social Science Methods</em></a></p>

	</div> <!-- /.container all -->
	<script src="https://code.jquery.com/jquery.min.js"></script>
	<script src="js/bootstrap.min.js"></script>
</body>
</html>