Skip to content

Commit

Permalink
teach about how to make a box plot, introduce comparing box plots and…
Browse files Browse the repository at this point in the history
… histograms, incorporate variability into all lesson references to statistical questions, add a page for practicing making a box plot by hand (see #2129, #2111, #2104, #2085)
  • Loading branch information
flannery-denny committed Aug 8, 2024
1 parent 26fe0db commit 20087e7
Show file tree
Hide file tree
Showing 14 changed files with 265 additions and 64 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@
"description": "a symmetrical histogram with tall bars on the left and right side, a gap in the middle, and shorter bars inside of the tall bars.",
"source" : "Created by the Bootstrap Team based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA"
},
"blank-0to80-num-line-numbered.png": {
"description": "A number line labelled at 0, 10, 20, 30, 40, 50, 60, 70, 80",
"source" : "Created by the Bootstrap Team",
"license" : "Creative Commons 4.0 - NC - SA"
},
"box-n-whisker-plot.png": {
"description": "A sample box-and-whisker plot based on contrived data",
Expand Down Expand Up @@ -85,7 +90,11 @@
"source" : "Created by the Bootstrap Team based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA"
},
"histogram-a.png": {
"histogram-pounds.png": {
"description": "histogram of pounds, with a tall bar counting 17 animals weighing 0-20 pounds, and much shorter bars for the subsequent 20-pound intervals indicating counts of 3, 4, 2, 2, 1, 1, 0, and 2 animals respectively",
"source" : "Created in pyret by the Bootstrap Team based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA"
},"histogram-a.png": {
"description": "A histogram with 6 bins. The middle 2 bars are 3 times as tall as the rest.",
"source" : "Created by the Bootstrap Team based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA"
Expand Down Expand Up @@ -171,11 +180,21 @@
"source" : "Created by the Bootstrap Team in Pyret based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA",
"caption" : "Right Skew"
},
"smith.png": {
"description": "a box plot of the Smith family data distributed across the full length of the number line",
"source" : "Created by the Bootstrap Team in Pyret based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA"
},
"symmetric.png": {
"description": "a box plot with equally long whiskers, and boxes that are narrower than the whiskers, but the same width as each other",
"source" : "Created by the Bootstrap Team in Pyret based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA",
"caption" : "Symmetric"
},
"young.png": {
"description": "a box plot of the Young family data clustered tightly at the right end of the number line",
"source" : "Created by the Bootstrap Team in Pyret based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA"
}
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
224 changes: 168 additions & 56 deletions lessons/Data-Science/box-plots/langs/en-us/index.adoc

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
= Distribution of a Dataset

== The Ages at the Smith Family Reunion

1, 44, 3, 42, 46, 74, 75, 21, 74, 70, 40, 41, 45

@n Order the Numbers from Least to Greatest: @fitb{}{@ifsoln{1, 3, 21, 40, 41, 42, 44, 45, 46, 70, 74, 74, 75}}

@n
@fitbruby{5em}{@ifsoln{1}}{Minimum} @hspace{1em}
@fitbruby{5em}{@ifsoln{30.5}}{Q1} @hspace{1em}
@fitbruby{5em}{@ifsoln{44}}{Median} @hspace{1em}
@fitbruby{5em}{@ifsoln{72}}{Q3} @hspace{1em}
@fitbruby{5em}{@ifsoln{75}}{Maximum} @hspace{5em}
@fitbruby{10em}{@ifsoln{74}}{Range} @hspace{1em} @fitbruby{10em}{@ifsoln{41.5}}{Interquartile Range (IQR)}

== The Ages at the Young Family's First-Cousin Gathering

70, 68, 69, 72, 65, 75, 65, 78, 70, 72, 71, 70

@n Order the Numbers from Least to Greatest: @fitb{}{@ifsoln{65, 65, 68, 69, 70, 70, 70, 71, 72, 72, 75, 78}}

@n
@fitbruby{5em}{@ifsoln{65}}{Minimum} @hspace{1em}
@fitbruby{5em}{@ifsoln{68.5}}{Q1} @hspace{1em}
@fitbruby{5em}{@ifsoln{70}}{Median} @hspace{1em}
@fitbruby{5em}{@ifsoln{72}}{Q3} @hspace{1em}
@fitbruby{5em}{@ifsoln{78}}{Maximum} @hspace{5em}
@fitbruby{10em}{@ifsoln{10}}{Range} @hspace{1em} @fitbruby{10em}{@ifsoln{3.5}}{*Interquartile Range (IQR)}

== Box Plots - Visualizing Shape

Plot the 5-Number Summaries for each family on the number lines below. Then draw a box around the IQR (connecting Q1 and Q3 with a line for the median dividing it into 2 sections) and draw whiskers to the minimum and maximum values to reveal the box plot.

@n Smith: @ifnotsoln{@image{../images/blank-0to80-num-line-numbered.png}} @ifsoln{@image{../images/smith.png}}

@n Young: @ifnotsoln{@image{../images/blank-0to80-num-line-numbered.png}} @ifsoln{@image{../images/young.png}}

== Compare and Contrast

@n For which family gathering was the median age more typical? How do you know? @fitb{}{}

@fitb{}{@ifsoln{Young. Because the data is more closely clustered around the median.}}

@n What else do you Notice and Wonder about the data from these two family gatherings?

@fitb{}{}

@fitb{}{}

@fitb{}{}

@n What advantages were there to making these two box plots on number lines with the same scale? @fitb{}{}

@fitb{}{@ifsoln{Seeing both box plots on the same scale makes them easy to compare.}}

@n What might have been a benefit of making the box plots on number lines with different scales? @fitb{}{}

@fitb{}{@ifsoln{It might be easier to read the precise values for the Young family data if we zoomed in on the range the data actually falls within.}}

Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ The Data Cycle is a roadmap that guides us in the process of data analysis. You'
@A{Lookup, arithmetic, and statistical questions.}

@Q{What's the difference between arithmetic and statistical questions?}
@A{A statistical question does not specify a particular arithmetic process, while an arithmetic question does.}
@A{A statistical question anticipates variability in the data related to the question and accounts for it in the answers, while an arithmetic question anticipates a specific answer related to a particular arithmetic process.}

*Consider Data*
@Q{What do we need to determine in this phase?}
Expand Down
4 changes: 2 additions & 2 deletions lessons/Data-Science/data-cycle/langs/en-us/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -219,7 +219,7 @@ Most questions can be broken down into one of four categories:

@slidebreak

- *Statistical questions* - These kinds of questions are the most interesting! And are often best asked with "in general" attached, because the answer isn't black and white. If we ask "are dogs heavier than cats?", we know that not every dog is heavier than every cat! We just want to know if it is _generally true_ or _generally false_!
- *Statistical questions* - These kinds of questions are the most interesting! And are often best asked with "in general" attached, because we expect some variability and the answer isn't black and white. If we ask "are dogs heavier than cats?", we know that not every dog is heavier than every cat! We just want to know if it is _generally true_ or _generally false_!

- *Questions we can't answer* - We might wonder where the animal shelter is located, or what time of year the data was gathered! But the data in the table won’t help us answer that question, so as Data Scientists we might need to do some research beyond the data. And if nothing turns up, we simply recognize that there are limits to what we can analyze.

Expand All @@ -233,7 +233,7 @@ Most questions can be broken down into one of four categories:
@Q{What kind of question is "How old is Toggle?" How do you know?}
@A{It's a _lookup question_ because it can be answered by just looking at the table.}
@Q{What kind of question is "Are older animals adopted more quickly than younger animals?" How do you know?}
@A{It's a _statistical question_ because we are wondering what is happening in general.}
@A{It's a _statistical question_ because we expect some variability in the data and are wondering what is happening in general.}
}

=== Investigate
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Each question a Data Scientist asks adds a chapter to the story of their researc
* *Arithmetic questions* - Answered by doing calculations (comparing, averaging, totaling, etc.) with values from one single column. Examples of arithmetic questions might be “How much does the heaviest animal weigh?” or “What is the average age of animals from the shelter?”
* *Statistical questions* - These often involve multiple steps to answer, and the answer isn't black and white. When we compare two statistics we are actually comparing two data sets. If we ask "are dogs heavier than cats?", we know that not every dog is heavier than every cat! We just want to know if it is generally true or generally false!
* *Statistical questions* - These are questions that both _expect some variability in the data_ related to the question and _account for it in the answers_. Statistical questions often involve multiple steps to answer, and the answers aren't black and white. When we compare two statistics we are actually comparing two data sets. If we ask "are dogs heavier than cats?", we know that not every dog is heavier than every cat! We just want to know if it is generally true or generally false!
* *Questions we can't answer* - We might wonder where the animal shelter is located, or what time of year the data was gathered! But the data in the table won’t help us answer that question, so as Data Scientists we might need to do some research beyond the data. And if nothing turns up, we simply recognize that there are limits to what we can analyze.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ There are two phases of data collection.
- Find graphs related to eating habits by conducting an online search using terms like, "What do Americans eat for snacks?"

=== 2. Asking a Meaningful Statistical Question
Once you have all of your data and have looked at several graphics online about America’s snacking habits, you will declare your statistical question for this project. Some suggestions are below, but feel free to develop your own based on your analysis of the data.
Once you have all of your data and have looked at several graphics online about America’s snacking habits, you will declare your @vocab{statistical question} for this project. Some suggestions are below, but feel free to develop your own based on your analysis of the data.

- What time of day do we eat the healthiest snacks?
- We know that snacks high in saturated fats are bad for you. Do high-fat snacks get unhealthy ratings? In general, how good would you say you are, as a class, at judging healthiness?
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ There are two phases of data collection.
- Find graphs related to time use by conducting an online search using terms like, “How do Americans use their time?” @link{https://flowingdata.com/2015/11/10/counting-the-hours/,Here} is one possible resource to check out (from FlowingData).

=== 2. Asking a Meaningful Statistical Question
Once you have all of your data and have looked at several graphics online about America’s time use, you will declare your statistical question for this project. Some suggestions are below, but feel free to develop your own based on your analysis of the data.
Once you have all of your data and have looked at several graphics online about America’s time use, you will declare your @vocab{statistical question} for this project. Some suggestions are below, but feel free to develop your own based on your analysis of the data.

- On average, how long do I spend on homework compared with students across the country?
- Do people who identify as male, female or non-binary take longer to groom themselves?
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Students will investigate four types of threats to validity by pretending to be
== Phases of the Project

=== 1. Decide on a question.
You and a partner will choose a statistical question that you would be interested in exploring. (Note, you will not actually be investigating this question in full. Rather, you will be developing a faulty plan to answer the question.) Be sure that your question is one that could be answered by closely analyzing data, which also lends itself to many threats needing to be addressed.
You and a partner will choose a @vocab{statistical question} that you would be interested in exploring. (Note, you will not actually be investigating this question in full. Rather, you will be developing a faulty plan to answer the question.) Be sure that your question is one that could be answered by closely analyzing data, which also lends itself to many threats needing to be addressed.

=== 2. Develop a faulty research plan.
You and your partner will develop a faulty plan to research your statistical question. Remember, your goal here is to use data to misconstrue and mislead. Be sure to describe in detail how you will incorporate each of the following threats to validity.
Expand Down
10 changes: 10 additions & 0 deletions lib/glossary-terms.json
Original file line number Diff line number Diff line change
Expand Up @@ -1952,6 +1952,16 @@
"description": "utilizar información de una muestra para sacar conclusiones sobre la población más grande de la que se origina la muestra"
}
},
{
"en-us": {
"keywords": [[ "statistical question"]],
"description": "questions that _expect some variability in the data_ related to the question and _account for it in the answers_. They focus on describing or analyzing patterns and trends in data sets, to gain understanding of what is generally true about the data, rather than computing a single precise value."
},
"es-mx": {
"keywords": [[ "inferencia estadística" ]],
"description": "Preguntas que _esperan cierta variabilidad en los datos_ relacionados con la pregunta y _la tienen en cuenta en las respuestas_. Se centran en describir o analizar patrones y tendencias en conjuntos de datos, para comprender lo que es generalmente cierto acerca de los datos, en lugar de calcular un único valor preciso."
}
},
{
"en-us": {
"keywords": [[ "strength" ]],
Expand Down

0 comments on commit 20087e7

Please sign in to comment.