Skip to content

Commit

Permalink
Boxplot Rethink (#2139)
Browse files Browse the repository at this point in the history
* teach about how to make a box plot, introduce comparing box plots and histograms, incorporate variability into all lesson references to statistical questions, add a page for practicing making a box plot by hand (see #2129, #2111, #2104, #2085)
* [DS] add more box-plot contracts to our langtable
* [DS] use @ifsoln to show the word 'teacher', rather than @teacher (closes #2140)
* finalize box plots lesson in response to pull request updates (see #2139, #2129)
  • Loading branch information
flannery-denny committed Aug 13, 2024
1 parent 26cfa16 commit d0a693a
Show file tree
Hide file tree
Showing 18 changed files with 375 additions and 77 deletions.
7 changes: 4 additions & 3 deletions README-slides.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ between md2gslides and Google, authenticating your computer. You're ready to roc
The script `build-slide` is used to build the slides for an
individual lesson, whose name is supplied as its argument, e.g.,

build-slide function-composition
./build-slide function-composition

This assumes that the lesson `function-composition` is
(or will be created) in the `distribution/**/lessons` directory.
Expand Down Expand Up @@ -154,9 +154,10 @@ The lesson-plan adoc files are typically split into slides at the 2-level
sections, with the lower-level sections being ignored. An author
can insert their own slide-break with the directive
`@slidebreak`, which is a no-op as far as the regular HTML
generation is concerned.
generation is concerned or `@slidebreak{layout-name}` to specify a
specific slide layout.

A final 1-level section, if it is named `Addtional Exercises`, is
A final 1-level section, if it is named `Additional Exercises`, is
also converted to a slide.

PD slides are marked with `@pd-slide{...}` or
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@
"description": "a symmetrical histogram with tall bars on the left and right side, a gap in the middle, and shorter bars inside of the tall bars.",
"source" : "Created by the Bootstrap Team based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA"
},
"blank-0to80-num-line-numbered.png": {
"description": "A number line labelled at 0, 10, 20, 30, 40, 50, 60, 70, 80",
"source" : "Created by the Bootstrap Team",
"license" : "Creative Commons 4.0 - NC - SA"
},
"box-n-whisker-plot.png": {
"description": "A sample box-and-whisker plot based on contrived data",
Expand All @@ -20,7 +25,12 @@
"license" : "Creative Commons 4.0 - NC - SA"
},
"box-plot-pounds.png": {
"description": "box plot of pounds with a 5 number summary of min: 0.1, Q1: 3.9, Q2: 11.3, Q3 60.4, Max: 172",
"description": "box plot of pounds with a 5-number summary of min: 0.1, Q1: 3.9, Q2: 11.3, Q3 60.4, Max: 172",
"source" : "Created in pyret by the Bootstrap Team based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA"
},
"box-plot-pounds-cropped.png": {
"description": "box plot of pounds. 5-number summary is not visible, but would be min: 0.1, Q1: 3.9, Q2: 11.3, Q3 60.4, Max: 172",
"source" : "Created in pyret by the Bootstrap Team based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA"
},
Expand Down Expand Up @@ -85,7 +95,11 @@
"source" : "Created by the Bootstrap Team based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA"
},
"histogram-a.png": {
"histogram-pounds.png": {
"description": "histogram of pounds, with a tall bar counting 17 animals weighing 0-20 pounds, and much shorter bars for the subsequent 20-pound intervals indicating counts of 3, 4, 2, 2, 1, 1, 0, and 2 animals respectively",
"source" : "Created in pyret by the Bootstrap Team based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA"
},"histogram-a.png": {
"description": "A histogram with 6 bins. The middle 2 bars are 3 times as tall as the rest.",
"source" : "Created by the Bootstrap Team based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA"
Expand Down Expand Up @@ -171,11 +185,21 @@
"source" : "Created by the Bootstrap Team in Pyret based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA",
"caption" : "Right Skew"
},
"ledet.png": {
"description": "a box plot of the Ledet family data distributed across the full length of the number line",
"source" : "Created by the Bootstrap Team in Pyret based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA"
},
"symmetric.png": {
"description": "a box plot with equally long whiskers, and boxes that are narrower than the whiskers, but the same width as each other",
"source" : "Created by the Bootstrap Team in Pyret based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA",
"caption" : "Symmetric"
},
"watson.png": {
"description": "a box plot of the Smith family data clustered tightly at the right end of the number line",
"source" : "Created by the Bootstrap Team in Pyret based on contrived data",
"license" : "Creative Commons 4.0 - NC - SA"
}
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
312 changes: 248 additions & 64 deletions lessons/Data-Science/box-plots/langs/en-us/index.adoc

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
= Distribution of a Dataset

== Family Gatherings by the Numbers

@vspace{1ex}

*Ledet Family Ages*: 1, 44, 3, 42, 46, 74, 75, 21, 74, 70, 40, 41, 45 @right{*Average:* 44.3 years old}

@n Order the Ages from Least to Greatest: @fitb{}{@ifsoln{1, 3, 21, 40, 41, 42, 44, 45, 46, 70, 74, 74, 75}}

Then compute:
@fitbruby{5em}{@ifsoln{1}}{Minimum} @hspace{1em}
@fitbruby{5em}{@ifsoln{30.5}}{Q1} @hspace{1em}
@fitbruby{5em}{@ifsoln{44}}{Median} @hspace{1em}
@fitbruby{5em}{@ifsoln{72}}{Q3} @hspace{1em}
@fitbruby{5em}{@ifsoln{75}}{Maximum} @hspace{5em}
@fitbruby{5em}{@ifsoln{74}}{Range} @hspace{1em} @fitbruby{10em}{@ifsoln{41.5}}{Interquartile Range (IQR)}

@vspace{2ex}

*Watson Family Ages:* 70, 68, 69, 72, 65, 75, 65, 78, 70, 72, 71, 70 @right{*Average:* 70.4 years old}

@n Order the Ages from Least to Greatest: @fitb{}{@ifsoln{65, 65, 68, 69, 70, 70, 70, 71, 72, 72, 75, 78}}

Then compute:
@fitbruby{5em}{@ifsoln{65}}{Minimum} @hspace{1em}
@fitbruby{5em}{@ifsoln{68.5}}{Q1} @hspace{1em}
@fitbruby{5em}{@ifsoln{70}}{Median} @hspace{1em}
@fitbruby{5em}{@ifsoln{72}}{Q3} @hspace{1em}
@fitbruby{5em}{@ifsoln{78}}{Maximum} @hspace{5em}
@fitbruby{5em}{@ifsoln{10}}{Range} @hspace{1em} @fitbruby{10em}{@ifsoln{3.5}}{Interquartile Range (IQR)}

== Box Plots - Visualizing Shape

Make box plots to each family's age distribution on the number lines below. Hint: Plot the 5-Number Summaries, draw a box around the IQR (from Q1 to Q3), let the median split the box into 2 parts, and add whiskers from the box to the minimum and maximum values.

@n Ledet: @ifnotsoln{@image{../images/blank-0to80-num-line-numbered.png}} @ifsoln{@image{../images/ledet.png}}

@n Watson: @ifnotsoln{@image{../images/blank-0to80-num-line-numbered.png}} @ifsoln{@image{../images/watson.png}}

== Compare and Contrast

@n For which family gathering was the average age more typical? How do you know? @fitb{}{@ifsoln{Watson. Because the data is more closely clustered.}}

@fitb{}{@ifsoln{The Range and IQR are significantly smaller. The mean and median are much more similar.}}

@n What else do you Notice and Wonder about the data from these two family gatherings?

@fitb{}{@ifsoln{Answers will vary.}}

@fitb{}{@ifsoln{The ages for the oldest quarter of the families fall within about the same interval.}}

@fitb{}{@ifsoln{The minimum age for the Watson family is significantly higher than the median age for the Ledet family.}}

@n We plotted both of these box plots on number lines with the same scale. Did that make sense?

@fitb{}{@ifsoln{Yes. Seeing both box plots on the same scale definitely makes them easier to compare,}}

@fitb{}{@ifsoln{but it might be easier to read the precise values for the Watson family data if we zoomed in on the range the data actually falls within.}}
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,6 @@ Describe the shape of the box plots on the left. _Do your best to incorporate th
| 2 | @centered-image{../images/boxplot-2.png, 300} | @ifsoln{Symmetric}
| 3 | @centered-image{../images/boxplot-3.png, 300} | @ifsoln{Skew Right}
| 4 | @centered-image{../images/boxplot-4.png, 300} | @ifsoln{Symmetric}
| 5 | @centered-image{../images/boxplot-6.png, 300} | @ifsoln{Evenly Distributed}
| 5 | @centered-image{../images/boxplot-6.png, 300} | @ifsoln{Symmetric - More specifically: Evenly Distributed}

|===
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ The Data Cycle is a roadmap that guides us in the process of data analysis. You'
@A{Lookup, arithmetic, and statistical questions.}

@Q{What's the difference between arithmetic and statistical questions?}
@A{A statistical question does not specify a particular arithmetic process, while an arithmetic question does.}
@A{A statistical question anticipates variability in the data related to the question and accounts for it in the answers, while an arithmetic question anticipates a specific answer related to a particular arithmetic process.}

*Consider Data*
@Q{What do we need to determine in this phase?}
Expand Down
4 changes: 2 additions & 2 deletions lessons/Data-Science/data-cycle/langs/en-us/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -219,7 +219,7 @@ Most questions can be broken down into one of four categories:

@slidebreak

- *Statistical questions* - These kinds of questions are the most interesting! And are often best asked with "in general" attached, because the answer isn't black and white. If we ask "are dogs heavier than cats?", we know that not every dog is heavier than every cat! We just want to know if it is _generally true_ or _generally false_!
- *Statistical questions* - These kinds of questions are the most interesting! And are often best asked with "in general" attached, because we expect some variability and the answer isn't black and white. If we ask "are dogs heavier than cats?", we know that not every dog is heavier than every cat! We just want to know if it is _generally true_ or _generally false_!

- *Questions we can't answer* - We might wonder where the animal shelter is located, or what time of year the data was gathered! But the data in the table won’t help us answer that question, so as Data Scientists we might need to do some research beyond the data. And if nothing turns up, we simply recognize that there are limits to what we can analyze.

Expand All @@ -233,7 +233,7 @@ Most questions can be broken down into one of four categories:
@Q{What kind of question is "How old is Toggle?" How do you know?}
@A{It's a _lookup question_ because it can be answered by just looking at the table.}
@Q{What kind of question is "Are older animals adopted more quickly than younger animals?" How do you know?}
@A{It's a _statistical question_ because we are wondering what is happening in general.}
@A{It's a _statistical question_ because we expect some variability in the data and are wondering what is happening in general.}
}

=== Investigate
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Each question a Data Scientist asks adds a chapter to the story of their researc
* *Arithmetic questions* - Answered by doing calculations (comparing, averaging, totaling, etc.) with values from one single column. Examples of arithmetic questions might be “How much does the heaviest animal weigh?” or “What is the average age of animals from the shelter?”
* *Statistical questions* - These often involve multiple steps to answer, and the answer isn't black and white. When we compare two statistics we are actually comparing two data sets. If we ask "are dogs heavier than cats?", we know that not every dog is heavier than every cat! We just want to know if it is generally true or generally false!
* *Statistical questions* - These are questions that both _expect some variability in the data_ related to the question and _account for it in the answers_. Statistical questions often involve multiple steps to answer, and the answers aren't black and white. When we compare two statistics we are actually comparing two data sets. If we ask "are dogs heavier than cats?", we know that not every dog is heavier than every cat! We just want to know if it is _generally_ true or _generally_ false!
* *Questions we can't answer* - We might wonder where the animal shelter is located, or what time of year the data was gathered! But the data in the table won’t help us answer that question, so as Data Scientists we might need to do some research beyond the data. And if nothing turns up, we simply recognize that there are limits to what we can analyze.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ There are two phases of data collection.
- Find graphs related to eating habits by conducting an online search using terms like, "What do Americans eat for snacks?"

=== 2. Asking a Meaningful Statistical Question
Once you have all of your data and have looked at several graphics online about America’s snacking habits, you will declare your statistical question for this project. Some suggestions are below, but feel free to develop your own based on your analysis of the data.
Once you have all of your data and have looked at several graphics online about America’s snacking habits, you will declare your @vocab{statistical question} for this project. Some suggestions are below, but feel free to develop your own based on your analysis of the data.

- What time of day do we eat the healthiest snacks?
- We know that snacks high in saturated fats are bad for you. Do high-fat snacks get unhealthy ratings? In general, how good would you say you are, as a class, at judging healthiness?
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ There are two phases of data collection.
- Find graphs related to time use by conducting an online search using terms like, “How do Americans use their time?” @link{https://flowingdata.com/2015/11/10/counting-the-hours/,Here} is one possible resource to check out (from FlowingData).

=== 2. Asking a Meaningful Statistical Question
Once you have all of your data and have looked at several graphics online about America’s time use, you will declare your statistical question for this project. Some suggestions are below, but feel free to develop your own based on your analysis of the data.
Once you have all of your data and have looked at several graphics online about America’s time use, you will declare your @vocab{statistical question} for this project. Some suggestions are below, but feel free to develop your own based on your analysis of the data.

- On average, how long do I spend on homework compared with students across the country?
- Do people who identify as male, female or non-binary take longer to groom themselves?
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Students will investigate four types of threats to validity by pretending to be
== Phases of the Project

=== 1. Decide on a question.
You and a partner will choose a statistical question that you would be interested in exploring. (Note, you will not actually be investigating this question in full. Rather, you will be developing a faulty plan to answer the question.) Be sure that your question is one that could be answered by closely analyzing data, which also lends itself to many threats needing to be addressed.
You and a partner will choose a @vocab{statistical question} that you would be interested in exploring. (Note, you will not actually be investigating this question in full. Rather, you will be developing a faulty plan to answer the question.) Be sure that your question is one that could be answered by closely analyzing data, which also lends itself to many threats needing to be addressed.

=== 2. Develop a faulty research plan.
You and your partner will develop a faulty plan to research your statistical question. Remember, your goal here is to use data to misconstrue and mislead. Be sure to describe in detail how you will incorporate each of the following threats to validity.
Expand Down
10 changes: 10 additions & 0 deletions lib/glossary-terms.json
Original file line number Diff line number Diff line change
Expand Up @@ -1952,6 +1952,16 @@
"description": "utilizar información de una muestra para sacar conclusiones sobre la población más grande de la que se origina la muestra"
}
},
{
"en-us": {
"keywords": [[ "statistical question"]],
"description": "questions that _expect some variability in the data_ related to the question and _account for it in the answers_. They focus on describing or analyzing patterns and trends in data sets, to gain understanding of what is generally true about the data, rather than computing a single precise value."
},
"es-mx": {
"keywords": [[ "inferencia estadística" ]],
"description": "Preguntas que _esperan cierta variabilidad en los datos_ relacionados con la pregunta y _la tienen en cuenta en las respuestas_. Se centran en describir o analizar patrones y tendencias en conjuntos de datos, para comprender lo que es generalmente cierto acerca de los datos, en lugar de calcular un único valor preciso."
}
},
{
"en-us": {
"keywords": [[ "strength" ]],
Expand Down
20 changes: 20 additions & 0 deletions lib/langtable.js
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,26 @@ var langTable = {
"domain": [{"name": "table-name", "type":"Table"}, {"name": "column", "type": "String"}],
"range": "Image",
"example-pyret": "modified-box-plot(animals-table, \"pounds\")"},
{"name": "box-plot-scaled",
"domain": [{"name": "table-name", "type":"Table"}, {"name": "column", "type": "String"}, {"name": "low", "type": "Number"}, {"name": "high", "type": "Number"}],
"range": "Image",
"example-pyret": "box-plot-scaled(animals-table, \"weeks\", 1, 40)"},
{"name": "vert-box-plot",
"domain": [{"name": "table-name", "type":"Table"}, {"name": "column", "type": "String"}],
"range": "Image",
"example-pyret": "vert-box-plot(animals-table, \"weeks\")"},
{"name": "modified-vert-box-plot",
"domain": [{"name": "table-name", "type":"Table"}, {"name": "column", "type": "String"}],
"range": "Image",
"example-pyret": "modified-vert-box-plot(animals-table, \"pounds\")"},
{"name": "modified-box-plot-scaled",
"domain": [{"name": "table-name", "type":"Table"}, {"name": "column", "type": "String"}, {"name": "low", "type": "Number"}, {"name": "high", "type": "Number"}],
"range": "Image",
"example-pyret": "modified-box-plot-scaled(animals-table, \"weeks\", 1, 40)"},
{"name": "modified-vert-box-plot-scaled",
"domain": [{"name": "table-name", "type":"Table"}, {"name": "column", "type": "String"}, {"name": "low", "type": "Number"}, {"name": "high", "type": "Number"}],
"range": "Image",
"example-pyret": "modified-vert-box-plot-scaled(animals-table, \"weeks\", 1, 40)"},
{"name": "histogram",
"domain": [{"name": "table-name", "type":"Table"}, {"name": "labels", "type": "String"}, {"name": "values", "type": "String"}, {"name": "bin-size", "type": " Number"}],
"range": "Image",
Expand Down

0 comments on commit d0a693a

Please sign in to comment.