-
Notifications
You must be signed in to change notification settings - Fork 2
/
index.Rmd
executable file
·175 lines (134 loc) · 12 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
---
title: "Data Mining with R"
author: "Bradley Boehmke"
site: bookdown::bookdown_site
output: bookdown::gitbook
documentclass: book
bibliography: references.bib
biblio-style: apalike
link-citations: yes
github-repo: bradleyboehmke/uc-bana-7025
twitter-handle: bradleyboehmke
description: "Master the art of data wrangling & analysis with the R programming language."
---
`r if (knitr::is_latex_output()) '<!--'`
# Syllabus {-}
```{block, type = "note"}
This is the primary "textbook" for the *Machine Learning* section of the UC BANA 4080 Data Mining course. The following is a truncated syllabus; for the full syllabus along with complete course content please visit the online course content in [Canvas](https://uc.instructure.com/).
```
Welcome to ___Data Mining with R___! This course provides an intensive, hands-on introduction to data mining and analysis techniques. You will learn the fundamental skills required to extract informative attributes, relationships, and patterns from data sets. You will gain hands-on experience with exploratory data analysis, data visualization, unsupervised learning techniques such as clustering and dimension reduction, and supervised learning techniques such as linear regression, regularized regression, decision trees, random forests, and more! You will also be exposed to some more advanced topics such as ensembling techniques, deep learning, model stacking, and model interpretation. Together, this will provide you with a solid foundation of tools and techniques applied in organizations to aid modern day data-driven decision making.
```{block, type='video'}
<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/1492301/sp/149230100/embedIframeJs/uiconf_id/49148882/partner_id/1492301?iframeembed=true&playerId=kaltura_player&entry_id=1_0ytoqhb0&flashvars[streamerType]=auto&flashvars[localizationCode]=en_US&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_i9fbuuuq" width="640" height="610" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="BANA 4080: Course Introduction"></iframe>
```
## Learning Objectives {-}
Upon successfully completing this course, you will be able to:
* Apply data wrangling techniques to manipulate and prepare data for analysis.
* Use exploratory data analysis and visualization to provide descriptive insights of data.
* Apply common unsupervised learning algorithms to find common groupings of observations and features in a given dataset.
* Describe and apply a sound analytic modeling process.
* Apply, compare, and contrast various predictive modeling techniques.
* Have the resources and understanding to continue advancing your data mining and analysis capabilities.
...all with R!
```{block, type = "note"}
This course assumes no prior knowledge of R. Experience with programming concepts or another programming language will help, but is not required to understand the material.
```
## Material {-}
This course is split into two main sections - <u>***Data Wrangling***</u> and <u>***Machine Learning***</u>. The data wrangling section will provide you the fundamental skills required to acquire, munge, transform, manipulate, and visualize data in a computing environment that fosters reproducibility. The primary course material for this section is provided via this [free online book](https://bradleyboehmke.github.io/uc-bana-7025/).
The second section focused on machine learning section will expose you to several algorithms to identify hidden patterns and relationships within your data. The primary course material for this part of the course is provided via this [free online book](https://bradleyboehmke.github.io/uc-bana-4080/). There will also be recorded lectures and additional supplementary resources provided via Canvas.
## Class Structure {-}
__Modules__: For this class each module is covered over the course of week. In the "Overview" section for each module you will find overall learning objectives, a short description of the learning content covered in that module, along with all tasks that are required of you for that module (i.e. quizzes, lab). Each module will have two or more primary lessons and associated quizzes along with a lab.
__Lessons__: For each lesson you will read and work through the tutorial. Short videos will be sprinkled throughout the lesson to further discuss and reinforce lesson concepts. Each lesson will have various "TODO" exercises throughout, along with end-of-lesson exercises. I highly recommend you work through these exercises as they will prepare you for the quizzes, labs, and project work.
__Quizzes__: There will be a short quiz associated with _each lesson_. These quizzes will be hosted in the course website on Canvas. Please check Canvas for due dates for these quizzes.
__Labs__: There will be a lab associated with _each module_. For these labs students will be guided through a case study step-by-step. The aim is to provide a detailed view on how to manage a variety of complex real-world data; how to convert real problems into data wrangling and analysis problems; and to apply R to address these problems and extract insights from the data. These labs will be provided via the course website on Canvas and the submission of these labs will also be done through the course website on Canvas. Please check Canvas for due dates for these labs.
__Projects__: There will be two projects designed for you to put to work the tools and knowledge that you gain throughout this course. This provides you with multiple benefits.
- It will provide you with more experience using data wrangling tools on real life data sets.
- It helps you become a self-directed learner. As a data scientist, a large part of your job is to self-direct your learning and interests to find unique and creative ways to find insights in data.
- It starts to build your data science portfolio. Establishing a data science portfolio is a great way to show potential employers your ability to work with data.
## Schedule {-}
```{block, type = "note"}
See the [Canvas](https://uc.instructure.com/) course webpage for a detailed schedule with due dates for quizzes, labs, etc.
```
| Module | Description |
|:-------------:|:----------------------------------------------------|
| | **DATA WRANGLING** |
| **1** | **Introduction** |
| | R fundamentals & the Rstudio IDE |
| | Deeper understanding of vectors |
| **2** | **Reproducible Documents and Importing Data** |
| | Managing your workflow and reproducibility |
| | Data structures & importing data |
| **3** | **Tidy Data and Data Manipulation** |
| | Data manipulation & summarization |
| | Tidy data |
| **4** | **Relational Data and More Tidyverse Packages** |
| | Relational data |
| | Leveraging the Tidyverse to text & date-time data |
| **5** | **Data Visualization & Exploration** |
| | Data visualization |
| | Exploratory data analysis |
| **6** | **Creating Efficient Code in R** |
| | Control statements & iteration |
| | Writing functions |
| **7** | **Mid-term Project** |
| | **MACHINE LEARNING** |
| **8** | **Introduction to Applied Modeling** |
| | Introduction to machine learning |
| | First model with Tidymodels |
| **9** | **First Regression Models** |
| | Simple linear regression |
| | Multiple linear regression |
| **10** | **More Modeling Processes** |
| | Feature engineering |
| | Resampling |
| **11** | **Classification & Regularization** |
| | Logistic regression |
| | Regularized regression |
| **12** | **Hyperparameter Tuning & Non-linearity** |
| | Hyperparameter tuning |
| | Multivariate adaptive regression splines |
| **13** | **Tree-based Models** |
| | Decision trees |
| | Bagging |
| | Random forests |
| **14** | **Unsupervised learning** |
| | Clustering |
| | Dimension reduction |
| **15** | **Final Project** |
## Conventions used in this book {-}
The following typographical conventions are used in this book:
* ___strong italic___: indicates new terms,
* __bold__: indicates package & file names,
* `inline code`: monospaced highlighted text indicates functions or other commands that could be typed literally by the user,
* code chunk: indicates commands or other text that could be typed literally by the user
```{r, first-code-chunk, collapse=TRUE}
1 + 2
```
In addition to the general text used throughout, you will notice the following cells that provide additional context for improved learning:
```{block, type = "video"}
A video demonstrating this topic is available in Canvas.
```
```{block, type = "tip"}
A tip or suggestion that will likely produce better results.
```
```{block, type = "note"}
A general note that could improve your understanding but is not required for the course requirements.
```
```{block, type = "warning"}
Warning or caution to look out for.
```
```{block, type = "todo"}
Knowledge check exercises to gauge your learning progress.
```
## Feedback {-}
To report errors or bugs that you find in this course material please post an issue at https://github.com/bradleyboehmke/uc-bana-4080/issues. For all other communication be sure to use Canvas or the university email.
```{block, type = "tip"}
When communicating with me via email, please always include **BANA4080** in the subject line.
```
## Acknowledgements {-}
This course and its materials have been influenced by the following resources:
- Jenny Bryan, [STAT 545: Data wrangling, exploration, and analysis with R](http://stat545.com/)
- Garrett Grolemund & Hadley Wickham, [R for Data Science](http://r4ds.had.co.nz/index.html)
- Stephanie Hicks, [Statistical Computing](https://www.stephaniehicks.com/jhustatcomputing2021/)
- Chester Ismay & Albert Kim, [ModernDive](https://moderndive.netlify.app/index.html)
- Alex Douglas et al., [An Introduction to R](https://intro2r.com/)
- Brandon Greenwell, [Hands-on Machine Learning with R](https://bradleyboehmke.github.io/HOML/)