-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathGLM_for_count.Qmd
132 lines (88 loc) · 6.48 KB
/
GLM_for_count.Qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
title: "Genalized Linear Model: Using Count Data"
subtitle: Learning how to used Generalized Linear models in R
title-block-banner: true
date: "`r Sys.Date()`"
author:
- name: Ana Bravo & Tendai Gwanzura
affiliations:
- Florida International University
- Robert Stempel College of Public Health and Social Work
toc: true
number-sections: true
highlight-style: pygments
format:
html:
code-fold: true
html-math-method: katex
---
## Libraries Used
```{r}
#| message: false
#|
library(tidyverse)
library(katex)
```
## Introduction
In this presentation, we will be discussing how to use the Generalized Linear Model (GLM) method using count data in the R programming language. We will show how to clean and wrangle the data, show necessary columns to execute our method, discuss assumptions held, showcase the code used to run this GLM, and the interpretation of our output results.
## What is a GLM?
### lets start with a quick overview of the simple linear regression
As you know, a simple linear regression has two major components: a Y, dependent outcome and an X, which is your independent or your predictor variable. the model looks something like this:
$$E(Y|X) = \beta_0 + \beta_1$$ where $\beta_0$ is your intercept and $\beta_1$ is your slope. A linear model is a function, that us used to fit a data. We often use this method to see the association, or strength in association between two variables of interest:
```{r}
#| message: false
# load data:
data(trees)
#rename variables:
names(trees) <- c("DBH_in","height_ft", "volume_ft3")
# simple model:
model <- lm(DBH_in ~ height_ft, data = trees)
# plot:
simple_trees <-
#data
ggplot(data = trees) +
# x and y
aes(x = height_ft, y = DBH_in) +
#labels
labs(title = "Example of Simple Association - Using Trees Data",
x = "Height in feet",
y = "Diameter in inches") +
# add points
geom_point() +
# add the lm
geom_smooth(method = "lm", color = "blue", se = FALSE) +
# add a simple theme
theme_bw()
# actual plot:
simple_trees
```
Often you will find this model writen in this form:
$$ E(Y|X) = \beta_0 + \beta_1 + \varepsilon$$ where $\beta_0$ and $\beta_1$ are our coefficients that need to be estimated and \$ \varepsilon\$ used for more complex lines Our **goal** is to see the association between an outcome with an exposure.
### Assumptions of a linear question
Before diving into the generalized models, lets quickly overview the assumptions of linear regression.
- Assumption 1: for each combination of independent variable x, y is a random variable with a [certain probability distribution.]{style="background-color:yellow"}
- Assumption 2: Y values are statistical dependent.
- Assumption 3: the mean value of Y, for each specific combination of values of X, is a linear function.
- Assumption 4: the variance of Y, is the same for any fixed value of $X_n$
- Assumption 5: for any fixed combination of $X_n$, Y is normally distributed.
Something to note, that in the simple explanation above we are assuming our Y variable (remember one of the points we talked above above? that y holds a specific distribution!) is continuous. So lets talk about our Y variable having a count distribution.
### lets talk about the generalized linear model
the term "generalized" is a big umbrella term used to describe a large class of models. Our response variable $y_i$ is following an exponential family distribution with a mean of $u_i$ which is sometimes non-linear! However [McCallagh and Nelder](link) considered them to be linear because our covariate affect the distribution of $y_i$ only through linear combination.
there are three major components of a GLM:
- A Random component: which specifies the probability distribution of our response variable
- Systematic component: specifies the explanatory variable $(x_1 .. x_n)$ in our model. Or their linear combination $(\beta_0 + \beta_1)$ etc.
- a link function: specifies the LINK between the random and systematic components. This helps us figure out our expected values of the response is related to the linear combination of our explanatory variables. for example the link function $g(u) = u$, which is called an identity function. this models the mean directly. Or, in the case we will talk about today the log of the mean:
$$ log(\pmb{\mu}) = \alpha + \beta_1x_1 + ... B_nx_n + \epsilon$$
### What is a Poisson Regression?
a Poisson regression models how the mean of a discrete (or we can say count too!) response variable $Y$ depends on our explanatory $X$ variables. Here is a simple look at the Poisson regression:
$$ log \lambda_i = \beta_0 + \beta x_i $$ where the random component: the distribution of $Y$ is the mean of $\lambda$ and the systematic component is the explanatory variable (or your X variables, which can be continuous or categorical) that is linearly associated. Or can be transformed if non-linear, and the link function is the log link stated in the section above `(link the section number here?)`
An advantage of using GLM over a normal line model is the link function gives us more flexibility in modeling and this model uses the Maximum likelihood estimate. Additionally we can use different inference tools like Wald's test for logistic and Poisson models.
### important points of Poisson models
- we can use this type of link function by using count data. count data can be describes at the number of devices that can access the internet, number of sex partners you have in your lifetime, and the number of individuals infected with a disease.
- Poisson is unimodel and skewed to the right, both the mean and the variance are the same. In other words, when the count is larger it tends to be more varied.
- if our mew increases the skew decreases and the distribution starts to become more bell shapped. our mean tends to look something like this:
$$\pmb{\mu} = exp(\alpha + \beta x) = e^\alpha (e^\beta)^x $$ where one unit increase in X has a multiplicative impact on your $e^\beta$ power on the mean. (More on this a little later in the interpretation section!)
### one last point before we move on
when your modeling count data, the link scale is linear. So the effects are additive on the link. While your response scale is nonlinear (this is on the exponent) and so the effects are *multiplicative.* makes sense? we will work out an example now!
## Example 1: using GLM for count responses - Crab Data
Lets import the `crabs.txt` data from the University of Florida's [open-source data files.](https://users.stat.ufl.edu/~aa/cat/data/)