GLM_for_count.Qmd

---
title: "Genalized Linear Model: Using Count Data"
subtitle: Learning how to used Generalized Linear models in R 
title-block-banner: true
date: "`r Sys.Date()`"
author:
  - name: Ana Bravo & Tendai Gwanzura
    affiliations:
      - Florida International University
      - Robert Stempel College of Public Health and Social Work
toc: true
number-sections: true
highlight-style: pygments
format:
  html:
    code-fold: true
    html-math-method: katex
---

## Libraries Used

```{r}
#| message: false
#| 
library(tidyverse)
library(katex)

```

## Introduction

In this presentation, we will be discussing how to use the Generalized Linear Model (GLM) method using count data in the R programming language. We will show how to clean and wrangle the data, show necessary columns to execute our method, discuss assumptions held, showcase the code used to run this GLM, and the interpretation of our output results.

## What is a GLM?

### lets start with a quick overview of the simple linear regression

As you know, a simple linear regression has two major components: a Y, dependent outcome and an X, which is your independent or your predictor variable. the model looks something like this:

$$E(Y|X) = \beta_0 + \beta_1$$ where $\beta_0$ is your intercept and $\beta_1$ is your slope. A linear model is a function, that us used to fit a data. We often use this method to see the association, or strength in association between two variables of interest:

```{r}
#| message: false


# load data:
data(trees)

#rename variables:
names(trees) <- c("DBH_in","height_ft", "volume_ft3")

# simple model:
model <- lm(DBH_in ~ height_ft, data = trees)

# plot:
simple_trees <-
  #data
  ggplot(data = trees) +
  
  # x and y
  aes(x = height_ft, y = DBH_in) + 
  
  #labels
  labs(title = "Example of Simple Association - Using Trees Data",
       x = "Height in feet",
       y = "Diameter in inches") + 
  
  # add points
  geom_point() + 
  
  # add the lm
  geom_smooth(method = "lm", color = "blue", se = FALSE) +
  
  # add a simple theme
  theme_bw()

# actual plot: 
simple_trees

```

Often you will find this model writen in this form:

$$ E(Y|X) = \beta_0 + \beta_1 + \varepsilon$$ where $\beta_0$ and $\beta_1$ are our coefficients that need to be estimated and \$ \varepsilon\$ used for more complex lines Our **goal** is to see the association between an outcome with an exposure.

### Assumptions of a linear question

Before diving into the generalized models, lets quickly overview the assumptions of linear regression.

-   Assumption 1: for each combination of independent variable x, y is a random variable with a [certain probability distribution.]{style="background-color:yellow"}
-   Assumption 2: Y values are statistical dependent.
-   Assumption 3: the mean value of Y, for each specific combination of values of X, is a linear function.
-   Assumption 4: the variance of Y, is the same for any fixed value of $X_n$
-   Assumption 5: for any fixed combination of $X_n$, Y is normally distributed.

Something to note, that in the simple explanation above we are assuming our Y variable (remember one of the points we talked above above? that y holds a specific distribution!) is continuous. So lets talk about our Y variable having a count distribution.

### lets talk about the generalized linear model

the term "generalized" is a big umbrella term used to describe a large class of models. Our response variable $y_i$ is following an exponential family distribution with a mean of $u_i$ which is sometimes non-linear! However [McCallagh and Nelder](link) considered them to be linear because our covariate affect the distribution of $y_i$ only through linear combination.

there are three major components of a GLM:

-   A Random component: which specifies the probability distribution of our response variable
-   Systematic component: specifies the explanatory variable $(x_1 .. x_n)$ in our model. Or their linear combination $(\beta_0 + \beta_1)$ etc.
-   a link function: specifies the LINK between the random and systematic components. This helps us figure out our expected values of the response is related to the linear combination of our explanatory variables. for example the link function $g(u) = u$, which is called an identity function. this models the mean directly. Or, in the case we will talk about today the log of the mean:

$$ log(\pmb{\mu}) = \alpha + \beta_1x_1 + ... B_nx_n + \epsilon$$

### What is a Poisson Regression?

a Poisson regression models how the mean of a discrete (or we can say count too!) response variable $Y$ depends on our explanatory $X$ variables. Here is a simple look at the Poisson regression:

$$  log \lambda_i = \beta_0 + \beta x_i $$ where the random component: the distribution of $Y$ is the mean of $\lambda$ and the systematic component is the explanatory variable (or your X variables, which can be continuous or categorical) that is linearly associated. Or can be transformed if non-linear, and the link function is the log link stated in the section above `(link the section number here?)`

An advantage of using GLM over a normal line model is the link function gives us more flexibility in modeling and this model uses the Maximum likelihood estimate. Additionally we can use different inference tools like Wald's test for logistic and Poisson models.

### important points of Poisson models

-   we can use this type of link function by using count data. count data can be describes at the number of devices that can access the internet, number of sex partners you have in your lifetime, and the number of individuals infected with a disease.
-   Poisson is unimodel and skewed to the right, both the mean and the variance are the same. In other words, when the count is larger it tends to be more varied.
-   if our mew increases the skew decreases and the distribution starts to become more bell shapped. our mean tends to look something like this:

$$\pmb{\mu} = exp(\alpha + \beta x) = e^\alpha (e^\beta)^x $$ where one unit increase in X has a multiplicative impact on your $e^\beta$ power on the mean. (More on this a little later in the interpretation section!)

### one last point before we move on

when your modeling count data, the link scale is linear. So the effects are additive on the link. While your response scale is nonlinear (this is on the exponent) and so the effects are *multiplicative.* makes sense? we will work out an example now!

## Example 1: using GLM for count responses - Crab Data

Lets import the `crabs.txt` data from the University of Florida's [open-source data files.](https://users.stat.ufl.edu/~aa/cat/data/)