1_read_data.Rmd

---
title: "Data import, Seurat object creation and overview"
authors: "Karla Lindquist (with Angelo Pelonero and Rebecca Jaszczak)"
date: "3/30/2021"
output: html_notebook
--- 


</br> 

For this tutorial, we will be analyzing a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data was obtained from [here](https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz), and is also linked on the course website under ["Raw data"](http://tiny.ucsf.edu/dsiscrnaseq).


Other raw data (for your future explorations) are available [here](https://support.10xgenomics.com/single-cell-gene-expression/datasets).

</br>

#### Reading raw data and creating a `Seurat` object  

We start by reading in the data. The `Read10X()` function reads in the output of the cellranger pipeline from 10X, returning a unique molecular identified (UMI) count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column).

For more information about the cellranger pipeline, visit the [Cell Ranger webpage](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger).

_Note: do _NOT_ try to open the raw "genes.tsv" file in Excel! The gene names could get converted to something unreadable or unmeaningful (e.g. gene SEPT2 will be converted by Excel to 2-Sep or 9/2)._ 


</br> 

__Method 1__: Read data using a RStudio project directory (if you created a project folder with subfolders called "src" for your notebooks/scripts and another subfolder called "data" for the data, and within the data folder you will have "filtered_gene_bc_matrices/hg19" if you downloaded the data.zip file from the course website).  
```{r}
getwd() ## confirm that your working directory is your "src" folder if you are using this method
projdir <- sub("src", "", getwd()) ## this removes the "src" part of the directory string to get you back to the main project folder 

# Load the PBMC dataset (note: you may need to remove the "./" part)
pbmc.data <- Read10X(data.dir = paste0(projdir, "./data/filtered_gene_bc_matrices/hg19/"))
```


__Method 2__: Read data by first setting your working directory to where the data files called "barcodes.tsv", "genes.tsv", and "matrix.mtx" are located. You will need to update the `setwd()` line to reflect your system. If running this code in an R notebook you will get a message to let you know that this is a temporary setting. If you are using this method then the code chunks in the rest of the tutorial where output is saved/loaded will also need to be updated to set the working directory. 
```{r}
setwd("~/Desktop/scRNAseq/data/filtered_gene_bc_matrices/hg19") ## example - update to your system
dir() ## confirm that you have the 3 raw data files required

pbmc.data <- Read10X(data.dir = ".")
```

 
</br> 

__More about the data formats:__ we are starting with base call (BCL) files here, which are generated by Illumina sequencers. What if you are starting with FASTQ files? You will need to use another software to create files that can be read by Seurat. An example pipeline using cellranger can be found in this tutoral called ["10x genomics single-cell RNAseq analysis from SRA data using Cell Ranger and Seurat"](https://bioinformaticsworkbook.org/dataAnalysis/RNA-Seq/Single_Cell_RNAseq/Chromium_Cell_Ranger.html#gsc.tab=0).  

</br>


We next use the count matrix provided to create a `Seurat` object. The object serves as a container that contains both the raw unnormalized expression data initially, and eventually analysis results (like PCA, or clustering results) for a single-cell dataset. 


When you create the Seurat object you can set some initial filtering parameters, such as in this example where we will select at least `min.features` genes that are expressed in at least `min.cells` cells. These will depend on your organism and experiment. Here we are selecting _genes_ expressed in >= 3 cells and _cells_ with at least 200 genes.   

```{r}
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, min.cells = 3, min.features = 200)
pbmc
```

The message "Feature names cannot have underscores ..." appears for some genes, e.g. in this data, ENSG00000229388	RP11-442N24__B.1.  

</br> 

#### Inspecting the `Seurat` object 

Now that the data is loaded into R, you can also use some of the functions that you would use for data frames, such as `dim()`, `names()` etc. Within each named object there are slots which can be can be viewed by name as well using `slotNames()` and accessed using the `@` symbol. 
For a technical discussion of the Seurat object structure, check out their [GitHub Wiki](https://github.com/satijalab/seurat/wiki).          

```{r}
dim(pbmc) ## rows (features/genes) x columns (barcodes/cells)
names(pbmc) ## for now we just have the raw RNA counts 
head(rownames(pbmc)) ## gene names from genes.tsv file
head(colnames(pbmc)) ## cell barcodes from barcodes.tsv
slotNames(pbmc[["RNA"]]) ## show all slot names
```

Let's look at the first 10 rows of the count matrix (from the matrix.mtx file). You can refer to the parts of a Seurat object by name (e.g. RNA) and the slots contained within it (e.g. counts).  

```{r}
head(pbmc[["RNA"]]@counts, n=10)
```


Now let's examine the counts from a few specific genes in the first 30 cells.  

```{r}
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
```
The `.` values in the matrix represent 0s (no molecules detected). Since most values in an scRNA-seq matrix are 0, Seurat uses a sparse-matrix representation whenever possible. This results in significant memory and speed savings for Drop-seq/inDrop/10x data. This is especially important was datasets venture into millions of cells.  

Why are scRNA-seq data often sparse? Because of a phenomenon called "dropout" where genes may have low/zero mRNA in once cell but not another cell (even if it's from the same type/treatment), or the capture of mRNA is not complete in a particular cell. Modeling of expression data is often done using "zero-inflated" models as a result. Other methods focus on genes that are highly variable across cell types and/or imputation. For further discussion of this topic, see ["Embracing the dropouts in single-cell RNA-seq analysis"](https://www.nature.com/articles/s41467-020-14976-9) by Pen Qiu (Nature Communications 2020).  

Let's take a quick look at what this means in practical terms:
```{r}
dense.size <- object.size(as.matrix(pbmc.data))
dense.size
```
~710 million bytes = 710 megabytes

```{r}
sparse.size <- object.size(pbmc.data)
sparse.size
```
~30 million bytes = 30 megabytes

```{r}
dense.size/sparse.size
```
The dense matrix is about 24 times larger than the sparse matrix format (ignore the "bytes" part of the output). This will become more relevant as we conduct complex analysis and our Seurat object (built off of the sparse matrix) has complex, high memory calculations added to it.