Named after MIMO control systems, and the associated methods in Systems Identification to use a combination of perturbations and measurements to help decode the system's dynamics, this package was designed to assist those attempting to understand biological dynamics by designing, performing, and analyzing perturbation scRNA-seq experiments.
Python Dependencies: pandas, numpy, matplotlib, seaborn, scipy, sklearn, statsmodels, Facebook PCA (though sklearn SVD or PCA should also work), networkx , and Infomap (this can be removed), and goatools
- Our paper
- Perturb-seq Google Forum
- Chimera Correction and Code
- sgRNA Design Tools
- Addgene Plasmids: pPS, pBA439
- GEO Database link
Please let me know (here or in the Google Forum) if there are any areas that you'd like to see improved or new items to be added!
- Q: How many perturbations can I do?
- A: It costs between $0.1/cell and $0.2/cell for commercial droplet scRNA-seq methods, it takes ~10 cells/perturbation to observe signature effects and ~100 cells/perturbation to see individual gene level effects robustly. Based on your budget, you can crunch the numbers (also see Design of Experiments section below)
For a rough comparison of our pilot scRNA-seq to population RNA-seq of the same perturbation, see this iPython notebook.
In designing Perturb-seq like experiments, there are a few key factors to keep in mind:
Are you interested in broad transcriptional signatures or individual gene level differential expression? If the former, a rough approximation may be around 10 cells/perturbation. If the later, 100 or more cells may be required based on the effect size.
A similar approximation for reads/cell would be a couple thousand for signatures and tens of thousands for gene-level.
As in any pooled screen, the representation of each perturbation in the library will vary. With genome wide CRISPR libraries the difference between the 10th and 90th percentile of a library is roughly 6-fold (Wang, 2013). Depending on how much a user wants to ensure every member of the library is represented, the cells/perturbation factor should be multiplied by an additional factor to reflect this variance.
Our approach to use high MOI instead of either a single vector with multiple sgRNAs or vectors with different selection methods benefits from ease of implementation and the ability to represent a large diversity of combinations (only limited by the number of cells).
However, challenges include a Poisson-like variance in the number of sgRNA/cell, sgRNA detection sensitivity, and the formation of PCR chimeras during the enrichment PCR procedure that can create misassignments.
All three of these factors should be assessed in pilot experiments to troubleshoot. An example of such a pilot is shown below (modified from the Drop-seq style species mixing experiments):
The distribution of reads going to the Perturb-seq vector (antiparallel) from 10X RNA-seq is shown above. Note that while the expression of the construct is comprable to that of a housekeeping gene, only a fraction of the reads overlap with the 18bp barcode (colored section in the coverage track). As such, it is advisable in most cases when you have a short barcode to perform enrichment PCR to obtain sensitive GBC/CBC pairing.
- An expression matrix output by a high throughput scRNA-seq protocol (such as Drop-seq or 10X cellranger)
- Guide barcode (GBC) PCR data to pair perturbations with cell barcodes (for certain applications this may be able to be directly obtained from the RNA-seq data
- A database of preassociated sgRNA/GBC pairs (either by Sanger sequencing or NGS)
- Guide barcodes and cell barcodes have to be paired accurately, accounting for chimeric PCR products
- A simple fitness calculation is possible by determining the difference between the initial abundances of a GBC and how many cells it appeared in
- MOI and detection probability are evaluated by comparing the observed number of GBCs/cell to a poission distribution that is first zero truncated, due to FACS selection for transduced cells, and then zero inflated, due to detection dropout (see this iPython notebook)
- A Cell state classifier is defined on wildtype or control cells and then applied to all cells in an experiment (see this iPython notebook). These classifications can used as outputs to be predicted (instead of gene expression) or as covariates in the model.
- The linear model integrating all covariates (and interactions terms as desired) is fit. An EM-like approach filters cells that look much more like control cells than perturbed cells (see this iPython notebook for an example)
- The regulatory coefficient obtained from the model are the most informative output giving an estimate of what extent each covariate (perturbation, cell state, pairwise interaction between perturbations, etc) impacted a given gene.
- Cell state effects are obtained by predicting the cell states based on the linear model instead of predicting gene expression
- Cell size effects (genes detected or transcripts detected) can be predicted as well