Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gene expression + CRISPR dataset for the cellranger module and scrnaseq pipeline #1415

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 94 additions & 63 deletions data/genomics/homo_sapiens/10xgenomics/cellranger/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ This folder contains test datasets from 10X Genomics for reference and testing i
|10k_pbmc | [Human PBMC from a Healthy Donor, 10k cells - multi (v2)](https://www.10xgenomics.com/resources/datasets/human-pbmc-from-a-healthy-donor-10-k-cells-multi-v-2-2-standard-5-0-0) | GEX, Fixed RNA Profiling, V(D)J-B, V(D)J-T, Antibody Capture | `count`, `vdj`, `multi` |
| 10k_pbmc_cmo | [10k Human PBMCs Stained with TotalSeq™-B Human Universal Cocktail, Singleplex Sample](https://www.10xgenomics.com/resources/datasets/10k-human-pbmcs-stained-with-totalseq-b-human-universal-cocktail-singleplex-sample-1-standard) | GEX, Cell Multiplexing | `count`, `multi` |
| 4plex_scFFPE | [Mixture of Healthy and Cancer FFPE Tissues Dissociated using Miltenyi FFPE Tissue Dissociation Kit, Multiplexed Samples, 4 Probe Barcodes](https://www.10xgenomics.com/datasets/mixture-of-healthy-and-cancer-ffpe-tissues-dissociated-using-miltenyi-ffpe-tissue-dissociation-kit-multiplexed-samples-4-probe-barcodes-1-standard) | GEX, FFPE, Cell Multiplexing | `multi` |
| sc3_v3_5k_a549_gex_crispr | [5k A549, Lung Carcinoma Cells, No Treatment Transduced with a CRISPR Pool](https://www.10xgenomics.com/datasets/5-k-a-549-lung-carcinoma-cells-no-treatment-transduced-with-a-crispr-pool-3-1-standard-6-0-0) | GEX, CRISPR | `count`, `multi` |

# Subsampling

Expand All @@ -20,67 +21,97 @@ Unless stated otherwise, FASTQs were naively subsampled to 10,000 reads by readi

```bash
.
|-- 10k_pbmc_cmo
| |-- 10k_pbmc_cmo_config.csv
| |-- 10k_pbmc_cmo_count_feature_reference.csv
| |-- README.md
| `-- fastqs
| |-- cmo
| | |-- subsampled_SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_1_multiplexing_capture_S1_L001_R1_001.fastq.gz
| | `-- subsampled_SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_1_multiplexing_capture_S1_L001_R2_001.fastq.gz
| |-- gex_1
| | |-- subsampled_SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_1_gex_S2_L001_R1_001.fastq.gz
| | `-- subsampled_SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_1_gex_S2_L001_R2_001.fastq.gz
| `-- gex_2
| |-- subsampled_SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_2_gex_S1_L001_R1_001.fastq.gz
| `-- subsampled_SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_2_gex_S1_L001_R2_001.fastq.gz
|-- 5k_cmvpos_tcells
| |-- 5k_human_antiCMV_T_TBNK_connect_Multiplex_count_feature_reference.csv
| |-- README.md
| |-- fastqs
| | |-- ab
| | | |-- subsampled_5k_human_antiCMV_T_TBNK_connect_AB_S2_L004_R1_001.fastq.gz
| | | `-- subsampled_5k_human_antiCMV_T_TBNK_connect_AB_S2_L004_R2_001.fastq.gz
| | |-- gex_1
| | | |-- subsampled_5k_human_antiCMV_T_TBNK_connect_GEX_1_S1_L001_R1_001.fastq.gz
| | | `-- subsampled_5k_human_antiCMV_T_TBNK_connect_GEX_1_S1_L001_R2_001.fastq.gz
| | `-- vdj
| | |-- subsampled_5k_human_antiCMV_T_TBNK_connect_VDJ_S1_L001_R1_001.fastq.gz
| | `-- subsampled_5k_human_antiCMV_T_TBNK_connect_VDJ_S1_L001_R2_001.fastq.gz
| `-- 5k_cmvpos_tcells_config.csv
|-- README.md
|-- 10k_pbmc
| -- 4plex_human_liver_colorectal_ovarian_panc_scFFPE_multiplex_S1_L001_R1_001.subsampled.fastq.gz
| -- 4plex_human_liver_colorectal_ovarian_panc_scFFPE_multiplex_S1_L001_R2_001.subsampled.fastq.gz
| -- 4plex_human_liver_colorectal_ovarian_panc_scFFPE_multiplex_S1_L002_R1_001.subsampled.fastq.gz
| -- 4plex_human_liver_colorectal_ovarian_panc_scFFPE_multiplex_S1_L002_R2_001.subsampled.fastq.gz
| -- 4plex_human_liver_colorectal_ovarian_panc_scFFPE_multiplex_S1_L003_R1_001.subsampled.fastq.gz
| -- 4plex_human_liver_colorectal_ovarian_panc_scFFPE_multiplex_S1_L003_R2_001.subsampled.fastq.gz
| -- 4plex_human_liver_colorectal_ovarian_panc_scFFPE_multiplex_S1_L004_R1_001.subsampled.fastq.gz
| -- 4plex_human_liver_colorectal_ovarian_panc_scFFPE_multiplex_S1_L004_R2_001.subsampled.fastq.gz
|-- 10k_pbmc
| |-- fastqs
| | |-- 5gex
| | | |-- 5fb
| | | | |-- subsampled_sc5p_v2_hs_PBMC_10k_5fb_S1_L001_R1_001.fastq.gz
| | | | `-- subsampled_sc5p_v2_hs_PBMC_10k_5fb_S1_L001_R2_001.fastq.gz
| | | `-- 5gex
| | | |-- subsampled_sc5p_v2_hs_PBMC_10k_5gex_S1_L001_R1_001.fastq.gz
| | | `-- subsampled_sc5p_v2_hs_PBMC_10k_5gex_S1_L001_R2_001.fastq.gz
| | |-- bcell
| | | |-- subsampled_sc5p_v2_hs_PBMC_10k_b_S1_L001_R1_001.fastq.gz
| | | `-- subsampled_sc5p_v2_hs_PBMC_10k_b_S1_L001_R2_001.fastq.gz
| | `-- tcell
| | |-- subsampled_sc5p_v2_hs_PBMC_10k_t_S1_L001_R1_001.fastq.gz
| | `-- subsampled_sc5p_v2_hs_PBMC_10k_t_S1_L001_R2_001.fastq.gz
| |-- sc5p_v2_hs_PBMC_10k_multi_5gex_5fb_b_t_config.csv
| `-- sc5p_v2_hs_PBMC_10k_multi_5gex_5fb_b_t_feature_ref.csv
`-- references
|-- README.md
`-- vdj
`-- refdata-cellranger-vdj-GRCh38-alts-ensembl-5.0.0
|-- fasta
| |-- regions.fa
| `-- supp_regions.fa
`-- reference.json
├── 10k_pbmc
│ ├── 10k_pbmc_config.csv
│ ├── README.md
│ ├── fastqs
│ │ ├── 5gex
│ │ │ ├── 5fb
│ │ │ │ ├── subsampled_sc5p_v2_hs_PBMC_10k_5fb_S1_L001_R1_001.fastq.gz
│ │ │ │ └── subsampled_sc5p_v2_hs_PBMC_10k_5fb_S1_L001_R2_001.fastq.gz
│ │ │ └── 5gex
│ │ │ ├── subsampled_sc5p_v2_hs_PBMC_10k_5gex_S1_L001_R1_001.fastq.gz
│ │ │ └── subsampled_sc5p_v2_hs_PBMC_10k_5gex_S1_L001_R2_001.fastq.gz
│ │ ├── bcell
│ │ │ ├── subsampled_sc5p_v2_hs_PBMC_10k_b_S1_L001_R1_001.fastq.gz
│ │ │ └── subsampled_sc5p_v2_hs_PBMC_10k_b_S1_L001_R2_001.fastq.gz
│ │ └── tcell
│ │ ├── subsampled_sc5p_v2_hs_PBMC_10k_t_S1_L001_R1_001.fastq.gz
│ │ └── subsampled_sc5p_v2_hs_PBMC_10k_t_S1_L001_R2_001.fastq.gz
│ └── sc5p_v2_hs_PBMC_10k_multi_5gex_5fb_b_t_feature_ref.csv
├── 10k_pbmc_cmo
│ ├── 10k_pbmc_cmo_config.csv
│ ├── 10k_pbmc_cmo_count_feature_reference.csv
│ ├── README.md
│ └── fastqs
│ ├── cmo
│ │ ├── subsampled_SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_1_multiplexing_capture_S1_L001_R1_001.fastq.gz
│ │ └── subsampled_SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_1_multiplexing_capture_S1_L001_R2_001.fastq.gz
│ ├── gex_1
│ │ ├── subsampled_SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_1_gex_S2_L001_R1_001.fastq.gz
│ │ └── subsampled_SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_1_gex_S2_L001_R2_001.fastq.gz
│ └── gex_2
│ ├── subsampled_SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_2_gex_S1_L001_R1_001.fastq.gz
│ └── subsampled_SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K_2_gex_S1_L001_R2_001.fastq.gz
├── 4plex_scFFPE
│ ├── 4plex_human_liver_colorectal_ovarian_panc_scFFPE_multiplex_S1_L001_R1_001.subsampled.fastq.gz
│ ├── 4plex_human_liver_colorectal_ovarian_panc_scFFPE_multiplex_S1_L001_R2_001.subsampled.fastq.gz
│ ├── 4plex_human_liver_colorectal_ovarian_panc_scFFPE_multiplex_S1_L002_R1_001.subsampled.fastq.gz
│ ├── 4plex_human_liver_colorectal_ovarian_panc_scFFPE_multiplex_S1_L002_R2_001.subsampled.fastq.gz
│ ├── 4plex_human_liver_colorectal_ovarian_panc_scFFPE_multiplex_S1_L003_R1_001.subsampled.fastq.gz
│ ├── 4plex_human_liver_colorectal_ovarian_panc_scFFPE_multiplex_S1_L003_R2_001.subsampled.fastq.gz
│ ├── 4plex_human_liver_colorectal_ovarian_panc_scFFPE_multiplex_S1_L004_R1_001.subsampled.fastq.gz
│ └── 4plex_human_liver_colorectal_ovarian_panc_scFFPE_multiplex_S1_L004_R2_001.subsampled.fastq.gz
├── 5k_cmvpos_tcells
│ ├── 5k_cmvpos_tcells_config.csv
│ ├── 5k_human_antiCMV_T_TBNK_connect_Multiplex_count_feature_reference.csv
│ ├── README.md
│ └── fastqs
│ ├── ab
│ │ ├── subsampled_5k_human_antiCMV_T_TBNK_connect_AB_S2_L004_R1_001.fastq.gz
│ │ └── subsampled_5k_human_antiCMV_T_TBNK_connect_AB_S2_L004_R2_001.fastq.gz
│ ├── gex_1
│ │ ├── subsampled_5k_human_antiCMV_T_TBNK_connect_GEX_1_S1_L001_R1_001.fastq.gz
│ │ └── subsampled_5k_human_antiCMV_T_TBNK_connect_GEX_1_S1_L001_R2_001.fastq.gz
│ └── vdj
│ ├── subsampled_5k_human_antiCMV_T_TBNK_connect_VDJ_S1_L001_R1_001.fastq.gz
│ └── subsampled_5k_human_antiCMV_T_TBNK_connect_VDJ_S1_L001_R2_001.fastq.gz
├── README.md
├── hashing_demultiplexing
│ ├── 438-21-raw_feature_bc_matrix.h5
│ ├── 438_21_raw_HTO.csv
│ ├── README.md
│ ├── hto
│ │ ├── barcodes.tsv.gz
│ │ ├── features.tsv.gz
│ │ └── matrix.mtx.gz
│ ├── hto.tar.gz
│ ├── rna
│ │ ├── barcodes.tsv.gz
│ │ ├── features.tsv.gz
│ │ └── matrix.mtx.gz
│ └── rna.tar.gz
├── references
│ ├── README.md
│ └── vdj
│ └── refdata-cellranger-vdj-GRCh38-alts-ensembl-5.0.0
│ ├── fasta
│ │ ├── regions.fa
│ │ └── supp_regions.fa
│ └── reference.json
└── sc3_v3_5k_a549_gex_crispr
├── README.md
├── SC3_v3_NextGem_DI_CRISPR_A549_5K_Multiplex_config.csv
├── SC3_v3_NextGem_DI_CRISPR_A549_5K_Multiplex_count_feature_reference.csv
└── fastqs
├── crispr
│ ├── subsampled_SC3_v3_NextGem_DI_CRISPR_A549_5K_crispr_S4_L001_R1_001.fastq.gz
│ ├── subsampled_SC3_v3_NextGem_DI_CRISPR_A549_5K_crispr_S4_L001_R2_001.fastq.gz
│ ├── subsampled_SC3_v3_NextGem_DI_CRISPR_A549_5K_crispr_S4_L002_R1_001.fastq.gz
│ └── subsampled_SC3_v3_NextGem_DI_CRISPR_A549_5K_crispr_S4_L002_R2_001.fastq.gz
└── gex
├── subsampled_SC3_v3_NextGem_DI_CRISPR_A549_5K_gex_S5_L001_R1_001.fastq.gz
├── subsampled_SC3_v3_NextGem_DI_CRISPR_A549_5K_gex_S5_L001_R2_001.fastq.gz
├── subsampled_SC3_v3_NextGem_DI_CRISPR_A549_5K_gex_S5_L002_R1_001.fastq.gz
└── subsampled_SC3_v3_NextGem_DI_CRISPR_A549_5K_gex_S5_L002_R2_001.fastq.gz
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# 5k A549, Lung Carcinoma Cells, No Treatment Transduced with a CRISPR Pool

This dataset was obtained from [10x Genomics](https://www.10xgenomics.com/datasets/5-k-a-549-lung-carcinoma-cells-no-treatment-transduced-with-a-crispr-pool-3-1-standard-6-0-0) and modifiedas described below. The dataset description and download instructions are reproduced from the 10x Genomics website.

## Dataset overview

A549 lung carcinoma cells that expressed dCas9-KRAB were transduced with a pool containing 93 total sgRNAs (90 sgRNAs targeting 45 different genes and 3 non-targeting control sgRNAs, all using Capture Sequence 2 inserted into the hairpin structure of the sgRNA). Cells were obtained by 10x Genomics from MilliporeSigma. Selected cells (cultured in a selection media) for each condition were individually frozen. Aliquots of cells were then thawed and counted. The same cells were used as part of the multiplexed sample.

Libraries were prepared following the Chromium Single Cell 3' Reagent Kits User Guide (v3.1 Chemistry Dual Index) with Feature Barcoding technology for CRISPR Screening User Guide (CG000316) and sequenced on Illumina NovaSeq 6000.

### Single Cell 3’ CRISPR Screening v3.1 Dual Index Library

- Sequencing Depth: 21,401 read pairs per cell
- Paired-end, dual indexing Read 1: 28 cycles (16 bp barcode, 12 bp UMI); i5 index: 10 cycles (sample index); i7 index: 10 cycles (sample index); Read 2: 90 cycles (transcript)

### Key Metrics

- Estimated Number of Cells: 5,867
- Median Genes per Cell: 3,194
- Median UMI Counts per Cell: 10,773

## Original input files download

Original input files were downloaded using:

```bash
# Input Files
curl -O https://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/6.0.0/SC3_v3_NextGem_DI_CRISPR_A549_5K_Multiplex/SC3_v3_NextGem_DI_CRISPR_A549_5K_Multiplex_fastqs.tar
curl -O https://cf.10xgenomics.com/samples/cell-exp/6.0.0/SC3_v3_NextGem_DI_CRISPR_A549_5K_Multiplex/SC3_v3_NextGem_DI_CRISPR_A549_5K_Multiplex_config.csv
curl -O https://cf.10xgenomics.com/samples/cell-exp/6.0.0/SC3_v3_NextGem_DI_CRISPR_A549_5K_Multiplex/SC3_v3_NextGem_DI_CRISPR_A549_5K_Multiplex_count_feature_reference.csv
```

## Changes to original input files

For both gene expression and CRISPR FASTQ files, only lanes 1 and 2 files were kept, and the first 10,000 reads were subsampled.

In the count_feature_reference.csv file, the older gene symbol H2AFY was replaced by the newer gene symbol MACROH2A1 designating the same gene for compatibility with newer genome references.
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
[gene-expression]
reference,/path/to/references/refdata-gex-GRCh38-2020-A
expect-cells,5000

[feature]
reference,/path/to/feature_refs/SC3P_CellPlex_Set_A_millipore_pool_v2_jul_2020.csv

[libraries]
fastq_id,fastqs,lanes,physical_library_id,feature_types,subsample_rate
SC3_v3_NextGem_DI_CRISPR_A549_5K_gex,/path/to/fastqs/SC3_v3_NextGem_DI_CRISPR_A549_5K/SC3_v3_NextGem_DI_CRISPR_A549_5K_gex,any,CRISPR_A549_5K_gex,gene expression,
SC3_v3_NextGem_DI_CRISPR_A549_5K_crispr,/path/to/fastqs/SC3_v3_NextGem_DI_CRISPR_A549_5K/SC3_v3_NextGem_DI_CRISPR_A549_5K_crispr,any,CRISPR_A549_5K_crispr,Crispr Guide Capture,

Loading