Skip to content

gravitogen/scJournal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 

Repository files navigation

Journalling Single-Cell Reading

File types in Single-Cell Genomics and Data-Science

  1. BCL file: Binary Base Call (BCL) files are the raw data files generated by the Illumina sequencers. Binary Base Call (BCL) files are the raw data files generated by the Illumina sequencers. Illumina sequencing technology uses cluster generation and sequencing by synthesis chemistry to sequence millions or billions of clusters on a flow cell, depending on the sequencing platform. During sequencing, for each cluster, base calls are made and stored for every cycle of sequencing by the Real-Time Analysis (RTA) software on the instrument. RTA stores the base call data in the form of individual base call (or BCL) files. When sequencing completes, the base calls in the BCL files must be converted into sequence data. The Real Time Analysis (RTA) software writes the base and the confidence in the call as a quality score to base call (.bcl) files. As the name implies this is done in real time, i.e. for every cycle of the sequencing run a call for every location identified on the flow cell (tiles and lanes) is added. BCL files are stored in binary format and represent the raw data output of a sequencing run.

This process is called BCL to FASTQ conversion.

  1. Fastq file: Short (and long) sequencing reads coming from the sequencers are stored in FASTQ format (files with an extension .fastq). This format contains the information about the sequence and the quality of each sequenced base. The quality encodes the probability that the corresponding base call is incorrect. The FASTQ format contains four rows per sequencing read: (i)a header containing @ as the first character (ii) the sequence content (iii) a spacer (iv) the quality encoded using ASCII characters.

  • Score = 10 (symbol ‘+’) => probability of incorrect base call = 0.1 => base call accuracy = 90%
  • Score = 20 (symbol ‘5’) => probability of incorrect base call = 0.01 => base call accuracy = 99%
  • Score = 30 (symbol ‘?’) => probability of incorrect base call = 0.001 => base call accuracy = 99.9% - This is a commonly acceptable threshold for trimming.
  • Score = 40 (symbol ‘I’) => probability of incorrect base call = 0.0001 => base call accuracy = 99.99%

Reference:

Blog reading

  1. https://liorpachter.wordpress.com/2019/06/21/single-cell-rna-seq-for-dummies/
  2. https://liorpachter.wordpress.com/2019/07/01/high-velocity-rna-velocity/

Data Analysis Workflow with Examples

  1. Generating gzipped fastq dump from SRA Accession list: Generate FASTQ GZ
  2. From .SRA to BAM file

RNA Velocity Papers

  1. https://www.embopress.org/doi/10.1038/msb.2011.62
  2. https://www.nature.com/articles/s41586-018-0414-6
  3. https://www.sciencedirect.com/science/article/abs/pii/S1097276518307974
  4. https://www.biorxiv.org/content/10.1101/673285v1
  5. https://www.pnas.org/content/116/39/19490
  6. https://jef.works/blog/2020/08/25/using-scvelo-in-R-using-reticulate/

Reference datasets

  1. mm10 genome index for kallisto used for RNA velocity: https://zenodo.org/record/3623148

Notable people in single-cell research

  1. Rahul Satija
  2. Sten Linnarsson
  3. Peter V. Kharchenko
  4. Lior Pachter
  5. ValentineSvensson
  6. Fabian J. Theis