Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sci-dash incorrect information #29

Open
AgedMordorBlue opened this issue Jan 23, 2024 · 6 comments
Open

sci-dash incorrect information #29

AgedMordorBlue opened this issue Jan 23, 2024 · 6 comments

Comments

@AgedMordorBlue
Copy link

Hi Job,

I'm getting some weird output in my sci-dash:

  • The Total Input Reads column adds up to more than the Total input read-pairs
  • Many cell metrics (mean reads/cell, mean UMI/cell, etc) are about 10 times lower than indicated in the STARsolo summary output.

I went back to the STARsolo summary file, and the values in the sci-dash don't match what is written there. When ooking at these Summary stats, they are much more in line with the sci-dash output of an earlier version of the pipeline. I've added the JSON and the STARsolo Summary.csv content of the same sample below.

Best,
Yani

sci-dash JSON:
"sample_succes": {
"5mm_dsDNAse": {
"n_pairs_success": 373470356,
"sequencing_saturation": 0.751197,
"estimated_cells": 5738,
"total_mapped_reads": 131048010,
"total_unique_reads": 111547480,
"total_multimapped_reads": 19500530,
"total_correct_reads_genes": 90306169,
"total_exonic_reads": 41117200,
"total_intronic_reads": 49189000,
"total_intergenic_reads": 40741810,
"total_mitochondrial_reads": 0,
"total_exonicAS_reads": 2446128,
"total_intronicAS_reads": 7610787,
"mean_reads_per_cell": 2414,
"mean_genes_per_cell": 210,
"mean_umis_per_cell": 376

STARsolo Summary:
Number of Reads,185672402
Reads With Valid Barcodes,1
Sequencing Saturation,0.751197
Q30 Bases in CB+UMI,1
Q30 Bases in RNA read,0.93507
Reads Mapped to Genome: Unique+Multiple,0.705802
Reads Mapped to Genome: Unique,0.600776
Reads Mapped to GeneFull_Ex50pAS: Unique+Multiple GeneFull_Ex50pAS,0.486374
Reads Mapped to GeneFull_Ex50pAS: Unique GeneFull_Ex50pAS,0.441959
Estimated Number of Cells,5738
Unique Reads in Cells Mapped to GeneFull_Ex50pAS,71273128
Fraction of Unique Reads in Cells,0.868553
Mean Reads per Cell,12421
Median Reads per Cell,9947
UMIs in Cells,17618322
Mean UMI per Cell,3070
Median UMI per Cell,2504
Mean GeneFull_Ex50pAS per Cell,1596
Median GeneFull_Ex50pAS per Cell,1442
Total GeneFull_Ex50pAS Detected,20151

STARsolo summary of prior run (I think the switch from GeneFull to GeneFull_Ex50pAS explains the difference between versions):
Number of Reads,190841919
Reads With Valid Barcodes,1
Sequencing Saturation,0.730686
Q30 Bases in CB+UMI,1
Q30 Bases in RNA read,0.934432
Reads Mapped to Genome: Unique+Multiple,0.818547
Reads Mapped to Genome: Unique,0.676697
Reads Mapped to GeneFull: Unique+Multiple GeneFull,0.545113
Reads Mapped to GeneFull: Unique GeneFull,0.483755
Estimated Number of Cells,5758
Unique Reads in Cells Mapped to GeneFull,79965182
Fraction of Unique Reads in Cells,0.866167
Mean Reads per Cell,13887
Median Reads per Cell,11178
UMIs in Cells,21385715
Mean UMI per Cell,3714
Median UMI per Cell,3041
Mean GeneFull per Cell,1792
Median GeneFull per Cell,1629
Total GeneFull Detected,17842

@J0bbie
Copy link
Collaborator

J0bbie commented Jan 24, 2024

Hi Yani,

Good catch! It was indeed generating a mean/sum based on all 'raw' cells / ambient RNA (instead of just the filtered cells).
This was throwing the numbers off.

I've fixed this in the latest commit and also made some other small changes to the sci-dash.

Just pull the latest code, delete the sci-dash folder of your run and start the snakemake workflow again. It should re-generate just the sci-dash.

Let me know if this fixed it for you!

Best,

Job

@gauravvaidya16
Copy link

gauravvaidya16 commented Jan 29, 2024

Hi Job,

I am facing a similar issue where both the samples have identical stats on the sci-dash but when you look at the STARsolo summary file for the samples they differ. Also the successful read-pairs for the two samples in total are higher than the total input read pairs

Best,
Gaurav

Below are the sci-dash JSON and the StarSolo summaries for each sample:

"sample_succes": {
"Pmor_50percPEG": {
"n_pairs_success": 362401472,
"total_reads": 175944916,
"sequencing_saturation": 0.490957,
"perc_mapped_reads_genome": 0.560275,
"perc_unique_reads_genome_unique": 0.305331,
"perc_mapped_reads_gene": 0.126788,
"perc_unique_reads_gene_unique": 0.105158,
"estimated_cells": 8953,
"mean_reads_per_cell": 1458,
"mean_umi_per_cell": 729,
"mean_genes_per_cell": 560,
"total_exonic_reads": 8109995,
"total_intronic_reads": 7492202,
"total_intergenic_reads": 46385009,
"total_mitochondrial_reads": 0,
"total_exonicAS_reads": 1492351,
"total_intronicAS_reads": 3167266
}

"Pmor": {
  "n_pairs_success": 202924630,
  "total_reads": 175944916,
  "sequencing_saturation": 0.490957,
  "perc_mapped_reads_genome": 0.560275,
  "perc_unique_reads_genome_unique": 0.305331,
  "perc_mapped_reads_gene": 0.126788,
  "perc_unique_reads_gene_unique": 0.105158,
  "estimated_cells": 8953,
  "mean_reads_per_cell": 1458,
  "mean_umi_per_cell": 729,
  "mean_genes_per_cell": 560,
  "total_exonic_reads": 8109995,
  "total_intronic_reads": 7492202,
  "total_intergenic_reads": 46385009,
  "total_mitochondrial_reads": 0,
  "total_exonicAS_reads": 1492351,
  "total_intronicAS_reads": 3167266
}

STARsolo Summary for Pmor_50percPEG

Screenshot 2024-01-29 at 14 16 52

STARsolo Summary for Pmor

Screenshot 2024-01-29 at 14 17 28

@J0bbie
Copy link
Collaborator

J0bbie commented Jan 31, 2024

I think I figured it out, it had to due with similar naming schematics and the regular expression used to retrieve the STARSolo files: ad29488

I.e. Pmor / Pmor_50percPEG were getting the wrong statistic files retrieved due to a wildcard search without the species.
Could you try again with the latest code and see if it makes more sense now?

@gauravvaidya16
Copy link

Hi Job,

It did fix most of the stats except the successful read-pairs for the two samples in total being higher than the total input read pairs

@J0bbie
Copy link
Collaborator

J0bbie commented Feb 16, 2024

That indeed sounds a bit fishy. I'll try to check whether I'm counting some reads double somewhere.
Are you using hashing-barcodes for these samples by chance?

@gauravvaidya16
Copy link

No the samples are unhashed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants