Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower evidence for fusions with Arriba #41

Closed
erprateek opened this issue Jan 22, 2020 · 18 comments
Closed

Lower evidence for fusions with Arriba #41

erprateek opened this issue Jan 22, 2020 · 18 comments

Comments

@erprateek
Copy link

erprateek commented Jan 22, 2020

Hi,

We have been observing consistently low evidence (split_reads) for fusions detected in cell lines/solid tumor samples as compared to when I use FusionCatcher or Pizzly. Is this expected or is there parameter tuning that can regulate this behavior?

Thanks,
Prateek

@suhrig
Copy link
Owner

suhrig commented Jan 22, 2020

Hi Prateek,

STAR has an issue with the alignment of split reads when the paired-end mates overlap. You should run STAR with --peOverlapNbasesMin 10, which alleviates this problem. I plan to make this a recommendation for running STAR in conjunction with Arriba.

In addition, you can run STAR with --alignSplicedMateMapLminOverLmate 0.5, which improves alignment of chimeric reads that span introns.

Lastly, you can increase the value of --chimScoreDropMax, which will cause STAR to report chimeric alignments with poorer alignment score. Note that this might lead to more false positives, though.

I should note that the columns split_reads1, split_reads2, and discordant_mates do not reflect all supporting reads. Some reads may have been discarded by one of Arriba's filters. You need to sum up the numbers of these columns and the number of reads listed in the column filters to obtain the total number of supporting reads.

Regards,
Sebastian

@erprateek
Copy link
Author

erprateek commented Jan 24, 2020

Hi Sebastian,

Thanks for your input here. I fired off several experiments based on your recommendations but I am running into an issue with STAR (2.7.3a) process not communicating with the system that it has finished. I have posted this issue on STAR repo as well. Will get back to you once that gets resolved.

Prateek

@erprateek
Copy link
Author

Seems like the issue I was running into was due to working with a pre-packaged binary instead of building from source. New experiments underway.

@erprateek
Copy link
Author

Hi Sebastian, while I am observing that there is a clear boost to the signal of True positive fusions using --peOverlapNbasesMin 10 and --alignSplicedMateMapLminOverLmate 0.5 , there are a few fusion events that start losing evidence as well. Do you happen to know why that might be happening?

@erprateek
Copy link
Author

erprateek commented Jan 30, 2020

fusions.discarded.tsv:SLC34A2 ROS1 +/+ -/- 4:25665939 6:117645578 CDS CDS translocation downstream downstream 0 0 1 18 63 low. . duplicates(1),min_support . . . .

fusions.tsv:SLC34A2 ROS1 +/+ -/- 4:25665952 6:117645578 splice-site splice-site translocation downstream downstream 0 4 1 15 63 high . . duplicates(1) ACCCACTCTTCAGGACTC-GGGATCAAGTGGTCAG___AGAGAGACACCAAAGGGAAGATTCTCTGTTTCTTCCAAGGGATTGGGAGATTGATTTTACTTCTCGGATTTCTCTACTTTTTCGTGTGCTCCCTGGATATTCTTAGTAGCGCCTTCCAGCTGGTTGGAG|ATGATTTTTGGATACCAGAAACAAGTTTCATACTTACTATTATAGTTGGAATATTTCTGGTTGTTACAATCCCACTGACCTTTG___TCTGGCATAGAAGATTAAAGAATCAAAAAAGTGCCAAGGAAGGGGTGACAGTGCTTATAAACGAAGACAAAGAGTTGGCTGAGCTGCGAGGTCTGGC out-of-frame PTLQDsgssgqretpkgrfsvsskglgdfyfsdfstfscapwiflvapsswle|mifgyqkqvsylll

When I turn those two flags on, I lose any evidence for this fusion

@erprateek
Copy link
Author

STAR logs from the default run vs experiment. Seems there is a substantial difference of chimeric reads sample_pipeline_params_arriba_opt.txt
sample_pipeline_params_only.txt

@suhrig
Copy link
Owner

suhrig commented Jan 31, 2020

there are a few fusion events that start losing evidence as well. Do you happen to know why that might be happening?

Looking at your log files, the parameter --alignSplicedMateMapLminOverLmate 0.5 might be at fault here. It permits poorer quality alignments where only half of a read (vs. the default of 66%) must map if it contains a splice site. Possibly, STAR finds alternative non-chimeric alignments and does not consider the chimeric alignment option anymore. What happens when you only use --peOverlapNbasesMin 10 in that particular case?

I have tested both parameters independently on some of my own test data and noticed that both of them occasionally cause chimeric reads to be lost (although there is a big net gain). The reasons why these reads are lost are not fully clear to me. The lost chimeric alignments seem to have the following pattern: STAR still aligns the non-chimeric segment of the read. The alignment is a few bases longer, but still a substantial number of bases are clipped. However, STAR does not seem to attempt to search for a chimeric alignment using the clipped segment, even though it maps uniquely to the genome (according to Blat).

If I find some time tomorrow, I can do some more tests. In case you have the time it would be helpful if you could send me a pair of reads that gets lost for ROS1 fusion. When you run Arriba with -I, you will find the names of the supporting reads in the column read_identifiers. That way, it should be easy to figure out, which reads are lost due to the optimized alignment parameters.

Thanks a lot for bringing this to my attention. Perhaps I will need to reconsider if/how to change the default parameters of STAR.

@erprateek
Copy link
Author

I am running a master parameter sweep across the following STAR params to find the best set (including defaults) for a given sample with known fusions. Criteria I am evaluating on:

  • detection by arriba
  • level of evidence reported by arriba for each of the known fusions (compared to other methods)

--chimSegmentMin
--chimJunctionOverhangMin
--chimScoreMin
--chimScoreDropMax
--chimScoreJunctionNonGTAG
--chimScoreSeparation
--chimSegmentReadGapMax
--peOverlapNbasesMin
--alignSJDBoverhangMin
--chimNonchimScoreDropMin

This is probably going to take a couple of weeks before I summarize results

@suhrig
Copy link
Owner

suhrig commented Feb 4, 2020

Wow, this is very useful. Thanks a lot, very much appreciated!

In order to save some computation, you might want to limit the repeated realignment to only those reads that are either unmapped or that have clipped bases. It probably makes no sense to realign reads that map perfectly to the reference genome, because they would probably be mapped to the same locus regardless of the chimeric parameters. Just a thought.

@erprateek
Copy link
Author

Hi Sebastian,

Thanks for the recommendation. Yes I have limited the analysis to the unmapped/clipped bases reads. Just wanted to let you know that our cluster is sooooper busy so this would have to wait for atleast 2-3 months. Here is the set of reads that gets lost:

missed_reads.txt

@suhrig
Copy link
Owner

suhrig commented Feb 23, 2020

Hi Prateek,

Thanks for following up with the sequences. Can you also send me the full STAR command? When I run the latest version of STAR the reads are aligned, regardless of whether I use the old or the optimized parameters. Moreover, can you confirm that the reads are indeed not aligned with the optimized parameters? I want to rule out that Arriba simply ignores them for whatever reason.

So far I have only made moderate progress in investigating this issue. Here is what I have found out so far: Apparently, some reads are not aligned when the --peOverlapNbasesMin parameter is used. Strangely, when I merge the reads manually and align them in single-end mode, STAR succeeds. So possibly this is a bug in the peOverlapMerge code. I will try to get to the root of this.

Regards,
Sebastian

@erprateek
Copy link
Author

Yes thats what I have been seeing as well in the so-far completed parameter sweep for STAR. I am not moving forward with the --peOverlapNbasesMin parameter currently.

My current STAR command is:

STAR version=2.7.3a
STAR compilation time,server,dir=Fri Feb 14 15:52:55 PST 2020 
/apps/star/2.7.3a/source
##### Command Line:
/apps/star/2.7.3a/source/STAR --genomeDir ${genome_dir} --readFilesIn r1.fq r2.fq --runThreadN 8 --chimSegmentMin 20 --twopassMode Basic --twopass1readsN -1 --outFilterMatchNminOverLread 0.5 --outFilterScoreMinOverLread 0.5 --sjdbOverhang 149 --limitBAMsortRAM 40850000000 --peOverlapNbasesMin 10 --chimOutType WithinBAM --outSAMtype BAM Unsorted --quantMode GeneCounts --outReadsUnmapped Fastx --outBAMcompression 0 --outStd BAM_Unsorted

@erprateek
Copy link
Author

erprateek commented Feb 24, 2020

I have a separate sort step that sorts the BAMs and then I submit it to Arriba. This is due to the fact that we have 4 fastqs (2 R1 and 2 R2 from different lanes). 2 sets of STAR jobs are run followed by sort+merge and then submit to Arriba.

@erprateek
Copy link
Author

I spot checked a few of the reads from the set that I had sent in the star output Unmapped.out.mate* files, I cannot find these reads in those.

@erprateek
Copy link
Author

erprateek commented Feb 24, 2020

For completion, here is the set of values that I observe for unmapped reads with or without peOverlapNbasesMin:

pipeline_run_no_pe/log_final_out:                                  UNMAPPED READS:
pipeline_run_no_pe/log_final_out-  Number of reads unmapped: too many mismatches |	0
pipeline_run_no_pe/log_final_out-       % of reads unmapped: too many mismatches |	0.00%
pipeline_run_no_pe/log_final_out-            Number of reads unmapped: too short |	6272716
pipeline_run_no_pe/log_final_out-                 % of reads unmapped: too short |	6.86%
pipeline_run_no_pe/log_final_out-                Number of reads unmapped: other |	1217391
pipeline_run_no_pe/log_final_out-                     % of reads unmapped: other |	1.33%
--
pipeline_run_with_pe/log_final_out:                                  UNMAPPED READS:
pipeline_run_with_pe/log_final_out-  Number of reads unmapped: too many mismatches |	1522830
pipeline_run_with_pe/log_final_out-       % of reads unmapped: too many mismatches |	1.67%
pipeline_run_with_pe/log_final_out-            Number of reads unmapped: too short |	1965539
pipeline_run_with_pe/log_final_out-                 % of reads unmapped: too short |	2.15%
pipeline_run_with_pe/log_final_out-                Number of reads unmapped: other |	347804
pipeline_run_with_pe/log_final_out-                     % of reads unmapped: other |	0.38%

@suhrig
Copy link
Owner

suhrig commented Feb 28, 2020

Hi Prateek,

I cannot reproduce the alignment behavior with the example reads you sent to me. Even when I use the STAR parameters you stated, the reads always align the same way. But after close inspection of the fusion predictions of two of my own samples (one with 50nt read length, the other with 100nt), I now have a better overview of the negative effects of the parameters --peOverlapNbasesMin and --alignSplicedMateMapLminOverLmate. Overall, there is a net gain in alignments. The alignment statistics in your last comment prove this. However, there are some individual breakpoints that lose supporting reads. The reasons for this are diverse. As mentioned in my previous post, sometimes the reads are not aligned at all; other times, they are aligned differently. The most common pattern I have observed is this:

When the above mentioned parameters are adjusted as suggested by me in the beginning, then instead of clipping a read at the breakpoint and making a chimeric alignment, STAR prefers to extend the alignment for a couple of bases (3-10). The extended alignment usually contains one or two mismatches and is finally also clipped (when too many mismatches accumulate), but the clipped segment is shorter. I am not sure, why this shorter clipped segment is not used for a chimeric alignment. To my impression it was often long enough to be mappable.

There is no single parameter to blame for this, --peOverlapNbasesMin or --alignSplicedMateMapLminOverLmate. Sometimes it's one, sometimes it's the other, and I have even seen a case where both had to be modified to observe the effect. --alignSplicedMateMapLminOverLmate could be increased a bit (maybe to 0.55 or 0.6) to prevent the suboptimal alignments, but you'll lose the extra sensitivity gained from reducing it in the first place, of course. It's a trade-off. Playing with the parameter --peOverlapNbasesMin does not make much sense IMO. Either the mates overlap or not.

Probably, only the developer of STAR can explain this behavior and what to do about it. I'll collect some example reads and consult him.

@suhrig
Copy link
Owner

suhrig commented Jun 11, 2020

Hi Prateek,

I reported the loss of some chimeric alignments caused by --peOverlapNbasesMin and --alignSplicedMateMapLminOverLmate to the developer of STAR. The good new is: While playing around with STAR parameters, I noticed that much of this deterioration disappears when the new chimeric alignment detection is used (STAR has two internal algorithms for chimeric detection - and old one and a new one). The new one can be triggered by setting --chimMultimapNmax to a value > 1. The bad news is that this mode is not yet compatible compatible with Arriba. When the new mode is used, STAR does not report the chimeric alignments in BAM format. It only reports them in its proprietary Chimeric.out.junction file. However, I have submitted a code patch to the developer, which gives STAR the ability to report multimapping chimeric alignments in BAM format. Sooner or later the patch should find its way into the main code base of STAR. I am positive that much of the lost reads will be rescued, once this has been done.

Regards,
Sebastian

@suhrig
Copy link
Owner

suhrig commented Oct 10, 2020

Hi Prateek,

Arriba 2.0.0 has been released and the optimized alignment parameters (--alignSplicedMateMapLminOverLmate 0.5 and --peOverlapNbasesMin 10) are now enabled by default. In addition, the latest release of STAR, 2.7.6a, is now capable of using the new chimeric alignment algorithm (which supports multi-mapping chimeric reads) in conjunction with output in BAM format. Although, I have noticed that in general the uniquely mapping chimeric reads are often fewer than with the old algorithm, the overall quality seems to have improved, such that in sum, there are usually more supporting reads for the true fusion events. I doubt there is much more that could be done about increasing the supporting read counts at this point. For these reasons, I will close the issue for now.

Regards,
Sebastian

@suhrig suhrig closed this as completed Oct 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants