-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate entries in vcf file #10
Comments
Thanks for checking on this example @Zoeyoungxy These cases typically occur when the same deletion is found within 2 (or N) different contextual haplotypes. Sometimes these differences can be biologically interesting, but right now this often appears as the result of sequencing and assembly noise as well. It is another aspect of the larger cohort scaling for joint-genotyping (besides runtime), that will need more optimization in future since the rate of this phenomena tends to increase with sample count. The filtration decision will depend on the downstream application. We'll be adding more outputs soon to more easily match the full assembly contig to each VCF entry to help understand these cases |
Thank you for your help, and I look forward to the updates for sawfish. Regarding the example above, may I consider retaining the rows with higher QUAL values for now? |
Retaining the record with higher QUAL sounds like a reasonable strategy. Ideally you might want to merge the GT values across these records first but that would be more complicated to implement. I'll keep this ticket open to note that we can use an optional post-processing script to marginalize over all (apparent) haplotypes with the same SV. |
I’m encountering an issue while using Sawfish for joint calling.
In the VCF file generated, I noticed some loci appear as completely identical entries, differing from the typical multiallelic case.
Here is an example:
chr1 122014316 sawfish:106:1718:0:0 TTTGTAATGTCTGCAAGTGGATATTCAGACCTCTTTGAGGCCTTCGTTGGAAAAGGGATTTCTTCATATTATGCTAGACAGAATAATTCTCAGTAACTTCCTTGTGTTGTGTGTATTCAACTCACAGAGTTGAACGATCCTTTACAGAGAGCAGACTTGAAACACTCTTTTTGTGGAATTTGCAAGTGGAGATTTCAGCCGCTTTGAGGTCAATGGTACAATAGGAAATATCTTCCTATAGAAAATAGACAGAATGATTCTCATAAACTCCTTTGTGATGTGTGCGTTCAACTCACAGAGTTTAACCTTTCTTTTCATAGAGCAGTTAGGAAACACTTTGC T 999 PASS SVTYPE=DEL;END=122014656;SVLEN=-340;HOMLEN=2;HOMSEQ=TT GT:GQ:PL:AD:PS 0/1:32:32,0,232:5,1:. 1/1:12:200,12,0:0,4:. ./.:.:0,0,0:0,0:. ./.:.:0,0,0:0,0:. 0/1:2:702,0,2:1,15:.
chr1 122014316 sawfish:60:1727:1:0 TTTGTAATGTCTGCAAGTGGATATTCAGACCTCTTTGAGGCCTTCGTTGGAAAAGGGATTTCTTCATATTATGCTAGACAGAATAATTCTCAGTAACTTCCTTGTGTTGTGTGTATTCAACTCACAGAGTTGAACGATCCTTTACAGAGAGCAGACTTGAAACACTCTTTTTGTGGAATTTGCAAGTGGAGATTTCAGCCGCTTTGAGGTCAATGGTACAATAGGAAATATCTTCCTATAGAAAATAGACAGAATGATTCTCATAAACTCCTTTGTGATGTGTGCGTTCAACTCACAGAGTTTAACCTTTCTTTTCATAGAGCAGTTAGGAAACACTTTGC T 72 PASS SVTYPE=DEL;END=122014656;SVLEN=-340;HOMLEN=6;HOMSEQ=TTGTAA GT:GQ:PL:AD:PS 0/0:15:0,15,250:5,0:. ./.:.:0,0,0:0,0:. ./.:.:0,0,0:0,0:. ./.:.:0,0,0:0,0:. 0/0:6:0,6,100:2,0:.
In this case, both entries describe a DEL with the same SVLEN (-340) and identical POS, but the GT results for individual samples differ.
What could be the cause of such duplicate entries in the VCF file, and should I filter out one of them? If so, what criteria should I use to decide which row to keep?
Best wishes
The text was updated successfully, but these errors were encountered: