Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate entries in vcf file #10

Open
Zoeyoungxy opened this issue Dec 20, 2024 · 3 comments
Open

Duplicate entries in vcf file #10

Zoeyoungxy opened this issue Dec 20, 2024 · 3 comments

Comments

@Zoeyoungxy
Copy link

I’m encountering an issue while using Sawfish for joint calling.
In the VCF file generated, I noticed some loci appear as completely identical entries, differing from the typical multiallelic case.
Here is an example:

chr1 122014316 sawfish:106:1718:0:0 TTTGTAATGTCTGCAAGTGGATATTCAGACCTCTTTGAGGCCTTCGTTGGAAAAGGGATTTCTTCATATTATGCTAGACAGAATAATTCTCAGTAACTTCCTTGTGTTGTGTGTATTCAACTCACAGAGTTGAACGATCCTTTACAGAGAGCAGACTTGAAACACTCTTTTTGTGGAATTTGCAAGTGGAGATTTCAGCCGCTTTGAGGTCAATGGTACAATAGGAAATATCTTCCTATAGAAAATAGACAGAATGATTCTCATAAACTCCTTTGTGATGTGTGCGTTCAACTCACAGAGTTTAACCTTTCTTTTCATAGAGCAGTTAGGAAACACTTTGC T 999 PASS SVTYPE=DEL;END=122014656;SVLEN=-340;HOMLEN=2;HOMSEQ=TT GT:GQ:PL:AD:PS 0/1:32:32,0,232:5,1:. 1/1:12:200,12,0:0,4:. ./.:.:0,0,0:0,0:. ./.:.:0,0,0:0,0:. 0/1:2:702,0,2:1,15:.
chr1 122014316 sawfish:60:1727:1:0 TTTGTAATGTCTGCAAGTGGATATTCAGACCTCTTTGAGGCCTTCGTTGGAAAAGGGATTTCTTCATATTATGCTAGACAGAATAATTCTCAGTAACTTCCTTGTGTTGTGTGTATTCAACTCACAGAGTTGAACGATCCTTTACAGAGAGCAGACTTGAAACACTCTTTTTGTGGAATTTGCAAGTGGAGATTTCAGCCGCTTTGAGGTCAATGGTACAATAGGAAATATCTTCCTATAGAAAATAGACAGAATGATTCTCATAAACTCCTTTGTGATGTGTGCGTTCAACTCACAGAGTTTAACCTTTCTTTTCATAGAGCAGTTAGGAAACACTTTGC T 72 PASS SVTYPE=DEL;END=122014656;SVLEN=-340;HOMLEN=6;HOMSEQ=TTGTAA GT:GQ:PL:AD:PS 0/0:15:0,15,250:5,0:. ./.:.:0,0,0:0,0:. ./.:.:0,0,0:0,0:. ./.:.:0,0,0:0,0:. 0/0:6:0,6,100:2,0:.

In this case, both entries describe a DEL with the same SVLEN (-340) and identical POS, but the GT results for individual samples differ.
What could be the cause of such duplicate entries in the VCF file, and should I filter out one of them? If so, what criteria should I use to decide which row to keep?

Best wishes

@ctsa
Copy link
Member

ctsa commented Dec 20, 2024

Thanks for checking on this example @Zoeyoungxy

These cases typically occur when the same deletion is found within 2 (or N) different contextual haplotypes. Sometimes these differences can be biologically interesting, but right now this often appears as the result of sequencing and assembly noise as well. It is another aspect of the larger cohort scaling for joint-genotyping (besides runtime), that will need more optimization in future since the rate of this phenomena tends to increase with sample count. The filtration decision will depend on the downstream application. We'll be adding more outputs soon to more easily match the full assembly contig to each VCF entry to help understand these cases

@Zoeyoungxy
Copy link
Author

Thanks for checking on this example @Zoeyoungxy

These cases typically occur when the same deletion is found within 2 (or N) different contextual haplotypes. Sometimes these differences can be biologically interesting, but right now this often appears as the result of sequencing and assembly noise as well. It is another aspect of the larger cohort scaling for joint-genotyping (besides runtime), that will need more optimization in future since the rate of this phenomena tends to increase with sample count. The filtration decision will depend on the downstream application. We'll be adding more outputs soon to more easily match the full assembly contig to each VCF entry to help understand these cases

Thank you for your help, and I look forward to the updates for sawfish. Regarding the example above, may I consider retaining the rows with higher QUAL values for now?

@ctsa
Copy link
Member

ctsa commented Dec 24, 2024

Retaining the record with higher QUAL sounds like a reasonable strategy. Ideally you might want to merge the GT values across these records first but that would be more complicated to implement.

I'll keep this ticket open to note that we can use an optional post-processing script to marginalize over all (apparent) haplotypes with the same SV.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants