-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deepvariant calling too many variants? #209
Comments
Hi @aderzelle Something seems strange in these files. In the DeepVariant output, DeepVariant reports the Allele Depth at these positions as [600819]: 36,15 ; [600831]: 36,15 ; [600834]: 35,15 While the GATK output reports the positions as: [600819]: 49,0 ; [600831]: 49, 0 ; [600834]: 49,0 Those are quite different reports about the underlying content of the reads in the region.. DeepVariant conducts a re-assembly of the region, so it may be the case that the reassembler identifies a different haplotype. I think to be conclusive about these differences, we'd need to see the BAM. But we'd definitely be interested to take a look at this if you don't mind sharing. |
Hello, |
@AndrewCarroll You can get a sample bam from the initial discussion thread on the rtg-users group: https://groups.google.com/a/realtimegenomics.com/forum/#!topic/rtg-users/U0UQnR2LRtw I just ran deepvariant 0.8.0 on the sample bam from there and replicated the results that @aderzelle reported. |
Hi @Lenbok Thank you for the note, with the links from @aderzelle, I was able to pull in the file and visualize this event. I think what is happening is that there are variants that can be represented in an internally consistent way at two different sets of positions. I think that DeepVariant reassembly is generating these two sets of candidates. The neural net always sees positions reassembled in the context of that particular position, so there looks to be evidence for support for each when inspected relative to the reference. We've had some internal discussions about how to improve candidate haplotype assignment for reads, but it will likely take some time to implement, test, and release. Thank you for highlighting this issue. |
Hi @aderzelle , we're continuing to look into this issue. I'm leaving this open for now, and will give you an update later. |
Hi @pichuan |
Hi @aderzelle Today we released DeepVariant v0.9, which contains several changes to code and training models. As part of this release, we have introduced changes which fix the issue for the BAM snippets presented, and which we think will generally fix the issue that you observed in other cases. To briefly summarize what we believe to be the cause - in candidate generation, a de Bruijn graph of variant and reference haplotypes is constructed. In rare cases, some graph paths are created in which local connections are valid, but no individual read supports the entire path. In your case, this caused two similar representations to generate candidates at different positions, each of which could be locally supported. In our fix, we require at least some support for the constructed graph of the candidate haplotype. We also noticed a separate fix that resolves your case. Specifically, your case was sensitive to the kmer length used to construct the graph. By default, this is 10, but we noticed that increasing to 15 also resolved your issue. We think this may reflect local repetitiveness. We have exposed this parameter in make_examples as: --dbg_min_k This is available when running make_examples directly, but not in the Docker image. Since the issue should be resolved in v0.9 without this change, this is mostly for your information if you want to experiment with other tweaks. We would be interested to hear your feedback confirming this case is resolved in v0.9. Thank you, |
Thanks a lot! I will certainly let you know if the fix solved that kind of issues. In the meanwhile, we have sequenced more samples and are now refining our assembly. Therefore I am not planning to relaunch deepvariant in the coming weeks but it wil be done within a month. thanks again for having looked into this issue and I will let you know if I come across other strange cases. |
Hello, it seems deepvariant is producing "overcalling" in some regions.
I was using RTG vcfeval to compare different vcf (one vcf = one population )and ended up with a strange case with some variants being both true and false positive. I first thought it was a bug of vcfeval but after submitting the case to the RTG guys, they ran a little analysis that concluded that
I thereafter compared with called made by GATK and Octopus for that location
Here is what Octopus reports, so nothing for the sites 19, 31, 34
And here is what GATK 4 reports, explicitly reporting no het site for 19, 31, 34
However, deepvariant gives a "PASS" for those variants.
I can send you the bam and vcf file if you wish so.
Thank you
The text was updated successfully, but these errors were encountered: