Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepvariant calling too many variants? #209

Closed
ghost opened this issue Aug 22, 2019 · 8 comments
Closed

deepvariant calling too many variants? #209

ghost opened this issue Aug 22, 2019 · 8 comments
Assignees

Comments

@ghost
Copy link

ghost commented Aug 22, 2019

Hello, it seems deepvariant is producing "overcalling" in some regions.
I was using RTG vcfeval to compare different vcf (one vcf = one population )and ended up with a strange case with some variants being both true and false positive. I first thought it was a bug of vcfeval but after submitting the case to the RTG guys, they ran a little analysis that concluded that

$ rtg extract vcfeval.out/output.vcf.gz
Chrom_2 6000819 . G GACA . . SYNC=6000819;CALL_WEIGHT=0.667;CALL=TP GT . 0/1
Chrom_2 6000820 . A . . . CALL=IGN GT . 0/0
Chrom_2 6000828 . G A . . SYNC=6000819,6000828;BASE=TP;CALL=FP;CALL_ALTERNATE GT 0/1 0/1
Chrom_2 6000829 . A . . . BASE=IGN;CALL=IGN GT 0/0 0/0
Chrom_2 6000831 . GA G . . SYNC=6000819;CALL_WEIGHT=0.667;CALL=TP GT . 0/1
Chrom_2 6000833 . C . . . CALL=IGN GT . 0/0
Chrom_2 6000834 . GCC G . . SYNC=6000819;CALL_WEIGHT=0.667;CALL=TP GT . 0/1
Chrom_2 6000836 . C G . . SYNC=6000819,6000836;BASE=TP;CALL=FP;CALL_ALTERNATE GT 0/1 0/1
Chrom_2 6000837 . A . . . BASE=IGN;CALL=IGN GT 0/0 0/0

The --XX debugging options I used let us verify that the resulting baseline and call haplotypes are identical within the region, and enable an experimental annotation to flag variants that were matched on less optimal solutions. The variants at 6000828 and 6000836 are CALL=FP, because vcfeval was able to instead match three call variants at 6000819, 6000831, and 6000834. It's like the call set in this region is the union of two ways to represent the baseline haplotype.

I thereafter compared with called made by GATK and Octopus for that location
Here is what Octopus reports, so nothing for the sites 19, 31, 34

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	chaos
Chrom_2	6000828	.	G	A	1609.23	PASS	AC=1;AN=2;DP=51;MQ=60;MQ0=0;NS=1;RFQUAL_ALL=26.99	GT:GQ:DP:MQ:PS:PQ:RFQUAL:FT	1|0:1609:51:60:6000828:100:26.99:PASS
Chrom_2	6000836	.	C	G	1609.23	PASS	AC=1;AN=2;DP=53;MQ=60;MQ0=0;NS=1;RFQUAL_ALL=30.51	GT:GQ:DP:MQ:PS:PQ:RFQUAL:FT	1|0:1609:53:60:6000828:100:30.51:PASS

And here is what GATK 4 reports, explicitly reporting no het site for 19, 31, 34

Chrom_2	6000811	.	C	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:118
Chrom_2	6000812	.	T	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:118
Chrom_2	6000813	.	T	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:118
Chrom_2	6000814	.	T	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:118
Chrom_2	6000815	.	A	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:118
Chrom_2	6000816	.	G	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:118
Chrom_2	6000817	.	C	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:118
Chrom_2	6000818	.	C	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:118
**Chrom_2	6000819**	.	G	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:118
Chrom_2	6000820	.	A	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:118
Chrom_2	6000821	.	C	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:118
Chrom_2	6000822	.	A	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:118
Chrom_2	6000823	.	A	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:118
Chrom_2	6000824	.	C	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:118
Chrom_2	6000825	.	A	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:118
Chrom_2	6000826	.	A	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:118
Chrom_2	6000827	.	C	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:118
**Chrom_2	6000828**	.	G	A	953.6	.	AC=1;AF=0.5;AN=2;BaseQRankSum=-0.489;DP=51;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=19.46;ReadPosRankSum=0.24;SOR=0.616	GT:AD:DP:GQ:PL	0/1:24,25:49:99:961,0,1214
Chrom_2	6000829	.	A	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:120
Chrom_2	6000830	.	C	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:120
**Chrom_2	6000831**	.	G	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:120
Chrom_2	6000832	.	A	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:120
Chrom_2	6000833	.	C	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:120
**Chrom_2	6000834**	.	G	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:120
Chrom_2	6000835	.	C	.	156.54	.	DP=49	GT:AD:DP:RGQ	0/0:49,0:49:120
**Chrom_2	6000836**	.	C	G	1019.6	.	AC=1;AF=0.5;AN=2;BaseQRankSum=-1.192;DP=53;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=20.39;ReadPosRankSum=0.029;SOR=0.555	GT:AD:DP:GQ:PL	0/1:25,25:50:99:1027,0,1186
Chrom_2	6000837	.	A	.	156.54	.	DP=43	GT:AD:DP:RGQ	0/0:43,0:43:120
Chrom_2	6000838	.	T	.	156.54	.	DP=43	GT:AD:DP:RGQ	0/0:43,0:43:120
Chrom_2	6000839	.	G	.	156.54	.	DP=43	GT:AD:DP:RGQ	0/0:43,0:43:120
Chrom_2	6000840	.	T	.	156.54	.	DP=43	GT:AD:DP:RGQ	0/0:43,0:43:120
Chrom_2	6000841	.	T	.	156.54	.	DP=43	GT:AD:DP:RGQ	0/0:43,0:43:120
Chrom_2	6000842	.	A	.	156.54	.	DP=43	GT:AD:DP:RGQ	0/0:43,0:43:120
Chrom_2	6000843	.	C	.	156.54	.	DP=43	GT:AD:DP:RGQ	0/0:43,0:43:120
Chrom_2	6000844	.	G	.	156.54	.	DP=43	GT:AD:DP:RGQ	0/0:43,0:43:120
Chrom_2	6000845	.	G	.	156.54	.	DP=43	GT:AD:DP:RGQ	0/0:43,0:43:120
Chrom_2	6000846	.	T	.	156.54	.	DP=43	GT:AD:DP:RGQ	0/0:43,0:43:120
Chrom_2	6000847	.	A	.	156.54	.	DP=43	GT:AD:DP:RGQ	0/0:43,0:43:120
Chrom_2	6000848	.	A	.	156.54	.	DP=43	GT:AD:DP:RGQ	0/0:43,0:43:120
Chrom_2	6000849	.	G	.	156.54	.	DP=43	GT:AD:DP:RGQ	0/0:43,0:43:120
Chrom_2	6000850	.	T	.	156.54	.	DP=43	GT:AD:DP:RGQ	0/0:43,0:43:120
Chrom_2	6000851	.	C	.	156.54	.	DP=43	GT:AD:DP:RGQ	0/0:43,0:43:120

However, deepvariant gives a "PASS" for those variants.

Chrom_2	6000455	.	T	C,<*>	51.4	PASS	.	GT:GQ:DP:AD:VAF:PL	0/1:51:77:31,46,0:0.597403,0:51,0,69,990,990,990
Chrom_2	6000456	.	C	<*>	0	.	END=6000818	GT:GQ:MIN_DP:PL	0/0:48:49:0,204,2279
**Chrom_2	6000819**	.	G	GACA,<*>	13.3	PASS	.	GT:GQ:DP:AD:VAF:PL	0/1:13:52:36,15,0:0.288462,0:13,0,50,990,990,990
Chrom_2	6000820	.	A	<*>	0	.	END=6000827	GT:GQ:MIN_DP:PL	0/0:50:49:0,147,1469
**Chrom_2	6000828**	.	G	A,<*>	4.1	PASS	.	GT:GQ:DP:AD:VAF:PL	0/1:4:50:39,11,0:0.22,0:1,0,36,990,990,990
Chrom_2	6000829	.	A	<*>	0	.	END=6000830	GT:GQ:MIN_DP:PL	0/0:50:50:0,150,1499
**Chrom_2	6000831**	.	GA	G,<*>	15.1	PASS	.	GT:GQ:DP:AD:VAF:PL	0/1:15:51:36,15,0:0.294118,0:14,0,45,990,990,990
Chrom_2	6000833	.	C	<*>	0	.	END=6000833	GT:GQ:MIN_DP:PL	0/0:50:36:0,108,1079
**Chrom_2	6000834**	.	GCC	G,<*>	9.9	PASS	.	GT:GQ:DP:AD:VAF:PL	0/1:10:51:35,15,0:0.294118,0:9,0,39,990,990,990
**Chrom_2	6000836**	.	C	G,<*>	5.8	PASS	.	GT:GQ:DP:AD:VAF:PL	0/1:6:37:25,12,0:0.324324,0:4,0,39,990,990,990
Chrom_2	6000837	.	A	<*>	0	.	END=6000897	GT:GQ:MIN_DP:PL	0/0:50:43:0,156,1559
Chrom_2	6000898	.	C	T,<*>	44.1	PASS	.	GT:GQ:DP:AD:VAF:PL	0/1:44:50:31,19,0:0.38,0:44,0,150,990,990,990
Chrom_2	6000899	.	T	<*>	0	.	END=6001000	GT:GQ:MIN_DP:PL	0/0:50:45:0,147,1469

I can send you the bam and vcf file if you wish so.
Thank you

@AndrewCarroll
Copy link
Collaborator

Hi @aderzelle

Something seems strange in these files. In the DeepVariant output, DeepVariant reports the Allele Depth at these positions as [600819]: 36,15 ; [600831]: 36,15 ; [600834]: 35,15

While the GATK output reports the positions as: [600819]: 49,0 ; [600831]: 49, 0 ; [600834]: 49,0

Those are quite different reports about the underlying content of the reads in the region.. DeepVariant conducts a re-assembly of the region, so it may be the case that the reassembler identifies a different haplotype. I think to be conclusive about these differences, we'd need to see the BAM. But we'd definitely be interested to take a look at this if you don't mind sharing.

@ghost
Copy link
Author

ghost commented Aug 23, 2019

Hello,
it's ok for me to share, where should I send you the download link?

@Lenbok
Copy link

Lenbok commented Aug 28, 2019

@AndrewCarroll You can get a sample bam from the initial discussion thread on the rtg-users group: https://groups.google.com/a/realtimegenomics.com/forum/#!topic/rtg-users/U0UQnR2LRtw

I just ran deepvariant 0.8.0 on the sample bam from there and replicated the results that @aderzelle reported.

image

@AndrewCarroll
Copy link
Collaborator

Hi @Lenbok

Thank you for the note, with the links from @aderzelle, I was able to pull in the file and visualize this event.

I think what is happening is that there are variants that can be represented in an internally consistent way at two different sets of positions. I think that DeepVariant reassembly is generating these two sets of candidates. The neural net always sees positions reassembled in the context of that particular position, so there looks to be evidence for support for each when inspected relative to the reference.

We've had some internal discussions about how to improve candidate haplotype assignment for reads, but it will likely take some time to implement, test, and release.

Thank you for highlighting this issue.

@pichuan
Copy link
Collaborator

pichuan commented Oct 30, 2019

Hi @aderzelle , we're continuing to look into this issue. I'm leaving this open for now, and will give you an update later.

@pichuan pichuan assigned pichuan and akolesnikov and unassigned pichuan Oct 30, 2019
@ghost
Copy link
Author

ghost commented Nov 2, 2019

Hi @pichuan
thanks a lot for the update. No worries ... in the meanwhile I am working on improving the genome assembly, it's a very complex one (in short, I haven't moved on forgetting about deepvariants ;) ).

@AndrewCarroll
Copy link
Collaborator

AndrewCarroll commented Nov 13, 2019

Hi @aderzelle

Today we released DeepVariant v0.9, which contains several changes to code and training models. As part of this release, we have introduced changes which fix the issue for the BAM snippets presented, and which we think will generally fix the issue that you observed in other cases.

To briefly summarize what we believe to be the cause - in candidate generation, a de Bruijn graph of variant and reference haplotypes is constructed. In rare cases, some graph paths are created in which local connections are valid, but no individual read supports the entire path. In your case, this caused two similar representations to generate candidates at different positions, each of which could be locally supported.

In our fix, we require at least some support for the constructed graph of the candidate haplotype.

We also noticed a separate fix that resolves your case. Specifically, your case was sensitive to the kmer length used to construct the graph. By default, this is 10, but we noticed that increasing to 15 also resolved your issue. We think this may reflect local repetitiveness. We have exposed this parameter in make_examples as: --dbg_min_k

This is available when running make_examples directly, but not in the Docker image. Since the issue should be resolved in v0.9 without this change, this is mostly for your information if you want to experiment with other tweaks.

We would be interested to hear your feedback confirming this case is resolved in v0.9.

Thank you,
Andrew

@ghost
Copy link
Author

ghost commented Nov 14, 2019

Thanks a lot!
I am not surprised by your explanation, the organism I am working with is notoriously known to be a nightmare to assemble. Therefore I am not surprised at its "little" scale it fooled Deepvariant.

I will certainly let you know if the fix solved that kind of issues. In the meanwhile, we have sequenced more samples and are now refining our assembly. Therefore I am not planning to relaunch deepvariant in the coming weeks but it wil be done within a month.

thanks again for having looked into this issue and I will let you know if I come across other strange cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants