Small model for DeepVariant

DeepVariant performs variant calling by way of image classification. Sequenced reads are read from .bam files, marshalled into pileup images that encode a variety of read and variant information and written out as image-like tensors. These images are fed to a convolutional neural network (CNN) for classification.

This approach achieves state-of-the-art accuracy. In an effort to speed up DeepVariant, version 1.8.0 introduces a second, light-weight multilayer perceptron (MLP) that provides an initial genotype classification for all biallelic variants before pileup images are generated.

When this small model is sufficiently confident in its classification—i.e. the phred-scaled class probability (GQ) is above a set threshold set as an input parameter—the genotype for that putative variant is accepted. If the GQ value does not pass the threshold, the candidate continues through the regular image-classification pipeline, i.e. the candidate is converted to a pileup image and classified by the CNN during call_variants.

Details on the small model

The "small model" is a simple, fully-connected MLP with two hidden layers. It has roughly ~360k parameters, considerably fewer than the CNN's 82M. It accepts a feature vector for each putative variant containing allele frequency, base- and mapping quality, variant- and haplotype information, as well as allele frequencies of adjacent positions.

The small model is called during the make_examples step, after candidate sweeping but before pileup image generation. This way we avoid generating pileup images for candidates we have already classified. If the small model classification is accepted, make_examples writes them to disk as CallVariantsOutput protos, which are then consumed by postprocess_variants alongside the CVOs from call_variants.

Fig 1: The small model is invoked during make_examples before pileup image generation, writing CVO protos that are consumed by postprocess_variants.

Model Features

The following table breaks up the feature vector of a single candidate locus.

identifying features		allele features		variant features		allele context features
contig	chr20	num_reads_supports_ref	20	is_snp	1	variant_allele_frequency_at_minus_4	0
start	101370	num_reads_supports_alt_1	21	is_insertion	0	variant_allele_frequency_at_minus_3	0
end	101371	total_depth	41	is_deletion	0	variant_allele_frequency_at_minus_2	0
ref	T	variant_allele_frequency_1	51	insertion_length	0	variant_allele_frequency_at_minus_1	2
alt_1	C	ref_mapping_quality	60	deletion_length	0	variant_allele_frequency_at_plus_0	51
genotype	1	alt_1_mapping_quality	60			variant_allele_frequency_at_plus_1	2
		ref_base_quality	88			variant_allele_frequency_at_plus_2	0
		alt_1_base_quality	80			variant_allele_frequency_at_plus_3	2
		ref_reverse_strand_ratio	0			variant_allele_frequency_at_plus_4	0
		alt_1_reverse_strand_ratio	0

Table 1: An example feature vector for the small model for WGS. The identifying features such as contig, position, etc. are not passed to the model. Additionally, for long-read sequencing platforms such as PacBio and ONT, the allele features are computed per each haplotype (0,1,2) provided phasing information is available, thereby adding 30 more features. The variant features provide information on what type of variant this candidate represents. Finally, the default range for the allele context features from -25 to +25; for brevity, we show just -4 to +4.

The set of features named variant_allele_frequency_at_X provide the model with allele information of the surrounding environment of each candidate locus. The model may adjust its classification probabilities if the candidate is in a region of high variant density and is surrounded by many other putative variants. These sites are generally very difficult to call, and the small model has learned to be less confident in these regions.

Accuracy numbers

Fig 2: Accuracy numbers by sequencing platform for HG003.

Runtime Improvements

As stated previously, the main motivation of the small model is to reduce runtime while minimizing degradation in accuracy.

Fig 3: Runtime by sequencing platform

Runtime vs Number of calls made per model

The small model effects the total runtime in multiple ways:

An added classification computation occurs during make_examples. This exerts an additive effect on runtime.
All candidate variants the small model classifies successfully, i.e. with sufficient confidence, no longer need to be converted to pileup images during make_examples. This has a reductive effect on runtime.
call_variants scales more or less linearly with the number of examples needing to be classified. Since the small model reduces the number of pileup images, this strongly reduces the call_variants runtime.

The GQ threshold determines the level of confidence we require from the small model before accepting its classification. As a result, the runtime scales with this threshold: for higher GQ thresholds, fewer classifications are accepted and more candidates are classified by the CNN. This is shown clearly in Figure 5.

Fig 4: The number of variants called by which model (top) and the resulting total runtime (bottom) by GQ threshold for PacBio HG003.

How to change default values or turn it off

The behavior of the small model is fully customizable:

--disable_small_model will disable the small model entirely and therefore skip the small model feature vector generation and classification.
--make_examples_extra_args "small_model_snp_gq_threshold=<VALUE>": Specifies the GQ thresholds for SNPs. Setting it to -1 disables all SNPs to be called by the small model.
--make_examples_extra_args "small_model_indel_gq_threshold=<VALUE>": Specifies the GQ thresholds for INDELs. Setting it to -1 disabled all INDELs to be called by the small model.

What about multi-allelic variants?

The small model in 1.8.0 release does not consider multi-allelic variants, due to the inherent complexity of classifying these sites. This may change in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepvariant-small-model-details.md

deepvariant-small-model-details.md

Small model for DeepVariant

Details on the small model

Model Features

Accuracy numbers

Runtime Improvements

Runtime vs Number of calls made per model

How to change default values or turn it off

What about multi-allelic variants?

Files

deepvariant-small-model-details.md

Latest commit

History

deepvariant-small-model-details.md

File metadata and controls

Small model for DeepVariant

Details on the small model

Model Features

Accuracy numbers

Runtime Improvements

Runtime vs Number of calls made per model

How to change default values or turn it off

What about multi-allelic variants?