Documentation for CNNScoreVariants #8526

felixm3 · 2023-09-22T17:39:46Z

felixm3
Sep 22, 2023

Where can I find detailed documentation on CNNScoreVariants beyond the paragraph here please?

I'm looking for something similar to this copied below with details on inputs, architecture, how the CNNs were trained, etc.

Input/output
Clair3 uses a pileup input design simplified from that of its predecessors, and a full-alignment input to cover as many details in the read alignments as possible. Supplementary Fig. 3 visualizes the pileup and full-alignment inputs of a random SNP, insertion, deletion or non-variant. The pileup input comprises 594 integers—33 genome positions wide, with 18 features at each position (A+, C+, G+, T+, IS+, I1S+, DS+, D1S+, DR+, A−, C−, G−, T−, IS−, I1S−, DS−, D1S− and DR−, where the + and − indicate the positive strand and negative strand, respectively). A, C, G, T, I and D represent the counts of read support for the four nucleotides, and insertion and deletion. Superscript ‘1’ means only the indel with the highest read support is counted if various lengths of indel are found in a candidate site (that is, all indels are counted if without ‘1’). Subscript ‘S’ means the starting position of an indel. Subscript ‘R’ means the following positions of an indel. For example, a 3-bp deletion with the most reads support will have the first deleted base counted in either D1S+ or D1S−, and the second and third deleted bases counted in either DR+ or DR−. The design was determined experimentally, but the rationale is that for 1-bp indels that are easy to call, look into the differences between the ‘S’ counts, but reduce the quality if the ‘R’ counts and discrepancy between positions increase. Supplementary Fig. 3 provides some intuitions on how the features are counted given four random examples. For developers to confirm their understanding, the input creation logics are available at https://github.com/HKU-BAL/Clair3/blob/main/preprocess/CreateTensorPileup.py ref. 16. The pileup output is explained in Supplementary Section 3. The indel allele (or two indel alleles) with the highest reads support is used as the output according to the decision made in the 21-genotype task. The full-alignment input comprises 23,496 integers—eight channels of 33 genome positions and a maximum number of reads of 89. A description of the eight channels is provided in Supplementary Section 1. The full-alignment output is explained in Supplementary Section 3. The two indel length tasks can represent the exact indel length from −15 to 15 bp, or below −15 bp/above 15 bp. An indel call with an exact length will output the most reads-supported allele at that length. Otherwise, the most reads-supported allele below −15 bp/above 15 bp is outputted. Supplementary Fig. 4 shows the performance of Clair3 for different indel lengths and supports a cutoff at 15 bp. In training, indel length task 1 is given the smaller number, and in all our variant calling experiments, no length predictions in task 1 larger than in task 2 were observed. The maximum supported coverage of full-alignment input was 89. If the coverage was above 89, random subsampling of reads was applied. If the coverage was below 89, zero-padding was applied with reads placed at the center of the input. The maximum supported coverage of full-alignment input can be increased by changing the ‘matrix_depth_dict’ variable in the ‘param_f.py’ configuration file.

Network architecture
The pileup and full-alignment networks are shown in Supplementary Fig. 5. The pileup network uses two bidirectional long short-term memory (Bi-LSTM) layers with 128 and 160 LSTM units. Stacked LSTM layers enable the network to learn the characteristics of raw sequential signal from different aspects at each position, but without increasing memory capacity, which enables the network to converge faster. Compared to Clair, the transpose-split layer is removed for a 40% speedup, with a small performance loss that is taken care of in full-alignment calling. The full-alignment network is derived from a residual neural network (ResNet) and uses three standard residual blocks. A convolutional layer is added on top of each residual block to expand the channels but reduce dimensionality across the channels. A spatial pyramid pooling17 (SPP) layer is used to tackle the problem of varying coverage in the full-alignment input. SPP is a pooling layer that removes a network’s fixed-size constraint, thus avoiding the need for input cropping or warping at the beginning. The SPP layer generates various receptive fields using three pooling scales (1 × 1, 2 × 2 and 3 × 3) in each channel. It then pools the receptive fields of all channels and generates a fixed-length output for the next layer. In both networks, the dropout rates of 0.2 for the flatten layer, 0.5 for the penultimate dense layer, and 0.2 for the task-specific final dense layers, are empirically determined. In comparison, the Inception-v3 network, used as the full-alignment network in DeepVariant and PEPPER, is approximately eight times larger (2,989,210 versus ~24 million parameters) than Clair3’s full-alignment network.

We tried removing a residual block from the full-alignment network, and the overall F1-score reduced by ~4%, on average, in multiple experiments with HG003 and HG004, and coverage from 10× to 50×. On adding a residual block, the overall F1-score improvements were unnoticeable, but the network speed slowed by 20%. Removing the SPP layer reduced the Indel F1-score by ~10% at 10× coverage. More results of removing the insertion, phasing, mapping quality (MQ), base quality (BQ) channel or the two Indel length tasks are shown in Supplementary Table 11. A visualization toolkit that shows the network activations of individual inputs using guided propagation is available from the GitHub and Zenodo16 repositories. An example is given in Supplementary Fig. 6.

Model availability and training
Pretrained models are provided in Clair3’s installation. Models for specific chemistries and base callers that are tested and supported by the ONT developers are available through Rerio18. Detailed steps, options and caveats for training a pileup model and a full-alignment model are available in Clair3’s GitHub repository (https://github.com/HKU-BAL/Clair3/blob/main/docs/pileup_training.md and https://github.com/HKU-BAL/Clair3/blob/main/docs/full_alignment_training_r1.md); these are continually updated. The pretrained models, although targeted for use in production, were trained using multiple GIAB samples with known variants and ten coverages for each sample (more details are provided in Supplementary Section 4), but they always hold out chromosome 20 in Clair3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation for CNNScoreVariants #8526

{{title}}

Replies: 0 comments

Select a reply

Documentation for CNNScoreVariants #8526

felixm3 Sep 22, 2023

Replies: 0 comments

felixm3
Sep 22, 2023