-
Notifications
You must be signed in to change notification settings - Fork 34
RdRP OTU system
https://serratus-public.s3.amazonaws.com/rce/rdrp_otus/2020-01-20/otus.fa
https://serratus-public.s3.amazonaws.com/rce/rdrp_otus/2020-01-20/otus.tsv
https://serratus-public.s3.amazonaws.com/rce/rdrp_otus/2020-01-20/acc_table.tsv
We need a hierarchical classification of RdRPs to organize known Riboviruses and characterize new RdRPs discovered by Serratus. Taxonomy is not adequate because many known viruses are not classified at lower ranks ("unclassified Ribovirus", "unclassified Betacoronavirus" etc.), and in general sequence similarity is not sufficient to make a taxonomic assignment of a new sequence to a named clade.
Identifying a set of "known" RdRPs is not straightforward because there is no curated RdRP database analogous to SILVA for 16S and it is not trivial to extract RdRPs from GenBank due to inconsistent terminology and missing annotations. In practice, we must use sequence-based searches (diamond, motifator...) to identify RdRPs in GenBank from full-length nucleotide genomes and in translated viral ORFs and CDSs. With sequence search, there is a continuous spectrum from high confidence to low confidence and below; this must be addressed by setting thresholds which are necessarily somewhat arbitrary.
If a sequence is present only in a viral genome which is not annotated but is a low-confidence-RdRP according to motifator, is it "known"? This could be argued both ways. For claims in a paper, we must be conservative so we will not claim it as novel. There is a spectrum from "clearly known" through "arguable" to "clearly novel assuming comprehensive due diligence in finding RdRPs in GenBank".
We should make our best effort collect known RdRPs and assign them to at least two bins: (1) high-confidence ("gold") and (2) other ("bronze", because silver sounds too good).
Gold sequences will be the foundation for classifying assembled RdRPs.
The "bronze" bin could contain RdRPs which are not known with high-confidence, but are identifiable easily enough that we should not claim them as novel. However, this is not needed if we use high-confidence motifator to search both known sequences and assemblies. This is because a high-confidence assembly match will be in the gold set if it is already known; it cannot be 90% similar to an RdRP in the bronze set because then motifator would have identified it with high confidence.
Possible methods include PFAM HMMs, diamond against a trusted reference, motifator, and Raffy (a random forest classifier combining inputs from diamond and motifator).
With HMMs and diamond, there are a few drawbacks. E-value is a measure of sequence similarity, but sequence similarity does not necessarily imply homology, and homology does not necessarily imply RdRP function (e.g., RTs are homologs with a different function). With diamond, we have seen that E-value < 1e-6 is a good classifier but for HMMs we haven't investigated this systematically. Also, the PFAM domains have varying overlaps with the A+C region we want to use as a marker.
Raffy looks to work well, but it's more complicated and more difficult to use than motifator, and I see no compelling advantage to using it.
Motifator has excellent specificity for RdRP, good sensitivity, and locates an informative marker segment (A+C). Therefore, motifator is used as the automated method for identifying RdRPs in uncurated sequences.
Notation: A+C is the maximal segment containing both A and C motifs. AxBxC is a sequence containg A, B and C motifs in canonical order, separated by xxx.
Possible methods include glocal alignment to a pre-trimmed reference such as the Wolf original sequences (before permutation hacking), and Motifator. Glocal has the problem that permutations must be identified, otherwise the trim boundaries may be quite wrong, e.g. when the closest known species has a different order of motifs. Motifator is therefore preferred as the automated method.
Manual trimming can be used in anomalous cases such as unusual permuated domains where motifator fails.
The gold RdRP set our best set of high-confidence knowns, trimmmed to A+C. An RdRP fragment missing one or more motifs is operationally considered to be unknown. A claim of novelty is reasonable if we find the ABC segment for such a fragment. Each gold sequence is distinct; if duplicates are found in different sources these are noted in an annotation file.
The sources for gold v0.1 are described next, in decreasing priority order. Representative sequences are chosen in this order.
Proteins annotated by UniProt as RdRP. As with rdrp1, these are assumed to be monkey-curated and motifator is used to trim; any sequence predicted as RdRP with low-, medium- or high-confidence and intact ABC is included.
The current Serratus query from AB. I am assuming this was monkey-curated and all sequences can be considered high-confidence RdRP. These sequences are trimmed by motifator as for UniProt. Here, the manual curation trumps the motifator classification.
RdRPs from Koonin paper (from AB). Considered to be curated, so all motifator hits are included.
This is a complete (?) set of viral ORFs from GenBank from AB. I don't know exactly how it was built. From these, only high-confidence motifator predictions are included.
This is a complete (?) set of viral nucleotide sequences downloaded from GenBank by AB. From these, only high-confidence motifator predictions are included.
Not yet constructed, and might not be needed. These will be medium- and low-confidence RdRPs from gb241 plus other non-curated sources we can find, e.g. motifator search of 6-frame translation of full-length nt virus genomes.
Gold is constructed from the above sources, giving a set of unique trimming RdRP sequences. OTUs are constructed from these by UCLUST at species-like, genus-like and family-like thresholds (90%, 75% and 40%). At each stage (uniques, 90, 75, 40), input to UCLUST is the centroids from the previous stage.
OTU system is versioned by date. If multiple versions are posted on the same dat, then letters a, b... will be appended to the date.
FASTA file with unique Gold sequences trimmed to A+C. Typical label is:
>u23 f123.g456.s6789 taxon:2585030
Here, u, f, g and s are the unique, family-like, genus-like and species-like integer identifiers and taxon is the NCBI TaxID.
Tab-separated file with the following fields:
-
Accession. May be from GenBank, UniProt or some other database. The database can probably be distinguished from the identifier, but there is no systematic way to do this.
-
Unique integer (100% identity cluster number).
-
Species-like integer (90% identity cluster number).
-
Genus-like integer (75% identity cluster number).
-
Family-like integer (40% identity cluster number).
-
NCBI TaxID.
-
Scientific name.
-
String with all taxonomy ranks.
Accession table. Tab-separated text with one line for every accession in the input sets. The accession is the first field. This enables any input sequence to be mapped to a unique Gold or to the discard pile.
Once an OTU system is in production, it should be stable so that a given query sequence will be assigned to the same OTU. To maintain a stable reference, new gold sequences will be added as follows. All new sequences that do not match otus.fa at 90% identity are collected in new.fa. The unique sequences in new.fa are identified as new.uniques.fa; each of these receives a new unique integer identifier (u). These are clustered at 90% identity, giving centroids new.id90.fa which are assigned new species-like identifiers (s). Then, new.id90.fa is clustered at 75% giving centroids new.id75.fa with new g identifiers, and new.id75.fa is clustered at 40% giving centroids new.id40.fa each of which is assigned a new f identifier.