Skip to content

Commit

Permalink
Clarify discussion of "recombinant" samples
Browse files Browse the repository at this point in the history
  • Loading branch information
huddlej committed Aug 19, 2024
1 parent d3ea840 commit 63f6779
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 1 deletion.
3 changes: 2 additions & 1 deletion manuscript/cartography.tex
Original file line number Diff line number Diff line change
Expand Up @@ -419,7 +419,8 @@ \subsection{SARS-CoV-2 clusters recapitulate broad genetic groups corresponding

To test the optimal cluster parameters identified above, we applied embedding methods to late SARS-CoV-2 data and compared clusters from these embeddings to the corresponding Nextstrain clades and Pango lineages.
Compared to the 17 Nextstrain clades defined in this time period (Supplementary Fig.~S\ref{S_Fig_sarscov2_late_embeddings_by_Nextstrain_clade}), the closest clusters were from t-SNE (normalized VI=0.09 with 66 clusters) and UMAP (normalized VI=0.09 with 13 clusters, Fig.~\ref{fig:sars-cov-2-2022-2023-clusters-vs-Nextstrain-clade} and Supplementary Table~S\ref{S_Table_optimal_cluster_parameters}).
We attributed t-SNE's additional clusters to recombinant lineages that were genetically distinct but which received a generic ``recombinant'' label in Nextstrain's clade definitions instead of a unique clade name.
We attributed t-SNE's additional clusters to recombinant lineages that were genetically distinct but which received a generic ``recombinant'' label in Nextstrain's clade definitions instead of a unique clade name (Supplementary Fig.~S\ref{S_Fig_sarscov2_late_embeddings_by_Nextstrain_clade}).
Although we did not consider these non-monophyletic recombinant samples when calculating VI distances between clusters and Nextstrain clades, these samples appear in each embedding where they could form their own distinct clusters.
Only t-SNE, UMAP, and genetic distance clusters were fully monophyletic (Supplementary Table~S\ref{S_Table_monophyletic_clusters}).
Genetic distance, PCA, and t-SNE clusters were best supported by cluster-specific mutations with 16 of 17 clusters (94\%), 6 of 7 clusters (86\%), and 51 of 66 clusters (77\%), respectively (Supplementary Table~S\ref{S_Table_mutations_per_cluster}).
Clusters from t-SNE had the lowest average within-group distances (Supplementary Fig.~S\ref{S_Fig_sarscov2_within_between_group_distances}).
Expand Down
1 change: 1 addition & 0 deletions manuscript/cartography_supplement.tex
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,7 @@ \section*{Supplementary data}
\includegraphics[width=0.9\columnwidth]{figures/sarscov2-test-embeddings-by-Nextstrain_clade-clade.png}
\caption{{\bf Phylogeny of late (2022--2023) SARS-CoV-2 sequences plotted by number of nucleotide substitutions from the most recent common ancestor on the x-axis (top) and low-dimensional embeddings of the same sequences by PCA (middle left), MDS (middle right), t-SNE (bottom left), and UMAP (bottom right).}
Tips in the tree and embeddings are colored by their Nextstrain clade assignment.
Tips that could not be assigned to a predefined Nextstrain clade due to recombination were colored as ``recombinant''.
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
Clade labels in the tree and embeddings highlight larger clades.
Expand Down

0 comments on commit 63f6779

Please sign in to comment.