This is the official code repository for the paper "Learning from pre-pandemic data to forecast viral escape". This paper is a joint collaboration between the Marks Lab and the OATML group.
EVEscape is a model that computes the predicted likelihood of a given viral protein variant to induce immune escape from antibodies. For each protein, EVEscape predicts escape from data sources available pre-pandemic: sequence likelihood predictions from broader viral evolution, antibody accessibility information from protein structures, and changes in binding interaction propensity from residue chemical properties.
Computing EVEscape scores consists of three components:
- Fitness: use scores from EVE, an unsupervised generative model of mutation effect from broader evolutionary sequences
- Accessibility: calculate WCN from PDB structures of relevant conformations of the viral protein of interest
- Dissimilarity: calculate difference in charge and hydrophobicity between the mutant residue and the wildtype
The components are then standardized and fed into a temperature scaled logistic function, and we take the the log transform of the product of the 3 terms to obtain final EVEscape scores.
We also provide EVEscape scores for all single mutation variants of SARS-CoV-2 Spike and aggregate strain-level predictions for all GISAID strains, and EVEscape rankings of newly occurring GISAID strains and visualization of likely future mutations will be available at evescape.org.
The scripts folder contains python scripts to calculate EVEscape scores for all single mutations and aggregate available deep mutational scanning data for SARS-CoV-2 RBD, Flu HA, HIV Env, Lassa glycoprotein, and Nipah fusion and glycoproteins from data. Specifically this includes the following two scripts:
- process_protein_data.py calculates the three EVEscape components
- evescape_scores.py creates the final evescape scores and outputs scores and processed DMS data in summaries_with_scores
The scripts folder also contains a python script score_pandemic_strains.py to calculate EVEscape scores for all strains in GISAID. The output strain scores (~150MB unzipped) can be downloaded as follows:
curl -o strain_scores_20230318.zip https://marks.hms.harvard.edu/evescape/strain_scores_20230318.zip
unzip strain_scores_20230318.zip
rm strain_scores_20230318.zip
The workflow of the scripts to create the data tables in results needed for the main figures of the EVEscape paper is available in evescape_summary.pdf. Additional data tables are available in the paper supplement.
The training data required to obtain EVEscape scores is one or multiple PDB files of the viral antigen, EVE scores (see next subsection) and a fasta file of the wildtype sequence for the viral protein of interest.
To download the RBD escape data used as validation data in this project (~120MB unzipped):
curl -o escape_dms_data_20220109.zip https://marks.hms.harvard.edu/evescape/escape_dms_data_20220109.zip
unzip escape_dms_data_20220109.zip
rm escape_dms_data_20220109.zip
(originally downloaded from SARS2_RBD_Ab_escape_maps)
We leverage the original EVE codebase to compute the evolutionary indices used in EVEscape.
The MSAs used to train the EVE models used in this project can be found in the supplemental material of the paper (Data S1).
We modify the Bayesian VAE training script to support the following hyperparameter choices in the MSA_processing call:
- sequence re-weighting in MSA (theta): we choose a value of 0.01 that is better suited to viruses (Hopf et al., Riesselman et al.)
- fragment filtering (threshold_sequence_frac_gaps): we keep sequences in the MSA that align to at least 50% of the target sequence.
- position filtering (threshold_focus_cols_frac_gaps): we keep columns with at least 70% coverage, except for SARS-CoV-2 Spike for which we lower the required value to 30% in order to maximally cover experimental positions and significant pandemic sites.
We train 5 independent models with different random seeds.
For the 5 independently-trained models, we compute evolutionary indices sampling 20k times from the approximate posterior distribution (ie., num_samples_compute_evol_indices=20000). We then average the resulting scores across the 5 models to obtain the final EVE scores used in EVEscape.
We provide open access to the EVE models we trained for the various viruses discussed in the paper. These model checkpoints were obtained following the training procedure described above. To download checkpoints for a viral protein of interest, please adapt the following example with the relevant filename (filenames are listed in the table underneath).
curl -o EVE_checkpoints_I4EPC4.zip https://marks.hms.harvard.edu/evescape/EVE_checkpoints_I4EPC4.zip
unzip EVE_checkpoints_I4EPC4.zip
rm EVE_checkpoints_I4EPC4.zip
Uniprot ID | Organism | Protein name | Filename |
---|---|---|---|
I4EPC4 | Influenza A virus | Hemagglutinin | EVE_checkpoints_I4EPC4.zip |
P0DTC2 (pre2020 sequences only) | SARS-CoV-2 | Spike glycoprotein | EVE_checkpoints_P0DTC2_full_pre2020.zip |
P0DTC2 | SARS-CoV-2 | Spike glycoprotein | EVE_checkpoints_P0DTC2_full.zip |
Q2N0S5 | HIV-1 | Envelope glycoprotein gp160 | EVE_checkpoints_Q2N0S5.zip |
FUS_NIPAV | Nipah virus | Fusion glycoprotein F0 | EVE_checkpoints_FUS_NIPAV.zip |
GLYC_LASSJ | Lassa virus | Pre-glycoprotein polyprotein GP complex | EVE_checkpoints_GLYC_LASSJ.zip |
GLYCP_NIPAV | Nipah virus | Glycoprotein G | EVE_checkpoints_GLYCP_NIPAV.zip |
The entire codebase is written in python. The corresponding environment may be created via conda and the provided requirements.txt file as follows:
conda config --add channels conda-forge
conda create --name evescape_env --file requirements.txt
conda activate evescape_env
The environment installs in minutes.
After collecting the training data, generating EVEscape scores for all single mutations runs in minutes. Strain scoring of all GISAID strains runs in 2 hours on 64G of memory.
This project is available under the MIT license.
If you use this code, please cite the following paper:
Nicole N. Thadani*, Sarah Gurev*, Pascal Notin*, Noor Youssef, Nathan J. Rollins, Daniel Ritter, Chris Sander, Yarin Gal, Debora S. Marks. Learning from pre-pandemic data to forecast viral escape. Nature. 2023.
(* equal contribution)
Links:
- Publication: https://www.nature.com/articles/s41586-023-06617-0
- Website: https://www.evescape.org/
See new work using EVEscape to design infectious Spike proteins that forecast future neutralizing antibody escape on BioRxiv.