Skip to content

Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion

License

Notifications You must be signed in to change notification settings

Profluent-Internships/MMDiff

Repository files navigation

MMDiff: Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion (NeurIPS MLSB 2023)

PyTorch Lightning Config: Hydra

Checkpoints DOI

Description

Official PyTorch implementation of Towards Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion.

Animation of a diffusion model-generated 3D nucleic acid molecule visualized successively

Table of contents

How to install

Install Mamba

wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh  # accept all terms and install to the default location
rm Mambaforge-$(uname)-$(uname -m).sh  # (optionally) remove installer after using it
source ~/.bashrc  # alternatively, one can restart their shell session to achieve the same result

Install dependencies

# clone project
git clone https://github.com/Profluent-Internships/MMDiff
cd MMDiff

# create conda environment
mamba env create -f environment.yaml
conda activate MMDiff  # note: one still needs to use `conda` to (de)activate environments

# install local project as package
pip3 install -e .

Download pre-trained checkpoints

Note: Make sure to be located in the project's root directory beforehand (e.g., ~/MMDiff/)

# fetch and extract model checkpoints directory
wget https://zenodo.org/record/8247932/files/MMDiff_Release_Checkpoints.tar.gz
tar -xzf MMDiff_Release_Checkpoints.tar.gz -C checkpoints
rm MMDiff_Release_Checkpoints.tar.gz

Install US-align and qTMclust to cluster generated structures

cd $MY_PROGRAMS_DIR  # download US-align to your choice of directory (e.g., `~/Programs/`)
git clone https://github.com/pylelab/USalign.git && cd USalign/ && git checkout 97325d3aad852f8a4407649f25e697bbaa17e186
g++ -static -O3 -ffast-math -lm -o USalign USalign.cpp
g++ -static -O3 -ffast-math -lm -o qTMclust qTMclust.cpp

Note: Make sure to update the usalign_exec_path and qtmclust_exec_path values in e.g., configs/paths/default.yaml to reflect where you have placed the US-align and qTMclust executables on your machine.

How to install RoseTTAFold2NA to score generated sequences

cd forks/RoseTTAFold2NA/

# create conda environment for RoseTTAFold2NA
conda deactivate
mamba env create -f RF2na-linux.yml --prefix RF2NA/
conda activate RF2NA/

# install SE(3)-Transformer locally to run RoseTTAFold2NA #
cd SE3Transformer/
pip install --no-cache-dir -r requirements.txt
python setup.py install
cd ..

# download pre-trained weights locally #
cd network
wget https://files.ipd.uw.edu/dimaio/RF2NA_apr23.tgz
tar xvfz RF2NA_apr23.tgz
du -sh weights/ # note: this directory should contain a 1.1GB weights file
cd ..

# download sequence and structure databases #
# note: downloading these databases is unnecessary if one will only be using RF2NA's single-sequence mode #
# uniref30 [46G]
wget http://wwwuser.gwdg.de/~compbiol/uniclust/2020_06/UniRef30_2020_06_hhsuite.tar.gz
mkdir -p UniRef30_2020_06
tar xfz UniRef30_2020_06_hhsuite.tar.gz -C ./UniRef30_2020_06

# BFD [272G]
wget https://bfd.mmseqs.com/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz
mkdir -p bfd
tar xfz bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz -C ./bfd

# structure templates (including *_a3m.ffdata, *_a3m.ffindex)
wget https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2021Mar03.tar.gz
tar xfz pdb100_2021Mar03.tar.gz

# RNA databases
mkdir -p RNA
cd RNA

# Rfam [300M]
wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.full_region.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz
gunzip Rfam.cm.gz
cmpress Rfam.cm

# RNAcentral [12G]
wget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/rfam/rfam_annotations.tsv.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/id_mapping/id_mapping.tsv.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/sequences/rnacentral_species_specific_ids.fasta.gz
../input_prep/reprocess_rnac.pl id_mapping.tsv.gz rfam_annotations.tsv.gz   # ~8 minutes
gunzip -c rnacentral_species_specific_ids.fasta.gz | makeblastdb -in - -dbtype nucl  -parse_seqids -out rnacentral.fasta -title "RNACentral"

# nt [151G]
update_blastdb.pl --decompress nt
cd ../../../../  # return to the root directory of the project

How to prepare input data

How to download protein input data

To get the protein portion of the training dataset, first download the PDB and then preprocess it using our provided scripts. The PDB can be downloaded from the RCSB: https://www.wwpdb.org/ftp/pdb-ftp-sites#rcsbpdb. Note that our scripts assume you have downloaded the PDB in the mmCIF format. Navigate down to "Download Protocols" and follow the download instructions depending on your location.

WARNING: Downloading PDB can take up to 1TB of space.

After downloading, you should have a directory formatted like this: https://files.rcsb.org/pub/pdb/data/structures/divided/mmCIF/

00/
01/
02/
..
zz/

In this directory, unzip all the files:

find . -name "*.gz" -exec gzip -d {} +

Then run the following with <mmcif_dir> replaced with the location of the files downloaded from the PDB.

python src/data/components/pdb/process_pdb_mmcif_files.py --mmcif_dir <mmcif_dir> --write_dir <processed_dir> --num_processes 50 --skip_existing --verbose

See the script for more options. Each mmCIF will be written as a pickle file that we read and process in the data loading pipeline. A protein_metadata.csv will be saved that contains the pickle path of each protein example as well as additional information about each example for faster filtering.

How to enable PDB clustering

To use clustered training data, download the clusters at 30% sequence identity via the RCSB. This download link also works at the time of writing:

https://cdn.rcsb.org/resources/sequence/clusters/clusters-by-entity-30.txt

Place this file in data/processed_pdb or anywhere in your file system. Update your config to point to the clustered data:

data:
  cluster_path: data/processed_pdb/clusters-by-entity-30.txt

To use clustered data, set sample_mode to either cluster_time_batch or cluster_length_batch. See next section for details.

Setting the batching mode

experiment:
  # Use one of the following:

  # Each batch contains multiple time steps of the same protein.
  sample_mode: time_batch

  # Each batch contains multiple proteins of the same length.
  sample_mode: length_batch

  # Each batch contains multiple time steps of a protein from a cluster.
  sample_mode: cluster_time_batch

  # Each batch contains multiple clusters of the same length.
  sample_mode: cluster_length_batch

How to download nucleic acid input data

# download complexes from the PDB for training
python notebooks/creating_protein_na_datasets_from_the_pdb.py

How to analyze input data

# download Pfam databases and related metadata
pfam_db_dir=~/Databases/Pfam
mkdir -p $pfam_db_dir

curl ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz | gunzip > $pfam_db_dir/Pfam-A.hmm
curl ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz | gunzip > $pfam_db_dir/Pfam-A.hmm.dat
curl ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/active_site.dat.gz | gunzip > $pfam_db_dir/active_site.dat

cd $pfam_db_dir
hmmpress Pfam-A.hmm

# execute analysis script to attach Pfam annotations to existing dataset metadata
python notebooks/analyze_dataset_characteristics.py

How to process nucleic acid input data

First, group each of the downloaded PDB files by their two middle alphanumeric characters, with <pdb_dir> replaced with the location of the PDB files downloaded.

cd <pdb_dir> && find . -maxdepth 1 -type f -name "*.pdb" -exec sh -c 'dir=$(basename {} | awk -F"[._]" "{print substr(\$1, 2, 2)}"); mkdir -p "$dir"; mv {} "$dir"' \;

Then, run the following with <pdb_dir> replaced with the location of the PDB files downloaded.

python src/data/components/pdb/process_pdb_na_files.py --pdb_dir <pdb_dir> --write_dir <processed_dir> --num_processes 50 --skip_existing --verbose

See the script for more options. Each PDB will be written as a pickle file that we read and process in the data loading pipeline. A na_metadata.csv will be saved that contains the pickle path of each nucleic acid example as well as additional information about each example for faster filtering.

Note: The process_pdb_na_files.py script must be run after first running the process_pdb_mmcif_files.py script, to ensure that the nucleic acid complexes curated by the creating_protein_na_datasets_from_the_pdb.py script overwrite those collected by process_pdb_mmcif_files.py. This results in de-duplication of approximately 2,500 PDB complexes that are assembled by both process_pdb_mmcif_files.py (first) and creating_protein_na_datasets_from_the_pdb.py (second, and preferred since nucleic acid molecules are included in this collection).

How to combine protein and nucleic acid metadata for dataloading

Lastly, when building the dataset from scratch, we need to combine the protein_metadata.csv and na_metadata.csv files by joining the two by their pdb_name columns and dropping duplicate rows contained in protein_metadata.csv (i.e., preferring to keep entries in na_metadata.csv over those in protein_metadata.csv). One can do so as follows:

python src/data/components/pdb/join_pdb_metadata.py na_metadata_csv_path=<processed_dir>/na_metadata.csv protein_metadata_csv_path=<processed_dir>/protein_metadata.csv metadata_output_csv_path=<processed_dir>/metadata.csv

After the script above is finished, the resulting metadata.csv file should also be placed in metadata/ as PDB_NA_Dataset.csv to finalize the information required for the model's dataloading pipeline, as shown below:

cp <processed_dir>/metadata.csv metadata/PDB_NA_Dataset.csv

How to train

Train model with default configuration

# train on CPU
python src/train.py trainer=cpu

# train on GPU
python src/train.py trainer=gpu

Train model with chosen experiment configuration from configs/experiment/. For example:

python src/train.py experiment=pdb_prot_na_gen_se3

Note: You can override any parameter from the command line like this

python src/train.py trainer.max_epochs=20 data.batch_size=64

How to sample

sample.py is the inference script, which can be run as follows.

python src/sample.py

For example, to reproduce the sampling experiments in our manuscript, one can run the inference script as follows, for protein-only, nucleic acid-only, and protein-nucleic acid sequence-structure diffusion, respectively:

# protein-only sampling and evaluation
python src/sample.py inference.name='protein_only_sequence_structure_se3_discrete_diffusion_stratified_eval_${now:%Y-%m-%d}_${now:%H-%M-%S}' inference.seed=123 inference.run_statified_eval=true inference.filter_eval_split=true inference.run_self_consistency_eval=true inference.run_diversity_eval=true inference.run_novelty_eval=true inference.output_dir=\'./inference_eval_outputs/\' inference.samples.min_length=10 inference.samples.max_length=120 inference.samples.num_length_steps=10 inference.samples.min_num_chains=1 inference.samples.max_num_chains=4 paths.usalign_exec_path=$HOME/Programs/USalign/USalign paths.qtmclust_exec_path=$HOME/Programs/USalign/qTMclust data.data_cfg.filtering.mmcif_allowed_oligomer=[notype] data.data_cfg.filtering.allowed_molecule_types=[protein] ckpt_path=checkpoints/protein_na_sequence_structure_g42jpyug_rotations_epoch_286.ckpt
# nucleic acid-only sampling and evaluation
python src/sample.py inference.name='na_only_sequence_structure_se3_discrete_diffusion_stratified_eval_${now:%Y-%m-%d}_${now:%H-%M-%S}' inference.seed=456 inference.run_statified_eval=true inference.filter_eval_split=true inference.run_self_consistency_eval=true inference.run_diversity_eval=true inference.run_novelty_eval=true inference.output_dir=\'./inference_eval_outputs/\' inference.samples.min_length=10 inference.samples.max_length=120 inference.samples.num_length_steps=10 inference.samples.min_num_chains=1 inference.samples.max_num_chains=4 paths.usalign_exec_path=$HOME/Programs/USalign/USalign paths.qtmclust_exec_path=$HOME/Programs/USalign/qTMclust data.data_cfg.filtering.mmcif_allowed_oligomer=[notype] data.data_cfg.filtering.allowed_molecule_types=[na] inference.measure_auxiliary_na_metrics=true ckpt_path=checkpoints/protein_na_sequence_structure_g42jpyug_rotations_epoch_286.ckpt
# protein-nucleic acid sampling and evaluation
python src/sample.py inference.name='protein_na_sequence_structure_se3_discrete_diffusion_stratified_eval_${now:%Y-%m-%d}_${now:%H-%M-%S}' inference.seed=789 inference.run_statified_eval=true inference.filter_eval_split=true inference.run_self_consistency_eval=true inference.run_diversity_eval=true inference.run_novelty_eval=true inference.output_dir=\'./inference_eval_outputs/\' inference.samples.min_length=10 inference.samples.max_length=120 inference.samples.num_length_steps=10 inference.samples.min_num_chains=1 inference.samples.max_num_chains=4 paths.usalign_exec_path=$HOME/Programs/USalign/USalign paths.qtmclust_exec_path=$HOME/Programs/USalign/qTMclust data.data_cfg.filtering.mmcif_allowed_oligomer=[notype] data.data_cfg.filtering.allowed_molecule_types=[protein,na] ckpt_path=checkpoints/protein_na_sequence_structure_g42jpyug_rotations_epoch_286.ckpt

The config for inference is in configs/sample.yaml. See the config for different inference options. By default, inference will use the model weights checkpoints/protein_na_sequence_structure_g42jpyug_rotations_epoch_286.ckpt for protein-nucleic acid joint generation of sequence and structure. Simply change the ckpt_path to use your custom weights (e.g., for structure-only generation) as desired.

Note: Evaluation with RoseTTAFold2NA for assessing designability does not allow semicolons (:) in the names of any FASTA file inputs. Keep this in mind when choosing your value for inference.name or inference.output_dir.

ckpt_path: <path>

Samples will be saved to inference.output_dir in the sample.yaml. By default it is set to ./inference_outputs/. Sample outputs will be saved as follows:

inference_outputs
└── 10D_07M_2023Y_20h_46m_13s                                         # date time of inference
    └── length_100                                                    # sampled length
        ├── sample_0                                                  # sample ID for length
        │   ├── self_consistency                                      # self-consistency results
        │   │   ├── rf2na                                             # RoseTTAFold2NA prediction using generated chain sequences
        │   │   │   ├── sample_1.pdb
        │   │   ├── {protein_,na_,}sample_1.pdb
        │   │   ├── sc_results.csv                                    # summary metrics CSV
        │   │   └── seqs
        │   │       ├── {P:,S:,R:,PR:,}sample_0_chain_a.fa            # generated sequence for each generated chain
        │   │       ├── {P:,S:,R:,PR:,}sample_0_chain_b.fa            # note: `P:` -> protein; `S:` -> single-stranded DNA; `R` -> RNA; and `PR:` -> protein-RNA
        │   │       ├── {P:,S:,R:,PR:,}sample_0_chain_a.fasta         # note: MMDiff sequence when `inference.generate_protein_sequences_using_pmpnn` is `true` for protein-only generation
        │   │       ├── {P:,S:,R:,PR:,}sample_0_chain_b.fasta
        │   │       └── ...
        │   ├── {protein_,na_,}bb_traj_1.pdb                          # `x_{t-1}` diffusion trajectory
        │   ├── {protein_,na_,}sample_1.pdb                           # final sample
        │   └── x0_traj_1.pdb                                         # `x_0` model prediction trajectory
        └── sample_1                                                  # next sample

Acknowledgements

MMDiff builds upon the source code and data from the following projects:

We thank all their contributors and maintainers!

Citing this work

If you use the code or data associated with this package or otherwise find such work useful, please cite:

@inproceedings{
  morehead2023towards,
  title={Towards Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion},
  author={Morehead, Alex and Bhatnagar, Aadyot and Ruffolo, Jeffrey A. and Madani, Ali},
  booktitle={NeurIPS Machine Learning in Structural Biology Workshop},
  year={2023}
}

About

Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion

Resources

License

Stars

Watchers

Forks