Skip to content

Latest commit

 

History

History

helixdock

English | 简体中文

HelixDock:Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-ligand Structure Prediction Models

This repository contains the implementation for our paper.

Protein-ligand structure prediction is crucial in drug discovery, determining interactions between small molecules (ligands) and target proteins (receptors). Traditional physics-based docking tools are widely used but suffer from limited conformational sampling and imprecise scoring functions, impacting their accuracy. Recent advances using deep learning aim to enhance prediction accuracy but are hindered by limited training data. HelixDock addresses these challenges by pre-training on large-scale docking conformations generated by traditional physics-based tools and then fine-tuning with a limited set of experimentally validated receptor-ligand complexes. This approach significantly improves prediction accuracy and model generalizability. HelixDock has been rigorously benchmarked against both physics-based and deep learning-based baselines, demonstrating superior precision and robust transferability. HelixDock also demonstrates outstanding capabilities in cross-docking and structure-based virtual screening benchmarks, successfully identifying highly active inhibitors in real-world virtual screening projects.

Online Service

For those who want to try out our model without any installation, we also provide an online interface PaddleHelix HelixDock-forcast through web service.

License

This project is licensed under the CC BY-NC License.

Under this license, you are free to share, copy, distribute, and transmit the work, subject to the following restrictions:

  • Attribution (BY): You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • Non-Commercial (NC): You may not use the material for commercial purposes, but you are free to use it for academic research, education, and other non-commercial purposes.

For more details, please refer to the full license text.

Environment

Installation

Except those listed in the requirements.txt, openbabel tool is needed to calculate the aligned RMSD between the predicted pose and the crystal pose. You can use the following command to install the environment.

conda create -n helixdock python=3.7
conda activate helixdock
python install -r requirements.txt
conda install openbabel==2.4.1 -c conda-forge

Note that the rdkit version should be 2022.3.3, otherwise it may cause some errors when loading the model parameters.

Download the Trained Model Parameters

Here we provide the model parameters that can be used to reproduce the results of our paper.

mkdir -p model
wget https://paddlehelix.bd.bcebos.com/HelixDock/helixdock.pdparams
mv helixdock.pdparams ./model/

Dowload the raw data

# PDBbind core set
wget https://paddlehelix.bd.bcebos.com/HelixDock/pdbbind_core_raw.tgz
tar xzf pdbbind_core_raw.tgz
mkdir -p ../data/PDBbind_v2020/complex/
mv pdbbind_core/* ../data/PDBbind_v2020/complex/


# PoseBusters dataset
wget https://paddlehelix.bd.bcebos.com/HelixDock/posebuster_raw.tgz
tar xzf posebuster_raw.tgz

Download the processed data

mkdir -p data/processed/
# PDBbind core set
wget https://paddlehelix.bd.bcebos.com/HelixDock/pdbbind_core_processed.tgz
tar xzf pdbbind_core_processed.tgz
mv pdbbind_core_processed data/processed/

# PoseBusters dataset
wget https://paddlehelix.bd.bcebos.com/HelixDock/posebuster_processed.tgz
tar xzf posebuster_processed.tgz
mv posebuster_processed data/processed/

Usage

To reproduce the results of our paper, we provide the following scripts:

# reproduce the results of PDBBind core set
sh reproduce_core.sh

The output is organized as:

    ./log/reproduce_core/save_output/step-1
        mol_name.sdf

where mol_name.sdf is the predicted conformation for the input mol.

# reproduce the results of the results of PoseBusters
# note that to reproduce PoseBusters result, multi-sampling and ranking the result using RTMScore and posebuster score is required.
sh reproduce_posebuster.sh

The output is organized as:

    ./log/reproduce_posebuster/save_output/step-1
        mol_name.sdf

where mol_name.sdf is the predicted conformation for the input mol.

Data Availablity

To advance the frontiers of small molecule drug discovery and provide maximum support for academic researchers, HelixDock's latest technology is fully open to the academic community. This includes access to the code and billion-scale training data, accelerating the application of AI in small molecule drug research and promoting development in this field. (Commercial customers can inquire about specific business rules through the "Cooperation Consultation" entry on the official website).

Training data can be obtained free of charge by contacting the PaddleHelix team at the following link (please include your institution's name): https://paddlehelix.baidu.com/partnership.

Citing this work

If you use the code or data in this repository, please cite:

@article{liu2024pretraining,
      title={Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-ligand Structure Prediction Models}, 
      author={Lihang Liu and Shanzhuo Zhang and Donglong He and Xianbin Ye and Jingbo Zhou and Xiaonan Zhang and Yaoyao Jiang and Weiming Diao and Hang Yin and Hua Chai and Fan Wang and Jingzhou He and Liang Zheng and Yonghui Li and Xiaomin Fang},
      year={2024},
      eprint={2310.13913},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}