![]() |
BioAI Hackathon at the University of Warsaw (May 12–15, 2025) |
The aim of the project was to compare the performance of Graph Neural Networks (GNNs) with network propagation methods for disease-gene prioritization. We aimed at assessing and comparing how well these two strategies can prioritize candidate disease genes for further biological validation. As for the use case, we used genes involved in hereditary ataxia.
Medical genetics aims at diagnosing patients by finding genetic variants explaining their disease. While approximately half of the patients with rare genetic disease receive a genetic diagnosis, the other half remains undiagnosed. The reason is that some genomic variants are currently unknown to be disease-associated. Since experimental studies are usually quite expensive and time-consuming, in silico disease-gene prioritization is very useful in the discovery of new disease-genes. By ranking genes, the methods provide sets of the most promising new candidate genes relevant for a disease. Currently, novel bioinformatics methods, such as graph neural networks are increasingly popular. The huge number of such methods requires further understanding of how they compare to previous methods. In particular, it is of crucial importance for bioinformaticians to provide only high-confidence candidates for experimental validation. Therefore, during the hackathon we aimed at comparing the performance of graph neural networks with network propagation methods. Compared to network propagation, GNNs are expected to offer improved performance in recognizing complex network patterns, but are dependent on advanced configuration and limited by high computational requirements. Our project explores the potential and limitations of these two types of models.
- Jędrzej Kubica
- German Demidov
- Anna Simonyan
- Abolhassan Bahari
- Rahaf Ahmad
- Sreeram Peela
- Mikolaj Kocikowski
- Paulina Domek
In this study, we used the following software packages:
- https://github.com/geneDRAGNN/geneDRAGNN
- https://github.com/GiDeCarlo/XGDAG
- https://github.com/anthbapt/multixrank
For each method we used the following input:
- a biological network (protein-protein interaction constructed with BioGRID and Reactome)
- a list of known disease genes for ataxia (from https://www.genomicsengland.co.uk/)
- as node features, we used gene expression in cerebellum (from the GTEx database; https://www.gtexportal.org) and case/control variants statistics (from the AstraZeneca PheWAS Portal; https://azphewas.com/)
Everyone is welcome to clone this repository and build upon the project, for example, to expand the benchmarking of network-based methods for identifying disease-associated genes, or to obtain a functional XGDAG environment and geneDRAGNN setup.
This project demonstrates how to use the pretrained model from the geneDRAGNN repository to generate disease relevance scores for genes using a Multi-Layer Perceptron (MLP) model trained on lung adenocarcinoma (LUAD) data.
🔍 What This Does
- Loads a pretrained MLP model: MLP_node_trial99.ckpt
- Accepts node features of shape (N, 107)
- Outputs predicted gene scores in gene_scores.csv
🧪 Setup
-
Clone the Repo git clone https://github.com/hanikhatib/geneDRAGNN.git cd geneDRAGNN
-
Install Requirements (CPU) pip install torch torchvision torchaudio pandas numpy pip install torch-scatter -f https://data.pyg.org/whl/torch-2.1.0+cpu.html pip install torch-sparse -f https://data.pyg.org/whl/torch-2.1.0+cpu.html pip install torch-geometric
-
Prepare Input Files Place gene_features_107.csv and gene_names.csv in the repo folder.
-
Run Inference python3 run_gene_score_inference_final_v2.py
✅ Output: gene_scores.csv with pathogenicity relevance scores
- The pretrained model is trained for Lung Adenocarcinoma (LUAD).
- Using this model for other diseases will require retraining.
TBD
As a network propagation method, we used Random Walk with Restart, as implemented in MultiXrank (https://github.com/anthbapt/multixrank).
🔍 What This Does
- Loads a biological network and known disease genes (seeds)
- Outputs predicted gene scores
🧪 Setup
-
Clone the Repo git clone https://github.com/anthbapt/multixrank cd multixrank
-
Install Requirements (CPU) pip install multixrank
-
Prepare Input Files Place interactome_human.tsv (format: ENSG1\tENSG2) and ataxia_genes.txt (ENSG per row) in the ataxia folder.
-
Run MultiXrank as described in https://github.com/anthbapt/multixrank
✅ Output: multixrank_output_ataxia.tsv with scores
- We used the default MultiXrank parameters.
The list of the top 20 prioritized genes for ataxia using the geneDRAGNN tool that was trained on LUAD:
The list of the top 20 genes for ataxia using the retrained geneDRAGNN tool:
We were not able to utilize the XGDAG GNN due to challenges in reproducing a functional environment for this tool, across systems from a Windows laptop to a Linux HPC. The provided Conda setups were not functional and we could not resolve all dependencies. We requested the authores provide a containerized, platform-independent solution, however, due to the time limitations of the hackathon, we created a compatible Docker image from scratch. This way we were able to run the tool on the included, default data. This allowed us to realize several input files for XGDAG must be prepared with another tool, with limited documentation of the process. The tool set up unfortunately left us with not enough time to process and analyze our own data.
All scripts to run MultiXrank for ataxia genes are in multixrank/
Comparison of the top 20 genes from geneDRAGNN, retrained geneDRAGNN and MultiXrank:
During the hackathon, we conducted the project entitled "GRAIL: Gene Ranking for Ataxia using Interactome-based Learning" to compare the performance of two promising strategies for disease gene prioritization: graph neural networks and network propagation methods. Despite challenges with usability and limited time, we ran geneDRAGNN (a GNN-based method) and MultiXrank (a network propagation method). We used an example dataset consisting of a protein-protein interaction network and known gene for hereditary ataxia. Our results show that both methods produce distinct sets of top-ranked genes, which suggests that they might provide complementary results. The retraining of geneDRAGNN using disease-specific features might improve its performance. This project demonstrates the promise of GNNs in disease-gene prioritization. To conclude, a combined strategy that leverages the interpretability and scalability of propagation with the expressive power of GNNs may yield the best results for real-world rare disease research.
- Retrain XGDAG and geneDRAGNN using disease-specific input data
- GNN vs network propagation: leave-one-out cross-validation, compare the top 20 highest scoring genes (the most promising candidates for ataxia)
- Use XAI strategies to compute explanation subgraphs
- Construct multi-layer networks for GNN imputation
- Improve the usability of GNNs for non-CS researchers
- Perform external validation of candidate genes using the internal database of 30K+ of exomes and genomes of rare disease patients at IMGAG Tübingen.
- Altabaa, A., Huang, D., Byles-Ho, C., Khatib, H., Sosa, F., & Hu, T. (2022). GeneDRAGNN: Gene disease prioritization using graph neural networks. 2022 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 1–10. https://doi.org/10.1109/cibcb55180.2022.9863043
- Ata, S. K., Wu, M., Fang, Y., Ou-Yang, L., Kwoh, C. K., & Li, X.-L. (2021). Recent advances in network-based methods for disease gene prediction. Briefings in Bioinformatics, 22(4). https://doi.org/10.1093/bib/bbaa303
- Bekker, J., & Davis, J. (2020). Learning from positive and unlabeled data: a survey. Machine Learning, 109(4), 719–760. https://doi.org/10.1007/s10994-020-05877-5
- Laman Trip, D. S., van Oostrum, M., Memon, D., Frommelt, F., Baptista, D., Panneerselvam, K., Bradley, G., Licata, L., Hermjakob, H., Orchard, S., Trynka, G., McDonagh, E. M., Fossati, A., Aebersold, R., Gstaiger, M., Wollscheid, B., & Beltrao, P. (2025). A tissue-specific atlas of protein-protein associations enables prioritization of candidate disease genes. Nature Biotechnology. https://doi.org/10.1038/s41587-025-02659-z
- Mastropietro, A., De Carlo, G., & Anagnostopoulos, A. (2023). XGDAG: explainable gene-disease associations via graph neural networks. Bioinformatics (Oxford, England), 39(8). https://doi.org/10.1093/bioinformatics/btad482
- Shi, Y., Huang, Z., Wang, W., Zhong, H., Feng, S., & Sun, Y. (2020). Masked label prediction: Unified message passing model for semi-supervised classification. In arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2009.03509