This repository contains implementations for the ROLAND paper. This repository is a version of GraphGym modified for the ROLAND paper to include our new model implementation.
Please see the get_roland_public_data.sh
script for accessing publicly available datasets used in our paper from Stanford Large Network Dataset Collection.
We have provided MD5 values of datasets at public_dataset_MD5.txt
for you to verify the correctness of downloaded datasets.
We recorded our complete conda environment configuration: roland_environment.yml
.
You can check out requirements in requirements.txt
for essential packages.
The script ./run/replicate_roland_tables.sh
runs experiments of all tables in the paper.
We uploaded our raw tensorboard logs this repository as well. This section allows you to replicate our tables directly based on our tensorboard logs without re-running experiments.
-
We first upload logs to the TensorBoard Dev following tutorial here. This would allow you to obtain a
experiment_id
. -
To generate table in pandas format, you need to run the following command:
cd ./run
python3 make_table_tensorboard_dev.py <The Experiment ID You Obtained>
This script prints the latex-formatted tables to your terminal.
The configuration for the model corresponding to each entry in Table 2-4 are located at ./run/replication_configs/
. Each file name describes the dataset, dataset configuration and model specification. You can also look into these yaml
files for more details.
We tested our replication materials in June 2021 (the Camera Ready Version) on Ubuntu 20.04 + RTX3090 (CUDA 11.4).
@inproceedings{you2022roland,
title={ROLAND: graph learning framework for dynamic graphs},
author={You, Jiaxuan and Du, Tianyu and Leskovec, Jure},
booktitle={Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages={2358--2366},
year={2022}
}
Backend: the GraphGym
Platform
Details on Deploying GraphGym
GraphGym is a platform for designing and evaluating Graph Neural Networks (GNN). ### Highlights **1. Highly modularized pipeline for GNN** - **Data:** Data loading, data splitting - **Model:** Modularized GNN implementation - **Tasks:** Node / edge / graph level GNN tasks - **Evaluation:** Accuracy, ROC AUC, ...2. Reproducible experiment configuration
- Each experiment is fully described by a configuration file
3. Scalable experiment management
- Easily launch thousands of GNN experiments in parallel
- Auto-generate experiment analyses and figures across random seeds and experiments.
4. Flexible user customization
- Easily register your own modules in
graphgym/contrib/
, such as data loaders, GNN layers, loss functions, etc.
TL;DR: GraphGym is great for GNN beginners, domain experts and GNN researchers.
Scenario 1: You are a beginner to GNN, who wants to understand how GNN works.
You probably have read many exciting papers on GNN, and try to write your own GNN implementation. Using existing packages for GNN, you still have to code up the essential pipeline on your own. GraphGym is a perfect place for your to start learning standardized GNN implementation and evaluation.
Scenario 2: You want to apply GNN to your exciting applications.
You probably know that there are hundreds of possible GNN models, and selecting the best model is notoriously hard. Even worse, we have shown in our paper that the best GNN designs for different tasks differ drastically. GraphGym provides a simple interface to try out thousands of GNNs in parallel and understand the best designs for your specific task. GraphGym also recommends a "go-to" GNN design space, after investigating 10 million GNN model-task combinations.
Figure 2: A guideline for desirable GNN design choices.
(Sampling from 10 million GNN model-task combinations.)
Scenario 3: You are a GNN researcher, who want to innovate GNN models / propose new GNN tasks.
Say you have proposed a new GNN layer ExampleConv
.
GraphGym can help you convincingly argue that ExampleConv
is better than say GCNConv
:
when randomly sample from 10 millions possible model-task combinations, how often ExampleConv
will outperform GCNConv
, when everything else is fixed (including the computational cost).
Moreover, GraphGym can help you easily do hyper-parameter search, and visualize what design choices are better.
In sum, GraphGym can greatly facilitate your GNN research.
Requirements
- CPU or NVIDIA GPU, Linux, Python3
- PyTorch, various Python packages; Instructions for installing these dependencies are found below
1. Python environment We recommend using Conda package manager
conda create -n graphgym python=3.7
source activate graphgym
2. Pytorch: Install PyTorch. We have verified under PyTorch 1.4.0 and torchvision 0.5.0. For example:
pip install torch==1.4.0 torchvision==0.5.0
3. Pytorch Geometric: Install PyTorch Geometric, follow their instructions. For example:
# CUDA versions: cpu, cu92, cu101, cu102, cu110
# TORCH versions: 1.4.0, 1.5.0, 1.6.0, 1.7.0
CUDA=cu101
TORCH=1.4.0
pip install torch-scatter==latest+${CUDA} -f https://pytorch-geometric.com/whl/torch-${TORCH}.html
pip install torch-sparse==latest+${CUDA} -f https://pytorch-geometric.com/whl/torch-${TORCH}.html
pip install torch-cluster==latest+${CUDA} -f https://pytorch-geometric.com/whl/torch-${TORCH}.html
pip install torch-spline-conv==latest+${CUDA} -f https://pytorch-geometric.com/whl/torch-${TORCH}.html
pip install torch-geometric
Alternatively, you can install dependencies of torch-geometric
in one line:
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv -f https://data.pyg.org/whl/torch-${TORCH}+${CUDA}.html
Note: the installation of torch-scatter
can take up to 10 minutes. Please make sure the CUDA version and PyTorch version match your system driver exactly.
4. Clone GraphGym and install other dependencies:
git clone https://github.com/snap-stanford/GraphGym
cd GraphGym
pip install -r requirements.txt
python setup.py develop
5. Test the installation
Run a test GNN experiment using GraphGym, specified in run/configs/example.yaml
.
The experiment is about node classification on Cora dataset (random 80/20 train/val split).
cd run
bash run_single.sh
A full example is specified in run/run_single.sh
.
1.1 Specify a configuration file.
In GraphGym, an experiment is fully specified by a .yaml
file.
Unspecified configurations in the .yaml
file will be populated by the default values in graphgym/config.py
.
For example, in run/configs/example.yaml
, there are configurations on dataset, training, model, GNN, etc.
Descriptions for each configuration is described also in graphgym/config.py
.
1.2 Launch an experiment.
For example, in run/run_single.sh
:
python main.py --cfg configs/example.yaml --repeat 3
You can specify the number of different random seeds to repeat via --repeat
.
1.3 Understand the results.
Experimental results will be automatically saved in directory run/results/${CONFIG_NAME}/
;
in the example above, it is run/results/example/
.
Results for different random seeds will be saved in different subdirectories, such as run/results/example/2
.
The aggregated results over all the random seeds are automatically generated into run/results/example/agg
,
including the mean and standard deviation _std
for each metric.
Train/val/test results are further saved into subdirectories, such as run/results/example/agg/val
; here,
stats.json
stores the results after each epoch aggregated across random seeds,
best.json
stores the results at the epoch with the highest validation accuracy.
A full example is specified in run/run_batch.sh
.
2.1 Specify a base file.
GraphGym supports running a batch of experiments.
To start, a user needs to select a base architecture --config
.
The batch of experiments will be created by perturbing certain configurations of the base architecture.
2.2 (Optional) Specify a base file for computational budget.
Additionally, GraphGym allows a user to select a base architecture to control the computational budget for the grid search, --config_budget
.
The computational budget is currently measured by the number of trainable parameters; the control is achieved by auto-adjust
the hidden dimension size for GNN.
If no --config_budget
is provided, GraphGym will not control the computational budget.
2.3 Specify a grid file. A grid file describes how to perturb the base file, in order to generate the batch of the experiments. For example, the base file could specify an experiment of 3-layer GCN for Cora node classification. Then, the grid file specifies how to perturb the experiment along different dimension, such as number of layers, model architecture, dataset, level of task, etc.
2.4 Generate config files for the batch of experiments, based on the information specified above.
For example, in run/run_batch.sh
:
python configs_gen.py --config configs/${DIR}/${CONFIG}.yaml \
--config_budget configs/${DIR}/${CONFIG}.yaml \
--grid grids/${DIR}/${GRID}.txt \
--out_dir configs
2.5 Launch the batch of experiments.
For example, in run/run_batch.sh
:
bash parallel.sh configs/${CONFIG}_grid_${GRID} $REPEAT $MAX_JOBS $SLEEP
Each experiment will be repeated for $REPEAT
times.
We implemented a queue system to sequentially launch all the jobs, with $MAX_JOBS
concurrent jobs running at the same time.
In practice, our system works great when handling thousands of jobs.
2.6 Understand the results.
Experimental results will be automatically saved in directory run/results/${CONFIG_NAME}_grid_${GRID_NAME}/
;
in the example above, it is run/results/example_grid_example/
.
After running each experiment, GraphGym additionally automatically averages across different models, saved in
run/results/example_grid_example/agg
.
There, val.csv
represents validation accuracy for each model configuration at the final epoch,
and val_best.csv
represents the results at the epoch with the highest validation error.
When test set split is provided, test.csv
represents test accuracy for each model configuration at the final epoch,
and test_best.csv
represents the results at the epoch with the highest validation error.
We provides a handy tool to automatically provide an overview of a batch of experiments in
analysis/example.ipynb
.
cd analysis
jupyter notebook
example.ipynb # automatically provide an overview of a batch of experiments
A highlight of GraphGym is that it allows users to easily register their customized modules.
The supported customized modules are provided in directory graphgym/contrib/
, including:
- Activation
graphgym/contrib/act/
, - Configuration
graphgym/contrib/config/
, - Feature augmentation
graphgym/contrib/feature_augment/
, - Feature encoder
graphgym/contrib/feature_encoder/
, - GNN head
graphgym/contrib/head/
, - GNN layer
graphgym/contrib/layer/
, - Data loader
graphgym/contrib/loader/
, - Loss function
graphgym/contrib/loss/
, - GNN network
graphgym/contrib/network/
, - Optimizer
graphgym/contrib/optimizer/
, - GNN global pooling (graph classification only)
graphgym/contrib/pooling/
, - GNN stage
graphgym/contrib/stage/
, - Data transformations
graphgym/contrib/transform/
.
Within each directory, (at least) an example is provided, showing how to register user customized modules.
Note that new user customized modules may result in new configurations; in these cases, new configuration fields
should be registered at graphgym/contrib/config/
.
Note: Applying to your own datasets.
A common use case will be applying GraphGym to your favorite datasets.
To do so, you may follow our example in graphgym/contrib/loader/example.py
.
To provide more flexibility, GraphGym currently accepts a list of NetworkX graphs
or DeepSNAP graphs as the input;
the following attributes in the graphs will be auto_loaded and parsed node_feature
, node_label
, edge_feature
, edge_label
,
graph_feature
, graph_label
.
Additionally, we have provided examples on how to transform PyG datasets into the accepted format.
Further details on the data representation is described in DeepSNAP documentation.
Reproducing experiments in Design Space for Graph Neural Networks, Jiaxuan You, Rex Ying, Jure Leskovec, NeurIPS 2020 Spotlight. You may refer to the paper or project webpage for more details.
# NOTE: We include the raw results with GraphGym
# If you run the following code, the results will be overridden.
cd run
bash run_design_round1.sh # first round experiments, on a design space of 315K GNN designs
bash run_design_round2.sh # second round experiments, on a design space of 96 GNN designs
cd ../analysis
jupyter notebook
design_space.ipynb # reproducing all the analyses in the paper
Jiaxuan You initiates the project and majorly contributes to the entire GraphGym platform. Rex Ying contributes to the feature augmentation modules. Jonathan Gomes Selman enables GraphGym to have OGB support.
GraphGym is inspired by the framework of pycls. GraphGym adopt DeepSNAP as the data representation, which is a Python library that assists efficient deep learning on graphs. Part of GraphGym relies on Pytorch Geometric functionalities.
We warmly welcome the community to contribute to GraphGym.
GraphGym is particularly designed to enable contribution / customization in a simple way.
For example, you may contribute your modules to graphgym/contrib/
by creating pull requests.
If you find GraphGym or our paper useful, please cite our paper:
@InProceedings{you2020design,
title = {Design Space for Graph Neural Networks},
author = {You, Jiaxuan and Ying, Rex and Leskovec, Jure},
booktitle = {NeurIPS},
year = {2020}
}