Skip to content

Latest commit

 

History

History
17 lines (13 loc) · 1.29 KB

README.md

File metadata and controls

17 lines (13 loc) · 1.29 KB

Requirement

  • Python 3.6
  • Tensoflow 1.14

Dataset Preparation

First, process dataset to generate the graph and node features. The format is as follows:

  • feature.bin, the feature vectors of the node instances. To save the space, we store them as sparse matrix via scipy.sparse.lil.lil_matrix. Each row represents all the feature of a node.
  • label.bin, the one-hot labels of the node instances. Each row represents a one-hot vector of class label.
  • graph.bin, a dict in the format {index: [index_of_neighbor_nodes]}, where the neighbor nodes are organized as a list.

Then, we will randomly sample nodes to split the nodes into train/eval/test sets. For citation datasets cora, citeseer, pubmed, we follow the processing of GAT to split train/val/test sets. For amazon product dataset, we will use the default data separation given by OGB. For other datasets, we will take the ratio=60%, 20%, 20%. The node indexs are stored as:

  • idx_train.bin, the indexs of training instances.
  • idx_eval.bin, the indexs of evaluation instances.
  • idx_test.bin, the indexs of test instances.

Please refer to test_data.ipynb for the detailed information of the datasets under ./data folder.