Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got low accuraccy when replicating the expirements on ogbn-mag dataset (due to torch-geometric 2.0.0). Fixed by downgrading to torch-geometric 1.5.0 #40

Open
pdhung3012 opened this issue Sep 17, 2021 · 18 comments

Comments

@pdhung3012
Copy link

Hello

I and my friend tried to replicate the experiment on the ogbn-mag dataset. We haven't changed anything in configurations compared to the original code. You can see my running files at here:
https://github.com/pdhung3012/pyHGT/blob/master/ogbn-mag/preprocess_ogbn_mag_RTX3090.py
https://github.com/pdhung3012/pyHGT/blob/master/ogbn-mag/train_ogbn_mag_RTX3090.py
https://github.com/pdhung3012/pyHGT/blob/master/ogbn-mag/eval_ogbn_mag_RTX3090.py

However, we couldn't get the accuracy at around 0.5 like your experiment using Tesla graphic card. Here are I and my friend's computers' configurations:

Python 3.8
Nvvidia RTX 3090 (my friend used titan xp)
Pytorch 1.8.0
Cuda 11.1

Configuration:
/home/hungphd/anaconda3/envs/py38/bin/python /home/hungphd/git/pyHGT/ogbn-mag/eval_ogbn_mag_RTX3090.py --prev_norm --last_norm --use_RTE +--------------+-----------------------+ | Parameter | Value | +--------------+-----------------------+ | data_dir | dataset_v1/OGB_MAG.pk | +--------------+-----------------------+ | model_dir | ./hgt_4layer | +--------------+-----------------------+ | task_type | variance_reduce | +--------------+-----------------------+ | vr_num | 8 | +--------------+-----------------------+ | n_pool | 8 | +--------------+-----------------------+ | n_batch | 32 | +--------------+-----------------------+ | batch_size | 128 | +--------------+-----------------------+ | conv_name | hgt | +--------------+-----------------------+ | n_hid | 512 | +--------------+-----------------------+ | n_heads | 8 | +--------------+-----------------------+ | n_layers | 4 | +--------------+-----------------------+ | cuda | 0 | +--------------+-----------------------+ | dropout | 0.200 | +--------------+-----------------------+ | sample_depth | 6 | +--------------+-----------------------+ | sample_width | 520 | +--------------+-----------------------+ | prev_norm | 1 | +--------------+-----------------------+ | last_norm | 1 | +--------------+-----------------------+ | use_RTE | 1 | +--------------+-----------------------+

We both achieved quite low accuracy:
Model #Params: 21173389 eval: 100%|██████████| 328/328 [1:07:18<00:00, 12.31s/it, accuracy=0.002] 0.0021459739144948616

Here is what we see when training:
Epoch: 93 LR: 0.00004 Train Loss: 4.2447 Train Acc: 0.1353 Valid Acc: 0.0956 Test Acc: 0.0021 Data Preparation: 17.0s Epoch: 94 LR: 0.00003 Train Loss: 4.2646 Train Acc: 0.1341 Valid Acc: 0.0928 Test Acc: 0.0021 Data Preparation: 16.2s Epoch: 95 LR: 0.00003 Train Loss: 4.2497 Train Acc: 0.1360 Valid Acc: 0.0945 Test Acc: 0.0017 Data Preparation: 12.8s Epoch: 96 LR: 0.00002 Train Loss: 4.2595 Train Acc: 0.1328 Valid Acc: 0.0989 Test Acc: 0.0021 Data Preparation: 13.3s Epoch: 97 LR: 0.00002 Train Loss: 4.2609 Train Acc: 0.1349 Valid Acc: 0.1002 Test Acc: 0.0016 Data Preparation: 13.3s Epoch: 98 LR: 0.00001 Train Loss: 4.2525 Train Acc: 0.1352 Valid Acc: 0.0954 Test Acc: 0.0012 Data Preparation: 14.1s Epoch: 99 LR: 0.00001 Train Loss: 4.2530 Train Acc: 0.1346 Valid Acc: 0.0967 Test Acc: 0.0021 Data Preparation: 12.8s Epoch: 100 LR: 0.00000 Train Loss: 4.2650 Train Acc: 0.1340 Valid Acc: 0.0981 Test Acc: 0.0019

Is there problem with the compatibility of the code with new version of pytorch/cuda/graphic card?

Sincerely

@acbull
Copy link
Owner

acbull commented Sep 17, 2021

The log is very wierd. I guess it's probably due to some pyg update. I'll take a look at it later. In the meantime is it possible you could try our reported pyg version?

@pdhung3012
Copy link
Author

The log is very wierd. I guess it's probably due to some pyg update. I'll take a look at it later. In the meantime is it possible you could try our reported pyg version?

It is 2.0.0. I check by running the pip install to get the version since I already installed pyg
Requirement already satisfied: torch-geometric in /home/hungphd/.local/lib/python3.8/site-packages (2.0.0) Requirement already satisfied: jinja2 in /home/hungphd/.local/lib/python3.8/site-packages (from torch-geometric) (3.0.1) Requirement already satisfied: numpy in /home/hungphd/anaconda3/envs/py38/lib/python3.8/site-packages (from torch-geometric) (1.20.3) Requirement already satisfied: googledrivedownloader in /home/hungphd/.local/lib/python3.8/site-packages (from torch-geometric) (0.4) Requirement already satisfied: pandas in /home/hungphd/.local/lib/python3.8/site-packages (from torch-geometric) (1.3.3) Requirement already satisfied: scipy in /home/hungphd/.local/lib/python3.8/site-packages (from torch-geometric) (1.7.1) Requirement already satisfied: scikit-learn in /home/hungphd/.local/lib/python3.8/site-packages (from torch-geometric) (0.24.2) Requirement already satisfied: yacs in /home/hungphd/.local/lib/python3.8/site-packages (from torch-geometric) (0.1.8) Requirement already satisfied: tqdm in /home/hungphd/.local/lib/python3.8/site-packages (from torch-geometric) (4.62.2) Requirement already satisfied: pyparsing in /home/hungphd/.local/lib/python3.8/site-packages (from torch-geometric) (2.4.7) Requirement already satisfied: networkx in /home/hungphd/.local/lib/python3.8/site-packages (from torch-geometric) (2.6.3) Requirement already satisfied: rdflib in /home/hungphd/.local/lib/python3.8/site-packages (from torch-geometric) (6.0.0) Requirement already satisfied: requests in /home/hungphd/anaconda3/envs/py38/lib/python3.8/site-packages (from torch-geometric) (2.26.0) Requirement already satisfied: PyYAML in /home/hungphd/anaconda3/envs/py38/lib/python3.8/site-packages (from torch-geometric) (5.4.1) Requirement already satisfied: MarkupSafe>=2.0 in /home/hungphd/.local/lib/python3.8/site-packages (from jinja2->torch-geometric) (2.0.1) Requirement already satisfied: python-dateutil>=2.7.3 in /home/hungphd/anaconda3/envs/py38/lib/python3.8/site-packages (from pandas->torch-geometric) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /home/hungphd/anaconda3/envs/py38/lib/python3.8/site-packages (from pandas->torch-geometric) (2021.1) Requirement already satisfied: six>=1.5 in /home/hungphd/anaconda3/envs/py38/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas->torch-geometric) (1.15.0) Requirement already satisfied: setuptools in /home/hungphd/anaconda3/envs/py38/lib/python3.8/site-packages (from rdflib->torch-geometric) (52.0.0.post20210125) Requirement already satisfied: isodate in /home/hungphd/.local/lib/python3.8/site-packages (from rdflib->torch-geometric) (0.6.0) Requirement already satisfied: charset-normalizer~=2.0.0 in /home/hungphd/anaconda3/envs/py38/lib/python3.8/site-packages (from requests->torch-geometric) (2.0.5) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/hungphd/anaconda3/envs/py38/lib/python3.8/site-packages (from requests->torch-geometric) (1.26.6) Requirement already satisfied: idna<4,>=2.5 in /home/hungphd/anaconda3/envs/py38/lib/python3.8/site-packages (from requests->torch-geometric) (3.2) Requirement already satisfied: certifi>=2017.4.17 in /home/hungphd/anaconda3/envs/py38/lib/python3.8/site-packages (from requests->torch-geometric) (2021.5.30) Requirement already satisfied: threadpoolctl>=2.0.0 in /home/hungphd/.local/lib/python3.8/site-packages (from scikit-learn->torch-geometric) (2.2.0) Requirement already satisfied: joblib>=0.11 in /home/hungphd/.local/lib/python3.8/site-packages (from scikit-learn->torch-geometric) (1.0.1)

@pdhung3012
Copy link
Author

pdhung3012 commented Sep 17, 2021

The log is very wierd. I guess it's probably due to some pyg update. I'll take a look at it later. In the meantime is it possible you could try our reported pyg version?

The strange thing is that, we we run the experiment on the OAG dataset with original configuration that you did, we got the accuracy almost the same with reports on the paper (around 51% in MRR)
https://nl-pl-hgt.slack.com/files/U02DV3WU2P7/F02EB7ELFH7/image.png

@pdhung3012
Copy link
Author

The log is very wierd. I guess it's probably due to some pyg update. I'll take a look at it later. In the meantime is it possible you could try our reported pyg version?

I can try that version (pytorch_geometric 1.3.2). However, I think the code should work with new version of pytorch-geometric, since I need to downgrade torch-scatter, torch-sparse but they might not get along well with pytorch 1.8.1 ( it's impossible to install pytorch 1.3 on the latest RTX 3090).

@pdhung3012
Copy link
Author

The log is very wierd. I guess it's probably due to some pyg update. I'll take a look at it later. In the meantime is it possible you could try our reported pyg version?

I downgrade with torch-geometric 1.3.2 and got this error:
`/home/hungphd/anaconda3/envs/py38/bin/python /home/hungphd/git/pyHGT/ogbn-mag/preprocess_ogbn_mag_RTX3090.py
Traceback (most recent call last):
File "/home/hungphd/git/pyHGT/ogbn-mag/preprocess_ogbn_mag_RTX3090.py", line 25, in
dataset = PygNodePropPredDataset(name='ogbn-mag')
File "/home/hungphd/anaconda3/envs/py38/lib/python3.8/site-packages/ogb/nodeproppred/dataset_pyg.py", line 69, in init
self.data, self.slices = torch.load(self.processed_paths[0])
File "/home/hungphd/anaconda3/envs/py38/lib/python3.8/site-packages/torch/serialization.py", line 592, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/home/hungphd/anaconda3/envs/py38/lib/python3.8/site-packages/torch/serialization.py", line 851, in _load
result = unpickler.load()
ModuleNotFoundError: No module named 'torch_geometric.data.storage'

and the training file has this error:
Traceback (most recent call last): File "/home/hungphd/git/pyHGT/ogbn-mag/train_ogbn_mag_RTX3090.py", line 110, in <module> gnn = GNN(conv_name = args.conv_name, in_dim = len(graph.node_feature['paper'][0]), \ File "/home/hungphd/git/pyHGT/ogbn-mag/pyHGT/model.py", line 66, in __init__ self.gcs.append(GeneralConv(conv_name, n_hid, n_hid, num_types, num_relations, n_heads, dropout, use_norm = prev_norm, use_RTE = use_RTE)) File "/home/hungphd/git/pyHGT/ogbn-mag/pyHGT/conv.py", line 308, in __init__ self.base_conv = HGTConv(in_hid, out_hid, num_types, num_relations, n_heads, dropout, use_norm, use_RTE) File "/home/hungphd/git/pyHGT/ogbn-mag/pyHGT/conv.py", line 13, in __init__ super(HGTConv, self).__init__(node_dim=0, aggr='add', **kwargs) TypeError: __init__() got an unexpected keyword argument 'node_dim'
I guess I need to downgrade the torch-cluster torch-scatter and torch-sparse but it's impossible due to the fact that I have to downgrade pytorch version that isn't be supported by RTX 3090.

`

@pdhung3012
Copy link
Author

I checked and saw the latest update from ogbn-mag is 7 months ago, which torch-geometric-1.6.3 was the latest version at that time.

I tried to run the preprocess_.py but it still have the same error with 1.3. However, the train_.py is working now. I will update the result to you when it is available.

@pdhung3012
Copy link
Author

The log is very wierd. I guess it's probably due to some pyg update. I'll take a look at it later. In the meantime is it possible you could try our reported pyg version?

with torch-geometric-1.6.3 the accuracy is very low on my training also

Epoch: 95 LR: 0.00003 Train Loss: 4.0600 Train Acc: 0.1657 Valid Acc: 0.0555 Test Acc: 0.0022 Data Preparation: 18.8s Epoch: 96 LR: 0.00002 Train Loss: 4.0500 Train Acc: 0.1670 Valid Acc: 0.0575 Test Acc: 0.0016 Data Preparation: 19.1s Epoch: 97 LR: 0.00002 Train Loss: 4.0503 Train Acc: 0.1668 Valid Acc: 0.0524 Test Acc: 0.0018 Data Preparation: 18.9s Epoch: 98 LR: 0.00001 Train Loss: 4.0638 Train Acc: 0.1650 Valid Acc: 0.0619 Test Acc: 0.0021 Data Preparation: 19.4s Epoch: 99 LR: 0.00001 Train Loss: 4.0438 Train Acc: 0.1680 Valid Acc: 0.0591 Test Acc: 0.0016 Data Preparation: 19.8s Epoch: 100 LR: 0.00000 Train Loss: 4.0554 Train Acc: 0.1661 Valid Acc: 0.0592 Test Acc: 0.0019

@pdhung3012
Copy link
Author

The log is very wierd. I guess it's probably due to some pyg update. I'll take a look at it later. In the meantime is it possible you could try our reported pyg version?

It seems that I got the code working with much better results using torch-geometric 1.5.0 (released in May 2020). I need to provide some small fixes in your code also. For example, the preprocess.py file missed the declaration of Evaluator and the node_year_dict should be updated to node_year.

Here is what I got. Is that seem normally?

Epoch: 51 LR: 0.00026 Train Loss: 1.4745 Train Acc: 0.5646 Valid Acc: 0.4391 Test Acc: 0.4235 Data Preparation: 22.6s Epoch: 52 LR: 0.00025 Train Loss: 1.4691 Train Acc: 0.5634 Valid Acc: 0.4395 Test Acc: 0.4405 Data Preparation: 21.9s Epoch: 53 LR: 0.00025 Train Loss: 1.4615 Train Acc: 0.5677 Valid Acc: 0.4418 Test Acc: 0.4142 Data Preparation: 21.6s Epoch: 54 LR: 0.00024 Train Loss: 1.4544 Train Acc: 0.5687 Valid Acc: 0.4288 Test Acc: 0.4125 Data Preparation: 21.4s Epoch: 55 LR: 0.00024 Train Loss: 1.4552 Train Acc: 0.5692 Valid Acc: 0.4374 Test Acc: 0.4250 Data Preparation: 21.2s UPDATE!!! 0.4516340750319941

@acbull
Copy link
Owner

acbull commented Sep 18, 2021 via email

@pdhung3012
Copy link
Author

Hi: Thanks for figuring out the problem. Do you happen to know which part makes your previous experiment fails?

On Fri, Sep 17, 2021, 21:27 Hung PHAN @.***> wrote: The log is very wierd. I guess it's probably due to some pyg update. I'll take a look at it later. In the meantime is it possible you could try our reported pyg version? It seems that I got the code working with much better results using torch-geometric 1.5.0 (released in May 2020). I need to provide some small fixes in your code also. For example, the preprocess.py file missed the declaration of Evaluator and the node_year_dict should be updated to node_year. Here is what I got. Is that seem normally? Epoch: 51 LR: 0.00026 Train Loss: 1.4745 Train Acc: 0.5646 Valid Acc: 0.4391 Test Acc: 0.4235 Data Preparation: 22.6s Epoch: 52 LR: 0.00025 Train Loss: 1.4691 Train Acc: 0.5634 Valid Acc: 0.4395 Test Acc: 0.4405 Data Preparation: 21.9s Epoch: 53 LR: 0.00025 Train Loss: 1.4615 Train Acc: 0.5677 Valid Acc: 0.4418 Test Acc: 0.4142 Data Preparation: 21.6s Epoch: 54 LR: 0.00024 Train Loss: 1.4544 Train Acc: 0.5687 Valid Acc: 0.4288 Test Acc: 0.4125 Data Preparation: 21.4s Epoch: 55 LR: 0.00024 Train Loss: 1.4552 Train Acc: 0.5692 Valid Acc: 0.4374 Test Acc: 0.4250 Data Preparation: 21.2s UPDATE!!! 0.4516340750319941 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHREXR5AIMXLLEB2IN4A3ULUCQILDANCNFSM5EF7I6PA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Let me check a little bit to see which part caused the problem. I got the final accuracy at 47% of test acc (I guess it stills have some bugs since the reported acc is around 50%). I think currently the repository for obgn-mag worked with pytorch 1.5.0 but it didn't work on the latest 2.0.0 and the reported 1.3.2 also.

@pdhung3012 pdhung3012 changed the title Got low accuraccy when replicating the expirements on ogbn-mag dataset Got low accuraccy when replicating the expirements on ogbn-mag dataset (due to torch-geometric 2.0.0). Fixed by downgrading to torch-geometric 1.5.0 Sep 18, 2021
@pdhung3012
Copy link
Author

Hi
Here is what I got from running the eval....py

/home/hungphd/anaconda3/envs/py38/bin/python /home/hungphd/git/oldCommits/pyHGT/ogbn-mag/eval_ogbn_mag.py --task_type variance_reduce --prev_norm --last_norm --use_RTE +--------------+--------------------+ | Parameter | Value | +--------------+--------------------+ | data_dir | dataset/OGB_MAG.pk | +--------------+--------------------+ | model_dir | ./hgt_4layer | +--------------+--------------------+ | task_type | variance_reduce | +--------------+--------------------+ | vr_num | 8 | +--------------+--------------------+ | n_pool | 8 | +--------------+--------------------+ | n_batch | 32 | +--------------+--------------------+ | batch_size | 128 | +--------------+--------------------+ | conv_name | hgt | +--------------+--------------------+ | n_hid | 512 | +--------------+--------------------+ | n_heads | 8 | +--------------+--------------------+ | n_layers | 4 | +--------------+--------------------+ | cuda | 0 | +--------------+--------------------+ | dropout | 0.200 | +--------------+--------------------+ | sample_depth | 6 | +--------------+--------------------+ | sample_width | 520 | +--------------+--------------------+ | prev_norm | 1 | +--------------+--------------------+ | last_norm | 1 | +--------------+--------------------+ | use_RTE | 1 | +--------------+--------------------+ Model #Params: 21173389 eval: 100%|██████████| 328/328 [1:10:02<00:00, 12.81s/it, accuracy=0.489] 0.4934786237153962

@acbull
Copy link
Owner

acbull commented Sep 24, 2021

Hi:

I just noticed that pyg group re-implemented the hgt model using their updated API. Have you tried it?

https://github.com/pyg-team/pytorch_geometric/blob/master/examples/hetero/hgt_dblp.py

@pdhung3012
Copy link
Author

No I haven't since I downgrade my torch-geometric to your code version. I will try it after I finish my experiments on current torch-geometric.
It seems that in the new HGT model I don't see the code of graph sampling compared to your version. Is that correct?

@acbull
Copy link
Owner

acbull commented Sep 24, 2021 via email

@pdhung3012
Copy link
Author

I see

@pdhung3012
Copy link
Author

I will check and notify you

@mxdlzg
Copy link

mxdlzg commented Dec 13, 2021

Any updates?

@pdhung3012
Copy link
Author

Any updates?

Sorry for late reply. I am able to run the dataset on both this version (torch-geometric 1.5) and the newer version by Microsoft (torch-geometric 2.0) using the ogbn-mag dataset. The accuracies of 2 versions are quite similar which are around 50%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants