Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate graphs in synthetic benchmark #5

Open
seanli3 opened this issue May 22, 2023 · 1 comment
Open

Duplicate graphs in synthetic benchmark #5

seanli3 opened this issue May 22, 2023 · 1 comment

Comments

@seanli3
Copy link

seanli3 commented May 22, 2023

Thank you for publishing the synthetic dataset for cut vertex and edge.

I'm trying to run the experiment but found one issue that I would like to seek clarification from you.

In synthetic_wrapper.py, you generate a replacement graph for every item in PygPCQM4Mv2Dataset (line 395), so in total there are 3746620 graphs. But because the new graphs are generated from a small set of parameters using four generators counter_example_1_1, counter_example_1_2, counter_example_2_1, counter_example_2_2 and counter_exmple_3, most of the 3746620 graphs are actually duplicate.
For example, I counted unique graphs by comparing the edge_index tensor, only 1024 out of 3746620 are unique, which means only 0.27% of graphs are useful. I think I will find more duplicates if I ran more serious isomorphic tests on the 1024 graphs.

It feels like a waste of time to benchmark on the total 3746620 graphs, why not create a smaller set of unique graphs and just benchmark on it?

@seanli3
Copy link
Author

seanli3 commented May 23, 2023

Also just to add to my question, this setup might leak test samples to training, as what is in the test set might be duplicates of that in training set

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant