Duplicate graphs in synthetic benchmark #5

seanli3 · 2023-05-22T07:16:29Z

Thank you for publishing the synthetic dataset for cut vertex and edge.

I'm trying to run the experiment but found one issue that I would like to seek clarification from you.

In synthetic_wrapper.py, you generate a replacement graph for every item in PygPCQM4Mv2Dataset (line 395), so in total there are 3746620 graphs. But because the new graphs are generated from a small set of parameters using four generators counter_example_1_1, counter_example_1_2, counter_example_2_1, counter_example_2_2 and counter_exmple_3, most of the 3746620 graphs are actually duplicate.
For example, I counted unique graphs by comparing the edge_index tensor, only 1024 out of 3746620 are unique, which means only 0.27% of graphs are useful. I think I will find more duplicates if I ran more serious isomorphic tests on the 1024 graphs.

It feels like a waste of time to benchmark on the total 3746620 graphs, why not create a smaller set of unique graphs and just benchmark on it?

The text was updated successfully, but these errors were encountered:

seanli3 · 2023-05-23T05:59:25Z

Also just to add to my question, this setup might leak test samples to training, as what is in the test set might be duplicates of that in training set

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate graphs in synthetic benchmark #5

Duplicate graphs in synthetic benchmark #5

seanli3 commented May 22, 2023

seanli3 commented May 23, 2023

Duplicate graphs in synthetic benchmark #5

Duplicate graphs in synthetic benchmark #5

Comments

seanli3 commented May 22, 2023

seanli3 commented May 23, 2023