bug-report: data collection questions #5

BBerabi · 2023-02-09T19:49:42Z

Hello,

Thank you for your work! I have several questions about the training and data preprocessing.

In the paper, you say that at each round, you are training the fixer and breaker only on the newly generated data. Please see the screenshot and the equations. None of the equations mentions the dataset from round-0 on which the initial breaker is trained.

This seemed a bit odd to me because I would expect the newly generated data to be merged with the existing data from round-0. The model should then be trained on this joined dataset. Because the newly generated dataset contains a lot of bias since it is synthetic, it is very likely that the models forget everything learned from the initial data, which contains real-world bugs and fixes.

At first, I thought the equations were wrong, but also the text backed up the equations. See the screenshot from the paper:

To double-check, I started looking into your code and might have found several inconsistencies.

Indeed you merge the synthetically generated dataset with the initial dataset. See here: https://github.com/michiyasunaga/BIFI/blob/main/src/c006__generate_paired_data_from_fixer.py#L47-L52

But you do it in a very strange way. You build the dataset for the next step where 1/3 of it comes from the initial data, and the other 2/3 comes from the synthetically generated data. See here: https://github.com/michiyasunaga/BIFI/blob/main/src/c006__generate_paired_data_from_fixer.py#L59-L62

What is even more strange is that you duplicate data points. The total size is set to 30’000’000, and you repeat the data points. See here: https://github.com/michiyasunaga/BIFI/blob/main/src/c006__generate_paired_data_from_fixer.py#L58-L62

You just duplicate the same samples in the dataset.

There is not a single word on all of this in the paper if I did not miss anything and I do not understand the motivation behind these choices.

Could you please clarify these issues?

Thanks in advance!

Best,
Berkay Berabi

bug-report: data collection questions

79e835e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug-report: data collection questions #5

bug-report: data collection questions #5

BBerabi commented Feb 9, 2023 •

edited

Loading

bug-report: data collection questions #5

Are you sure you want to change the base?

bug-report: data collection questions #5

Conversation

BBerabi commented Feb 9, 2023 • edited Loading

BBerabi commented Feb 9, 2023 •

edited

Loading