Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug-report: data collection questions #5

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

BBerabi
Copy link

@BBerabi BBerabi commented Feb 9, 2023

Hello,

Thank you for your work! I have several questions about the training and data preprocessing.

In the paper, you say that at each round, you are training the fixer and breaker only on the newly generated data. Please see the screenshot and the equations. None of the equations mentions the dataset from round-0 on which the initial breaker is trained.

image

This seemed a bit odd to me because I would expect the newly generated data to be merged with the existing data from round-0. The model should then be trained on this joined dataset. Because the newly generated dataset contains a lot of bias since it is synthetic, it is very likely that the models forget everything learned from the initial data, which contains real-world bugs and fixes.

At first, I thought the equations were wrong, but also the text backed up the equations. See the screenshot from the paper:

image

To double-check, I started looking into your code and might have found several inconsistencies.

Indeed you merge the synthetically generated dataset with the initial dataset. See here: https://github.com/michiyasunaga/BIFI/blob/main/src/c006__generate_paired_data_from_fixer.py#L47-L52

But you do it in a very strange way. You build the dataset for the next step where 1/3 of it comes from the initial data, and the other 2/3 comes from the synthetically generated data. See here: https://github.com/michiyasunaga/BIFI/blob/main/src/c006__generate_paired_data_from_fixer.py#L59-L62

What is even more strange is that you duplicate data points. The total size is set to 30’000’000, and you repeat the data points. See here: https://github.com/michiyasunaga/BIFI/blob/main/src/c006__generate_paired_data_from_fixer.py#L58-L62

You just duplicate the same samples in the dataset.

There is not a single word on all of this in the paper if I did not miss anything and I do not understand the motivation behind these choices.

Could you please clarify these issues?

Thanks in advance!

Best,
Berkay Berabi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant