Add option to remove duplicate samples in training #259
Add option to remove duplicate samples in training #259danstan5 wants to merge 2 commits intohuggingface:mainfrom
Conversation
There was a problem hiding this comment.
My main concern with these changes is that it proposes that we:
- generate a large list of pairs,
- reduce it
Intuitively, it should be simpler to just generate a "proper" list from the get-go. Let us consider an example. A user has a binary text classification task. They have labeled 16 text samples for this task. They set their num_iterations to 20 and remove_duplicate_samples to False.
Then, the Trainer will initialize 16 * 20 * 2 = 640 pairs. (The * 2 is due to a positive and a negative pair being added every time) These pairs likely contain duplicates.
In another example, he sets remove_duplicate_samples to True, but keeps all other parameters the same. When the model is first trained, it does so using 500 pairs. If he runs it again, it may train with just 350, or 600. There is an inconsistency here, which may fairly strongly affect the training (speed & accuracy).
I propose an alternative solution which covers the same solution: a unique_pairs argument. This argument is passed to sentence_pairs_generation and sentence_pairs_generation_multilabel and it ensures that no two pairs are identical. These functions likely perform a sampling of all possible combinations for efficiency, and do their best to return exactly num_samples * num_iterations * 2 pairs. This may not always be possible, as there is a bound on the number of possible pairs, and in that case a warning may be thrown.
With this approach, users keep strict control over the number of training pairs used. It should have the same training efficiency benefits as your proposal.
What are your thoughts on this? If desired, I could try to work on this.
| parser.add_argument("--keep_body_frozen", default=False, action="store_true") | ||
| parser.add_argument("--add_data_augmentation", default=False) | ||
| parser.add_argument("--remove_duplicate_samples", type=bool, default=False) | ||
| parser.add_argument("--train_time", type=bool, default=False) |
There was a problem hiding this comment.
Good idea to add a training time argument here.
| def sentence_pairs_remove_duplicates(sentences: List[InputExample]): | ||
| key_pairs = set() | ||
| rm_duplicate_sentences = [] | ||
| for s in sentences: | ||
| key = tuple(sorted(s.texts)) | ||
| if key not in key_pairs: | ||
| key_pairs.add(key) | ||
| if key[0] != key[1]: | ||
| rm_duplicate_sentences.append(s) | ||
| return rm_duplicate_sentences |
There was a problem hiding this comment.
There was a problem hiding this comment.
No. Tbh I wrote these changes ages ago but only got round to evaluating on public dataset/ PR yesterday.
For my use I was slashing as much data as possible to speed up training, this is a remnant from that I forgot I added !
I can run some tests again with 2. applied or not, and we can see what the analysis shows. My intuition would say labelling identical pairs wouldn't improve performance, but we can see..
But will hold off until have some resolution about it should be applied (based on your first comment @tomaarsen) 👍
|
I agree it seems unintuitive to generate a list then reduce it. My original thinking was conceptually it seems understandable to say "oh we've removed 80% of the data as duplicates, that will greater speed up my training time". On your idea of generating the "proper" list I've also added more fundamental thoughts on #258. I'd be happy to add the The only outstanding point for me is in the case of >2 classes, Again happy to add |
I agree that it sounds like I understand the issue regarding 3+ classes, but won't the That all said, I'm a little wary to create a large imbalance in positive/negative pairs, as I'm not sure whether the literature on contrastive learning is in favor of having more or less negative pairs. I'm concerned about the situation where someone has e.g. 100 samples per class, with 50 classes. If they use One solution is simply to get the number of unique positive pairs, and then randomly sample equally many negative pairs. |
|
Work eqv to this PR to continue on #268 |
Related #258: Adds an optional bool parameter
remove_duplicate_samplesto the SetFittrainmethod.I've added a test + updated the
run_fewshotscript (for reproducing the test_set results). Let me know any further thoughts or extras required, thanks