Skip to content

Add option to remove duplicate samples in training #259

Closed
danstan5 wants to merge 2 commits intohuggingface:mainfrom
danstan5:remove-duplicate-samples
Closed

Add option to remove duplicate samples in training #259
danstan5 wants to merge 2 commits intohuggingface:mainfrom
danstan5:remove-duplicate-samples

Conversation

@danstan5
Copy link
Contributor

@danstan5 danstan5 commented Jan 8, 2023

Related #258: Adds an optional bool parameter remove_duplicate_samples to the SetFit train method.

I've added a test + updated the run_fewshot script (for reproducing the test_set results). Let me know any further thoughts or extras required, thanks

@tomaarsen tomaarsen mentioned this pull request Jan 9, 2023
Copy link
Member

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern with these changes is that it proposes that we:

  1. generate a large list of pairs,
  2. reduce it

Intuitively, it should be simpler to just generate a "proper" list from the get-go. Let us consider an example. A user has a binary text classification task. They have labeled 16 text samples for this task. They set their num_iterations to 20 and remove_duplicate_samples to False.
Then, the Trainer will initialize 16 * 20 * 2 = 640 pairs. (The * 2 is due to a positive and a negative pair being added every time) These pairs likely contain duplicates.

In another example, he sets remove_duplicate_samples to True, but keeps all other parameters the same. When the model is first trained, it does so using 500 pairs. If he runs it again, it may train with just 350, or 600. There is an inconsistency here, which may fairly strongly affect the training (speed & accuracy).

I propose an alternative solution which covers the same solution: a unique_pairs argument. This argument is passed to sentence_pairs_generation and sentence_pairs_generation_multilabel and it ensures that no two pairs are identical. These functions likely perform a sampling of all possible combinations for efficiency, and do their best to return exactly num_samples * num_iterations * 2 pairs. This may not always be possible, as there is a bound on the number of possible pairs, and in that case a warning may be thrown.

With this approach, users keep strict control over the number of training pairs used. It should have the same training efficiency benefits as your proposal.

What are your thoughts on this? If desired, I could try to work on this.

parser.add_argument("--keep_body_frozen", default=False, action="store_true")
parser.add_argument("--add_data_augmentation", default=False)
parser.add_argument("--remove_duplicate_samples", type=bool, default=False)
parser.add_argument("--train_time", type=bool, default=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to add a training time argument here.

Comment on lines +686 to +695
def sentence_pairs_remove_duplicates(sentences: List[InputExample]):
key_pairs = set()
rm_duplicate_sentences = []
for s in sentences:
key = tuple(sorted(s.texts))
if key not in key_pairs:
key_pairs.add(key)
if key[0] != key[1]:
rm_duplicate_sentences.append(s)
return rm_duplicate_sentences
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This differs slightly from my understanding of #258. In particular, this function:

  1. removes duplicate pairs
  2. removes pairs where the items are identical

I believe that #258 only mentions the former. Did you find the pairs where both items being identical to be a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Tbh I wrote these changes ages ago but only got round to evaluating on public dataset/ PR yesterday.
For my use I was slashing as much data as possible to speed up training, this is a remnant from that I forgot I added !

I can run some tests again with 2. applied or not, and we can see what the analysis shows. My intuition would say labelling identical pairs wouldn't improve performance, but we can see..

But will hold off until have some resolution about it should be applied (based on your first comment @tomaarsen) 👍

@tomaarsen tomaarsen added the enhancement New feature or request label Jan 9, 2023
@danstan5
Copy link
Contributor Author

danstan5 commented Jan 9, 2023

I agree it seems unintuitive to generate a list then reduce it. My original thinking was conceptually it seems understandable to say "oh we've removed 80% of the data as duplicates, that will greater speed up my training time". On your idea of generating the "proper" list I've also added more fundamental thoughts on #258.

I'd be happy to add the unique_pairs logic, it's sensible and in fact I think it could really just go in as a default rather than optional parameter - as when would you want to randomly duplicate pairs over incorporating new data?

The only outstanding point for me is in the case of >2 classes, remove_duplicate_samples would start removing more positive samples, as they are sampled evenly with negative pairs, despite being far fewer. In these cases the runtime speed ups come mostly from removing excessive positive pairs, which the tests have shown (at high sample numbers atleast) to have little impact on the accuracy.

Again happy to add unique_pairs as PR, but can put it in a separate PR maybe for now?
Let me know if have any thoughts on this, thanks @tomaarsen.

@tomaarsen
Copy link
Member

tomaarsen commented Jan 10, 2023

I'd be happy to add the unique_pairs logic, it's sensible and in fact I think it could really just go in as a default rather than optional parameter - as when would you want to randomly duplicate pairs over incorporating new data?

I agree that it sounds like unique_pairs should be True by default.

I understand the issue regarding 3+ classes, but won't the unique_pairs approach also create considerably more negative pairs than positive pairs? Assuming that we simply take all combinations possible in a 3+ class dataset (except identical pairs), then the majority of pairs would be negative. I believe the unique_pairs approach is near identical in effect to the remove_duplicate_samples approach.

That all said, I'm a little wary to create a large imbalance in positive/negative pairs, as I'm not sure whether the literature on contrastive learning is in favor of having more or less negative pairs. I'm concerned about the situation where someone has e.g. 100 samples per class, with 50 classes. If they use unique_pairs=True, then they will have 12,502,500 possible pairs. I believe that out of those pairs, 252,500 pairs will be positive, representing roughly 2% of all samples. With num_iterations=20, we sample num_iterations * 5000 * 2 = 200000 pairs, out of which a likely 4,000 will be positive compared to 196,000 negative pairs.

One solution is simply to get the number of unique positive pairs, and then randomly sample equally many negative pairs.

@danstan5 danstan5 mentioned this pull request Jan 12, 2023
3 tasks
@danstan5
Copy link
Contributor Author

Work eqv to this PR to continue on #268

@danstan5 danstan5 closed this Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants