Add option to remove duplicate samples in training by danstan5 · Pull Request #259 · huggingface/setfit

danstan5 · 2023-01-08T20:04:45Z

Related #258: Adds an optional bool parameter remove_duplicate_samples to the SetFit train method.

I've added a test + updated the run_fewshot script (for reproducing the test_set results). Let me know any further thoughts or extras required, thanks

tomaarsen

My main concern with these changes is that it proposes that we:

generate a large list of pairs,
reduce it

Intuitively, it should be simpler to just generate a "proper" list from the get-go. Let us consider an example. A user has a binary text classification task. They have labeled 16 text samples for this task. They set their num_iterations to 20 and remove_duplicate_samples to False.
Then, the Trainer will initialize 16 * 20 * 2 = 640 pairs. (The * 2 is due to a positive and a negative pair being added every time) These pairs likely contain duplicates.

In another example, he sets remove_duplicate_samples to True, but keeps all other parameters the same. When the model is first trained, it does so using 500 pairs. If he runs it again, it may train with just 350, or 600. There is an inconsistency here, which may fairly strongly affect the training (speed & accuracy).

I propose an alternative solution which covers the same solution: a unique_pairs argument. This argument is passed to sentence_pairs_generation and sentence_pairs_generation_multilabel and it ensures that no two pairs are identical. These functions likely perform a sampling of all possible combinations for efficiency, and do their best to return exactly num_samples * num_iterations * 2 pairs. This may not always be possible, as there is a bound on the number of possible pairs, and in that case a warning may be thrown.

With this approach, users keep strict control over the number of training pairs used. It should have the same training efficiency benefits as your proposal.

What are your thoughts on this? If desired, I could try to work on this.

tomaarsen · 2023-01-09T10:41:36Z

scripts/setfit/run_fewshot.py

    parser.add_argument("--keep_body_frozen", default=False, action="store_true")
    parser.add_argument("--add_data_augmentation", default=False)
+    parser.add_argument("--remove_duplicate_samples", type=bool, default=False)
+    parser.add_argument("--train_time", type=bool, default=False)


Good idea to add a training time argument here.

tomaarsen · 2023-01-09T10:42:56Z

src/setfit/modeling.py

+def sentence_pairs_remove_duplicates(sentences: List[InputExample]):
+    key_pairs = set()
+    rm_duplicate_sentences = []
+    for s in sentences:
+        key = tuple(sorted(s.texts))
+        if key not in key_pairs:
+            key_pairs.add(key)
+            if key[0] != key[1]:
+                rm_duplicate_sentences.append(s)
+    return rm_duplicate_sentences


This differs slightly from my understanding of #258. In particular, this function:

removes duplicate pairs

removes pairs where the items are identical

I believe that #258 only mentions the former. Did you find the pairs where both items being identical to be a problem?

No. Tbh I wrote these changes ages ago but only got round to evaluating on public dataset/ PR yesterday.
For my use I was slashing as much data as possible to speed up training, this is a remnant from that I forgot I added !

I can run some tests again with 2. applied or not, and we can see what the analysis shows. My intuition would say labelling identical pairs wouldn't improve performance, but we can see..

But will hold off until have some resolution about it should be applied (based on your first comment @tomaarsen) 👍

danstan5 · 2023-01-09T19:09:47Z

I agree it seems unintuitive to generate a list then reduce it. My original thinking was conceptually it seems understandable to say "oh we've removed 80% of the data as duplicates, that will greater speed up my training time". On your idea of generating the "proper" list I've also added more fundamental thoughts on #258.

I'd be happy to add the unique_pairs logic, it's sensible and in fact I think it could really just go in as a default rather than optional parameter - as when would you want to randomly duplicate pairs over incorporating new data?

The only outstanding point for me is in the case of >2 classes, remove_duplicate_samples would start removing more positive samples, as they are sampled evenly with negative pairs, despite being far fewer. In these cases the runtime speed ups come mostly from removing excessive positive pairs, which the tests have shown (at high sample numbers atleast) to have little impact on the accuracy.

Again happy to add unique_pairs as PR, but can put it in a separate PR maybe for now?
Let me know if have any thoughts on this, thanks @tomaarsen.

tomaarsen · 2023-01-10T09:28:56Z

I'd be happy to add the unique_pairs logic, it's sensible and in fact I think it could really just go in as a default rather than optional parameter - as when would you want to randomly duplicate pairs over incorporating new data?

I agree that it sounds like unique_pairs should be True by default.

I understand the issue regarding 3+ classes, but won't the unique_pairs approach also create considerably more negative pairs than positive pairs? Assuming that we simply take all combinations possible in a 3+ class dataset (except identical pairs), then the majority of pairs would be negative. I believe the unique_pairs approach is near identical in effect to the remove_duplicate_samples approach.

That all said, I'm a little wary to create a large imbalance in positive/negative pairs, as I'm not sure whether the literature on contrastive learning is in favor of having more or less negative pairs. I'm concerned about the situation where someone has e.g. 100 samples per class, with 50 classes. If they use unique_pairs=True, then they will have 12,502,500 possible pairs. I believe that out of those pairs, 252,500 pairs will be positive, representing roughly 2% of all samples. With num_iterations=20, we sample num_iterations * 5000 * 2 = 200000 pairs, out of which a likely 4,000 will be positive compared to 196,000 negative pairs.

One solution is simply to get the number of unique positive pairs, and then randomly sample equally many negative pairs.

danstan5 · 2023-01-13T11:26:13Z

Work eqv to this PR to continue on #268

danstan5 added 2 commits January 7, 2023 11:54

remove duplicate contrastive samples in train

8489402

update run_fewshot with remove_duplicate_samples

9db9b97

tomaarsen mentioned this pull request Jan 9, 2023

SetFit sampling trick #258

Closed

tomaarsen reviewed Jan 9, 2023

View reviewed changes

tomaarsen added the enhancement New feature or request label Jan 9, 2023

danstan5 mentioned this pull request Jan 12, 2023

Add training option unique_pairs #268

Closed

3 tasks

danstan5 closed this Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to remove duplicate samples in training #259

Add option to remove duplicate samples in training #259
danstan5 wants to merge 2 commits intohuggingface:mainfrom
danstan5:remove-duplicate-samples

danstan5 commented Jan 8, 2023

Uh oh!

tomaarsen left a comment •

edited

Loading

Uh oh!

tomaarsen Jan 9, 2023

Uh oh!

tomaarsen Jan 9, 2023

Uh oh!

danstan5 Jan 9, 2023

Uh oh!

danstan5 commented Jan 9, 2023

Uh oh!

tomaarsen commented Jan 10, 2023 •

edited

Loading

Uh oh!

danstan5 commented Jan 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danstan5 commented Jan 8, 2023

Uh oh!

tomaarsen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomaarsen Jan 9, 2023

Choose a reason for hiding this comment

Uh oh!

tomaarsen Jan 9, 2023

Choose a reason for hiding this comment

Uh oh!

danstan5 Jan 9, 2023

Choose a reason for hiding this comment

Uh oh!

danstan5 commented Jan 9, 2023

Uh oh!

tomaarsen commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danstan5 commented Jan 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomaarsen left a comment •

edited

Loading

tomaarsen commented Jan 10, 2023 •

edited

Loading