Sampling strategy (new sampler) by danstan5 · Pull Request #5 · tomaarsen/setfit

danstan5 · 2023-09-14T11:44:03Z

Hi @tomaarsen. This is still WIP but thought I'd share with you at this stage..

We now have sampling_strategy that accepts oversampling (default) undersampling and I've left the "unique_pairs" unbalanced option in there for now as unique (the code for this is very similar so left for now but can easily be removed later if desired).

We've discussed backwards compatibility... I've left num_iterations as an optional, for now this can be dropped to 0 to give the new default "oversampling" num pairs.

It's worth noting the sampling is not exactly the same as before.. For example:

Previous: ["Sent A L0", "Sent B L1", Sent C L1"] positive samples @num_iterations=2 could result in:

{"Sent A L0"-"Sent A L0"} , {"Sent B L1"-"Sent C L1"} , {"Sent C L1"-"Sent B L1"}
{"Sent A L0"-"Sent A L0"} , {"Sent B L1"-"Sent C L1"} , {"Sent C L1"-"Sent B L1"}

While now might result in:

{"Sent A L0"-"Sent A L0"} , {"Sent B L1"-"Sent B L1"} , {"Sent C L1"-"Sent C L1"}
{"Sent A L0"-"Sent A L0"} , {"Sent B L1"-"Sent C L1"} , {"Sent B L1"-"Sent B L1"}

The randomness of the second sentence section instead will randomly draw new perturbations first. The net effect over lots of num_iterations is slightly less random drawing of samples, it shouldn't impact scores but is not truly backwards compatible with the old sampler this way. There's no nice way to do this, unless you bring back in all the old sampling functions which I'd rather not. I can produce some tests to show the accuracy are very much aligned with <v1.0? But exact sampling will differ slighter...

…o pr-5

The code here is moved to sampler.py, which will need its own tester file

…o pr-5

tomaarsen · 2023-10-17T20:40:47Z

@danstan5

Hello!

Apologies for the delay, I've been busy. This is looking quite slick, definitely looking forward to merging this. I have a few questions:

How does unique differ from undersampling right now? I think zip continues until the shortest argument is exhausted, so even if the positive and negative pairs are identical, then I think this line means that the number of positive and negative pairs are still equally sampled:

setfit/src/setfit/sampler.py

Line 139 in 567f1c9

for pos_pair, neg_pair in zip(self.get_positive_pairs(), self.get_negative_pairs()):
Can we get roughly <v1.0.0 performance if we add:

        if sampling_strategy == "random":
            self.len_pos_pairs = len(self.sentences)
            self.len_neg_pairs = len(self.sentences)

Then this still needs a few tests and presumably trainer_unique_pairs.py has to be removed.
And I'll add sampling_strategy to the TrainingArguments, and I'll consider fully removing num_iterations if we can add the "random" strategy.

Tom Aarsen

…o pr-5

danstan5 · 2023-10-19T10:08:50Z

Hey @tomaarsen. In answer to your Qs:

Your right ! 🐞 The logic is broken here should be zip_longest so the unique case works correctly
- I will add tests for this to confirm logic works as expected after adding the fix.

Roughly <v1.0.0 performance is reproducible by setting num_iterations (which will now default to None). This draws the same no. & ratio of pairs as previous:

setfit/src/setfit/sampler.py

Lines 86 to 88 in 567f1c9

    
           if num_iterations is not None and num_iterations > 0: 
        
               self.len_pos_pairs = num_iterations * len(self.sentences) 
        
               self.len_neg_pairs = num_iterations * len(self.sentences)

The underlying sampler here has changed (it's slightly less random now). There's no work around for this without keeping all the <v1.0.0 code for complete reproducibility. I suggest the following:

Run evaluation dataset comparison pre/post new sampler: to show the metrics are within the margin error that would occur before from the random sampling.

I'll get these tests/ comparisons added + TrainingArguments added (as need them in place to run the tests).

…into refactor-sampling

This matches the default from 'transformers'

The old conditional was True with the default -1, not ideal

sampler for refactor WIP

08892f6

danstan5 changed the title ~~Refactor sampler~~ Sampling strategy (new sampler) Sep 14, 2023

tomaarsen added 4 commits October 17, 2023 21:41

Merge branch 'refactor_v2' of https://github.com/tomaarsen/setfit int…

429de0f

…o pr-5

Run formatters

173f084

Remove tests from modeling.py

c23959a

The code here is moved to sampler.py, which will need its own tester file

Merge branch 'refactor_v2' of https://github.com/tomaarsen/setfit int…

567f1c9

…o pr-5

Merge branch 'refactor_v2' of https://github.com/tomaarsen/setfit int…

67ddedc

…o pr-5

danstan5 and others added 14 commits October 19, 2023 12:37

sampler logic fix "unique" strategy

d37ee09

add sampler tests (not complete)

0ef8837

add sampling_strategy into TrainingArguments

131aa26

Merge branch 'refactor-sampling' of https://github.com/danstan5/setfit …

c6c6228

…into refactor-sampling

num_iterations removed from TrainingArguments

7431005

run_fewshot compatible with <v.1.0.0

3bd2acc

Run make style

3d07e6c

Use "no" as the default evaluation_strategy

978daee

This matches the default from 'transformers'

Move num_iterations back to TrainingArguments

2802a3f

Fix broken trainer tests due to new default sampling

391f991

Use the Contrastive Dataset for Distillation

f8b7253

Set the default logging steps at 50

38e9607

Add max_steps argument to TrainingArguments

4ead15d

Change max_steps conditional

eb70336

The old conditional was True with the default -1, not ideal

tomaarsen merged commit 3478799 into tomaarsen:refactor_v2 Oct 27, 2023

tomaarsen mentioned this pull request Nov 24, 2023

Add training option unique_pairs huggingface/setfit#268

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Sampling strategy (new sampler)#5

Sampling strategy (new sampler)#5
tomaarsen merged 20 commits intotomaarsen:refactor_v2from
danstan5:refactor-sampling

danstan5 commented Sep 14, 2023 •

edited

Loading

Uh oh!

tomaarsen commented Oct 17, 2023

Uh oh!

danstan5 commented Oct 19, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

danstan5 commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomaarsen commented Oct 17, 2023

Uh oh!

danstan5 commented Oct 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danstan5 commented Sep 14, 2023 •

edited

Loading

danstan5 commented Oct 19, 2023 •

edited

Loading