Skip to content

Comments

Sampling strategy (new sampler)#5

Merged
tomaarsen merged 20 commits intotomaarsen:refactor_v2from
danstan5:refactor-sampling
Oct 27, 2023
Merged

Sampling strategy (new sampler)#5
tomaarsen merged 20 commits intotomaarsen:refactor_v2from
danstan5:refactor-sampling

Conversation

@danstan5
Copy link

@danstan5 danstan5 commented Sep 14, 2023

Hi @tomaarsen. This is still WIP but thought I'd share with you at this stage..

We now have sampling_strategy that accepts oversampling (default) undersampling and I've left the "unique_pairs" unbalanced option in there for now as unique (the code for this is very similar so left for now but can easily be removed later if desired).

We've discussed backwards compatibility... I've left num_iterations as an optional, for now this can be dropped to 0 to give the new default "oversampling" num pairs.

It's worth noting the sampling is not exactly the same as before.. For example:

Previous: ["Sent A L0", "Sent B L1", Sent C L1"] positive samples @num_iterations=2 could result in:

{"Sent A L0"-"Sent A L0"} , {"Sent B L1"-"Sent C L1"} , {"Sent C L1"-"Sent B L1"}
{"Sent A L0"-"Sent A L0"} , {"Sent B L1"-"Sent C L1"} , {"Sent C L1"-"Sent B L1"}

While now might result in:

{"Sent A L0"-"Sent A L0"} , {"Sent B L1"-"Sent B L1"} , {"Sent C L1"-"Sent C L1"}
{"Sent A L0"-"Sent A L0"} , {"Sent B L1"-"Sent C L1"} , {"Sent B L1"-"Sent B L1"}

The randomness of the second sentence section instead will randomly draw new perturbations first. The net effect over lots of num_iterations is slightly less random drawing of samples, it shouldn't impact scores but is not truly backwards compatible with the old sampler this way. There's no nice way to do this, unless you bring back in all the old sampling functions which I'd rather not. I can produce some tests to show the accuracy are very much aligned with <v1.0? But exact sampling will differ slighter...

@danstan5 danstan5 changed the title Refactor sampler Sampling strategy (new sampler) Sep 14, 2023
@tomaarsen
Copy link
Owner

@danstan5

Hello!

Apologies for the delay, I've been busy. This is looking quite slick, definitely looking forward to merging this. I have a few questions:

  • How does unique differ from undersampling right now? I think zip continues until the shortest argument is exhausted, so even if the positive and negative pairs are identical, then I think this line means that the number of positive and negative pairs are still equally sampled:

    for pos_pair, neg_pair in zip(self.get_positive_pairs(), self.get_negative_pairs()):

  • Can we get roughly <v1.0.0 performance if we add:

        if sampling_strategy == "random":
            self.len_pos_pairs = len(self.sentences)
            self.len_neg_pairs = len(self.sentences)

Then this still needs a few tests and presumably trainer_unique_pairs.py has to be removed.
And I'll add sampling_strategy to the TrainingArguments, and I'll consider fully removing num_iterations if we can add the "random" strategy.

  • Tom Aarsen

@danstan5
Copy link
Author

danstan5 commented Oct 19, 2023

Hey @tomaarsen. In answer to your Qs:

  1. Your right ! 🐞 The logic is broken here should be zip_longest so the unique case works correctly

    • I will add tests for this to confirm logic works as expected after adding the fix.
  2. Roughly <v1.0.0 performance is reproducible by setting num_iterations (which will now default to None). This draws the same no. & ratio of pairs as previous:

    if num_iterations is not None and num_iterations > 0:
    self.len_pos_pairs = num_iterations * len(self.sentences)
    self.len_neg_pairs = num_iterations * len(self.sentences)

    The underlying sampler here has changed (it's slightly less random now). There's no work around for this without keeping all the <v1.0.0 code for complete reproducibility. I suggest the following:

    • Run evaluation dataset comparison pre/post new sampler: to show the metrics are within the margin error that would occur before from the random sampling.

I'll get these tests/ comparisons added + TrainingArguments added (as need them in place to run the tests).

@tomaarsen tomaarsen merged commit 3478799 into tomaarsen:refactor_v2 Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants