Sampling strategy (new sampler)#5
Conversation
The code here is moved to sampler.py, which will need its own tester file
|
Hello! Apologies for the delay, I've been busy. This is looking quite slick, definitely looking forward to merging this. I have a few questions:
if sampling_strategy == "random":
self.len_pos_pairs = len(self.sentences)
self.len_neg_pairs = len(self.sentences)Then this still needs a few tests and presumably
|
|
Hey @tomaarsen. In answer to your Qs:
I'll get these tests/ comparisons added + |
…into refactor-sampling
This matches the default from 'transformers'
The old conditional was True with the default -1, not ideal
Hi @tomaarsen. This is still WIP but thought I'd share with you at this stage..
We now have
sampling_strategythat acceptsoversampling(default)undersamplingand I've left the "unique_pairs" unbalanced option in there for now asunique(the code for this is very similar so left for now but can easily be removed later if desired).We've discussed backwards compatibility... I've left
num_iterationsas an optional, for now this can be dropped to 0 to give the new default "oversampling" num pairs.It's worth noting the sampling is not exactly the same as before.. For example:
Previous: ["Sent A L0", "Sent B L1", Sent C L1"] positive samples @
num_iterations=2could result in:While now might result in:
The randomness of the second sentence section instead will randomly draw new perturbations first. The net effect over lots of
num_iterationsis slightly less random drawing of samples, it shouldn't impact scores but is not truly backwards compatible with the old sampler this way. There's no nice way to do this, unless you bring back in all the old sampling functions which I'd rather not. I can produce some tests to show the accuracy are very much aligned with <v1.0? But exact sampling will differ slighter...