Skip to content

Quadratic sampling#5403

Merged
oobabooga merged 35 commits into
oobabooga:devfrom
kalomaze:quadratic-sampling
Feb 4, 2024
Merged

Quadratic sampling#5403
oobabooga merged 35 commits into
oobabooga:devfrom
kalomaze:quadratic-sampling

Conversation

@kalomaze
Copy link
Copy Markdown
Contributor

@kalomaze kalomaze commented Jan 30, 2024

Quadratic Sampling

The idea behind this is to simplify sampling as much as possible for the purposes of creative writing.

The design I've been testing (on a Mistral 7b so far) is "quadratic sampling". The way that it works is:

  • We transform each logit based on a quadratic function with a scaling factor & a reference value (h). A higher scaling factor will generally be more deterministic.
  • Logits closer to the reference value (which is the maximum logit) will be boosted in score, so that the top tokens become more evenly distributed, in order to avoid repetition and improve vocabulary usage
  • Because we are using the top logit as the reference value, the modifications should theoretically scale somewhat well across different models which have different "scales" (e.g. Yi 34b with its 64k vocab)
  • We inherently penalize small logits in the process of making the top ones more even, leading to a more coherent distribution overall without having to resort to cutting out tokens completely.

image

TLDR of what this means for an end user; sampler that can both make the model less deterministic while also punishing extremely low probability options.
Reasonable range from what I was testing is 0-1.0, 0.2-0.3 seemed like the optimal range to tinker with for creative outputs.

Can also be used to make it pseudo-deterministic in a way that will make extremely close options have less relative change that would be caused by a lower temperature value.

Higher smoothing_factor = more deterministic, lower = more even in top probabilities.

Values under 0.1 not recommended unless you're setting it at 0 to disable it.

oobabooga and others added 29 commits December 14, 2023 22:39
@BadisG
Copy link
Copy Markdown
Contributor

BadisG commented Feb 2, 2024

@Ph0rk0z Maybe by putting smoothing after MinP, it got way later in the order of samplers.

Let's see this situation:

Sampler1 -> Smoothing -> Sampler 2 -> Sampler 3 -> MinP -> Sampler 4

@kalomaze decided to move Smoothing only so this is what happened

Sampler1 ->  Sampler 2 -> Sampler 3 -> MinP -> Smoothing -> Sampler 4

The difference is quite huge, now Smoothing is being considered after 3 samplers instead of 1 in this example.

I feel like Smoothing didn't need to be moved (it had a fine order in the previous commit) and that MinP only had to be moved, MinP should be first (or second if we consider temperature) in the order sampler no matter what, that's my 2 cents.

@Ph0rk0z
Copy link
Copy Markdown
Contributor

Ph0rk0z commented Feb 3, 2024

Looking through there, it doesn't seem like there's another place it could go. Those other samplers aren't hijacked.

@oobabooga
Copy link
Copy Markdown
Owner

The way I see it, this parameter is an alternative to temperature. So maybe TemperatureLogitsWarperWithDynatemp should be renamed to something more general like ModifiedTemperatureLogitsWarper, and the transformation should go there and be applied when smoothing_factor > 0. Then temperature_last will automatically make smoothing_factor be applied last.

@oobabooga oobabooga changed the base branch from main to dev February 4, 2024 03:19
@oobabooga oobabooga merged commit b6077b0 into oobabooga:dev Feb 4, 2024
@BadisG
Copy link
Copy Markdown
Contributor

BadisG commented Feb 4, 2024

@oobabooga I'm not sure it's a good idea to include the smoothing sampler on the "temp_last", I was using this order:

MinP -> Smoothing -> Temp

And now I feel it's not possible anymore with this new configuration, right?

Will you consider making a custom sampler order feature one day? That would fix this issue quite easily, I think that having the right order can make a big impact of the output.

@kalomaze
Copy link
Copy Markdown
Contributor Author

kalomaze commented Feb 4, 2024

The way I see it, this parameter is an alternative to temperature. So maybe TemperatureLogitsWarperWithDynatemp should be renamed to something more general like ModifiedTemperatureLogitsWarper, and the transformation should go there and be applied when smoothing_factor > 0. Then temperature_last will automatically make smoothing_factor be applied last.

I strongly oppose this. The Temperature can change the base relative distance between the probabilities, and the quadratic transformation will naturally change as a direct consequence of this, unless my tests with the log prob viewer were wrong

@kalomaze
Copy link
Copy Markdown
Contributor Author

kalomaze commented Feb 4, 2024

Alright, so I think I was wrong to some extent. While the relationship between the two values isn't perfectly linear, it's predictable.

The Temperature value squared x 0.25 will always result in the same output, so this means

  • 4.0 Temperature & 4.0 Smoothing
  • 1.0 Temperature & 0.25 Smoothing
  • 5.0 Temperature & 6.25 Smoothing

Should all be equivalent transformations on the log probs (?)

EDIT: I'm not confident that you can estimate a smoothing value that, by itself, will be the same as high temp + high smoothing combinations in all cases, so I think that the change to replace Temperature completely probably takes away some degree of control, as I'd initially thought.

In any case, it might be best to keep both options instead of arbitrarily grouping them together if just for the fact that it'd be easier to scale and control it that way (see what happened with DynaTemp as a range rather than it being two values.)

Here is a different branch that adds smoothing_last as a proper option in the meantime, for those who prefer the old behavior where it was considered separate to Temperature:
https://github.com/kalomaze/text-generation-webui/tree/quad-smooth-last

@biship
Copy link
Copy Markdown

biship commented Feb 4, 2024

Why not remove the binary 'x_last' sequencing, permit the users to order their samplers, and just recommend an order (and values) in each the chat template? It always seemed weird to me the only sequence we could change was temp, and that was either first or last. When min_p was 'discovered', it wasn't obvious it worked best with temp not last, so users created their chat templates wrong. Properly curated text-generation-webui chat templates go long way to helping users.

@kalomaze
Copy link
Copy Markdown
Contributor Author

kalomaze commented Feb 4, 2024

Having a customizable order is the ideal solution, yeah

@oobabooga
Copy link
Copy Markdown
Owner

oobabooga commented Feb 4, 2024

About custom order for sampling parameters, I don't see a compelling reason for it. Why not also have the same parameter appearing 2 or more times in the stack, having temperature mixed with dynamic temperature, etc. It becomes a black box.

As I see it, there are 3 main types of parameters:

  • Those that remove tail tokens: top_p, min_p, top_k, typical_p, tfs, top_a, epsilon_cutoff, eta_cutoff
  • Temperature-like parameters that flatten the distribution or make it more peaked: temperature, dynamic temperature, quadratic sampling
  • Parameters that control repetition: repetition_penalty, presence_penalty, frequency_penalty

For the sake of interpretability and simplicity, I believe in using only 1 parameter of each type. I have never seen a reason to not apply the repetition penalty first, so temperature_last is sufficient for changing the order of the tail cutoff parameter and the temperature parameter (whatever each one may be). This is also why I don't see any reason to mix quadratic sampling with temperature, just like dynamic temperature is not currently mixed with temperature.

@Ph0rk0z
Copy link
Copy Markdown
Contributor

Ph0rk0z commented Feb 5, 2024

Tested current on the same preset I was using and nothing broke. I tried dynamic temp and regular temp with qSampling, I kept going back to low temperatures anyway. Don't know if that applies to all models.

@BadisG
Copy link
Copy Markdown
Contributor

BadisG commented Feb 5, 2024

Tested current on the same preset I was using and nothing broke. I tried dynamic temp and regular temp with qSampling, I kept going back to low temperatures anyway. Don't know if that applies to all models.

@Ph0rk0z The merged version deactivates temp when smoothing is applied, so no matter what temp you're using it won't change anything, tbh I prefered when I had the control over the two samplers at the same time, they don't have exactly a similar effect so you could find a nice combo out of it.

@BadisG
Copy link
Copy Markdown
Contributor

BadisG commented Feb 5, 2024

About custom order for sampling parameters, I don't see a compelling reason for it. Why not also have the same parameter appearing 2 or more times in the stack, having temperature mixed with dynamic temperature, etc. It becomes a black box.

As I see it, there are 3 main types of parameters:

  • Those that remove tail tokens: top_p, min_p, top_k, typical_p, tfs, top_a, epsilon_cutoff, eta_cutoff
  • Temperature-like parameters that flatten the distribution or make it more peaked: temperature, dynamic temperature, quadratic sampling
  • Parameters that control repetition: repetition_penalty, presence_penalty, frequency_penalty

For the sake of interpretability and simplicity, I believe in using only 1 parameter of each type. I have never seen a reason to not apply the repetition penalty first, so temperature_last is sufficient for changing the order of the tail cutoff parameter and the temperature parameter (whatever each one may be). This is also why I don't see any reason to mix quadratic sampling with temperature, just like dynamic temperature is not currently mixed with temperature.

@oobabooga Even if we consider that samplers can be summed up into 3 groups, I still don't think it's that simplistic.

  1. Let's say top_a comes before temp and it works fine, you can't just assume that all the tail token removers should be before temp, maybe for min_p, it works better if it's applied after temperature.
  2. Mixing multiple "same groupey" samplers is fine, NovelAI is using a preset that mixes top_a and tfs and it gives great results in practice, so that alone adds to the complexity and suggests that having the control over the sampler order might be a good addition to that.

So yeah, I'd also like to have this feature to do some experiments, even if it's as an extension, I wouldn't mind. If we can manage to squeeze out more performance out of our current models with a better sampler order, there's no reason to not go for it in my opinion.

@Ph0rk0z
Copy link
Copy Markdown
Contributor

Ph0rk0z commented Feb 5, 2024

The merged version deactivates temp when smoothing is applied, so no matter what temp you're using it won't change anything,

I know it does now. I mean in the previous versions.

I prefered when I had the control over the two samplers at the same time

Yea, it's not ideal but at least it still works. I too mix min_P with typical_P for instance. I would be devastated to not be able to have both there. The latter makes it so I barely have to do any repetition penalty.

@biship
Copy link
Copy Markdown

biship commented Feb 5, 2024

About custom order for sampling parameters, I don't see a compelling reason for it. Why not also have the same parameter appearing 2 or more times in the stack, having temperature mixed with dynamic temperature, etc. It becomes a black box.

As I see it, there are 3 main types of parameters:

  • Those that remove tail tokens: top_p, min_p, top_k, typical_p, tfs, top_a, epsilon_cutoff, eta_cutoff
  • Temperature-like parameters that flatten the distribution or make it more peaked: temperature, dynamic temperature, quadratic sampling
  • Parameters that control repetition: repetition_penalty, presence_penalty, frequency_penalty

For the sake of interpretability and simplicity, I believe in using only 1 parameter of each type. I have never seen a reason to not apply the repetition penalty first, so temperature_last is sufficient for changing the order of the tail cutoff parameter and the temperature parameter (whatever each one may be). This is also why I don't see any reason to mix quadratic sampling with temperature, just like dynamic temperature is not currently mixed with temperature.

I respect your viewpoint, but that is how is it today .
As a coder, isn't it easier to implement something that isn't constructed around a bunch of assumptions and existing patterns?
Remove all the logic behind why things would and wouldn't be combined or sequenced in certain orders.
Let the user sequence and enable/disable them as they see fit - even if it to their detriment.
It has to be a simpler solution to implement and maintain when new samplers show up.
Just guide users down the correct paths with templates that work.

Repository owner deleted a comment from Myobu1 Feb 5, 2024
@oobabooga
Copy link
Copy Markdown
Owner

@BadisG @biship I have added this option here: #5443

Tests are welcome.

@akujinnoninjin
Copy link
Copy Markdown

akujinnoninjin commented Feb 21, 2024

Alright, so I think I was wrong to some extent. While the relationship between the two values isn't perfectly linear, it's predictable.

The Temperature value squared x 0.25 will always result in the same output, so this means

  • 4.0 Temperature & 4.0 Smoothing
  • 1.0 Temperature & 0.25 Smoothing
  • 5.0 Temperature & 6.25 Smoothing

Should all be equivalent transformations on the log probs (?)

EDIT: I'm not confident that you can estimate a smoothing value that, by itself, will be the same as high temp + high smoothing combinations in all cases, so I think that the change to replace Temperature completely probably takes away some degree of control, as I'd initially thought.

In any case, it might be best to keep both options instead of arbitrarily grouping them together if just for the fact that it'd be easier to scale and control it that way (see what happened with DynaTemp as a range rather than it being two values.)

Here is a different branch that adds smoothing_last as a proper option in the meantime, for those who prefer the old behavior where it was considered separate to Temperature: https://github.com/kalomaze/text-generation-webui/tree/quad-smooth-last

I've been playing around on koboldcpp, and noticed a similar behavior: it's pretty consistent that increasing the temperature by a factor of N can be 'cancelled out' by increasing the smoothing by a factor of N^2. I have confirmed in testing that the generations from [T=1, S=0.25], [T=2, S=1], [T=3, S=2.25], [T=4, S=4], [T=0.5, S=0.0625] are all identical at fixed seeds (ie when N is 1/2/3/4/0.5 for that initial pair of values). I've also repeated this for a few other initial values, and the pattern has held up.

Looking at the code, I think this makes some sense. At first glance, I wasn't sure - the smoothing factor is only applied to the quadratic difference, and the temperature is applied to the whole logitprob; however, because of the normalisation that happens with softmax during the smoothing function, I believe that the effects of the "h" value are essentially being cancelled out due to it being constant across all tokens. I think.

Either way, there is one suggestion I would like to make based on this behavior, regarding the implementation with Dynamic Temperature:

As mentioned, the effects of the quadratic sampling are dependent on the combination of temperature and smoothing factor. However in its current implementation in Dynamic Temp, the smoothing factor remains constant no matter what the temperature is adjusted to. This means that when the temperature is adjusted below 1, the chosen smoothing factor is effectively increased, and when the temperature is adjusted above 1 it is decreased.

As an example, consider a dynamic temp range of 0.5 to 2, with a smoothing factor of 0.25:

  • At the low end, that's equivalent to [1, 1]
  • At the high end, that's equivalent to [1, 0.0625]
  • And if you increased the max to 5, it's the equivalent of [1, 0.01]!

I think it could potentially make more sense to adjust the value based on temperature for consistency. If you take the chosen smoothing factor as being at Temp=1, then you can multiply it by the square of the actual temperature to get a consistent effective value across all ranges.... but I'm not sure if that's going to just obviate temperature entirely.

Edit: I suppose potentially you could go a stage further, and invert the relationship; make the smoothing *more" deterministic at higher temperatures, and less so at lower. Even add a second control factor, to allow that relationship to be adjusted separately. My first thought is adding a (constant+1) that you multiply by that square of the temperature. At 0 the smoothing would be flat, at -1 you'd get the current behavior, and at +1 you'd get the inverse. But that's just off the top of my head; I'd have to see the actual logigprobs to see if that was actually worth doing. Do you have a public copy of your matplotlib graph setups anywhere?

Edit2: I originally said divide when it should be multiply

PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Feb 22, 2024
---------

Co-authored-by: oobabooga <112222186+oobabooga@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants