Quadratic sampling#5403
Conversation
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch (oobabooga#5257)
|
@Ph0rk0z Maybe by putting smoothing after MinP, it got way later in the order of samplers. Let's see this situation: @kalomaze decided to move Smoothing only so this is what happened The difference is quite huge, now Smoothing is being considered after 3 samplers instead of 1 in this example. I feel like Smoothing didn't need to be moved (it had a fine order in the previous commit) and that MinP only had to be moved, MinP should be first (or second if we consider temperature) in the order sampler no matter what, that's my 2 cents. |
|
Looking through there, it doesn't seem like there's another place it could go. Those other samplers aren't hijacked. |
|
The way I see it, this parameter is an alternative to temperature. So maybe |
|
@oobabooga I'm not sure it's a good idea to include the smoothing sampler on the "temp_last", I was using this order: And now I feel it's not possible anymore with this new configuration, right? Will you consider making a custom sampler order feature one day? That would fix this issue quite easily, I think that having the right order can make a big impact of the output. |
I strongly oppose this. The Temperature can change the base relative distance between the probabilities, and the quadratic transformation will naturally change as a direct consequence of this, unless my tests with the log prob viewer were wrong |
|
Alright, so I think I was wrong to some extent. While the relationship between the two values isn't perfectly linear, it's predictable. The Temperature value squared x 0.25 will always result in the same output, so this means
Should all be equivalent transformations on the log probs (?) EDIT: I'm not confident that you can estimate a smoothing value that, by itself, will be the same as high temp + high smoothing combinations in all cases, so I think that the change to replace Temperature completely probably takes away some degree of control, as I'd initially thought. In any case, it might be best to keep both options instead of arbitrarily grouping them together if just for the fact that it'd be easier to scale and control it that way (see what happened with DynaTemp as a range rather than it being two values.) Here is a different branch that adds |
|
Why not remove the binary 'x_last' sequencing, permit the users to order their samplers, and just recommend an order (and values) in each the chat template? It always seemed weird to me the only sequence we could change was temp, and that was either first or last. When min_p was 'discovered', it wasn't obvious it worked best with temp not last, so users created their chat templates wrong. Properly curated text-generation-webui chat templates go long way to helping users. |
|
Having a customizable order is the ideal solution, yeah |
|
About custom order for sampling parameters, I don't see a compelling reason for it. Why not also have the same parameter appearing 2 or more times in the stack, having temperature mixed with dynamic temperature, etc. It becomes a black box. As I see it, there are 3 main types of parameters:
For the sake of interpretability and simplicity, I believe in using only 1 parameter of each type. I have never seen a reason to not apply the repetition penalty first, so |
|
Tested current on the same preset I was using and nothing broke. I tried dynamic temp and regular temp with qSampling, I kept going back to low temperatures anyway. Don't know if that applies to all models. |
@Ph0rk0z The merged version deactivates temp when smoothing is applied, so no matter what temp you're using it won't change anything, tbh I prefered when I had the control over the two samplers at the same time, they don't have exactly a similar effect so you could find a nice combo out of it. |
@oobabooga Even if we consider that samplers can be summed up into 3 groups, I still don't think it's that simplistic.
So yeah, I'd also like to have this feature to do some experiments, even if it's as an extension, I wouldn't mind. If we can manage to squeeze out more performance out of our current models with a better sampler order, there's no reason to not go for it in my opinion. |
I know it does now. I mean in the previous versions.
Yea, it's not ideal but at least it still works. I too mix min_P with typical_P for instance. I would be devastated to not be able to have both there. The latter makes it so I barely have to do any repetition penalty. |
I respect your viewpoint, but that is how is it today . |
I've been playing around on koboldcpp, and noticed a similar behavior: it's pretty consistent that increasing the temperature by a factor of N can be 'cancelled out' by increasing the smoothing by a factor of N^2. I have confirmed in testing that the generations from [T=1, S=0.25], [T=2, S=1], [T=3, S=2.25], [T=4, S=4], [T=0.5, S=0.0625] are all identical at fixed seeds (ie when N is 1/2/3/4/0.5 for that initial pair of values). I've also repeated this for a few other initial values, and the pattern has held up. Looking at the code, I think this makes some sense. At first glance, I wasn't sure - the smoothing factor is only applied to the quadratic difference, and the temperature is applied to the whole logitprob; however, because of the normalisation that happens with softmax during the smoothing function, I believe that the effects of the "h" value are essentially being cancelled out due to it being constant across all tokens. I think. Either way, there is one suggestion I would like to make based on this behavior, regarding the implementation with Dynamic Temperature: As mentioned, the effects of the quadratic sampling are dependent on the combination of temperature and smoothing factor. However in its current implementation in Dynamic Temp, the smoothing factor remains constant no matter what the temperature is adjusted to. This means that when the temperature is adjusted below 1, the chosen smoothing factor is effectively increased, and when the temperature is adjusted above 1 it is decreased. As an example, consider a dynamic temp range of 0.5 to 2, with a smoothing factor of 0.25:
I think it could potentially make more sense to adjust the value based on temperature for consistency. If you take the chosen smoothing factor as being at Temp=1, then you can multiply it by the square of the actual temperature to get a consistent effective value across all ranges.... but I'm not sure if that's going to just obviate temperature entirely. Edit: I suppose potentially you could go a stage further, and invert the relationship; make the smoothing *more" deterministic at higher temperatures, and less so at lower. Even add a second control factor, to allow that relationship to be adjusted separately. My first thought is adding a (constant+1) that you multiply by that square of the temperature. At 0 the smoothing would be flat, at -1 you'd get the current behavior, and at +1 you'd get the inverse. But that's just off the top of my head; I'd have to see the actual logigprobs to see if that was actually worth doing. Do you have a public copy of your matplotlib graph setups anywhere? Edit2: I originally said divide when it should be multiply |
--------- Co-authored-by: oobabooga <112222186+oobabooga@users.noreply.github.com>
Quadratic Sampling
The idea behind this is to simplify sampling as much as possible for the purposes of creative writing.
The design I've been testing (on a Mistral 7b so far) is "quadratic sampling". The way that it works is:
TLDR of what this means for an end user; sampler that can both make the model less deterministic while also punishing extremely low probability options.
Reasonable range from what I was testing is 0-1.0, 0.2-0.3 seemed like the optimal range to tinker with for creative outputs.
Can also be used to make it pseudo-deterministic in a way that will make extremely close options have less relative change that would be caused by a lower temperature value.
Higher
smoothing_factor= more deterministic, lower = more even in top probabilities.Values under 0.1 not recommended unless you're setting it at 0 to disable it.