Cubic sampling w/ curve param#5551
Conversation
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch (oobabooga#5257)
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
Merge dev branch
|
I get nan error as well. using exl2_HF. utils.py line 2734 Its a problem from transformers.
2.99 is the highest I can go before this error. But so far I was able to get the factor down to .05 with that value. |
|
I only got 2 braincells to knock together, but to me
3-3 is 0 and then thats 0/2 which makes NaN Btw: Changing it to 10 did fix it for me, up to 9.99 ofc. I am getting best results with .20-.2X and 1.04 because all I can do is watch the PR video go by token distribution pictured. Also .02 and 4.82-5.6 was another decent point. Otherwise this removes too many tokens and gets very deterministic. |
|
The nan error was caused by making operations with |
|
Will retest. works but back to square one on how to set it. |


This adds upon the original Quadratic Sampling method with an additional parameter that I've labeled "smoothing_curve".
The idea is to enable even lower smoothing_factor values than ~0.25ish to work well; we do this by applying a cubic transformation to compensate, which seems to make the falloff steeper.
Not ready to merge yet, needs empirical testing from users. My hope is that you can fully avoid having to use truncation schemes and go for a fully "smooth" transformation to the distribution across different models.
The higher the smoothing_curve, the steeper the fall off (so it becomes harsher).
2024-02-19_15-42-06.mp4
Brief tests on a 7b are showing that it does in fact help to make lower smoothing_factor values coherent in practice (so far at least).
1.0 smoothing_curve is the "old" behavior and has no effect.