-
Notifications
You must be signed in to change notification settings - Fork 358
Probability Metric + New Normalization #276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0d351b2
to
6bcf160
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not clear why you need a new system for the PMI requests, can you elaborate?
Normalization = CharNorm | TokenNorm | PMINorm | ||
|
||
|
||
def normalize_log_probs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you use the mechanism we have for the metrics with arguments, where you instantiate a class then execute the correct normalisation based on the arguments?
Using empty dataclasses, then case and asserts has a very different style from the rest of the lib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class Normalization:
def __init__(self, level: str, ignore_first_space: bool = False):
if self.level not in ["character", "token", "pmi"]:
raise ValueError(...)
self.level = level
self.ignore_first_space = ignore_first_space
def normalize(
self,
choices_logprob: list[float],
unconditioned_logprob: list[float] | None,
choices_text: list[str] | None,
choices_tokens: list[list[int]] | None,
): ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw normalization is too broad as a name - you'll need to find something else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because ignore_first_space is property of just Char_normalization. The parameter doesn't make sense in other normalization context so it make sense to me to be a property of Character normalization. Can be confusing for people passing let's say token normalization as they might think that it will not consider the first space token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I would be OK with one class for each - I just don't see the points of the empty dataclasses which add an extra layer of complexity to the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, we've observed that adding too many layers of abstraction/disconnected code (like here, empty dataclasses + a very long normalization function where cases are uncorrelated to one another + a to_str function) are the best way to introduce bugs as it becomes very easy to forget that these pieces of code are linked - whereas using classes to group code logic will reduce some of this problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So at the end your proposed change would be to switch to literals (by flattening, so that char norm has now 2 literals), but would keep the rest of the logic the same ?
Like in the example you shown I don't really see why we need any class at all, we don't need to keep any state there or whatsover we just need to know how to normalize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really see why we need any class at all, we don't need to keep any state there or whatsover we just need to know how to normalize
- homegeneity with the rest of the code style
- avoiding having many functions (normalisation, to string), which are unconnected with the code base.
Give me a minute i'll send you how I would write this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also do it on OOP way.
class Normalization:
def normalize(logprobs):
pass
class CharNorm(Normalization):
def normalize(logprobs):
pass
Probably better as we will rather be adding new Normalizations than adding new functions that use them. wdyt @clefourrier ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of the OOP way, the different norms having nothing in common so having the hinerit from a common class is weird at best.
The way you did it here is the best I think but clementine is right in that it does not match the rest of the codebase, I would however still go for that approach.
Because PMI needs two requests, one with normal query and with with uncoditioned query (Usually empty or just Answer:). |
Co-authored-by: Nathan Habib <[email protected]>
Co-authored-by: Nathan Habib <[email protected]>
Co-authored-by: Nathan Habib <[email protected]>
f333587
to
20513e2
Compare
Rebased and made a small update:
Also noticed that we have this metric Here are comments I wrote to that fc
|
Thanks for the modifs and the tests ! This looks good to be merged after running |
Co-authored-by: Nathan Habib <[email protected]>
Ruffed, let's wait for tests and merge it |
What does this implement/fix? Explain your changes. --------------------------------------------------- This PR adds two new features: 1) New Probability Metric, allowing to collect probability of correct answer. This can be either raw prob or prob mass (normalized by other choices) 2) Revamps Acc/Prob normalization and adds two new normalizations a) Token normalization, which we found to be better at most of the non-english langauges compared to acc norm. b) PointwiseMutualInformation normalization, which is good way for testing tasks with unlikely token see: https://arxiv.org/abs/2406.08446 Lastly I have done some small changes to the requests processing, removing parts, which are not needed and can easily cause bugs. Comments ---------- - I am not really content with having new category just for normalization but I didn't find a better way in the current system. The problem is that when creating requests we only have access to sample fc, but nothing else, thus we can't really do any kind of structural decomposition :( - This new norms are only added for non-single token types of tasks. Adding them to single token would require improving the requests creating logic to be maintanable and can be done in other PR PS: Relevant disscusion about token norm EleutherAI/lm-evaluation-harness#1396 --------- Co-authored-by: Hynek Kydlicek <[email protected]> Co-authored-by: Nathan Habib <[email protected]>
What does this implement/fix? Explain your changes. --------------------------------------------------- This PR adds two new features: 1) New Probability Metric, allowing to collect probability of correct answer. This can be either raw prob or prob mass (normalized by other choices) 2) Revamps Acc/Prob normalization and adds two new normalizations a) Token normalization, which we found to be better at most of the non-english langauges compared to acc norm. b) PointwiseMutualInformation normalization, which is good way for testing tasks with unlikely token see: https://arxiv.org/abs/2406.08446 Lastly I have done some small changes to the requests processing, removing parts, which are not needed and can easily cause bugs. Comments ---------- - I am not really content with having new category just for normalization but I didn't find a better way in the current system. The problem is that when creating requests we only have access to sample fc, but nothing else, thus we can't really do any kind of structural decomposition :( - This new norms are only added for non-single token types of tasks. Adding them to single token would require improving the requests creating logic to be maintanable and can be done in other PR PS: Relevant disscusion about token norm EleutherAI/lm-evaluation-harness#1396 --------- Co-authored-by: Hynek Kydlicek <[email protected]> Co-authored-by: Nathan Habib <[email protected]>
What does this implement/fix? Explain your changes.
This PR adds two new features:
a) Token normalization, which we found to be better at most of the non-english langauges compared to acc norm.
b) PointwiseMutualInformation normalization, which is good way for testing tasks with unlikely token see: https://arxiv.org/abs/2406.08446
Lastly I have done some small changes to the requests processing, removing parts, which are not needed and can easily cause bugs.
Comments
PS: Relevant disscusion about token norm EleutherAI/lm-evaluation-harness#1396