Probability Metric + New Normalization #276

hynky1999 · 2024-08-22T11:51:12Z

What does this implement/fix? Explain your changes.

This PR adds two new features:

New Probability Metric, allowing to collect probability of correct answer. This can be either raw prob or prob mass (normalized by other choices)
Revamps Acc/Prob normalization and adds two new normalizations
a) Token normalization, which we found to be better at most of the non-english langauges compared to acc norm.
b) PointwiseMutualInformation normalization, which is good way for testing tasks with unlikely token see: https://arxiv.org/abs/2406.08446

Lastly I have done some small changes to the requests processing, removing parts, which are not needed and can easily cause bugs.

Comments

I am not really content with having new category just for normalization but I didn't find a better way in the current system. The problem is that when creating requests we only have access to sample fc, but nothing else, thus we can't really do any kind of structural decomposition :(
This new norms are only added for non-single token types of tasks. Adding them to single token would require improving the requests creating logic to be maintanable and can be done in other PR

PS: Relevant disscusion about token norm EleutherAI/lm-evaluation-harness#1396

clefourrier

I'm still not clear why you need a new system for the PMI requests, can you elaborate?

clefourrier · 2024-08-22T12:12:37Z

src/lighteval/metrics/normalizations.py

+Normalization = CharNorm | TokenNorm | PMINorm
+
+
+def normalize_log_probs(


Why don't you use the mechanism we have for the metrics with arguments, where you instantiate a class then execute the correct normalisation based on the arguments?
Using empty dataclasses, then case and asserts has a very different style from the rest of the lib

class Normalization: def __init__(self, level: str, ignore_first_space: bool = False): if self.level not in ["character", "token", "pmi"]: raise ValueError(...) self.level = level self.ignore_first_space = ignore_first_space def normalize( self, choices_logprob: list[float], unconditioned_logprob: list[float] | None, choices_text: list[str] | None, choices_tokens: list[list[int]] | None, ): ...

Btw normalization is too broad as a name - you'll need to find something else

Because ignore_first_space is property of just Char_normalization. The parameter doesn't make sense in other normalization context so it make sense to me to be a property of Character normalization. Can be confusing for people passing let's say token normalization as they might think that it will not consider the first space token

Then I would be OK with one class for each - I just don't see the points of the empty dataclasses which add an extra layer of complexity to the code

In general, we've observed that adding too many layers of abstraction/disconnected code (like here, empty dataclasses + a very long normalization function where cases are uncorrelated to one another + a to_str function) are the best way to introduce bugs as it becomes very easy to forget that these pieces of code are linked - whereas using classes to group code logic will reduce some of this problem

So at the end your proposed change would be to switch to literals (by flattening, so that char norm has now 2 literals), but would keep the rest of the logic the same ?

Like in the example you shown I don't really see why we need any class at all, we don't need to keep any state there or whatsover we just need to know how to normalize

I don't really see why we need any class at all, we don't need to keep any state there or whatsover we just need to know how to normalize

homegeneity with the rest of the code style

avoiding having many functions (normalisation, to string), which are unconnected with the code base.
Give me a minute i'll send you how I would write this

I can also do it on OOP way.

class Normalization: def normalize(logprobs): pass class CharNorm(Normalization): def normalize(logprobs): pass

Probably better as we will rather be adding new Normalizations than adding new functions that use them. wdyt @clefourrier ?

Not a fan of the OOP way, the different norms having nothing in common so having the hinerit from a common class is weird at best.
The way you did it here is the best I think but clementine is right in that it does not match the rest of the codebase, I would however still go for that approach.

src/lighteval/tasks/requests.py

src/lighteval/tasks/lighteval_task.py

src/lighteval/metrics/metrics.py

src/lighteval/metrics/metrics_sample.py

src/lighteval/metrics/normalizations.py

src/lighteval/tasks/lighteval_task.py

tests/metrics/test_normalizations.py

tests/test_unit_base_metrics.py

hynky1999 · 2024-08-22T13:07:20Z

I'm still not clear why you need a new system for the PMI requests, can you elaborate?

Because PMI needs two requests, one with normal query and with with uncoditioned query (Usually empty or just Answer:).
Thus you need to create two loglikehood requests

src/lighteval/logging/info_loggers.py

src/lighteval/metrics/__init__.py

src/lighteval/metrics/dynamic_metrics.py

src/lighteval/tasks/requests.py

tests/metrics/test_metric_requests.py

Co-authored-by: Nathan Habib <[email protected]>

hynky1999 · 2024-08-30T15:41:49Z

Rebased and made a small update:

Moved the raw probability to target_perplexity category to reduce wasteful compute
This meant that I had to change the singature that the comp fcs are receiving, so I updated that

Also noticed that we have this metric acc_gold_likelihoods, but it was computing completely different thing than expected so updated to to cmpute what it should:

Here are comments I wrote to that fc

    # TODO: (hynek) I think this function is absolutely broken and doesn't work as intended. The target_accs used to be True/False
    # True if argmax of logits is in the gold, False otherwise. We then took only the first gold for some reason.

    # Thus the name doesn't reflect that at all and neither do comments
    # 1) I think you wanted to talk about probs
    # 2) Even that's not true, what this measure is whether every prob of token will be > than the others
    # 3) This was measure across just single gold, not across all possible golds

src/lighteval/metrics/sample_preparator.py

src/lighteval/tasks/lighteval_task.py

NathanHB · 2024-09-02T11:42:44Z

Thanks for the modifs and the tests ! This looks good to be merged after running ruff. For the normalization functions, I would go for the way you use a function to route to the correct behaviour as we do not need OOP here.

Co-authored-by: Nathan Habib <[email protected]>

hynky1999 · 2024-09-02T14:13:21Z

Ruffed, let's wait for tests and merge it

What does this implement/fix? Explain your changes. --------------------------------------------------- This PR adds two new features: 1) New Probability Metric, allowing to collect probability of correct answer. This can be either raw prob or prob mass (normalized by other choices) 2) Revamps Acc/Prob normalization and adds two new normalizations a) Token normalization, which we found to be better at most of the non-english langauges compared to acc norm. b) PointwiseMutualInformation normalization, which is good way for testing tasks with unlikely token see: https://arxiv.org/abs/2406.08446 Lastly I have done some small changes to the requests processing, removing parts, which are not needed and can easily cause bugs. Comments ---------- - I am not really content with having new category just for normalization but I didn't find a better way in the current system. The problem is that when creating requests we only have access to sample fc, but nothing else, thus we can't really do any kind of structural decomposition :( - This new norms are only added for non-single token types of tasks. Adding them to single token would require improving the requests creating logic to be maintanable and can be done in other PR PS: Relevant disscusion about token norm EleutherAI/lm-evaluation-harness#1396 --------- Co-authored-by: Hynek Kydlicek <[email protected]> Co-authored-by: Nathan Habib <[email protected]>

hynky1999 force-pushed the prob_metrics_and_more_norms branch from 0d351b2 to 6bcf160 Compare August 22, 2024 12:30

clefourrier reviewed Aug 22, 2024

View reviewed changes

src/lighteval/logging/info_loggers.py Outdated Show resolved Hide resolved

hynky1999 requested a review from clefourrier August 22, 2024 17:11

hynky1999 mentioned this pull request Aug 23, 2024

Make info loggers dataclass #280

Merged