Skip to content

Conversation

hynky1999
Copy link
Collaborator

What does this implement/fix? Explain your changes.

This PR adds two new features:

  1. New Probability Metric, allowing to collect probability of correct answer. This can be either raw prob or prob mass (normalized by other choices)
  2. Revamps Acc/Prob normalization and adds two new normalizations
    a) Token normalization, which we found to be better at most of the non-english langauges compared to acc norm.
    b) PointwiseMutualInformation normalization, which is good way for testing tasks with unlikely token see: https://arxiv.org/abs/2406.08446

Lastly I have done some small changes to the requests processing, removing parts, which are not needed and can easily cause bugs.

Comments

  • I am not really content with having new category just for normalization but I didn't find a better way in the current system. The problem is that when creating requests we only have access to sample fc, but nothing else, thus we can't really do any kind of structural decomposition :(
  • This new norms are only added for non-single token types of tasks. Adding them to single token would require improving the requests creating logic to be maintanable and can be done in other PR

PS: Relevant disscusion about token norm EleutherAI/lm-evaluation-harness#1396

@hynky1999 hynky1999 force-pushed the prob_metrics_and_more_norms branch from 0d351b2 to 6bcf160 Compare August 22, 2024 12:30
Copy link
Member

@clefourrier clefourrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not clear why you need a new system for the PMI requests, can you elaborate?

Normalization = CharNorm | TokenNorm | PMINorm


def normalize_log_probs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use the mechanism we have for the metrics with arguments, where you instantiate a class then execute the correct normalisation based on the arguments?
Using empty dataclasses, then case and asserts has a very different style from the rest of the lib

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class Normalization:
    def __init__(self, level: str, ignore_first_space: bool = False):
        if self.level not in ["character", "token", "pmi"]:
            raise ValueError(...)
        self.level = level
        self.ignore_first_space = ignore_first_space
        
    def normalize(
        self, 
        choices_logprob: list[float],
        unconditioned_logprob: list[float] | None,
        choices_text: list[str] | None,
        choices_tokens: list[list[int]] | None,
    ): ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw normalization is too broad as a name - you'll need to find something else

Copy link
Collaborator Author

@hynky1999 hynky1999 Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because ignore_first_space is property of just Char_normalization. The parameter doesn't make sense in other normalization context so it make sense to me to be a property of Character normalization. Can be confusing for people passing let's say token normalization as they might think that it will not consider the first space token

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I would be OK with one class for each - I just don't see the points of the empty dataclasses which add an extra layer of complexity to the code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we've observed that adding too many layers of abstraction/disconnected code (like here, empty dataclasses + a very long normalization function where cases are uncorrelated to one another + a to_str function) are the best way to introduce bugs as it becomes very easy to forget that these pieces of code are linked - whereas using classes to group code logic will reduce some of this problem

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So at the end your proposed change would be to switch to literals (by flattening, so that char norm has now 2 literals), but would keep the rest of the logic the same ?

Like in the example you shown I don't really see why we need any class at all, we don't need to keep any state there or whatsover we just need to know how to normalize

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see why we need any class at all, we don't need to keep any state there or whatsover we just need to know how to normalize

  1. homegeneity with the rest of the code style
  2. avoiding having many functions (normalisation, to string), which are unconnected with the code base.
    Give me a minute i'll send you how I would write this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also do it on OOP way.

class Normalization:
    def normalize(logprobs):
          pass
          
          
class CharNorm(Normalization):
      def normalize(logprobs):
          pass

Probably better as we will rather be adding new Normalizations than adding new functions that use them. wdyt @clefourrier ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of the OOP way, the different norms having nothing in common so having the hinerit from a common class is weird at best.
The way you did it here is the best I think but clementine is right in that it does not match the rest of the codebase, I would however still go for that approach.

@hynky1999
Copy link
Collaborator Author

I'm still not clear why you need a new system for the PMI requests, can you elaborate?

Because PMI needs two requests, one with normal query and with with uncoditioned query (Usually empty or just Answer:).
Thus you need to create two loglikehood requests

@hynky1999 hynky1999 force-pushed the prob_metrics_and_more_norms branch from f333587 to 20513e2 Compare August 30, 2024 12:56
@hynky1999
Copy link
Collaborator Author

hynky1999 commented Aug 30, 2024

Rebased and made a small update:

  • Moved the raw probability to target_perplexity category to reduce wasteful compute
  • This meant that I had to change the singature that the comp fcs are receiving, so I updated that

Also noticed that we have this metric acc_gold_likelihoods, but it was computing completely different thing than expected so updated to to cmpute what it should:

Here are comments I wrote to that fc

    # TODO: (hynek) I think this function is absolutely broken and doesn't work as intended. The target_accs used to be True/False
    # True if argmax of logits is in the gold, False otherwise. We then took only the first gold for some reason.

    # Thus the name doesn't reflect that at all and neither do comments
    # 1) I think you wanted to talk about probs
    # 2) Even that's not true, what this measure is whether every prob of token will be > than the others
    # 3) This was measure across just single gold, not across all possible golds

@hynky1999 hynky1999 requested a review from NathanHB August 30, 2024 15:41
@NathanHB
Copy link
Member

NathanHB commented Sep 2, 2024

Thanks for the modifs and the tests ! This looks good to be merged after running ruff. For the normalization functions, I would go for the way you use a function to route to the correct behaviour as we do not need OOP here.

@hynky1999
Copy link
Collaborator Author

Ruffed, let's wait for tests and merge it

@NathanHB NathanHB merged commit 8c787df into main Sep 2, 2024
2 checks passed
hynky1999 added a commit that referenced this pull request May 22, 2025
What does this implement/fix? Explain your changes.
---------------------------------------------------
This PR adds two new features:
1) New Probability Metric, allowing to collect probability of correct answer. This can be either raw prob or prob mass (normalized by other choices)
2) Revamps Acc/Prob normalization and adds two new normalizations 
      a) Token normalization, which we found to be better at most of the non-english langauges compared to acc norm.
      b) PointwiseMutualInformation normalization, which is good way for testing tasks with unlikely token see: https://arxiv.org/abs/2406.08446

Lastly I have done some small changes to the requests processing, removing parts, which are not needed and can easily cause bugs.


Comments
----------
- I am not really content with having new category just for normalization but I didn't find a better way in the current system. The problem is that when creating requests we only have access to sample fc, but nothing else, thus we can't really do any kind of structural decomposition :(
- This new norms are only added for non-single token types of tasks. Adding them to single token would require improving the requests creating logic to be maintanable and can be done in other PR  

PS: Relevant disscusion about token norm EleutherAI/lm-evaluation-harness#1396

---------

Co-authored-by: Hynek Kydlicek <[email protected]>
Co-authored-by: Nathan Habib <[email protected]>
NathanHB added a commit that referenced this pull request Sep 19, 2025
What does this implement/fix? Explain your changes.
---------------------------------------------------
This PR adds two new features:
1) New Probability Metric, allowing to collect probability of correct answer. This can be either raw prob or prob mass (normalized by other choices)
2) Revamps Acc/Prob normalization and adds two new normalizations 
      a) Token normalization, which we found to be better at most of the non-english langauges compared to acc norm.
      b) PointwiseMutualInformation normalization, which is good way for testing tasks with unlikely token see: https://arxiv.org/abs/2406.08446

Lastly I have done some small changes to the requests processing, removing parts, which are not needed and can easily cause bugs.


Comments
----------
- I am not really content with having new category just for normalization but I didn't find a better way in the current system. The problem is that when creating requests we only have access to sample fc, but nothing else, thus we can't really do any kind of structural decomposition :(
- This new norms are only added for non-single token types of tasks. Adding them to single token would require improving the requests creating logic to be maintanable and can be done in other PR  

PS: Relevant disscusion about token norm EleutherAI/lm-evaluation-harness#1396

---------

Co-authored-by: Hynek Kydlicek <[email protected]>
Co-authored-by: Nathan Habib <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants