Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Information Retrieval (IR) metrics implementation (MAP, MRR, P@K, R@K, HR@K) [wip] #4991

Closed
wants to merge 10 commits into from

Conversation

lucadiliello
Copy link
Contributor

@lucadiliello lucadiliello commented Dec 6, 2020

What does this PR do?

Integrate 5 new metrics for information retrieval into the metrics package. This PR adds 5 classes in metrics.retrieval and 5 functions in metrics.functional.
Added metrics comprehend:

  • Mean Average Precision (MAP)
  • Mean Reciprocal Rank (MRR)
  • Precision @ K
  • Recall @ K
  • Hit Rate @ K

The discussion about this PR started on the metrics channel on slack.
I decided to open a single PR since all the metrics inherit from a RetrievalMetric class that contains most of the logic.

Since the structure of the metrics should be discussed further, I did not implement tests and I didn't update the doc. I will do that in future commits if the proposed solution is accepted.

The main problem arises from the fact that in IR, for each query a set of document is evaluated and a score is assigned to each of them. Then, the performance cannot be computed on a single "row" of the test set, but on each query results separately.

As an example, suppose you want to compute the P@1 of some retrieval system. The precision @ k is the fraction of relevant document among the k ones that received the higher score.
Then, to group prediction about the same query together I used an additional indexes tensor:

>>> indexes = torch.tensor([0, 0, 0, 1, 1, 1, 1])
>>> preds = torch.tensor([0.2, 0.3, 0.5, 0.1, 0.3, 0.5, 0.2])
>>> target = torch.tensor([False, False, True, False, True, False, False])

>>> p_k = PrecisionAtK(k=1)
>>> p_k(indexes, preds, target)
>>> p_k.compute() # P@1 on the first document is 1 because the higher-score document is relevant. On the second document P@1 is 0. Mean = 0.5
... 0.5

In this example, the first 3 predictions were wrt the query 0 and the others wrt the query 1. Unfortunately, not all queries are compared with the same number of documents, so I used the indexes method to group predictions about same queries.

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified; Bugfixes should be including in bug-fix release milestones (m.f.X) and features should be included in (m.X.b) releases.

Did you have fun?

Make sure you had fun coding 🙃

@pep8speaks
Copy link

pep8speaks commented Dec 6, 2020

Hello @lucadiliello! Thanks for updating this PR.

Line 74:121: E501 line too long (121 > 120 characters)

Comment last updated at 2020-12-13 19:28:27 UTC

@SkafteNicki SkafteNicki added Metrics feature Is an improvement or enhancement labels Dec 6, 2020
@SkafteNicki SkafteNicki added this to the 1.2 milestone Dec 6, 2020
@rohitgr7
Copy link
Contributor

rohitgr7 commented Dec 6, 2020

P@K, R@K these are already WIP in #4837 and it's follow-up PRs

@SkafteNicki
Copy link
Member

@rohitgr7 if you look at the wiki page for precision and recall (https://en.m.wikipedia.org/wiki/Precision_and_recall) it actually have two definitions, one for classification and one for information retrieval. It therefore make sense if we need different implementations for different use cases. Maybe @lucadiliello can give a explanation why we need special precision / recall for information retrieval?

@lucadiliello
Copy link
Contributor Author

Precision and recall in the Information Retrieval context are different. In IR you usually have a query Q that is compared with n documents D_1, D_2, ..., D_n of which some may be relevant and some not. For each pair (Q, D_i) the model computes some score s_i. After the predictions of the scores, the documents are ordered based on the scores.

After the sorting, you compute precision and recall in this way:
P@K = (number of relevant documents in the first K) / n
R@K = (number of relevant documents in the first K) / total relevant documents

This is quite different from the usual precision and recall, especially because you have to order predictions about a set of document scores and you have also to use many predictions (n) to compute the Precision or Recall wrt a single query.

More info here

I do not see any actual metric that allows me to do so.

@tadejsv
Copy link
Contributor

tadejsv commented Dec 6, 2020

@lucadiliello Would you be able to provide links to some repositories/datasets/examples, where evaluation is done with varying (unpredictable) number of documents per query?

Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also make sure, all the arguments are documented for every function/class

pytorch_lightning/metrics/retrieval/precision.py Outdated Show resolved Hide resolved
pytorch_lightning/metrics/retrieval/precision.py Outdated Show resolved Hide resolved
pytorch_lightning/metrics/retrieval/recall.py Outdated Show resolved Hide resolved
pytorch_lightning/metrics/retrieval/recall.py Outdated Show resolved Hide resolved
pytorch_lightning/metrics/retrieval/hit_rate.py Outdated Show resolved Hide resolved
pytorch_lightning/metrics/retrieval/hit_rate.py Outdated Show resolved Hide resolved
Copy link
Member

@SkafteNicki SkafteNicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall very nice. What reference implementation can we use to test against?
Also remember to update docs/source/metrics.rts with the new metrics.

pytorch_lightning/metrics/__init__.py Show resolved Hide resolved
@lucadiliello
Copy link
Contributor Author

lucadiliello commented Dec 7, 2020

@lucadiliello Would you be able to provide links to some repositories/datasets/examples, where evaluation is done with varying (unpredictable) number of documents per query?

I can mention:

and others. The problem arises from the fact that in real world applications, you have some engine that, given a query, it extracts a variable number of documents with some heuristic. After that, an NLP model is used to find the best document among the proposed.
For the mentioned datasets and many others and to be as most compatible as possible I propose this solution.

@Borda
Copy link
Member

Borda commented Dec 8, 2020

@lucadiliello great work! mind split it into a few smaller PRs, as we did with @tadejsv
I would suggest each metrics as a single PR plus add functional version...

@Borda Borda changed the base branch from master to release/1.2-dev December 14, 2020 17:33
@Borda Borda changed the title Information Retrieval (IR) metrics implementation (MAP, MRR, P@K, R@K, HR@K) Information Retrieval (IR) metrics implementation (MAP, MRR, P@K, R@K, HR@K) [wip] Jan 6, 2021
@edenlightning edenlightning modified the milestones: 1.2, 1.3 Feb 8, 2021
Base automatically changed from release/1.2-dev to master February 11, 2021 14:31
@edenlightning edenlightning removed this from the 1.3 milestone Feb 22, 2021
@Borda
Copy link
Member

Borda commented Mar 15, 2021

Hi @lucadiliello, we are very happy that you decided to improve PL and opened this PR.
Never the less we have moved the metric package outside PT, to a separate repository - torchmetrics, and unfortunately, there is not GH option to transfer your opened PR there to finish the job there. So I would like to kindly ask you if you are able to finish this PR by end of this week and then we can transfer your merged commit to the new repo or please open the very same PR in the new repository... 🐰

Anyway thank you very much for being with us ⚡

@Borda
Copy link
Member

Borda commented Mar 22, 2021

@lucadiliello I had transferred the last IR metrics to the new torchmetrics...
As we already started the deprecation here in PL it would be even more difficult to finish adding a new metric here, may I kindly ask you to reopen the very same PR in torchmetrics (the reason is to preserve you as the contribution author) and I will help you to finish it there :]

So pls, open PR in TM with the very same content and refer to this PL PR 🐰

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement has conflicts
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants