Information Retrieval (IR) metrics implementation (MAP, MRR, P@K, R@K, HR@K) [wip] #4991

lucadiliello · 2020-12-06T17:05:43Z

What does this PR do?

Integrate 5 new metrics for information retrieval into the metrics package. This PR adds 5 classes in metrics.retrieval and 5 functions in metrics.functional.
Added metrics comprehend:

Mean Average Precision (MAP)
Mean Reciprocal Rank (MRR)
Precision @ K
Recall @ K
Hit Rate @ K

The discussion about this PR started on the metrics channel on slack.
I decided to open a single PR since all the metrics inherit from a RetrievalMetric class that contains most of the logic.

Since the structure of the metrics should be discussed further, I did not implement tests and I didn't update the doc. I will do that in future commits if the proposed solution is accepted.

The main problem arises from the fact that in IR, for each query a set of document is evaluated and a score is assigned to each of them. Then, the performance cannot be computed on a single "row" of the test set, but on each query results separately.

As an example, suppose you want to compute the P@1 of some retrieval system. The precision @ k is the fraction of relevant document among the k ones that received the higher score.
Then, to group prediction about the same query together I used an additional indexes tensor:

>>> indexes = torch.tensor([0, 0, 0, 1, 1, 1, 1])
>>> preds = torch.tensor([0.2, 0.3, 0.5, 0.1, 0.3, 0.5, 0.2])
>>> target = torch.tensor([False, False, True, False, True, False, False])

>>> p_k = PrecisionAtK(k=1)
>>> p_k(indexes, preds, target)
>>> p_k.compute() # P@1 on the first document is 1 because the higher-score document is relevant. On the second document P@1 is 0. Mean = 0.5
... 0.5

In this example, the first 3 predictions were wrt the query 0 and the others wrt the query 1. Unfortunately, not all queries are compared with the same number of documents, so I used the indexes method to group predictions about same queries.

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified; Bugfixes should be including in bug-fix release milestones (m.f.X) and features should be included in (m.X.b) releases.

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2020-12-06T17:05:49Z

Hello @lucadiliello! Thanks for updating this PR.

In the file pytorch_lightning/metrics/retrieval/retrieval_metric.py:

Line 74:121: E501 line too long (121 > 120 characters)

Comment last updated at 2020-12-13 19:28:27 UTC

rohitgr7 · 2020-12-06T20:51:22Z

P@K, R@K these are already WIP in #4837 and it's follow-up PRs

SkafteNicki · 2020-12-06T21:08:38Z

@rohitgr7 if you look at the wiki page for precision and recall (https://en.m.wikipedia.org/wiki/Precision_and_recall) it actually have two definitions, one for classification and one for information retrieval. It therefore make sense if we need different implementations for different use cases. Maybe @lucadiliello can give a explanation why we need special precision / recall for information retrieval?

lucadiliello · 2020-12-06T21:59:57Z

Precision and recall in the Information Retrieval context are different. In IR you usually have a query Q that is compared with n documents D_1, D_2, ..., D_n of which some may be relevant and some not. For each pair (Q, D_i) the model computes some score s_i. After the predictions of the scores, the documents are ordered based on the scores.

After the sorting, you compute precision and recall in this way:
P@K = (number of relevant documents in the first K) / n
R@K = (number of relevant documents in the first K) / total relevant documents

This is quite different from the usual precision and recall, especially because you have to order predictions about a set of document scores and you have also to use many predictions (n) to compute the Precision or Recall wrt a single query.

More info here

I do not see any actual metric that allows me to do so.

tadejsv · 2020-12-06T23:09:48Z

@lucadiliello Would you be able to provide links to some repositories/datasets/examples, where evaluation is done with varying (unpredictable) number of documents per query?

justusschock

Also make sure, all the arguments are documented for every function/class

pytorch_lightning/metrics/retrieval/precision.py

pytorch_lightning/metrics/retrieval/recall.py

pytorch_lightning/metrics/retrieval/hit_rate.py

SkafteNicki

Overall very nice. What reference implementation can we use to test against?
Also remember to update docs/source/metrics.rts with the new metrics.

pytorch_lightning/metrics/__init__.py

lucadiliello · 2020-12-07T12:34:32Z

@lucadiliello Would you be able to provide links to some repositories/datasets/examples, where evaluation is done with varying (unpredictable) number of documents per query?

I can mention:

WikiQA
TrecQA
Quora question pairs if you group questions by id

and others. The problem arises from the fact that in real world applications, you have some engine that, given a query, it extracts a variable number of documents with some heuristic. After that, an NLP model is used to find the best document among the proposed.
For the mentioned datasets and many others and to be as most compatible as possible I propose this solution.

…resent 'average_precision' file

…lightning into ir_metrics

Borda · 2020-12-08T18:38:13Z

@lucadiliello great work! mind split it into a few smaller PRs, as we did with @tadejsv
I would suggest each metrics as a single PR plus add functional version...

Borda · 2021-03-15T11:18:19Z

Hi @lucadiliello, we are very happy that you decided to improve PL and opened this PR.
Never the less we have moved the metric package outside PT, to a separate repository - torchmetrics, and unfortunately, there is not GH option to transfer your opened PR there to finish the job there. So I would like to kindly ask you if you are able to finish this PR by end of this week and then we can transfer your merged commit to the new repo or please open the very same PR in the new repository... 🐰

Anyway thank you very much for being with us ⚡

Borda · 2021-03-22T08:31:09Z

@lucadiliello I had transferred the last IR metrics to the new torchmetrics...
As we already started the deprecation here in PL it would be even more difficult to finish adding a new metric here, may I kindly ask you to reopen the very same PR in torchmetrics (the reason is to preserve you as the contribution author) and I will help you to finish it there :]

So pls, open PR in TM with the very same content and refer to this PL PR 🐰

init information retrieval metrics

108c1fc

lucadiliello requested review from ananyahjha93, justusschock and teddykoker as code owners December 6, 2020 17:05

teddykoker requested a review from SkafteNicki December 6, 2020 19:05

SkafteNicki added Metrics feature Is an improvement or enhancement labels Dec 6, 2020

SkafteNicki added this to the 1.2 milestone Dec 6, 2020

justusschock reviewed Dec 7, 2020

View reviewed changes

SkafteNicki reviewed Dec 7, 2020

View reviewed changes

pytorch_lightning/metrics/__init__.py Show resolved Hide resolved

changed retrieval metrics names, expanded arguments and fixed typo

d3628d7

lucadiliello added 7 commits December 7, 2020 13:52

added 'Retrieval' prefix to metrics and fixed conflict with already-p…

3ada9d1

…resent 'average_precision' file

Merge branch 'master' into ir_metrics

2e813c5

improved code formatting

1ba0907

Merge branch 'ir_metrics' of https://github.com/lucadiliello/pytorch-…

8e27d9d

…lightning into ir_metrics

Merge branch 'master' into ir_metrics

f44e8ad

pep8 code compatibility

97ff6fa

Merge branch 'ir_metrics' of https://github.com/lucadiliello/pytorch-…

02efdca

…lightning into ir_metrics

lucadiliello mentioned this pull request Dec 8, 2020

Mean Average Precision metric for Information Retrieval (1/5) #5032

Merged

11 tasks

mergify bot requested a review from a team December 12, 2020 14:57

Merge branch 'master' into ir_metrics

436eed5

Borda changed the base branch from master to release/1.2-dev December 14, 2020 17:33

Borda assigned justusschock and SkafteNicki Dec 29, 2020

Borda changed the title ~~Information Retrieval (IR) metrics implementation (MAP, MRR, P@K, R@K, HR@K)~~ Information Retrieval (IR) metrics implementation (MAP, MRR, P@K, R@K, HR@K) [wip] Jan 6, 2021

github-actions bot added the has conflicts label Jan 18, 2021

edenlightning modified the milestones: 1.2, 1.3 Feb 8, 2021

Base automatically changed from release/1.2-dev to master February 11, 2021 14:31

edenlightning removed this from the 1.3 milestone Feb 22, 2021

Borda closed this Mar 22, 2021

This was referenced Apr 6, 2021

Information Retrieval (5/5) Lightning-AI/torchmetrics#160

Merged

Information Retrieval (6/5) Lightning-AI/torchmetrics#161

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Information Retrieval (IR) metrics implementation (MAP, MRR, P@K, R@K, HR@K) [wip] #4991

Information Retrieval (IR) metrics implementation (MAP, MRR, P@K, R@K, HR@K) [wip] #4991

lucadiliello commented Dec 6, 2020 •

edited

Loading

pep8speaks commented Dec 6, 2020 •

edited

Loading

rohitgr7 commented Dec 6, 2020

SkafteNicki commented Dec 6, 2020

lucadiliello commented Dec 6, 2020

tadejsv commented Dec 6, 2020

justusschock left a comment

SkafteNicki left a comment

lucadiliello commented Dec 7, 2020 •

edited

Loading

Borda commented Dec 8, 2020

Borda commented Mar 15, 2021

Borda commented Mar 22, 2021

Information Retrieval (IR) metrics implementation (MAP, MRR, P@K, R@K, HR@K) [wip] #4991

Information Retrieval (IR) metrics implementation (MAP, MRR, P@K, R@K, HR@K) [wip] #4991

Conversation

lucadiliello commented Dec 6, 2020 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

pep8speaks commented Dec 6, 2020 • edited Loading

Comment last updated at 2020-12-13 19:28:27 UTC

rohitgr7 commented Dec 6, 2020

SkafteNicki commented Dec 6, 2020

lucadiliello commented Dec 6, 2020

tadejsv commented Dec 6, 2020

justusschock left a comment

Choose a reason for hiding this comment

SkafteNicki left a comment

Choose a reason for hiding this comment

lucadiliello commented Dec 7, 2020 • edited Loading

Borda commented Dec 8, 2020

Borda commented Mar 15, 2021

Borda commented Mar 22, 2021

lucadiliello commented Dec 6, 2020 •

edited

Loading

pep8speaks commented Dec 6, 2020 •

edited

Loading

lucadiliello commented Dec 7, 2020 •

edited

Loading