-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Information Retrieval (IR) metrics implementation (MAP, MRR, P@K, R@K, HR@K) [wip] #4991
Conversation
Hello @lucadiliello! Thanks for updating this PR.
Comment last updated at 2020-12-13 19:28:27 UTC |
P@K, R@K these are already WIP in #4837 and it's follow-up PRs |
@rohitgr7 if you look at the wiki page for precision and recall (https://en.m.wikipedia.org/wiki/Precision_and_recall) it actually have two definitions, one for classification and one for information retrieval. It therefore make sense if we need different implementations for different use cases. Maybe @lucadiliello can give a explanation why we need special precision / recall for information retrieval? |
Precision and recall in the Information Retrieval context are different. In IR you usually have a query After the sorting, you compute precision and recall in this way: This is quite different from the usual precision and recall, especially because you have to order predictions about a set of document scores and you have also to use many predictions ( More info here I do not see any actual metric that allows me to do so. |
@lucadiliello Would you be able to provide links to some repositories/datasets/examples, where evaluation is done with varying (unpredictable) number of documents per query? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also make sure, all the arguments are documented for every function/class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall very nice. What reference implementation can we use to test against?
Also remember to update docs/source/metrics.rts
with the new metrics.
I can mention:
and others. The problem arises from the fact that in real world applications, you have some engine that, given a query, it extracts a variable number of documents with some heuristic. After that, an NLP model is used to find the best document among the proposed. |
…resent 'average_precision' file
@lucadiliello great work! mind split it into a few smaller PRs, as we did with @tadejsv |
Hi @lucadiliello, we are very happy that you decided to improve PL and opened this PR. Anyway thank you very much for being with us ⚡ |
@lucadiliello I had transferred the last IR metrics to the new So pls, open PR in TM with the very same content and refer to this PL PR 🐰 |
What does this PR do?
Integrate 5 new metrics for information retrieval into the metrics package. This PR adds 5 classes in metrics.retrieval and 5 functions in metrics.functional.
Added metrics comprehend:
The discussion about this PR started on the metrics channel on slack.
I decided to open a single PR since all the metrics inherit from a
RetrievalMetric
class that contains most of the logic.Since the structure of the metrics should be discussed further, I did not implement tests and I didn't update the doc. I will do that in future commits if the proposed solution is accepted.
The main problem arises from the fact that in IR, for each query a set of document is evaluated and a score is assigned to each of them. Then, the performance cannot be computed on a single "row" of the test set, but on each query results separately.
As an example, suppose you want to compute the P@1 of some retrieval system. The precision @ k is the fraction of relevant document among the k ones that received the higher score.
Then, to group prediction about the same query together I used an additional indexes tensor:
In this example, the first 3 predictions were wrt the query 0 and the others wrt the query 1. Unfortunately, not all queries are compared with the same number of documents, so I used the
indexes
method to group predictions about same queries.Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃