Skip to content

Add recall and NDCG operations in msmarco-v2-vector#610

Merged
jimczi merged 27 commits intoelastic:masterfrom
jimczi:jim/msmarco-v2-vector
Jun 10, 2024
Merged

Add recall and NDCG operations in msmarco-v2-vector#610
jimczi merged 27 commits intoelastic:masterfrom
jimczi:jim/msmarco-v2-vector

Conversation

@jimczi
Copy link
Contributor

@jimczi jimczi commented May 16, 2024

This change adds an operation called knn-recall that computes the following metrics:

  • Recall
  • NDCG
  • Avg number of nodes visited during search

The new queries-recall.json file contains all the queries (76 in total) from the testing set along with their embeddings and the top 1000 ids computed with brute force over the entire corpus.
For the relevance metrics, the qrels.tsv file contains annotations for all the queries listed in queries-recall.json. This file is generated from the original training data available at ir_datasets/msmarco_passage_v2.

jimczi added 12 commits May 16, 2024 14:04
This change adds an operation called knn-recall that computes the following metrics:
  * Recall
  * NDCG
  * Avg number of nodes visited during search

Given the size of the corpus, the true top N values used for recall operations have been approximated offline for each query as follows:
```
{
    "knn": {
        "field": "emb",
        "query_vector": query['emb'],
        "k": 10000,
        "num_candidates": 10000
    },
    "rescore": {
        "window_size": 10000,
        "query": {
            "query_weight": 0,
                "rescore_query": {
                    "script_score": {
                        "query": {
                            "match_all": {}
                        },
                    "script": {
                        "source": "double value = dotProduct(params.query_vector, 'emb'); return sigmoid(1, Math.E, -value);",
                        "params": {
                            "query_vector": vec
                        }
                    }
                }
            }
        }
    }
}
```
This means that the computed recall is measured against the system's best possible approximate neighbor run rather than the actual top N.

For the relevance metrics, the `qrels.tsv` file contains annotations for all the queries listed in `queries.json`. This file is generated from the original training data available at [ir_datasets/msmarco_passage_v2](https://ir-datasets.com/msmarco-passage-v2.html#msmarco-passage-v2/train).
@jimczi jimczi requested a review from afoucret May 17, 2024 12:29
@jimczi jimczi requested a review from 1stvamp May 17, 2024 13:14
"dynamic": false,
"_source": {
"enabled": false
"mode": "synthetic"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a param?

Copy link
Contributor

@afoucret afoucret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comment but nothing that would prevent you to merge the PR

for query in dataset.queries_iter():
emb = await retrieve_embed_for_query(co, query[1])
resp = await es.search(
index="msmarco-v2", query=get_brute_force_query(emb), size=1000, _source=["_none_"], fields=["docid"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a param?

@jimczi jimczi merged commit b6f3535 into elastic:master Jun 10, 2024
@jimczi jimczi deleted the jim/msmarco-v2-vector branch June 10, 2024 13:32
@gareth-ellis
Copy link
Member

@jimczi should this be backported to 8.15? I tried to backport #708 due to it adding about 50 mins to IT tests, but it seems that this PR was never backported to 8.15, so the changes are only in master (By default rally will choose the 8.15 branch when benchmarking against 8.X, where the version being tested is 8.15 or later - serverless always runs from master)

@jimczi
Copy link
Contributor Author

jimczi commented Dec 4, 2024

Sorry for the delay here @gareth-ellis .
Are you trying to add this challenge somewhere? I can do the backport but I am not sure I understand whether that's what you imply here?

gareth-ellis pushed a commit to gareth-ellis/rally-tracks that referenced this pull request Dec 6, 2024
This change adds an operation called knn-recall that computes the following metrics:
  * Recall
  * NDCG
  * Avg number of nodes visited during search

Given the size of the corpus, the true top N values used for recall operations have been approximated offline for each query as follows:
```
{
    "knn": {
        "field": "emb",
        "query_vector": query['emb'],
        "k": 10000,
        "num_candidates": 10000
    },
    "rescore": {
        "window_size": 10000,
        "query": {
            "query_weight": 0,
                "rescore_query": {
                    "script_score": {
                        "query": {
                            "match_all": {}
                        },
                    "script": {
                        "source": "double value = dotProduct(params.query_vector, 'emb'); return sigmoid(1, Math.E, -value);",
                        "params": {
                            "query_vector": vec
                        }
                    }
                }
            }
        }
    }
}
```
This means that the computed recall is measured against the system's best possible approximate neighbor run rather than the actual top N.

For the relevance metrics, the `qrels.tsv` file contains annotations for all the queries listed in `queries.json`. This file is generated from the original training data available at [ir_datasets/msmarco_passage_v2](https://ir-datasets.com/msmarco-passage-v2.html#msmarco-passage-v2/train).

(cherry picked from commit b6f3535)
gareth-ellis added a commit that referenced this pull request Dec 16, 2024
* Add recall and NDCG operations in msmarco-v2-vector (#610)

This change adds an operation called knn-recall that computes the following metrics:
  * Recall
  * NDCG
  * Avg number of nodes visited during search

Given the size of the corpus, the true top N values used for recall operations have been approximated offline for each query as follows:
```
{
    "knn": {
        "field": "emb",
        "query_vector": query['emb'],
        "k": 10000,
        "num_candidates": 10000
    },
    "rescore": {
        "window_size": 10000,
        "query": {
            "query_weight": 0,
                "rescore_query": {
                    "script_score": {
                        "query": {
                            "match_all": {}
                        },
                    "script": {
                        "source": "double value = dotProduct(params.query_vector, 'emb'); return sigmoid(1, Math.E, -value);",
                        "params": {
                            "query_vector": vec
                        }
                    }
                }
            }
        }
    }
}
```
This means that the computed recall is measured against the system's best possible approximate neighbor run rather than the actual top N.

For the relevance metrics, the `qrels.tsv` file contains annotations for all the queries listed in `queries.json`. This file is generated from the original training data available at [ir_datasets/msmarco_passage_v2](https://ir-datasets.com/msmarco-passage-v2.html#msmarco-passage-v2/train).

(cherry picked from commit b6f3535)

* Exclude msmarco from IT tests (#708)

---------

Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants