Add date filter to background linking reranker #786

chriskamphuis · 2019-08-21T13:56:48Z

Date is saved as an additional field
Date filter can be toggled using command line argument

Without datefilter (BM25 + RM3)

target/appassembler/bin/SearchCollection -searchnewsbackground -index lucene-index.core18.pos+docvectors+rawdocs -topicreader NewsBackgroundLinking -topics ~/topics/newsir18-topics.txt -bm25 -rm3 -hits 100 -backgroundlinking.k 100 -runtag bm25_rm3 -output test/bm25-rm3.txt

will result in NDCG@5 = 0.3526

With datefilter (BM25 + RM3)

target/appassembler/bin/SearchCollection -searchnewsbackground -index lucene-index.core18.pos+docvectors+rawdocs -topicreader NewsBackgroundLinking -topics ~/topics/newsir18-topics.txt -bm25 -rm3 -hits 100 -backgroundlinking.k 100 -backgroundlinking.datefilter -runtag bm25_rm3_df -output outputs/bm25-rm3-df.txt

will result in NDCG@5 = 0.4171

Date is saved as an additional field Date filter can be toggled using command line argument

chriskamphuis · 2019-08-21T13:59:15Z

The datefilter filters out articles published after the topic article. These can be relevant, but tend to be not relevant much more often.

lintool · 2019-08-21T14:56:33Z

@Peilin-Yang can you take a look?

Peilin-Yang · 2019-08-22T13:38:49Z

src/main/java/io/anserini/index/generator/WapoGenerator.java

@@ -86,6 +86,7 @@ public Document createDocument(WashingtonPostCollection.Document wapoDoc) {
    // This is needed to break score ties by docid.
    doc.add(new SortedDocValuesField(FIELD_ID, new BytesRef(id)));
    doc.add(new LongPoint(WapoField.PUBLISHED_DATE.name, wapoDoc.getPublishDate()));
+    doc.add(new StoredField(WapoField.PUBLISHED_DATE.name, wapoDoc.getPublishDate()));


I am not sure if StoreField is needed. Can you try to remove it to see if it still works?

Storefield is needed:

An indexed long field for fast range filters. If you also need to store the value, you should add a separate StoredField instance.

according to http://lucene.apache.org/core/8_2_0/core/org/apache/lucene/document/LongPoint.html

But maybe this should also be optional, like positions in the index.

Peilin-Yang · 2019-08-22T13:41:20Z

src/main/java/io/anserini/rerank/lib/NewsBackgroundLinkingReranker.java

+
+    if(context.getSearchArgs().backgroundlinking_datefilter){
+      try{
+        Document queryDoc = reader.document(NewsBackgroundLinkingTopicReader.convertDocidToLuceneDocid(reader, queryDocId));


You can use LongPoint.newRangeQuery here.
See http://lucene.apache.org/core/8_2_0/core/org/apache/lucene/document/LongPoint.html#newRangeQuery-java.lang.String-long-long-

I will try this, but I expect if you do it in the initial retrieval step the results will be different as the RM3 constructed query will be different.

This is actually interesting.
I think we should keep the filter same for both initial ranking and the relevance feedback otherwise the results are biased. right?

If the filter models "no information from the future should be used", then the filter should be everywhere (and technically, IDF values should also depend on time...).

If it is just reflecting how judging took place, then maybe that is a different situation.

How about let's make this consistent with what we did for microblog track?
That is, to only filter the "future" documents in the initial ranking?

anserini/src/main/java/io/anserini/search/SearchCollection.java

Line 505 in 453a9d0

Query filter = LongPoint.newRangeQuery(TweetGenerator.StatusField.ID_LONG.name, 0L, t);

Ok the results are:

BM25 with datefilter in initial ranking:
NDCG@5 : 0.3735

BM25 + RM3 with datefilter in initial ranking:
NDCG@5 : 0.3433

BM25 + RM3 with datefilter in both initial ranking and reranking:
NDCG@5 : 0.3444

BM25 + RM3 with datefilter in only reranking (initial implementation)
NDCG@5 : 0.4171

So it seems that datefilter in initial ranking is good when no reranking is applied, however when trying to rerank the query construction of RM3 does suffer from missing the documents that are filtered out.

I think we should keep the filter here as the same way we did for MB track.
@lintool wdyt?

Well, it depends on the answer to my question below about the task guidelines... do we want to "win" or model the task accurately? :)

chriskamphuis · 2019-08-26T13:27:59Z

More a general comment on this PR: The reason for the big increase in performance is also likely due to the fact that initially the task guidelines stated that articles could only relevant if they were published before the topic article; resulting in a bias in the judgments because the pooling consists of more articles published before the topic article.

lintool · 2019-08-26T16:48:38Z

Do the official guidelines say anything about using "future evidence"? Obviously, in the task scenario, "documents in the future" haven't been written yet... so it's not really a "bias" in judgments... it's reflecting the reality of the task?

chriskamphuis · 2019-08-26T18:40:23Z

So initially the idea was like you mentioned, it would not make sense to consider future documents relevant as they are not published yet at the moment the topic article is published. However, after discussion, it was decided that future documents can be relevant:

There is time between when the article is published and when someone is reading it. In the meantime another article that is important for the context of the query article can be published. In this case that article would be considered relevant to.

So using future evidence would make sense, it can even contain relevant documents. But it turns out that filtering out all documents that are published after the query article helps the effectiveness.

chriskamphuis · 2019-08-26T18:41:57Z

Re: bias in the judgement. I think because runs were submitted not taking future documents into account the pool of scored documents might be biased. (Although future docs can be relevant)

lintool · 2019-08-27T02:22:10Z

I understand now. The key is that article publication time and user reading time may not be the same.

Since @Peilin-Yang has been doing CR, I'm happy to merge after he gives 👍

Are we just making the date filter command-line parameters?

Peilin-Yang · 2019-08-27T03:03:22Z

@chriskamphuis @lintool
I understand.
So the reason that you put the filter as the last step is just a heuristic that is similar to deduping, right?

lintool · 2019-08-27T03:07:43Z

@Peilin-Yang to be precise, it is because the qrels only include documents before the article time.

Peilin-Yang

Overall looks good to me, please fix the code format before merging

Peilin-Yang · 2019-08-27T03:18:52Z

src/main/java/io/anserini/rerank/lib/NewsBackgroundLinkingReranker.java

+    }
+
+    if(context.getSearchArgs().backgroundlinking_datefilter){
+      try{


nit: add space before {

Peilin-Yang · 2019-08-27T03:19:32Z

src/main/java/io/anserini/rerank/lib/NewsBackgroundLinkingReranker.java

+
+    if(context.getSearchArgs().backgroundlinking_datefilter){
+      try{
+        Document queryDoc = reader.document(NewsBackgroundLinkingTopicReader.convertDocidToLuceneDocid(reader, queryDocId));


Can we make proper line return so that the line is not too long?

Peilin-Yang · 2019-08-27T03:19:50Z

src/main/java/io/anserini/rerank/lib/NewsBackgroundLinkingReranker.java

+        long queryDocDate = Long.parseLong(queryDoc.getField(PUBLISHED_DATE.name).stringValue());
+        for (int i = 0; i < docs.documents.length; i++) {
+          long date = Long.parseLong(docs.documents[i].getField(PUBLISHED_DATE.name).stringValue());
+          if(date > queryDocDate){


nit: add space before {

Add date filter to background linking reranker

b9e07e0

Date is saved as an additional field Date filter can be toggled using command line argument

lintool requested a review from Peilin-Yang August 21, 2019 14:56

chriskamphuis added 2 commits August 22, 2019 10:24

Removed unnecessary string

8c2923a

Formatting

ecfdd00

Peilin-Yang reviewed Aug 22, 2019

View reviewed changes

Peilin-Yang approved these changes Aug 27, 2019

View reviewed changes

chriskamphuis added 2 commits August 27, 2019 09:31

Formatting

6a80ddd

Formatting

481e36f

lintool merged commit c5ee9af into castorini:master Aug 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add date filter to background linking reranker #786

Add date filter to background linking reranker #786

chriskamphuis commented Aug 21, 2019

chriskamphuis commented Aug 21, 2019 •

edited

Loading

lintool commented Aug 21, 2019

Peilin-Yang Aug 22, 2019

chriskamphuis Aug 23, 2019 •

edited

Loading

Peilin-Yang Aug 22, 2019

chriskamphuis Aug 23, 2019

Peilin-Yang Aug 24, 2019

arjenpdevries Aug 24, 2019

Peilin-Yang Aug 25, 2019

chriskamphuis Aug 26, 2019

chriskamphuis Aug 26, 2019

Peilin-Yang Aug 26, 2019

lintool Aug 26, 2019

chriskamphuis commented Aug 26, 2019

lintool commented Aug 26, 2019

chriskamphuis commented Aug 26, 2019

chriskamphuis commented Aug 26, 2019

lintool commented Aug 27, 2019

Peilin-Yang commented Aug 27, 2019

lintool commented Aug 27, 2019

Peilin-Yang left a comment

Peilin-Yang Aug 27, 2019

Peilin-Yang Aug 27, 2019

Peilin-Yang Aug 27, 2019

Add date filter to background linking reranker #786

Add date filter to background linking reranker #786

Conversation

chriskamphuis commented Aug 21, 2019

chriskamphuis commented Aug 21, 2019 • edited Loading

lintool commented Aug 21, 2019

Choose a reason for hiding this comment

chriskamphuis Aug 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chriskamphuis commented Aug 26, 2019

lintool commented Aug 26, 2019

chriskamphuis commented Aug 26, 2019

chriskamphuis commented Aug 26, 2019

lintool commented Aug 27, 2019

Peilin-Yang commented Aug 27, 2019

lintool commented Aug 27, 2019

Peilin-Yang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chriskamphuis commented Aug 21, 2019 •

edited

Loading

chriskamphuis Aug 23, 2019 •

edited

Loading