-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add date filter to background linking reranker #786
Conversation
Date is saved as an additional field Date filter can be toggled using command line argument
The datefilter filters out articles published after the topic article. These can be relevant, but tend to be not relevant much more often. |
@Peilin-Yang can you take a look? |
@@ -86,6 +86,7 @@ public Document createDocument(WashingtonPostCollection.Document wapoDoc) { | |||
// This is needed to break score ties by docid. | |||
doc.add(new SortedDocValuesField(FIELD_ID, new BytesRef(id))); | |||
doc.add(new LongPoint(WapoField.PUBLISHED_DATE.name, wapoDoc.getPublishDate())); | |||
doc.add(new StoredField(WapoField.PUBLISHED_DATE.name, wapoDoc.getPublishDate())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if StoreField
is needed. Can you try to remove it to see if it still works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storefield is needed:
An indexed long field for fast range filters. If you also need to store the value, you should add a separate StoredField instance.
according to http://lucene.apache.org/core/8_2_0/core/org/apache/lucene/document/LongPoint.html
But maybe this should also be optional, like positions
in the index.
|
||
if(context.getSearchArgs().backgroundlinking_datefilter){ | ||
try{ | ||
Document queryDoc = reader.document(NewsBackgroundLinkingTopicReader.convertDocidToLuceneDocid(reader, queryDocId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use LongPoint.newRangeQuery
here.
See http://lucene.apache.org/core/8_2_0/core/org/apache/lucene/document/LongPoint.html#newRangeQuery-java.lang.String-long-long-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try this, but I expect if you do it in the initial retrieval step the results will be different as the RM3 constructed query will be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually interesting.
I think we should keep the filter same for both initial ranking and the relevance feedback otherwise the results are biased. right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the filter models "no information from the future should be used", then the filter should be everywhere (and technically, IDF values should also depend on time...).
If it is just reflecting how judging took place, then maybe that is a different situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about let's make this consistent with what we did for microblog track?
That is, to only filter the "future" documents in the initial ranking?
Query filter = LongPoint.newRangeQuery(TweetGenerator.StatusField.ID_LONG.name, 0L, t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok the results are:
BM25 with datefilter in initial ranking:
NDCG@5 : 0.3735
BM25 + RM3 with datefilter in initial ranking:
NDCG@5 : 0.3433
BM25 + RM3 with datefilter in both initial ranking and reranking:
NDCG@5 : 0.3444
BM25 + RM3 with datefilter in only reranking (initial implementation)
NDCG@5 : 0.4171
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it seems that datefilter in initial ranking is good when no reranking is applied, however when trying to rerank the query construction of RM3 does suffer from missing the documents that are filtered out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the filter here as the same way we did for MB track.
@lintool wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it depends on the answer to my question below about the task guidelines... do we want to "win" or model the task accurately? :)
More a general comment on this PR: The reason for the big increase in performance is also likely due to the fact that initially the task guidelines stated that articles could only relevant if they were published before the topic article; resulting in a bias in the judgments because the pooling consists of more articles published before the topic article. |
Do the official guidelines say anything about using "future evidence"? Obviously, in the task scenario, "documents in the future" haven't been written yet... so it's not really a "bias" in judgments... it's reflecting the reality of the task? |
So initially the idea was like you mentioned, it would not make sense to consider future documents relevant as they are not published yet at the moment the topic article is published. However, after discussion, it was decided that future documents can be relevant: There is time between when the article is published and when someone is reading it. In the meantime another article that is important for the context of the query article can be published. In this case that article would be considered relevant to. So using future evidence would make sense, it can even contain relevant documents. But it turns out that filtering out all documents that are published after the query article helps the effectiveness. |
Re: bias in the judgement. I think because runs were submitted not taking future documents into account the pool of scored documents might be biased. (Although future docs can be relevant) |
I understand now. The key is that article publication time and user reading time may not be the same. Since @Peilin-Yang has been doing CR, I'm happy to merge after he gives 👍 Are we just making the date filter command-line parameters? |
@chriskamphuis @lintool |
@Peilin-Yang to be precise, it is because the qrels only include documents before the article time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me, please fix the code format before merging
} | ||
|
||
if(context.getSearchArgs().backgroundlinking_datefilter){ | ||
try{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add space before {
|
||
if(context.getSearchArgs().backgroundlinking_datefilter){ | ||
try{ | ||
Document queryDoc = reader.document(NewsBackgroundLinkingTopicReader.convertDocidToLuceneDocid(reader, queryDocId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make proper line return so that the line is not too long?
long queryDocDate = Long.parseLong(queryDoc.getField(PUBLISHED_DATE.name).stringValue()); | ||
for (int i = 0; i < docs.documents.length; i++) { | ||
long date = Long.parseLong(docs.documents[i].getField(PUBLISHED_DATE.name).stringValue()); | ||
if(date > queryDocDate){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add space before {
Date is saved as an additional field
Date filter can be toggled using command line argument
Without datefilter (BM25 + RM3)
will result in
NDCG@5 = 0.3526
With datefilter (BM25 + RM3)
will result in
NDCG@5 = 0.4171