Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add date filter to background linking reranker #786

Merged
merged 5 commits into from
Aug 29, 2019
Merged

Add date filter to background linking reranker #786

merged 5 commits into from
Aug 29, 2019

Conversation

chriskamphuis
Copy link
Collaborator

Date is saved as an additional field
Date filter can be toggled using command line argument

Without datefilter (BM25 + RM3)

target/appassembler/bin/SearchCollection -searchnewsbackground -index lucene-index.core18.pos+docvectors+rawdocs -topicreader NewsBackgroundLinking -topics ~/topics/newsir18-topics.txt -bm25 -rm3 -hits 100 -backgroundlinking.k 100 -runtag bm25_rm3 -output test/bm25-rm3.txt

will result in NDCG@5 = 0.3526

With datefilter (BM25 + RM3)

target/appassembler/bin/SearchCollection -searchnewsbackground -index lucene-index.core18.pos+docvectors+rawdocs -topicreader NewsBackgroundLinking -topics ~/topics/newsir18-topics.txt -bm25 -rm3 -hits 100 -backgroundlinking.k 100 -backgroundlinking.datefilter -runtag bm25_rm3_df -output outputs/bm25-rm3-df.txt

will result in NDCG@5 = 0.4171

Date is saved as an additional field
Date filter can be toggled using command line argument
@chriskamphuis
Copy link
Collaborator Author

chriskamphuis commented Aug 21, 2019

The datefilter filters out articles published after the topic article. These can be relevant, but tend to be not relevant much more often.

@lintool
Copy link
Member

lintool commented Aug 21, 2019

@Peilin-Yang can you take a look?

@@ -86,6 +86,7 @@ public Document createDocument(WashingtonPostCollection.Document wapoDoc) {
// This is needed to break score ties by docid.
doc.add(new SortedDocValuesField(FIELD_ID, new BytesRef(id)));
doc.add(new LongPoint(WapoField.PUBLISHED_DATE.name, wapoDoc.getPublishDate()));
doc.add(new StoredField(WapoField.PUBLISHED_DATE.name, wapoDoc.getPublishDate()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if StoreField is needed. Can you try to remove it to see if it still works?

Copy link
Collaborator Author

@chriskamphuis chriskamphuis Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storefield is needed:

An indexed long field for fast range filters. If you also need to store the value, you should add a separate StoredField instance.

according to http://lucene.apache.org/core/8_2_0/core/org/apache/lucene/document/LongPoint.html

But maybe this should also be optional, like positions in the index.


if(context.getSearchArgs().backgroundlinking_datefilter){
try{
Document queryDoc = reader.document(NewsBackgroundLinkingTopicReader.convertDocidToLuceneDocid(reader, queryDocId));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try this, but I expect if you do it in the initial retrieval step the results will be different as the RM3 constructed query will be different.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually interesting.
I think we should keep the filter same for both initial ranking and the relevance feedback otherwise the results are biased. right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the filter models "no information from the future should be used", then the filter should be everywhere (and technically, IDF values should also depend on time...).

If it is just reflecting how judging took place, then maybe that is a different situation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about let's make this consistent with what we did for microblog track?
That is, to only filter the "future" documents in the initial ranking?

Query filter = LongPoint.newRangeQuery(TweetGenerator.StatusField.ID_LONG.name, 0L, t);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok the results are:

BM25 with datefilter in initial ranking:
NDCG@5 : 0.3735

BM25 + RM3 with datefilter in initial ranking:
NDCG@5 : 0.3433

BM25 + RM3 with datefilter in both initial ranking and reranking:
NDCG@5 : 0.3444

BM25 + RM3 with datefilter in only reranking (initial implementation)
NDCG@5 : 0.4171

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems that datefilter in initial ranking is good when no reranking is applied, however when trying to rerank the query construction of RM3 does suffer from missing the documents that are filtered out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the filter here as the same way we did for MB track.
@lintool wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it depends on the answer to my question below about the task guidelines... do we want to "win" or model the task accurately? :)

@chriskamphuis
Copy link
Collaborator Author

More a general comment on this PR: The reason for the big increase in performance is also likely due to the fact that initially the task guidelines stated that articles could only relevant if they were published before the topic article; resulting in a bias in the judgments because the pooling consists of more articles published before the topic article.

@lintool
Copy link
Member

lintool commented Aug 26, 2019

Do the official guidelines say anything about using "future evidence"? Obviously, in the task scenario, "documents in the future" haven't been written yet... so it's not really a "bias" in judgments... it's reflecting the reality of the task?

@chriskamphuis
Copy link
Collaborator Author

So initially the idea was like you mentioned, it would not make sense to consider future documents relevant as they are not published yet at the moment the topic article is published. However, after discussion, it was decided that future documents can be relevant:

There is time between when the article is published and when someone is reading it. In the meantime another article that is important for the context of the query article can be published. In this case that article would be considered relevant to.

So using future evidence would make sense, it can even contain relevant documents. But it turns out that filtering out all documents that are published after the query article helps the effectiveness.

@chriskamphuis
Copy link
Collaborator Author

Re: bias in the judgement. I think because runs were submitted not taking future documents into account the pool of scored documents might be biased. (Although future docs can be relevant)

@lintool
Copy link
Member

lintool commented Aug 27, 2019

I understand now. The key is that article publication time and user reading time may not be the same.

Since @Peilin-Yang has been doing CR, I'm happy to merge after he gives 👍

Are we just making the date filter command-line parameters?

@Peilin-Yang
Copy link
Collaborator

@chriskamphuis @lintool
I understand.
So the reason that you put the filter as the last step is just a heuristic that is similar to deduping, right?

@lintool
Copy link
Member

lintool commented Aug 27, 2019

@Peilin-Yang to be precise, it is because the qrels only include documents before the article time.

Copy link
Collaborator

@Peilin-Yang Peilin-Yang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, please fix the code format before merging

}

if(context.getSearchArgs().backgroundlinking_datefilter){
try{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add space before {


if(context.getSearchArgs().backgroundlinking_datefilter){
try{
Document queryDoc = reader.document(NewsBackgroundLinkingTopicReader.convertDocidToLuceneDocid(reader, queryDocId));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make proper line return so that the line is not too long?

long queryDocDate = Long.parseLong(queryDoc.getField(PUBLISHED_DATE.name).stringValue());
for (int i = 0; i < docs.documents.length; i++) {
long date = Long.parseLong(docs.documents[i].getField(PUBLISHED_DATE.name).stringValue());
if(date > queryDocDate){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add space before {

@lintool lintool merged commit c5ee9af into castorini:master Aug 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants