Team OpenWebSearch (OWS) at LongEval 2024

The repository for team OWS at LongEval 2024 at CLEF 2024

Setup:

tira-cli download --dataset longeval-tiny-train-20240315-training

Idea 1: Features from WOWS

Feature-based Learning to Rank (Gijs)
- Use the WOWS submissions to create around 50 features
  - Extract features/add them via code like this: https://github.com/tira-io/teaching-ir-with-shared-tasks/blob/main/tutorials/tutorial-query-performance-prediction.ipynb respectively https://github.com/tira-io/teaching-ir-with-shared-tasks/blob/main/tutorials/tutorial-data-access-from-java.ipynb
  - Query Features: Query Intent + Query Performance Prediction
  - Document Features: Web Page Genre + Corpus Grap + Readibility
  - Query-Document Features: BM25 + MonoT5 + ColBERT
  - ToDo: Look for more interesting components
- Apply LambdaMART

Idea 2: Archive Lookup / Learning from History / History Repeats Itself / Exploit Overlap / Zipfs Law

I already know which topics will be submitted. I already know "similar" documents

Archive lookup: I look up what was good a few months ago, now we try to transfer this via two strategies and combinations thereof: (1) query reformulation, (2) document reformulation.

Query Reformulation (Daria):
- Idea:
  - Queries overlap over the different time slots
  - I.e. for some query, we know which documents were clicked a few months ago
  - We insert the clicked documents into the current corpus and reformulate the query with RM3 until the known relevant documents from a few months ago are at the top positions
    - Use Explain Like I am BM25 for better reformulating
  - Remove the old doc ids from the ranking
  - Combine it with the Corpus Graph idea
    - What to do if a query is new
Document Reformulation/Oracle Indexing/Index Partitioning (Maik):
- Idea:
  - Reformulate all documents so that they are only retrieved for the queries for which they are relevant and not be retrieved for queries for which they are not relevant
  - Because queries are overlapping
  - Learn optimal sequence to sequence translation of documents to ideal documents on the training data
  - Use this on the test data that is X months in the future and uses the same topics (information needs) but (slightly) updated documents
- What do we need:
  - Sequence to sequence training dataset: Document -> perfect document
  - How do we construct this?
    - document => all queries for which the document is relevant (DeepCT / splade training objective)
    - Reverse bipartite graph between documents and its relevant queries
    - Include corpus graph idea?
    - Goal: ndcg above 0.8
- Technnologies to look into:
  - Naive Bayes: P(query|document text) or P(partition|document text)
  - Transformer: sequence to sequence
  - Splade
  - RM3 reversed

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.tira		.tira
evaluation		evaluation
ir_datasets_integration		ir_datasets_integration
keyqueries		keyqueries
learning-to-rank		learning-to-rank
literature		literature
oracle-indexing		oracle-indexing
paper		paper
pyterrier-retrieval		pyterrier-retrieval
runs		runs
splits		splits
.devcontainer.json		.devcontainer.json
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Team OpenWebSearch (OWS) at LongEval 2024

Setup:

Idea 1: Features from WOWS

Idea 2: Archive Lookup / Learning from History / History Repeats Itself / Exploit Overlap / Zipfs Law

About

Releases

Packages

Contributors 3

Languages

OpenWebSearch/LONGEVAL-24

Folders and files

Latest commit

History

Repository files navigation

Team OpenWebSearch (OWS) at LongEval 2024

Setup:

Idea 1: Features from WOWS

Idea 2: Archive Lookup / Learning from History / History Repeats Itself / Exploit Overlap / Zipfs Law

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages