Skip to content

When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets

License

Notifications You must be signed in to change notification settings

orionw/LM-expansions

Repository files navigation

When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets

License: MIT

Official repository for the paper When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets including code to reproduce and links to the data generated by the models.

Table of Contents

Overview

This project presents a comprehensive study on generative query and document expansions across various methods, retrievers, and datasets. It aims to identify when these expansions fail and provide insights into improving information retrieval systems.

Data

The generations from the models can be found at orionweller/llm-based-expansions-generations, organized by dataset and expansion type.

Requirements

  • Python 3.10
  • conda
  • OpenAI API key (for using OpenAI models)
  • Together.ai or Anthropic API keys (if using their services)
  • GPU (if using Llama for generation)
  • pyserini (for BM25 results reproduction)

Setup

  1. Clone the repository:

    git clone https://github.com/orionw/LM-expansions.git
    cd LM-expansions
    
  2. Install the correct Python environment:

    conda env create --file=environment.yaml -y && conda activate expansions
    
  3. Download the local data:

    git clone https://huggingface.co/datasets/orionweller/llm-based-expansions-eval-datasets
    

    This dataset contains local data not available on Huggingface, such as scifact-refute and other datasets formatted in a common format. To reproduce the creation of scifact-refute, check out scripts/make_scifact_refute.py.

Reproduce

Reproduce Expansions Data

  1. Set up your environment variables (e.g., OPENAI_API_KEY) if using OpenAI models.

  2. Create or modify a prompt config. Examples are in prompt_configs/*. For instance:

    bash generate_expansions.sh scifact_refute prompt_configs/chatgpt_doc2query.jsonl
    
  3. Adjust parameters as needed:

    • num_examples: maximum number of instances to predict
    • temperature: controls the randomness of predictions

    Note: If using Together.ai or Anthropic API keys, define them accordingly. For Llama generation, ensure you're using a GPU.

Reproduce Model Results Using Expansions

  1. Run the model using the following command structure:

    bash rerank.sh <dataset name> <name of run> <shard id> <num shards> <query expansion path or "none"> <"none" if not using document expansions otherwise "replace" or "append" the query with the expansion> <document expansion path or "none"> <"none" if not using query expansions otherwise "replace" or "append" the query with the expansion> <model name> <number of queries to run> <number of docs to run>
    

    Example:

    bash rerank.sh "scifact_refute" "testing" 0 1 "none" "none" "llm-based-expansions-generations/scifact_refute/expansion_hyde_chatgpt64.jsonl" "replace" "contriever_msmarco" 10 100
    
  2. Results will be written to results/<dataset name>/<name of run>/<dataset name>-<name of run>-run.txt.

  3. Evaluate the results:

    bash evaluate.sh scifact_refute testing
    

To reproduce the top 1000 BM25 results:

  1. Install pyserini following their installation docs.

  2. Run the BM25 retrieval:

    bash make_bm25_run.sh <your folder> <your dataset name> <document id field> <document text fields> <query id field> <query text fields>
    

    Example:

    bash make_bm25_run.sh bm25 scifact_refute doc_id "title,text" query_id text
    

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citing

If you found the code, data or paper useful, please cite:

@inproceedings{weller-etal-2024-generative,
    title = "When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets",
    author = "Weller, Orion  and
      Lo, Kyle  and
      Wadden, David  and
      Lawrie, Dawn  and
      Van Durme, Benjamin  and
      Cohan, Arman  and
      Soldaini, Luca",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2024",
    month = mar,
    year = "2024",
    address = "St. Julian{'}s, Malta",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-eacl.134",
    pages = "1987--2003",
}

This project also built off of many others (see the paper for a full list of references), including code from TART and InPars, please check them and the others out!

About

When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets

Topics

Resources

License

Stars

Watchers

Forks