The source code for Keyphrase Ranking with SFO paper accepted in KDBC 2024
We have used the following datasets for this project:
- Inspec Dataset: Link to Inspec Dataset
- NUS Dataset: Link to NUS Dataset
- SemEval Dataset: Link to SemEval Dataset
We use MiniLM
for sentence embeddings. You can find more information about the model here:
To run this project locally, follow these steps:
-
Clone the repository:
git clone https://github.com/yourusername/KPRanking-with-SFO.git cd KPRanking-with-SFO-main
-
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the dependencies:
pip install -r requirements.txt
-
Download the datasets (or the datasets will be automatically fetched via HuggingFace datasets library).
If you want to measure the following metrics for your own dataset:
- GKP: Average Number of Gold Keyphrases per document (title, abstract, full text)
- KPL: Average Keyphrase Length
- DL: Average Document Length (in words)
You can use the script DatasetMetrics.py
provided in this repository. To run this script:
python DatasetMetrics.py --dataset <dataset_name>
To execute the main script, simply run:
python sfoOnRealDataset.py
The following pseudocode outlines our Submodular Function Optimization (SFO) approach for balancing relevance and diversity in keyphrase selection.
Here:
- 𝑉 is the set of candidate keyphrases.
- 𝑁 is the desired number of keyphrases.
- 𝛼 is a parameter that adjusts the balance between relevance and diversity.
Our model was evaluated on three datasets, comparing F1-Score@5, Intra-List Distance (ILD), Subtopic Recall (SR), and average runtime against three baselines: EmbedRank++, SIFRank, and DPP. The performance of our method in each metric is summarized below.
Dataset | Method | F1-Score@5 | ILD | SR | Runtime per Doc (s) |
---|---|---|---|---|---|
Inspec | SFO | 19.29 | 0.8277 | 0.8144 | 0.3656 |
EmbedRank++ | 61.85 | 0.5633 | 0.7326 | 0.1882 | |
SIFRank | 65.00 | 0.3464 | 0.6807 | 0.6443 | |
DPP | 62.43 | 0.2128 | 0.3734 | 1.2433 | |
NUS | SFO | 20.98 | 0.8619 | 0.7867 | 2.3251 |
EmbedRank++ | 74.18 | 0.7634 | 0.3836 | 7.1397 | |
SIFRank | 66.53 | 0.3069 | 0.6034 | 19.2913 | |
DPP | 31.95 | 0.1950 | 0.3081 | 1.8312 | |
SemEval | SFO | 16.90 | 0.8640 | 0.7345 | 2.4978 |
EmbedRank++ | 77.05 | 0.7408 | 0.3798 | 7.4637 | |
SIFRank | 66.66 | 0.2908 | 0.4615 | 20.6732 | |
DPP | 23.82 | 0.2015 | 0.4618 | 2.0910 |
These results demonstrate that SFO outperforms the baselines in diversity measures with competitive runtime efficiency.
If you find this work helpful in your research, please cite our paper:
@misc{umair2024optimizingkeyphraserankingrelevance,
title={Optimizing Keyphrase Ranking for Relevance and Diversity Using Submodular Function Optimization (SFO)},
author={Muhammad Umair and Syed Jalaluddin Hashmi and Young-Koo Lee},
year={2024},
eprint={2410.20080},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2410.20080},
}