Links prediction in Citation Networks

Repository for the project of the INF554 Machine Learning 1 course.
This project was developed by Fabrizio Indirli, Leon Kloten, Martin Wohlfender and Seongbin Lim.

Introduction

The project consisted in predicting missing links in a citation network. In particular, some edges were removed from a citation network constructed upon a dataset with 27770 articles: the aim of the project was to build a model that could predict whether an edge was part of the original network or not.

To do so, 17 features have been extracted from the given data and various supervised learning models have been developed, obtaining an F1 score of almost 0.97. Among all the features, 6 of them largely affect the predictions performance: number of paths, TFIDF cosine similarities of abstracts, source hub score, target authority score and the Resource Allocation index. Between the various classifiers that were tried, XGBoost and Neural Network produced the best results.
You can read more on our results and methodology in report.pdf and presentation.pdf.

Folders structure

Folder scripts contains the feature-generation scripts that are called by create_features.py
Folder data contains the CSV/TXT data files generated running create_features.py
Folder prediction will contain the predictions file generated by model.py
node_information.csv is a spreadsheet that contains the datas about the 27770 articles of the citation network: the columns are Node_ID, Publication Year, Title, Authors, Journal, Abstract
training_set.txt contains the known datas on the edges of the citation network: each line of this file has the form Source_Node_ID, Target_Node_ID, Exist and the associated edge (Source_Node_ID, Target_Node_ID) exists in the citation network only if the value of Exists = 1
testing_set.txt contains possible edges Source_Node_ID, Target_Node_ID that have to be classified
create_features.py is the script that has to be called to generate the features' files in the data folder.
model.py is the script that executes the model and outputs a prediction in the predictions folder; it requires that all the features' files are present in the data folder (run create_features.py to create those files).
report.pdf is a short (4 pages) report of the project
presentation.pdf is the longer presentation of the project

How to run the code

Default training_set.txt, testing_set.txt and node_information.csv files are already provided.
If you want to use yours, put them (with the same names and structure!) in the root folder where the create_features.py and model.py files are, replacing the default ones, and delete all the content of the data folder.
Run create_features.py to generate the files containing the features

WARNING: The calculation of one of the features (TFIDF cosine similarity) requires about 13 GB of ram on the provided dataset and might fail with "MemoryError" if not enough memory is available. This is why we provide this feature already calculated for the default dataset in the "training_sims.csv" and "testing_sims.csv" files inside the data folder, but it can be computed from scratch on a computer with enough memory

REMARKS:
- As it takes a lot of time to create them, the following feature files are already provided, too: closeness_centrality_nodes.txt, number_paths_testing.txt, number_paths_training.txt, hits_infos.csv
- The files containing the calculated features are stored in the data folder.
- The scripts that calculate the features are stored in the scripts folder.
Run model.py to generate the predictions

REMARK: The predictions will be stored in the folder predictions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Links prediction in Citation Networks

Introduction

Folders structure

How to run the code

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
predictions		predictions
scripts		scripts
README.md		README.md
create_features.py		create_features.py
model.py		model.py
node_information.csv		node_information.csv
presentation.pdf		presentation.pdf
report.pdf		report.pdf
testing_set.txt		testing_set.txt
training_set.txt		training_set.txt

fabrizio-indirli/citationsPrediction

Folders and files

Latest commit

History

Repository files navigation

Links prediction in Citation Networks

Introduction

Folders structure

How to run the code

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages