This is the code used for this paper. Reproducing the results from this paper can be done as follows:
- Download files. We plan to make these available on a webserver in the future. For now, you can ask us for these files and save the following ones in data/arxiv/keywords-backend/
papers
paper_topics
all_lengths.json
broadness_lda
and these in data/arxiv/thomsonreuters/
JournalHomeGrid-2001.csv
...
JournalHomeGrid-2009.csv
- Set up a MySQL database and save the connection data in settings_private.py.
DB_PASS = '...'
DB_USER = '...'
DB_HOST = '...'
DB_NAME = '...'
- Set up the database and import the arXiv/Paperscape/JIF data.
mysql < database_structure.sql
python arxiv_importer.py
python paperscape_importer.py
python jif_importer.py
- Some pre-processing needs to be done.
python analysis.py
python net.py
- Run the following SQL command.
UPDATE analysissingle512_authors SET train_real = train
- Generate the cross-validation groups and prepare the x and y data.
python run_local.py prepare
- Train the neural network and random forest models for each cross-validation round $i (0 to 19).
python run_cluster.py train-rf $i
python run_cluster.py train-net $i
- Evaluate the trained models as well as some naive baseline models for each $i and summarize the results.
python run_local.py evaluate-rf --i $i
python run_local.py evaluate-net --i $i
python run_local.py evaluate-linear-naive --i $i
python run_local.py summarize
The summary files will be placed in data/analysissingle512/evaluate/no-max-hindex, the results for each individual trained model will be placed in data/analysissingle512/evaluate/no-max-hindex/task-results.