Generation of a dataset including published offers from Euraxess
The process is composed of two phases:
-
Daily download of offers
This is a simple process that needs to be set up as a cron process. Everyday the following command should be run:
> wget -O - "ANONYMIZED_EURAXESS_JOBPOSTS_URL" --output-document=[your_path]/jobs_`date +%Y-%m-%d_%H:%M:%S`.xml
-
To consolidate all downloaded data (since offers appear repeteade in the retrieved files) we need to run the following python script
> python main.py -c config_file
This scripts processes the downloades XML files, extracts the necessasry information, and consolidates the offers in a final CSV file
The script keeps track of downloaded files that have already been processed. If the whole dataset wants to be regenerated from scratch, Step 2
needs to be carried out activating the --resetCSV
flag.