Suggest Courses from University Based on Interests / Bio
- Scraping - can be avoided if you already have all your schools course data - custom for each school
- Indexing - New index for each school because each school can have different mapping - Given json file containing data and a mapping, index into es with correct fields and types
- Inference - Course suggestions
- Evaluation - Given "golden dataset" we compare different approaches
- BM25 elastic search default (retrieval + reranking)
- DPR (retrieval + reranking)
- BM25 (retrieval) + CE (reranking)
- DPR (retrieval) + CE (reranking)
- DPR + BM25 (retrieval) + CE (reranking)
To run scraper for a website:
- Make sure you are in the 'spiders' directory
- execute:
scrapy runspider $SPIDER_NAME -O $OUTPUT_FILE
example:scrapy runspider waterlooCompSci.py -O test.json
Note: can pass use paths for the file names
Commands to run indexing
BM25
python .\indexing\index.py -i uwaterloo-courses -d .\scraping\contents\waterloo\output.json
DPR
python .\indexing\index.py -i uwaterloo-courses-dpr -d .\scraping\contents\waterloo\output.json
T5
python .\indexing\index.py -i uwaterloo-courses-t5 -d .\scraping\contents\waterloo\output.json
Commands to run inference
Replace YOUR QUERY with a query string. The RETRIEVERS param can be replaced with any combination of bm25, dpr, or t5. N is the number of returned courses, and the BOOL should be set to true if you would like to rerank the scores. Note reranking automatically occurs if using multiple retrievers.
python .\inference\infer.py -q "YOUR QUERY" -rt RETRIEVERS -n N -rr BOOL
As an example of dpr retrieval with reranking
python .\inference\infer.py -q "Machine Learning" -rt dpr -n 5 -rr True
As an example BM25 + T5 retrieval (auto reranking)
python .\inference\infer.py -q "Machine Learning" -rt bm25 t5 -n 5
- Fine tune DPR
- Remove excess info from course descrptions
- Investigate low scores