a Python repository for content-based recommendation based on Latent semantic analysis (LSA) topic distance and Rocchio Algorithm. Science Concierge is an backend algorithm for Scholarfy www.scholarfy.net, an automatic scheduler for conference.
See full article on PLOS ONE, Arxiv or full tex manuscript and presentation here. You can also see the scale version of Scholarfy to 14.3M articles from Pubmed at pubmed.scholarfy.net.
First, clone the repository.
$ git clone https://github.com/titipata/science_concierge
Install dependencies using pip
,
$ pip install -r requirements.txt
Install the library using setup.py
,
$ python setup.py develop install
We provide example csv
file from Pubmed Open Acess Subset that you can download and
play with (we parsed using pubmed_parser).
Each file contains pmc
, pmid
, title
, abstract
, publication_year
as column name.
Use download
function to download example data,
import science_concierge
science_concierge.download(['pubmed_oa_2015.csv', 'pubmed_oa_2016.csv'])
We provide pubmed_oa_{year}.csv
from {year} = 2007, ..., 2016
(note 2007 is
all publications before year 2008). Alternative is to use awscli
to download,
$ aws s3 cp s3://science-of-science-bucket/science_concierge/data/ . --recursive
You can build quick recommendation by importing ScienceConcierge
class
then use fit
method to fit list of documents. Then use recommend
to recommend
documents based on like or dislike documents.
import pandas as pd
from science_concierge import ScienceConcierge
df = pd.read_csv('data/pubmed_oa_2016.csv', encoding='utf-8')
docs = list(df.abstract) # provide list of abstracts
titles = list(df.title) # titles
# select weighting from 'count', 'tfidf', or 'entropy'
recommend_model = ScienceConcierge(stemming=True, ngram_range=(1,1),
weighting='entropy', norm=None,
n_components=200, n_recommend=200,
verbose=True)
recommend_model.fit(docs) # input list of documents or abstracts
index = recommend_model.recommend(likes=[10000], dislikes=[]) # input list of like/dislike index (here we like title[10000])
docs_recommend = [titles[i] for i in index[0:10]] # recommended documents
We have adds on vectorizer classes including LogEntropyVectorizer
and
BM25Vectorizer
for calculating documents-terms weighting from input
list of documents. Here is an example usage.
from science_concierge import LogEntropyVectorizer
l_model = LogEntropyVectorizer(norm=None, ngram_range=(1,2),
stop_words='english', min_df=1, max_df=0.8)
X = l_model.fit_transform(docs) # where docs is list of documents
In this case when we have sparse matrix of documents,
we can use fit_document_matrix
method directly.
recommend_model = ScienceConcierge(n_components=200, n_recommend=200)
recommend_model.fit_document_matrix(X)
index = recommend_model.recommend(likes=[10000], dislikes=[])
- numpy
- pandas
- unidecode
- nltk with white space tokenizer and Porter stemmer,
usescience_concierge.download_nltk()
to download required corpora (there is a stemmer bug innltk==3.2.2
) - scikit-learn
- cachetools
- joblib
Copyright (c) 2015 Titipat Achakulvisut, Daniel E. Acuna, Tulakan Ruangrong, Konrad Kording