Skip to content

📻 a Python repository for content-based recommendation based on Latent semantic analysis (LSA) topic distance and Rocchio Algorithm, see the implementation interactively on

Notifications You must be signed in to change notification settings

titipata/science_concierge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Science Concierge

a Python repository for content-based recommendation based on Latent semantic analysis (LSA) topic distance and Rocchio Algorithm. Science Concierge is an backend algorithm for Scholarfy www.scholarfy.net, an automatic scheduler for conference.

See full article on PLOS ONE, Arxiv or full tex manuscript and presentation here. You can also see the scale version of Scholarfy to 14.3M articles from Pubmed at pubmed.scholarfy.net.

Usage

First, clone the repository.

$ git clone https://github.com/titipata/science_concierge

Install dependencies using pip,

$ pip install -r requirements.txt

Install the library using setup.py,

$ python setup.py develop install

Download example data

We provide example csv file from Pubmed Open Acess Subset that you can download and play with (we parsed using pubmed_parser). Each file contains pmc, pmid, title, abstract, publication_year as column name. Use download function to download example data,

import science_concierge
science_concierge.download(['pubmed_oa_2015.csv', 'pubmed_oa_2016.csv'])

We provide pubmed_oa_{year}.csv from {year} = 2007, ..., 2016 (note 2007 is all publications before year 2008). Alternative is to use awscli to download,

$ aws s3 cp s3://science-of-science-bucket/science_concierge/data/ . --recursive

Example usage of Science Concierge

You can build quick recommendation by importing ScienceConcierge class then use fit method to fit list of documents. Then use recommend to recommend documents based on like or dislike documents.

import pandas as pd
from science_concierge import ScienceConcierge

df = pd.read_csv('data/pubmed_oa_2016.csv', encoding='utf-8')
docs = list(df.abstract) # provide list of abstracts
titles = list(df.title) # titles
# select weighting from 'count', 'tfidf', or 'entropy'
recommend_model = ScienceConcierge(stemming=True, ngram_range=(1,1),
                                   weighting='entropy', norm=None,
                                   n_components=200, n_recommend=200,
                                   verbose=True)
recommend_model.fit(docs) # input list of documents or abstracts
index = recommend_model.recommend(likes=[10000], dislikes=[]) # input list of like/dislike index (here we like title[10000])
docs_recommend = [titles[i] for i in index[0:10]] # recommended documents

Vectorizer available

We have adds on vectorizer classes including LogEntropyVectorizer and BM25Vectorizer for calculating documents-terms weighting from input list of documents. Here is an example usage.

from science_concierge import LogEntropyVectorizer
l_model = LogEntropyVectorizer(norm=None, ngram_range=(1,2),
                               stop_words='english', min_df=1, max_df=0.8)
X = l_model.fit_transform(docs) # where docs is list of documents

In this case when we have sparse matrix of documents, we can use fit_document_matrix method directly.

recommend_model = ScienceConcierge(n_components=200, n_recommend=200)
recommend_model.fit_document_matrix(X)
index = recommend_model.recommend(likes=[10000], dislikes=[])

Dependencies

Members

License

License

Copyright (c) 2015 Titipat Achakulvisut, Daniel E. Acuna, Tulakan Ruangrong, Konrad Kording

About

📻 a Python repository for content-based recommendation based on Latent semantic analysis (LSA) topic distance and Rocchio Algorithm, see the implementation interactively on

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published