Repository including all the programming assignements given throughout the course of Information Retrieval and Web Search at the University of Mannheim during the Spring Term 2017
For Homework 1 the (Damerau-)Levensthein distance has been implemented both with dynamic programming and recursions.
There are several flags to customize what happens at runtime
- original: left side of the comparison
- compare: right side of the comparison
- recursive: true if a recursive version shall be used
- damerau: true if the Damerau-Levensthein Distance shall be used
- weigths: true if custom weights for transposition/ replacement shall be used
For the implementation golang was used, there are a couple of tests to show sample output and benchmark tests to see the difference in runtime between recursive and dynamic programming versions.
- Term weighting: Compute TF-IDF for a toy document collection with different definitions for TF and IDF and rank the documents given a query with cosine similarity.
- Distance/similarity metrics: Ranking of documents given a query and 'raw Euclidean distance', 'normalized Euclidean distance' and 'cosine similarity'
- Optimizing vector space model: Given a toy collection of TF-IDF vectors perform random projections to reduce computation costs. Do a pre-clustering of the documents using a given set of leader vectors. Finally retrieve top 5 documents for a query vector using the random projection vectors and leader vectors with clusters.
- Classic probabilistic retrieval: Given a query rank documents with 'Binary independence model', 'Two-Poisson model', 'BM25'
- Unigram Likelihood Model for Information Retrieval: For the programming assignment the tasks was to build a query likelihood model based on a unigram Likelihood Model for the 20 News corpus, which is able to take ad-hoc queries and rank the documents by relevance based on the unigram model. This part is implemented using Scala and the Spark Api.
- Latent Semantic Indexing: Computing the similarity of latent vectors for a toy collection of documents and a query
- Text Clustering: Using 'K-Means' and 'Single Pass Clustering' to cluster a toy collection of TF-IDF vectors
- IR Evaluation: Calculating precision, recall, F1, P@k, R-precision, average precision and mean average precision for a toy collection of retrievals and their relevance rating
- Semantic Retrieval with Word-Embeddings: Implementation of a simple retrieval engine based on aggregation of word embeddings using the pretrained 'GloVe' word embeddings and a random subsample of 500 documents from the '20 News Groups dataset'