Skip to content

A 'Learning to Rank' (LETOR) search engine built completely from scratch over the Wikipedia corpus

License

Notifications You must be signed in to change notification settings

nileshsah/letor-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

'Learning to Rank' (LETOR) Text based Search Engine

Build Status

A stand-alone project for building a search engine completely from scratch over the wikipedia data corpus using Hadoop framework and RankLib learning to rank library.

The project is configured to run various hadoop jobs over the wikipedia database dump to process parameters such as:

  • Page Rank
  • TF-IDF
  • Page Index
  • Page Length

Example Usage: hadoop jar letor-search-engine.jar com.nellex.hadoop.WikiIndexProcessing

Currently the project utilizes a SQL database for storing the index table, the schema for which is presented here artifacts/Wiki.sql.

Various helper classes has been provided under the com.nellex.helpers package for parsing the generated output from the hadoop runs into the required schema.

We make use of the RankLib library as our ranking engine by first training it on a collection of labeled dataset as generated via the python script search.py which in turn works by querying the wikipedia search API with a randomly sampled set of unigrams and bi-grams.

The various features used for training the model includes, (1) query id (2) covered query terms in the title (3) title length (4) covered query term ratio for the title (5) length of the document (6) PageRank score (7) occurrence count of the query terms in the document (8) sum of term frequency for the query term in the document (8) minimum of the term frequency for the query term in the document, similarly (9) maximum of term frequency (10) mean of term frequency (11) sum of tf * idf (12) min of tf * idf (13) max of tf * idf (14) mean of tf *idf (15) IDF.

License

MIT License. See the license for more details.

About

A 'Learning to Rank' (LETOR) search engine built completely from scratch over the Wikipedia corpus

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published