'Learning to Rank' (LETOR) Text based Search Engine

A stand-alone project for building a search engine completely from scratch over the wikipedia data corpus using Hadoop framework and RankLib learning to rank library.

The project is configured to run various hadoop jobs over the wikipedia database dump to process parameters such as:

Page Rank
TF-IDF
Page Index
Page Length

Example Usage: hadoop jar letor-search-engine.jar com.nellex.hadoop.WikiIndexProcessing

Currently the project utilizes a SQL database for storing the index table, the schema for which is presented here artifacts/Wiki.sql.

Various helper classes has been provided under the com.nellex.helpers package for parsing the generated output from the hadoop runs into the required schema.

We make use of the RankLib library as our ranking engine by first training it on a collection of labeled dataset as generated via the python script search.py which in turn works by querying the wikipedia search API with a randomly sampled set of unigrams and bi-grams.

The various features used for training the model includes, (1) query id (2) covered query terms in the title (3) title length (4) covered query term ratio for the title (5) length of the document (6) PageRank score (7) occurrence count of the query terms in the document (8) sum of term frequency for the query term in the document (8) minimum of the term frequency for the query term in the document, similarly (9) maximum of term frequency (10) mean of term frequency (11) sum of tf * idf (12) min of tf * idf (13) max of tf * idf (14) mean of tf *idf (15) IDF.

License

MIT License. See the license for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
UI		UI
artifacts		artifacts
libs		libs
src/com/nellex		src/com/nellex
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

'Learning to Rank' (LETOR) Text based Search Engine

License

About

Releases

Packages

Languages

License

nileshsah/letor-search-engine

Folders and files

Latest commit

History

Repository files navigation

'Learning to Rank' (LETOR) Text based Search Engine

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages