- Christian Armstrong
- Daniel Liu
This project implements a simple search engine for Wikipedia-like XML dumps. It indexes pages, computes relevance and importance scores, and allows users to search for pages using a command-line interface.
Run the indexer to process an XML wiki dump and generate index files:
python index.py <XML filepath> <titles filepath> <docs filepath> <words filepath>
<XML filepath>
: Path to the XML file to index.<titles filepath>
: Output file for ID-Title pairs.<docs filepath>
: Output file for ID-PageRank pairs.<words filepath>
: Output file for Word-ID-Relevance triples.
Run the querier to search the indexed data:
python query.py [--pagerank] <titleIndex> <documentIndex> <wordIndex>
--pagerank
(optional): If present, search results consider PageRank in ranking.<titleIndex>
: File with ID-Title pairs (from indexer).<documentIndex>
: File with ID-PageRank pairs (from indexer).<wordIndex>
: File with Word-ID-Relevance triples (from indexer).
You will be prompted to enter search queries. Type :quit
to exit.
- Purpose: Measures how important a word is to a document in the corpus.
- How:
- For each word in a page, compute term frequency (TF) as the count of the word divided by the max word count in that page.
- Compute inverse document frequency (IDF) as
log(total_docs / docs_with_word)
. - The relevance score for a word in a page is
TF * IDF
. - Stop words and numbers are excluded; words are stemmed.
- Purpose: Measures the importance of a page based on the link structure.
- How:
- Build a directed graph where nodes are pages and edges are links between them.
- Use the PageRank algorithm (with damping factor epsilon) to iteratively update scores until convergence.
- Special cases:
- Pages linking to nothing are treated as linking to all other pages (except themselves).
- Self-links and duplicate links are ignored.
- Links to non-existent pages are ignored.
- For each query, compute a relevance score for each page based on the sum of TF-IDF scores for the query words.
- If
--pagerank
is specified, add the PageRank score to the relevance score. - Return the top 10 results (or fewer if less available).
- Excludes numbers and stop words from indexing.
- Handles edge cases in link graph (self-links, dead links, duplicates).
- Stems words for better matching.
- Interactive REPL for searching.
- Handles empty queries, queries with only spaces, gibberish, punctuation, and numbers.
- Tested with multiple and duplicate words, case insensitivity, and leading/trailing spaces.
Query (no pagerank): computer science
- LEO (computer)
- Malware
- Motherboard
- Graphical user interface
- PCP
- Hacker (term)
- Junk science
- Gary Kildall
- Macro virus
- Foonly
Query (pagerank): computer science
- Islamabad Capital Territory
- Java (programming language)
- Portugal
- Jürgen Habermas
- Mercury (planet)
- Pakistan
- Malware
- Graphical user interface
- LEO (computer)
- Isaac Asimov
Query (no pagerank): testing results
- JUnit
- GRE Physics Test
- Median lethal dose
- First-class cricket
- Kiritimati
- Mendelian inheritance
- Autosomal dominant polycystic kidney
- Telecommunications in Morocco
- Kuomintang
- Flying car (aircraft)
Query (pagerank): testing results
- Kuomintang
- Netherlands
- Northern Hemisphere
- Montoneros
- Nazi Germany
- Pakistan
- JUnit
- General relativity
- GRE Physics Test
- Manhattan Project
Query (no pagerank): dog cheese man fight house
- Isle of Man
- Cuisine of the Midwestern United States
- Limburg
- Morphology (linguistics)
- LMS
- Public house
- Fahrenheit 451
- Pizza
- HOL
- Enter the Dragon
Query (pagerank): dog cheese man fight house
- Neolithic
- Netherlands
- Isle of Man
- Morphology (linguistics)
- North Pole
- Franklin D. Roosevelt
- Falklands War
- Empress Jitō
- Nazi Germany
- Normandy
Query (no pagerank): dog dog dog dog
- Morphology (linguistics)
- Mustelidae
- Kyle MacLachlan
- Cuisine of the Midwestern United States
- Eth
- Phoenix (TV series)
- John James Rickard Macleod
- James Madison University
- Poltergeist
- Novial
Query (pagerank): dog dog dog dog
- Morphology (linguistics)
- North Pole
- Neolithic
- Guam
- Nazi Germany
- Franklin D. Roosevelt
- Mustelidae
- Pennsylvania
- Grammatical gender
- Kyle MacLachlan
Query (no pagerank): ""
(empty)
- No results
Query (no pagerank): " "
(spaces)
- No results
Query (no pagerank): ja;sldkfj;alksdjfa;sdlkf
(gibberish)
- No results
Query (no pagerank): body?!?
(punctuation)
- No results
Query (no pagerank): 17208372
(numbers)
- No results
Query (no pagerank): hello
(leading space)
- Java (programming language)
- Enjambment
- Shoma Morita
- Forth (programming language)
- Shock site
- John Woo
- Luxembourgish language
- Mandy Patinkin
- HAL 9000
- Kareem Abdul-Jabbar
Query (no pagerank): hello
- Java (programming language)
- Enjambment
- Shoma Morita
- Forth (programming language)
- Shock site
- John Woo
- Luxembourgish language
- Mandy Patinkin
- HAL 9000
- Kareem Abdul-Jabbar
Query (no pagerank): HELLO
(all caps)
- Java (programming language)
- Enjambment
- Shoma Morita
- Forth (programming language)
- Shock site
- John Woo
- Luxembourgish language
- Mandy Patinkin
- HAL 9000
- Kareem Abdul-Jabbar
None