- Read websites urls from
input.txtfile. - Crawling website contents and store the occurence list of each word in
occurrenceListfolder where thefile nameis reference for index terms and file content is the occurence list for word. The occurrence list store website url hash value, website url, word count. - Build compressed trie from website contents and get rid of stop words and store
file nameas reference for index terms in the external node of the trie. - The query for a single word takes O(n) time, where n is the length of the word. And the query from trie for sentence takes O(n) time, where n is the length of sentence. And get the intersection of all query urls takes O(nm) time, where n is the size of urls query from the first word, m is the number of words in the sentence.
- For ranking algorithm, first we have a list of urls. For each urls we have word frequency and then sort them according the frequency in descending order. Return the sorted urls.
First, you need to make sure you have python3 environment (preferred using virtualenv) open the terminal in MAC OS, linux or the cmd in windows system. Then cd to working directory, then run following commands as below:
pip3 install -r requirements.txt
python3 searchEngine.py
The input.txt contains 7 website links which I used as the input source. If you want test other pages, just copy the URLs and add it to this input file. Run searchEngine.py again to update the Trie. Then you can search the keyword of those pages by adding some code in the searchEngine.py file.
The output file contains the result of sample queries, which is already created by running searchEngine.py file. Feel free to run the searchEngine.py file again to recreate it.
This file contains urls returned by query result and ranking them by frequency of each words. It means that the website contains more words would rank higher.

This is the main program in my project.
I imported a package named beautifulsoup which is aim to parse the content of URLs in input.txt to plain text.
There are 4 parts in this file:
A compressed trie for the set of index terms, where each external node stores the index of occurrence list of the associated term.
because of their large total size, the occurrence lists are stored on disk as files in the occurenceList folder after running this program.
Search the given word in the URLs content which has already been filtered. insert(): insert key word and website url to Trie searchOne(): search a single key word searchManyOrOne(): search a sentences Or a single word
The search method would return urls that contains all the input words and ranking them by frequency.
Get the website URLs then parse them to be plain text. Then do a series of data processing, like convert to lowercase, get rid of stop words, divide the plain text to multitude of single words. Then store these words and url into inverted file. After these preparation, the content is ready to be searched.
take an url and convert it into hash value, preparing for making a dictionary for fast intersection algorithm.
This is a folder where I store occurence list for each word as required. the occurenceList folder storing the occurrence list for each word.
the ‘python.txt’ file as below, there are 6 websites contain the python word.
The first line: the 1st string value is hash value calculated from the url; the 2nd string is the website url; the 3rd is the frequency of ‘python’ word in the website.

This file specify all the python3 libraries needed for this search engine project. Install these requirements libries with pip3 install -r requirements.txt
