Pythia & Delphi form the two main parts of our search engine, the web crawler and the query engine. Both programmes are built using Elixir, a functional programming language, for the following reasons:
- Friendly syntax, great documentation and community support
- Faster than Ruby or Python
- As it is built on Erlang, it is easy to spawn new processes to create multi-threaded process systems
As well as Elixir, we also used:
- ESpec and Hound to test the code
- Phoenix Framework to build the query engine
- Ecto and Postgres to store the urls and the collected information
- Video to project presentation (coming soon)
- Link to project presentation
$ clone the repo locally (https://github.com/Andy-Bell/delphi)
$ cd delphi
$ mix deps.get
$ mix ecto.create && mix ecto.migrate
Delphi includes 3 main parts:
- Url Writer
- Url Scraper
- Spider
Url Writer takes one url (string) as an argument, and writes the urls on that page to the database.
$ iex -S mix
$ UrlWriter.add_url_to_table("http://www.makersacademy.com")
The database has an unique-url constraint. If an url that has already been added to the url database is trying to be added again by the UrlWriter, it will return an error message, and will move on to the next url.
Url Scraper, similar to the Url Writer, takes one url (string) as an argument and will collect the title and the description on that page. This will then get added to a second database which stores the data.
$ iex -S mix
$ UrlScraper.search_urls("http://www.makersacademy.com")
Web Crawler/Spider goes through the url database, and it will scrape the information from those links, and will add the url and the scraped information to the respective databases. It will then follow the urls on the pages it accessed and continue collecting information from every page it visits. As mentioned above, Elixir made it quite fast and easy to spawn hundreds of spiders going through different urls and collecting information.
$ clone this repo locally
$ cd pythia
$ mix deps.get
$ npm install
Open two terminals, and start phantomjs on one terminal:
$ phantomjs --wd
Run the tests on the other terminal:
$ mix test
$ mix server.start
Visit http://localhost:4000 from your browser.
We used the Phoenix Framework to build Pythia. It is very similar to Ruby on Rails Framework but provides more flexibility around customisation.
Pythia handles the user queries, and returning relevant results to the user. It will list the results based on the following:
- Accessing the database which stores the scraped information from the urls, and return all the relevant links and information stored.
- Ranking the results based on a simple algorithm which involves checking whether the keyword is included in the url, title and/or the description.
![Pythia&Delphi](web/static/assets/images/Pythia search bar.png)
Improving spider efficiency and limitation , and the search algorithm