Web Crawler and Search Engine

This project implements a multi-threaded web crawler and search engine. It crawls web pages, indexes their content, and provides a search functionality.

Features

Web Crawling: Fetches HTML content from web pages starting from a seed URL.
HTML Parsing: Extracts links and removes HTML tags, scripts, and styles.
Inverted Index: Builds an inverted index of words found in the crawled pages.
Multi-threaded: Uses a thread pool for concurrent crawling and indexing.
Search Functionality: Allows searching the indexed content using query terms.

Main Components

Driver

The main entry point of the application. It sets up logging, initializes the crawler, and processes search queries.

java public static void main(String[] args) { // ... (initialization code) InvertedIndexBuilder builder = new InvertedIndexBuilder(); index = builder.createInvertedIndex(seedURL); index.writeInvertedIndex(); Search search = new Search(index); search.processSearch(queryFile, writer); // ... (error handling and cleanup) }

InvertedIndexBuilder

Responsible for crawling web pages and building the inverted index.

ArrayList<String> newLinks = HTMLParser.parseLinks(fetch.getHTML());
// ... (link processing code)
for (String newLink : newLinks) {
    // ... (URL handling code)
    if (newLink.startsWith("http")) {
        newURL = new URL(newLink);
    } else {
        newURL = new URL(new URL(url), newLink);
    }
    // ... (more processing)
}

HTMLParser

Parses HTML content, removes tags, and extracts links.

Search

Processes search queries and returns results based on the inverted index.

public void processSearch(String path, PrintWriter writer) {
    ArrayList<String> queryList = QueryListFactory.createQueryList(path);
    index.processQueries(queryList);
    HashMap<Integer, TreeSet<Map.Entry<String, Integer>>> results = index.getSearchResults();
// ... (result processing and writing)
}

Usage

Run the Driver class with the following command-line arguments:

java Driver -u <seed_url> -q <query_file>

Output

invertedindex.txt: Contains the built inverted index.
searchresults.txt: Contains the search results for the given queries.
debug.log: Contains debug information and logs.

Dependencies

Java 6 or higher
Log4j for logging
JUnit for testing (optional)

Note

This project is designed to crawl up to 30 pages starting from the seed URL. It uses multi-threading to improve performance and handles both absolute and relative URLs.

License

This code is not for distribution.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
projectfiles		projectfiles
readme		readme
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler and Search Engine

Features

Main Components

Driver

InvertedIndexBuilder

HTMLParser

Search

Usage

Output

Dependencies

Note

License

About

Releases

Packages

Contributors 2

Languages

capkutay/searchengine

Folders and files

Latest commit

History

Repository files navigation

Web Crawler and Search Engine

Features

Main Components

Driver

InvertedIndexBuilder

HTMLParser

Search

Usage

Output

Dependencies

Note

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages