This project implements a multi-threaded web crawler and search engine. It crawls web pages, indexes their content, and provides a search functionality.
- Web Crawling: Fetches HTML content from web pages starting from a seed URL.
- HTML Parsing: Extracts links and removes HTML tags, scripts, and styles.
- Inverted Index: Builds an inverted index of words found in the crawled pages.
- Multi-threaded: Uses a thread pool for concurrent crawling and indexing.
- Search Functionality: Allows searching the indexed content using query terms.
The main entry point of the application. It sets up logging, initializes the crawler, and processes search queries.
java public static void main(String[] args) { // ... (initialization code) InvertedIndexBuilder builder = new InvertedIndexBuilder(); index = builder.createInvertedIndex(seedURL); index.writeInvertedIndex(); Search search = new Search(index); search.processSearch(queryFile, writer); // ... (error handling and cleanup) }
Responsible for crawling web pages and building the inverted index.
ArrayList<String> newLinks = HTMLParser.parseLinks(fetch.getHTML());
// ... (link processing code)
for (String newLink : newLinks) {
// ... (URL handling code)
if (newLink.startsWith("http")) {
newURL = new URL(newLink);
} else {
newURL = new URL(new URL(url), newLink);
}
// ... (more processing)
}
Parses HTML content, removes tags, and extracts links.
Processes search queries and returns results based on the inverted index.
public void processSearch(String path, PrintWriter writer) {
ArrayList<String> queryList = QueryListFactory.createQueryList(path);
index.processQueries(queryList);
HashMap<Integer, TreeSet<Map.Entry<String, Integer>>> results = index.getSearchResults();
// ... (result processing and writing)
}
Run the Driver class with the following command-line arguments:
java Driver -u <seed_url> -q <query_file>
invertedindex.txt
: Contains the built inverted index.searchresults.txt
: Contains the search results for the given queries.debug.log
: Contains debug information and logs.
- Java 6 or higher
- Log4j for logging
- JUnit for testing (optional)
This project is designed to crawl up to 30 pages starting from the seed URL. It uses multi-threading to improve performance and handles both absolute and relative URLs.
This code is not for distribution.