#icrawler A simple web crawler that indexes a website's pages and lists down all of static assets of each page.
You need nodejs version >= 4 and npm version >= 2. Then, download this module and then do npm install followed by npm run build. You can try the module from a file called playground.js. Alternatively, you can use it in your own project by using npm link. Run npm link on this module directory and then go to your project directory and run npm link icrawler.
Make sure that you are in root directory of the project. To run tests do npm run test
or to run lint do npm run lint.
const icrawler = require('icrawler');
const url = 'jquery.com';
const res = icrawler.crawl(url).then(result => {
// do something with result, e.g print it out
console.log(result);
});
Output from gocardless.com is available on output.txt.
- Will only work on html, won't work with any other type (json, xml, etc)
- Does not check if url is valid or not. If url is invalid, or the domain of the url does not exist, it will return an empty array.
- Might take a while if website is big or internet connection is slow.
- A library to make HTTP request - can use node http module
- A library to parse HTML - can use htmlparser or htmlparser2
- Algorithm to traverse the url
- Should not traverse other domains and subdomains - check
- Make sure that the module does not loop forever when traversing the website - check
- Get all static assets (images, stylesheets and js) on each page - check
- Produce output in the specified format - check
- Don't use web crawling framework - check
- README with clear installation and running instructions - check
- Module works as specified (functionality) - check
- Structure the module nicely and put some comments (structure and clarity) - check
- Check for edge cases (robustness) - check
- Write tests (testing) - check
- Linting - check