GitHub

#icrawler A simple web crawler that indexes a website's pages and lists down all of static assets of each page.

Installation

You need nodejs version >= 4 and npm version >= 2. Then, download this module and then do npm install followed by npm run build. You can try the module from a file called playground.js. Alternatively, you can use it in your own project by using npm link. Run npm link on this module directory and then go to your project directory and run npm link icrawler.

Tests and QA

Make sure that you are in root directory of the project. To run tests do npm run test or to run lint do npm run lint.

Usage

const icrawler = require('icrawler');
const url = 'jquery.com';
const res = icrawler.crawl(url).then(result => {
  // do something with result, e.g print it out
  console.log(result);
});

Output from gocardless.com is available on output.txt.

Caveats

Will only work on html, won't work with any other type (json, xml, etc)
Does not check if url is valid or not. If url is invalid, or the domain of the url does not exist, it will return an empty array.
Might take a while if website is big or internet connection is slow.

Notes

What's needed for this library?

A library to make HTTP request - can use node http module
A library to parse HTML - can use htmlparser or htmlparser2
Algorithm to traverse the url

Task checklist

Should not traverse other domains and subdomains - check
Make sure that the module does not loop forever when traversing the website - check
Get all static assets (images, stylesheets and js) on each page - check
Produce output in the specified format - check
Don't use web crawling framework - check
README with clear installation and running instructions - check
Module works as specified (functionality) - check
Structure the module nicely and put some comments (structure and clarity) - check
Check for edge cases (robustness) - check
Write tests (testing) - check
Linting - check

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
lib		lib
test		test
.babelrc		.babelrc
.eslintrc		.eslintrc
.gitignore		.gitignore
.npmignore		.npmignore
README.md		README.md
index.js		index.js
package.json		package.json
playground.js		playground.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Installation

Tests and QA

Usage

Caveats

Notes

What's needed for this library?

Task checklist

About

Uh oh!

Releases

Packages

Languages

momenator/icrawler

Folders and files

Latest commit

History

Repository files navigation

Installation

Tests and QA

Usage

Caveats

Notes

What's needed for this library?

Task checklist

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages