Web Crawler in Node.js to spider dynamically whole websites.
IMPORTANT: This is a DEVELOPMENT tool, therefore SHOULD NOT be used against a website you DO NOT OWN!
It helps you to map / process entire websites, spidering them and parsing each page in a smart way. It follows all the links and test several times the form objects. In this way is possible to check effectively the whole website.
- Table Of Contents
- What's this for?
- Features
- Dependencies
- Installation
- Configuration
- Usage
- Examples
- Tests
- How it works
- Bugs
- Changelog
- Contributing
- Licence
This project was born with the aim of improve the legacy code, but it's not strictly restricted only to that.
salmonJS will crawl every page from an entry-point URL, retrieving all the links in the page and firing all the events bound to any DOM element in the page in order to process all the possible combination automatically. The only "limitation" of an automatic robot is the user input, so for that cases has been implemented the test case files where it's possible to define custom input values (e.g.: POST variables for forms, input values for javascript prompts, etc).
With this in mind the usage of salmonJS could be different based on your own needs, like checking legacy code for dead code or profiling the web app performance.
Here below few suggestions about its usage:
- Improve the legacy code
- Check the dead code (enabling the code coverage server-side)
- Discover 500 Internal Server Errors
- Discover notices and warnings
- SQL profiling
- Testing
- Process forms (it'll create easy test cases to be manually compiled)
- Process automatically JS events attached to DOM nodes
- Scraping
- Get the page content for each URL
- Get the screenshot for each URL
- Enumeration
- URLs list
- Execution times
- Page output
- Page load
- ...
- Command Line Interface
- Catch and handle all the events bound to DOM elements (regardless how they have been set)
- Follows any 3xx redirect, JS document.location and meta redirect (can be disabled)
- Ignore duplicated URLs / requests and external URLs
- Test case files, with support of:
- COOKIEs
- FILES upload
- GET parameters
- HTTP headers
- POST parameters
- HTTP authentication
- Proxy settings
- Politeness Policy
- Generate report for each page crawled, with: 6
- Screenshot
- HTTP headers
- HTTP method
- Data sent (GET and POST)
- Page output
- Execution time
- Console messages
- Alerts, Confirmations & Prompts
- Errors
- List of successful and failed requests
- Pool system to limit the number of workers in the same time, then queue them
- Multiple crawlers working asynchronously one URL each one
- Support for the following HTML tags: a, area, base, form, frame, iframe, img, input, link, script
- URL normalisation
- Process the web page using PhantomJS
- Process all the output content types
- Keep the connection alive for lower CPU and memory load on the server
salmonJS is based on Node.js and PhantomJS and uses Redis as queue manager.
salmonJS is tested using TravisCI on the following versions:
- Node.js: v0.8.26 and v0.10.25
- PhantomJS: v1.9.7
- Redis: v2.9.6
This is the list of main dependencies:
- optimist v0.6.1
- path v0.4.9
- colors v0.6.2
- redis v0.10.1
- glob v3.2.9
- winston v0.7.3
- insight v0.3.1
- request v2.34.0
- htmltidy v0.0.6
- URIjs v1.12.1
- jsonfn v0.31.0
This is the list of development dependencies:
- jshint v2.4.4
- yuidocjs v0.3.49
- jasmine-node v1.14.2
- phapper v0.1.9
- grunt-cli v0.1.13
- grunt-jasmine-node-coverage v0.1.8
- grunt-contrib-jshint v0.9.2
- grunt-release v0.7.0
- grunt-contrib-yuidoc v0.5.2
- grunt-todo v0.2.0
- chai v1.9.1
- grunt-verb v0.0.5
At the moments salmonjs is available only a NPM package.
So, you can install it directly from NPM using the following command:
[user@hostname ~]$ npm install salmonjs -g
Eventually you can get the source code from GitHub and then run this command:
[user@hostname ~/salmonjs]$ npm install
Here an example of a test case file:
; Test Case File
; generated by salmonJS v0.5.0 (http://www.salmonjs.org) at Sat, 01 Jan 1970 00:00:00 GMT
; url = http://www.example.com
; id = http___www_example_com
[GET]
variable1=value1
[POST]
variable1=value1
variable2=value2
variable3=@/path/to/file.ext ; use @ in front to use the upload feature (the file MUST exists)
[COOKIE]
name=value
[HTTP_HEADERS]
header=value
[CONFIRM]
Message=true ; true = OK, false = Cancel
[PROMPT]
Question="Answer"
The file is using the INI format.
There are different section you can use and customise to your needs:
- GET, used for the GET variables in the query string
The key is the variable name, the value is the variable's value. - POST, used for the POST variables
The key is the variable name, the value is the variable's value.
It's possible to send also files using@
in front of the value to upload (the file MUST exists). - COOKIE, used to set the cookies
The key is the cookie name, the value is the content.
It's not possible to configure the domain, the path and the expiration date of the cookie. It'll be configured automatically salmonJS - HTTP_HEADERS, used to set HTTP headers
The key is the HTTP header name, the value is the header's value. - CONFIRM, used to change the JS confirm message behaviour
The key is the message text, the value is the button to be clicked.
Possible values:true
= OK,false
= Cancel - PROMPT, used to change the JS prompt message behaviour
The key is the question text, the value is the answer.
__ _____ _______
.-----.---.-.| |.--------.-----.-----._| | __|
|__ --| _ || || | _ | | |__ |
|_____|___._||__||__|__|__|_____|__|__|_______|_______|
salmonJS v0.5.0
Copyright (C) 2014 Fabio Cicerchia <[email protected]>
Web Crawler in Node.js to spider dynamically whole websites.
Usage: node ./bin/salmonjs
Options:
--uri The URI to be crawled [required]
-c, --credentials Username and password for HTTP authentication (format "username:password")
-d, --details Store details for each page (in the specified folder)
-f, --follow Follows redirects [default: false]
-p, --proxy Proxy settings (format: "ip:port" or "username:password@ip:port")
-w, --workers Maximum number of asynchronous workers [default: 10]
-r, --restore Restore the previous interrupted session [default: false]
-s, --sanitise Sanitise any malformed HTML page [default: false]
--cases Test cases folder
--redis Redis configuration (format "ip:port") [default: "127.0.0.1:6379"]
--timeout Resource timeout [default: 5000]
--attempts Number of attempts before stop to request the URL [default: 5]
--interval Number of millisecond before try to fetch an URL after a failure [default: 5000]
--disable-stats Disable anonymous report usage stats [default: false]
-q, --quiet Disable all the output messages
-v Verbose
--version Display the current version
--help Show the help
These are just few examples how to use salmonjs:
[user@hostname ~]$ salmonjs --uri "http://www.google.com"
[user@hostname ~]$ salmonjs --uri "www.google.com"
[user@hostname ~]$ salmonjs --uri "/tmp/file.html"
[user@hostname ~]$ salmonjs --uri "file.html"
You can find more detailed examples in the folder docs/examples.
[user@hostname ~/salmonjs]$ npm test
- Start processing an URL
- Open a system process to PhantomJS
- Open the URL
- If there is a JS event, put it into a dedicate stack
- Inject custom event listener 1. Override existent event listener
- Collect all the relevant info from the page for the report
- On load complete, execute the events in the stack
- Start to process the web page
- Get all the links from the page content
- Normalise and filter by uniqueness all the URLs collected
- Get all the JS events bound to DOM elements
- Clone the web page for each new combination in the page (confirm)
- Put the web page instance in a dedicate stack for each JS event
- Process the all the web pages in the stack
- Get all the links from the page content
- Reiterate until there are no more JS events
- If there is an error retry up to 5 times
- Collect all the data sent by the parser
- Create test cases for POST data with normalised fields
- Get POST test cases for current URL
- Launch a new crawler for each test case
- Store details in report file
- Increase the counter for possible crawlers to be launched based on the links
- Check the links if are already been processed
- If not, launch a new process for each link
- If there are no more links to be processed, check if there are still sub-crawlers running
- If not so, terminate the process
For a list of bugs please go to the GitHub Issue Page.
- Improved documentation
- Fixed the pool system in order to make sure it doesn't fork too many process
- Fixed URL normalisation
- Fixed Redis port
- Fixed issue with sanitise & link extraction
- Added few more examples
- Added demo website for integration/functional testing
- Added support to retrieve relative url as well
- Fixed problem with storing report file when the directory already exists
- Improved url processing speed - removing duplicates
- Added more html tags to the parser
- Fixed issue with report + issue with serialisation of the pool's queue
- Fixed minor bugs with processing local files + added urijs
- Fixed #45
- Updated project url to www.salmonjs.org
- Disabled 2 tests because of a possible "bug" in phantomjs
- Removed ci for node.js 0.11
- Disabled gzip, it's not recognised properly from phantomjs
- Added readme generation via grunt
- Added todo file, generated via grunt
- Linted the code
- Added dev utilities with grunt
- Coverage improvements
- Improved tests
- Removed casperjs and added jasmine & grunt
- Improved documentation + cleanup + fixed tests
- Fixed #38
- Fixed #39
- Fixed #26
- Fixed #37
- Fixed #40
- Fixed #34
- Fixed #15 #16
- Fixed #45
- Fixed proxy undefined + fix on pool mechanism
- Fixed #9
- Closed #33
- Closed #20
- Changed CDN
- Closed #14
- Closed #23
- Closed #25
- Minor improvements
- Fixed several bugs
- Workaround to avoid to stop the execution when 'about:blank'
- Fixed #30
- Improved code
- Fixed issue with upload
- Fixed #12
- Rebranding: say goodbye to spidey, say hello to salmonJS
- Added more nodejs versions
- Fixed #7
- Fixed #6
- Wrap the functional test to avoid problems with code coverage
- Fixed #5
- Closed #13
- Improved docs
- Improved tests
- Added insight support
- Improved to execute all the JS events in the parser
- Removed support for casperjs as parser
- Fixed tests to work with the new casperjs (dev) version
- Fixed #11
- Do not process the page if there are no links
- Disabled the processing if it's not HTML
- Added follow redirects option
- Linted
- Improved report
- Fixed issue with top parent node (that has no parentNode)
- Increased redis version
- Minor improvements
- Fixed several bugs
- Fixed attribute name for script tag (thanks to Ben Ellis)
- Added custom user agent value + removed duplicate property 'tags'
- Improved test case file to use section (e.g.: POST, GET, ...)
- Removed glob as external dependency
- Added report id and comments in test cases
- Minor improvements
- Fixed several bugs
- Fixed viewport size and page content
- Improved coverage generation
- Improved the crawler using workers
- Improved casper & phantom parsers + more tests with casperjs + new classes to emulate fs & glob + minor changes
- Added tests with CasperJS for DOM interaction + draft for parser with CasperJS
- Improved url normalisation
- Added IOC
- Removed winston log from mocha tests
- Added .travis.yml
- Added tests with mocha & chai
- Add dependecy phantomjs to npm
- Fixed bugged behaviour which was keeping open the node.js process
- Checked if the attributes ('on*') attached at runtime are handled correctly
- Better redis error handling
- Fixed bug with or condition
- Minor changes + improved documentation
- Added screenshot feature
- Rebrand
- Added html template for reports
- Added support for report generation
- Crated a separate file for the console
- Added http authentication
- Fixed existsSync using the right node.js version
- Added help
- Linted the code
- Added sorting querystring parameters
- Improved to parse url and file path
- Retrieving events bound to elements & process them + fix signature hash
- Added tests
- Converted project from PHP to Node.js
- Minor improvements
- Fixed several bugs
Please take a moment to review this document in order to make the contribution process easy and effective for everyone involved.
Following these guidelines helps to communicate that you respect the time of the developers managing and developing this open source project. In return, they should reciprocate that respect in addressing your issue, assessing changes, and helping you finalize your pull requests.
The issue tracker is the preferred channel for bug reports, features requests and submitting pull requests, but please respect the following restrictions:
- Please do not derail or troll issues. Keep the discussion on topic and respect the opinions of others.
A bug is a demonstrable problem that is caused by the code in the repository. Good bug reports are extremely helpful - thank you!
Guidelines for bug reports:
-
Use the GitHub issue search — check if the issue has already been reported.
-
Check if the issue has been fixed — try to reproduce it using the latest
master
or development branch in the repository. -
Isolate the problem — ideally create a reduced test case.
A good bug report shouldn't leave others needing to chase you up for more information. Please try to be as detailed as possible in your report. What is your environment? What steps will reproduce the issue? What OS experiences the problem? What would you expect to be the outcome? All these details will help people to fix any potential bugs.
Example:
Short and descriptive example bug report title
A summary of the issue and the browser/OS environment in which it occurs. If suitable, include the steps required to reproduce the bug.
- This is the first step
- This is the second step
- Further steps, etc.
<url>
- a link to the reduced test caseAny other information you want to share that is relevant to the issue being reported. This might include the lines of code that you have identified as causing the bug, and potential solutions (and your opinions on their merits).
Feature requests are welcome. But take a moment to find out whether your idea fits with the scope and aims of the project. It's up to you to make a strong case to convince the project's developers of the merits of this feature. Please provide as much detail and context as possible.
Good pull requests - patches, improvements, new features - are a fantastic help. They should remain focused in scope and avoid containing unrelated commits.
Please ask first before embarking on any significant pull request (e.g. implementing features, refactoring code), otherwise you risk spending a lot of time working on something that the project's developers might not want to merge into the project.
Please adhere to the coding conventions used throughout a project (indentation, accurate comments, etc.) and any other requirements (such as test coverage).
Adhering to the following this process is the best way to get your work included in the project:
-
Fork the project, clone your fork, and configure the remotes:
# Clone your fork of the repo into the current directory git clone https://github.com/<your-username>/salmonjs # Navigate to the newly cloned directory cd salmonjs # Assign the original repo to a remote called "upstream" git remote add upstream https://github.com/salmonjs/salmonjs
-
If you cloned a while ago, get the latest changes from upstream:
git checkout master git pull upstream master
-
Create a new topic branch (off the main project development branch) to contain your feature, change, or fix:
git checkout -b <topic-branch-name>
-
Make sure to update, or add to the tests when appropriate. Patches and features will not be accepted without tests. Run
npm test
to check that all tests pass after you've made changes. -
Commit your changes in logical chunks. Please adhere to these git commit message guidelines or your code is unlikely be merged into the main project. Use Git's interactive rebase feature to tidy up your commits before making them public.
-
Locally merge (or rebase) the upstream development branch into your topic branch:
git pull [--rebase] upstream master
-
Push your topic branch up to your fork:
git push origin <topic-branch-name>
-
Open a Pull Request with a clear title and description.
-
If you are asked to amend your changes before they can be merged in, please use
git commit --amend
(or rebasing for multi-commit Pull Requests) and force push to your remote feature branch. You may also be asked to squash commits.
IMPORTANT: By submitting a patch, you agree to license your work under the same license as that used by the project.
If you have commit access, please follow this process for merging patches and cutting new releases.
- Check that a change is within the scope and philosophy of the project.
- Check that a change has any necessary tests and a proper, descriptive commit message.
- Checkout the change and test it locally.
- If the change is good, and authored by someone who cannot commit to
master
, please try to avoid using GitHub's merge button. Apply the change tomaster
locally (feel free to amend any minor problems in the author's original commit if necessary). - If the change is good, and authored by another maintainer/collaborator, give them a "Ship it!" comment and let them handle the merge.
- All non-trivial changes should be put up for review using GitHub Pull Requests.
- Your change should not be merged into
master
(or another feature branch), without at least one "Ship it!" comment from another maintainer/collaborator on the project. "Looks good to me" is not the same as "Ship it!". - Try to avoid using GitHub's merge button. Locally rebase your change onto
master
and then push to GitHub. - Once a feature branch has been merged into its target branch, please delete the feature branch from the remote repository.
- Include all new functional changes in the CHANGELOG.
- Use a dedicated commit to increment the version. The version needs to be
added to the
CHANGELOG.md
(inc. date) and thepackage.json
. - The commit message must be of
v0.0.0
format. - Create an annotated tag for the version:
git tag -m "v0.0.0" v0.0.0
. - Push the changes and tags to GitHub:
git push --tags origin master
. - Publish the new version to npm:
npm publish
.
salmonJS's license follows:
====
Copyright (C) 2014 Fabio Cicerchia [email protected]
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
====
This license applies to all parts of salmonJS that are not externally maintained libraries. The externally maintained libraries used by salmonJS are:
-
tiny-sha1, located at src/sha1. tiny-sha1's license follows: """ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """