Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl mode #1

Open
agis opened this issue Sep 16, 2017 · 2 comments
Open

Crawl mode #1

agis opened this issue Sep 16, 2017 · 2 comments

Comments

@agis
Copy link
Owner

agis commented Sep 16, 2017

Add functionality that would allow the user to provide one target URL, the "root" node, from which mcdetect would begin scraping the site and checking every internal link for issues. It could be limited with a max-depth option that would limit the depth of the traversal.

If you'd like to see this feature implemented please add a 👍 reaction to indicate interest.

@agis agis added the feature label Sep 16, 2017
@Spunkie
Copy link

Spunkie commented Sep 22, 2017

Along with this feature you may want to implement a way to exclude urls pattern with regex or something. A common issue I've run into is if you are scraping a dynamic site that has a calendar/event system, you will often get stuck endlessly scraping the previous/next days.

A max-depth would certainly help in this case but I would personally prefer setting max-depth to indefinite and manually exclude the few problematic pages as the scraper finds them. That way I can be sure all of the site has been scanned.

This may seem like a big annoyance but I generally find even when I'm scanning a site with thousands of pages I only need 2-3 exclude patterns for the whole site.


Another issue I run into often with tools like this is an inability to tell when a dynamic page is actually trying to start a file download.

So a static download url like example.com/file.txt would be properly ignored but if the url was something like example.com/index.php?task=download&filename=file.txt the tool would often process it as a normal url, resulting in the tool trying to download random binaries from the site or the request simply timing out.

It would be really nice if your tool could look at the headers sent to be able to properly log/skip these file downloads.

@agis
Copy link
Owner Author

agis commented Sep 23, 2017

@Spunkie I like the idea of having an option to exclude/include URLs by a regex. It should be somewhat simple to implement. I'll try to ship it in the first iteration of the scrape mode.

Regarding the second issue, along with limiting the "Accept" header to only accept html, we could maybe read the response headers as soon as they are received (if headless chrome allows us to do so) and abort the request if they're of a different resource type that we want (ie. html). Thing is, we can't always rely on them. The best thing would be if Chrome provided us DOM-parsing-related events. I'll look into this, it would be a separate feature though, outside of "scrape mode". I've created #4 to track this.

Thanks for the great feedback!

@agis agis changed the title Scrape mode Crawl mode Feb 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants