Crawl mode #1

agis · 2017-09-16T16:03:21Z

Add functionality that would allow the user to provide one target URL, the "root" node, from which mcdetect would begin scraping the site and checking every internal link for issues. It could be limited with a max-depth option that would limit the depth of the traversal.

If you'd like to see this feature implemented please add a 👍 reaction to indicate interest.

The text was updated successfully, but these errors were encountered:

Spunkie · 2017-09-22T22:07:53Z

Along with this feature you may want to implement a way to exclude urls pattern with regex or something. A common issue I've run into is if you are scraping a dynamic site that has a calendar/event system, you will often get stuck endlessly scraping the previous/next days.

A max-depth would certainly help in this case but I would personally prefer setting max-depth to indefinite and manually exclude the few problematic pages as the scraper finds them. That way I can be sure all of the site has been scanned.

This may seem like a big annoyance but I generally find even when I'm scanning a site with thousands of pages I only need 2-3 exclude patterns for the whole site.

Another issue I run into often with tools like this is an inability to tell when a dynamic page is actually trying to start a file download.

So a static download url like example.com/file.txt would be properly ignored but if the url was something like example.com/index.php?task=download&filename=file.txt the tool would often process it as a normal url, resulting in the tool trying to download random binaries from the site or the request simply timing out.

It would be really nice if your tool could look at the headers sent to be able to properly log/skip these file downloads.

agis · 2017-09-23T11:30:49Z

@Spunkie I like the idea of having an option to exclude/include URLs by a regex. It should be somewhat simple to implement. I'll try to ship it in the first iteration of the scrape mode.

Regarding the second issue, along with limiting the "Accept" header to only accept html, we could maybe read the response headers as soon as they are received (if headless chrome allows us to do so) and abort the request if they're of a different resource type that we want (ie. html). Thing is, we can't always rely on them. The best thing would be if Chrome provided us DOM-parsing-related events. I'll look into this, it would be a separate feature though, outside of "scrape mode". I've created #4 to track this.

Thanks for the great feedback!

agis added the feature label Sep 16, 2017

agis mentioned this issue Sep 23, 2017

Handle non-HTML resources #4

Open

agis changed the title ~~Scrape mode~~ Crawl mode Feb 1, 2018

agis added the help wanted label Feb 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl mode #1

Crawl mode #1

agis commented Sep 16, 2017 •

edited

Loading

Spunkie commented Sep 22, 2017 •

edited

Loading

agis commented Sep 23, 2017

Crawl mode #1

Crawl mode #1

Comments

agis commented Sep 16, 2017 • edited Loading

Spunkie commented Sep 22, 2017 • edited Loading

agis commented Sep 23, 2017

agis commented Sep 16, 2017 •

edited

Loading

Spunkie commented Sep 22, 2017 •

edited

Loading