-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawl mode #1
Comments
Along with this feature you may want to implement a way to exclude urls pattern with regex or something. A common issue I've run into is if you are scraping a dynamic site that has a calendar/event system, you will often get stuck endlessly scraping the previous/next days. A This may seem like a big annoyance but I generally find even when I'm scanning a site with thousands of pages I only need 2-3 exclude patterns for the whole site. Another issue I run into often with tools like this is an inability to tell when a dynamic page is actually trying to start a file download. So a static download url like It would be really nice if your tool could look at the headers sent to be able to properly log/skip these file downloads. |
@Spunkie I like the idea of having an option to exclude/include URLs by a regex. It should be somewhat simple to implement. I'll try to ship it in the first iteration of the scrape mode. Regarding the second issue, along with limiting the "Accept" header to only accept html, we could maybe read the response headers as soon as they are received (if headless chrome allows us to do so) and abort the request if they're of a different resource type that we want (ie. html). Thing is, we can't always rely on them. The best thing would be if Chrome provided us DOM-parsing-related events. I'll look into this, it would be a separate feature though, outside of "scrape mode". I've created #4 to track this. Thanks for the great feedback! |
Add functionality that would allow the user to provide one target URL, the "root" node, from which mcdetect would begin scraping the site and checking every internal link for issues. It could be limited with a
max-depth
option that would limit the depth of the traversal.If you'd like to see this feature implemented please add a 👍 reaction to indicate interest.
The text was updated successfully, but these errors were encountered: