-
-
Notifications
You must be signed in to change notification settings - Fork 139
[Features]: Enable sitemap.xml-based crawling and following domain redirects #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This is useful in cases where the top-level site e.g. mywebsite.com redirects to www.mywebsite.com
|
New and updated dependencies detected. Learn more about Socket for GitHub ↗︎
|
| await this.#crawlSitemap(url) | ||
| } | ||
|
|
||
| await this.#fetchPage(url, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably don't need to continue normal fetching if sitemap is enabled, ie:
if (this.options.enableSitemap) {
await this.#crawlSitemap(url)
} else {
await this.#fetchPage(url, {})
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, while sitemap.xml fetching is a great thing to have for larger sites or search-optimized sites, a lot of sites don't have it still.
I was viewing the enableSitemap options as "we are enabling sitemap-based crawling if it's available" but also assuming that if someone calls siteFetch() they want the site fetched regardless of whether or not there is a sitemap available. E.g. if someone is batch-fetching a bunch of sites, they probably want the sites fetched and to use sitemap if possible, but don't want the site skipped just because there's not a sitemap.
| crawled as normal: | ||
| ``` | ||
| sitefetch https://nextjs.org --enable-sitemap | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can it simply be:
sitefetch https://example.com/sitemap.xml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires that you know ahead of time if the site you're trying to crawl has a sitemap which requires making an additional request, see above comment. Happy to change it if you feel strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can just check if the input url ends with /sitemap.xml
|
btw we use |
Oops! I will nuke the package-lock then. I actually use Bun for all of my personal stuff but most projects don't so I'm used to reverting to npm for OSS contributions. |
This PR adds two features:
sitemap.xml. if it is present, all the specified URLs will be added to the crawl queue.