Skip to content

Conversation

@K-Mistele
Copy link

This PR adds two features:

  1. The option to follow domain redirects e.g. from somewebsite.com to www.somewebsite.com, currently the behavior is that redirects to subdomains are not followed.
  2. The option to check for a sitemap.xml. if it is present, all the specified URLs will be added to the crawl queue.

This is useful in cases where the top-level site e.g. mywebsite.com redirects to www.mywebsite.com
@socket-security
Copy link

New and updated dependencies detected. Learn more about Socket for GitHub ↗︎

Package New capabilities Transitives Size Publisher
npm/[email protected] None +1 192 kB amitgupta
npm/[email protected] 🔁 npm/[email protected] Transitive: eval, network +19 185 MB rolldownbot

View full report↗︎

await this.#crawlSitemap(url)
}

await this.#fetchPage(url, {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably don't need to continue normal fetching if sitemap is enabled, ie:

if (this.options.enableSitemap) {
 await this.#crawlSitemap(url)
} else {
  await this.#fetchPage(url, {})
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, while sitemap.xml fetching is a great thing to have for larger sites or search-optimized sites, a lot of sites don't have it still.
I was viewing the enableSitemap options as "we are enabling sitemap-based crawling if it's available" but also assuming that if someone calls siteFetch() they want the site fetched regardless of whether or not there is a sitemap available. E.g. if someone is batch-fetching a bunch of sites, they probably want the sites fetched and to use sitemap if possible, but don't want the site skipped just because there's not a sitemap.

crawled as normal:
```
sitefetch https://nextjs.org --enable-sitemap
```
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it simply be:

sitefetch https://example.com/sitemap.xml

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires that you know ahead of time if the site you're trying to crawl has a sitemap which requires making an additional request, see above comment. Happy to change it if you feel strongly about it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just check if the input url ends with /sitemap.xml

@egoist
Copy link
Owner

egoist commented Jan 17, 2025

btw we use Bun to install packages here 😄

@K-Mistele
Copy link
Author

btw we use Bun to install packages here 😄

Oops! I will nuke the package-lock then. I actually use Bun for all of my personal stuff but most projects don't so I'm used to reverting to npm for OSS contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants