[Features]: Enable sitemap.xml-based crawling and following domain redirects #14

K-Mistele · 2025-01-16T22:46:01Z

This PR adds two features:

The option to follow domain redirects e.g. from somewebsite.com to www.somewebsite.com, currently the behavior is that redirects to subdomains are not followed.
The option to check for a sitemap.xml. if it is present, all the specified URLs will be added to the crawl queue.

This is useful in cases where the top-level site e.g. mywebsite.com redirects to www.mywebsite.com

socket-security · 2025-01-16T22:46:23Z

New and updated dependencies detected. Learn more about Socket for GitHub ↗︎

Package	New capabilities	Transitives	Size	Publisher
npm/[email protected]	None	`+1`	192 kB	amitgupta
npm/[email protected] 🔁 npm/[email protected]	Transitive: eval, network	`+19`	185 MB	rolldownbot

View full report↗︎

egoist · 2025-01-17T07:34:17Z

src/index.ts

+      await this.#crawlSitemap(url)
+    }
+
    await this.#fetchPage(url, {


we probably don't need to continue normal fetching if sitemap is enabled, ie:

if (this.options.enableSitemap) { await this.#crawlSitemap(url) } else { await this.#fetchPage(url, {}) }

So, while sitemap.xml fetching is a great thing to have for larger sites or search-optimized sites, a lot of sites don't have it still.
I was viewing the enableSitemap options as "we are enabling sitemap-based crawling if it's available" but also assuming that if someone calls siteFetch() they want the site fetched regardless of whether or not there is a sitemap available. E.g. if someone is batch-fetching a bunch of sites, they probably want the sites fetched and to use sitemap if possible, but don't want the site skipped just because there's not a sitemap.

egoist · 2025-01-17T07:35:21Z

README.md

+crawled as normal:
+``` 
+sitefetch https://nextjs.org --enable-sitemap
+```


can it simply be:

sitefetch https://example.com/sitemap.xml

This requires that you know ahead of time if the site you're trying to crawl has a sitemap which requires making an additional request, see above comment. Happy to change it if you feel strongly about it.

we can just check if the input url ends with /sitemap.xml

egoist · 2025-01-17T07:36:53Z

btw we use Bun to install packages here 😄

K-Mistele · 2025-01-17T16:00:28Z

btw we use Bun to install packages here 😄

Oops! I will nuke the package-lock then. I actually use Bun for all of my personal stuff but most projects don't so I'm used to reverting to npm for OSS contributions.

K-Mistele added 2 commits January 16, 2025 16:00

feat: add support for following domain redirects (disabled by default)

d13b01a

This is useful in cases where the top-level site e.g. mywebsite.com redirects to www.mywebsite.com

feat: add sitemap parser

cc3470c

egoist reviewed Jan 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Features]: Enable sitemap.xml-based crawling and following domain redirects #14

[Features]: Enable sitemap.xml-based crawling and following domain redirects #14

Uh oh!

K-Mistele commented Jan 16, 2025

Uh oh!

socket-security bot commented Jan 16, 2025

Uh oh!

egoist Jan 17, 2025

Uh oh!

K-Mistele Jan 17, 2025

Uh oh!

egoist Jan 17, 2025

Uh oh!

K-Mistele Jan 17, 2025

Uh oh!

egoist Jan 17, 2025

Uh oh!

egoist commented Jan 17, 2025

Uh oh!

K-Mistele commented Jan 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Features]: Enable sitemap.xml-based crawling and following domain redirects #14

Are you sure you want to change the base?

[Features]: Enable sitemap.xml-based crawling and following domain redirects #14

Uh oh!

Conversation

K-Mistele commented Jan 16, 2025

Uh oh!

socket-security bot commented Jan 16, 2025

Uh oh!

egoist Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

K-Mistele Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

egoist Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

K-Mistele Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

egoist Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

egoist commented Jan 17, 2025

Uh oh!

K-Mistele commented Jan 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants