WARC-Resource-Type field possibilities (feedback wanted) #96

ikreymer · 2024-03-04T20:55:28Z

Browsers have different ways of reporting the 'resource type' for any resource that's being fetched. When using browser-based crawling, it is often easy to access this 'resource type' and store it in a custom WARC header.

It is possible to introduce a WARC-Resource-Type header to store this type. Unfortunately, there isn't a single standard of 'resource types' and various browser APIs expose different variations on this.

If a resource type is written to a WARC header, is there a way to make it future proof to support different vocabularies?

Some possibilities include:

Chrome Debug Protocol (CDP) resource type
this is easiest for Chromium-browser based crawling as these fields are directly accessible, but is not especially well standardized and could change anytime.
Fetch Request.destination - this is well standardized vocabulary but not a one-to-one mapping and may not be accessible for non-Fetch data.
Extension API webRequest.resourceType - better standardized and supported by all the major browsers with some differences for browser extensions. Not quite one-to-one with CDP types.

One approach to make this more future proof might be to prefix the resourceType with a namespace based on where the data is coming from and which vocabulary is used.

For example, if using CDP, cdp:Document or cdp:Image, if using webRequest, might be webRequest:sub_frame, webRequest:image, if using destination, destination:image, destination:document, etc...

This allows for expanding into other vocabularies in the future, but may be harder to parse.

Alternatively, there could be a fixed vocabulary that is allowed that is a common subset of at least 2 of the above, which might be:
document, image, media, script, stylesheet, font, ping, websocket, fetch and a catch-all other.

(In this case, we should specify what the more specific values are recorded as, eg. main_frame / sub_frame would be recorded as document)

Other thoughts / suggestions welcome!

The text was updated successfully, but these errors were encountered:

ikreymer · 2024-03-04T21:10:01Z

I should note our initial implementation just stores the Chrome CDP value, eg. WARC-Resource-Type: Document, WARC-Resource-Type: Image, etc... w/o a prefix, as that was the easiest to try. We could also just keep that, but wanted to see if there were any thoughts on the above proposals. Other tools that work directly with Chrome Debug Protocol, such as Brozzler or the Chrome Extractor for Heritrix, would actually have the same vocabulary as well, so may not be an immediate concern.
Mostly a question of other tools / future proofing to support vocabulary not coming from CDP, if such a header were to be standardized.

tw4l · 2024-03-04T22:53:40Z

Note that Puppeteer and Playwright use the CDP values but lowercased: https://playwright.dev/docs/api/class-request#:~:text=resourceType%E2%80%8B&text=ResourceType%20will%20be%20one%20of,%2C%20websocket%20%2C%20manifest%20%2C%20other%20

tw4l · 2024-03-04T22:54:04Z

Playwright mapping for Firefox: https://github.com/microsoft/playwright/blob/73ffaf65d75b2378168ac5a11eb37cced03ff6ea/packages/playwright-core/src/server/firefox/ffNetworkManager.ts#L161

ato · 2024-03-05T08:48:33Z

Do we have any use cases in mind for this field when reading the WARC?

I guess one might be be listing all the top-level crawled documents. This can't be done accurately by Content-Type alone as XHR/Fetch requests can have text/html responses.

The main_frame/sub_frame distinction also seems interesting for that use case. It's not in the CDP resource type but if we map to one of the other vocabularies presumably it could be determined from the frameId?

I guess the hopsFromSeed metadata field could be used for listing top-level crawled documents but it's coarse grained and doesn't make distinctions between different kinds of embedded content.

It's also possible for an image to have a text/html Content-Type and still display correctly due to MIME sniffing. So similarly if you wanted to do something with all the images in a crawl, Content-Type alone is insufficient.

tw4l · 2024-03-05T16:02:23Z

We've added this to our WARCs in response to a user-submitted issue: webrecorder/browsertrix-crawler#451, with the primary use case being differentiating between resources fetched by JavaScript (via fetch, xhr) versus resources loaded directly from the HTML.

edsu · 2024-03-05T16:09:39Z

This is probably off topic for this issue, but it came up recently in the context of using mailbagit that it would be useful to know if a record is for a seed URL. Or is there another common way of doing that? The motivation here is to be able to pick out URLs from the WARC data to serve as entry points during replay.

ato · 2024-03-06T00:06:17Z

it would be useful to know if a record is for a seed URL. Or is there another common way of doing that?

For WARCs created by Heritrix a metadata record without the via and hopsFromSeed fields is indicative of a seed. If the crawler doesn't populate those fields though I don't think there's a reliable way to tell from a WARC file alone. Requests without a Referer header might also be indicative for some crawlers but but not ones that obey Referrer-Policy: no-referrer.

WACZ defines an accompanying pages.jsonl file for entry points.

benoit74 · 2024-06-27T07:05:10Z

See webrecorder/browsertrix-crawler#630 for a feedback "from the trenches".

This was referenced Mar 4, 2024

warc: add Network.resourceType (https://chromedevtools.github.io/devt… webrecorder/browsertrix-crawler#481

Merged

Add request initiator to WARC? webrecorder/browsertrix-crawler#451

Closed

edsu mentioned this issue Mar 5, 2024

Add the message body to the pages tab in replayweb UAlbanyArchives/mailbagit#247

Open

tw4l mentioned this issue Mar 14, 2024

Document new WARC fields in 1.x crawler-produced WACZ files webrecorder/browsertrix#1588

Open

maeb mentioned this issue May 6, 2024

Support WARC extension fields from Browsertrix nlnwa/gowarc#76

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WARC-Resource-Type field possibilities (feedback wanted) #96

WARC-Resource-Type field possibilities (feedback wanted) #96

ikreymer commented Mar 4, 2024 •

edited

Loading

ikreymer commented Mar 4, 2024

tw4l commented Mar 4, 2024

tw4l commented Mar 4, 2024

ato commented Mar 5, 2024 •

edited

Loading

tw4l commented Mar 5, 2024

edsu commented Mar 5, 2024

ato commented Mar 6, 2024

benoit74 commented Jun 27, 2024

WARC-Resource-Type field possibilities (feedback wanted) #96

WARC-Resource-Type field possibilities (feedback wanted) #96

Comments

ikreymer commented Mar 4, 2024 • edited Loading

ikreymer commented Mar 4, 2024

tw4l commented Mar 4, 2024

tw4l commented Mar 4, 2024

ato commented Mar 5, 2024 • edited Loading

tw4l commented Mar 5, 2024

edsu commented Mar 5, 2024

ato commented Mar 6, 2024

benoit74 commented Jun 27, 2024

ikreymer commented Mar 4, 2024 •

edited

Loading

ato commented Mar 5, 2024 •

edited

Loading