Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support WARC extension fields from Browsertrix #76

Merged
merged 3 commits into from
May 6, 2024
Merged

Conversation

maeb
Copy link
Member

@maeb maeb commented May 6, 2024

This PR adds the fields WARC-Page-ID, WARC-Resource-Type and WARC-JSON-Metadata to the list of known fields such that processing of WARC header fields produced by browsertrix preserves the casing of the field names.

@maeb maeb changed the title Support WARC extensions from Browsertrix Support WARC extension fields from Browsertrix May 6, 2024
@maeb
Copy link
Member Author

maeb commented May 6, 2024

Need to rename WarcPageId to WarcPageID (for consistency) before merging this.

@maeb maeb force-pushed the feat/extension-fields branch 2 times, most recently from cbf6d83 to cccbe4c Compare May 6, 2024 08:15
@maeb
Copy link
Member Author

maeb commented May 6, 2024

maeb added 3 commits May 6, 2024 10:56
To preserve the case of field-names produced by Browsertrix, this commit adds
the fields WARC-Page-ID, WARC-JSON-Metadata and WARC-Resource-Type to the list
of known header fields.

gowarc can handle unknown fields but normalizes their field-names to title-case.

The WARC specification allows for extension fields in the WARC header.
Adds some more test cases.
@maeb maeb merged commit dbd27d8 into master May 6, 2024
3 checks passed
@maeb maeb deleted the feat/extension-fields branch May 6, 2024 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant