Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quoted-string WARC header values are not parsed correctly #130

Open
JustAnotherArchivist opened this issue Jul 4, 2021 · 0 comments
Open

Comments

@JustAnotherArchivist
Copy link
Contributor

warcio fails to parse this valid WARC record correctly:

import gzip
import io
import warcio.archiveiterator


noise0 = b'WARC/1.1\r\nWARC-Record-ID: <urn:uuid:fe4275e8-87bd-435c-a3ff-9586e86427be>\r\nWARC-Type: warcinfo\r\nContent-Length: 0\r\nWARC-Date: 2021-07-04T17:52:55Z\r\n'
signal = b'WARC-Filename: "foo\\\nbar"\r\n'
noise1 = b'\r\n\r\n\r\n' # End of headers and end of record
f = io.BytesIO(gzip.compress(noise0 + signal + noise1))
for record in warcio.archiveiterator.WARCIterator(f):
	print(repr(record.rec_headers.get_header('WARC-Filename')))

The critical part here is the escaped line feed in the WARC-Filename value. WARC's quoted-string definition permits any valid UTF-8 character to be escaped with a backslash, including an LF. The correct value for the header would be, in Python notation, 'foo\nbar'. Instead, the code above prints '"foo\\'.

WARC-Filename and Content-Type are the only official fields which may contain quoted-strings, but any unofficial field can also use it.

Cf. iipc/warc-specifications#71 and iipc/warc-specifications#72 for bugs in the standard related to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant