quoted-string WARC header values are not parsed correctly #130

JustAnotherArchivist · 2021-07-04T18:06:57Z

warcio fails to parse this valid WARC record correctly:

import gzip
import io
import warcio.archiveiterator


noise0 = b'WARC/1.1\r\nWARC-Record-ID: <urn:uuid:fe4275e8-87bd-435c-a3ff-9586e86427be>\r\nWARC-Type: warcinfo\r\nContent-Length: 0\r\nWARC-Date: 2021-07-04T17:52:55Z\r\n'
signal = b'WARC-Filename: "foo\\\nbar"\r\n'
noise1 = b'\r\n\r\n\r\n' # End of headers and end of record
f = io.BytesIO(gzip.compress(noise0 + signal + noise1))
for record in warcio.archiveiterator.WARCIterator(f):
	print(repr(record.rec_headers.get_header('WARC-Filename')))

The critical part here is the escaped line feed in the WARC-Filename value. WARC's quoted-string definition permits any valid UTF-8 character to be escaped with a backslash, including an LF. The correct value for the header would be, in Python notation, 'foo\nbar'. Instead, the code above prints '"foo\\'.

WARC-Filename and Content-Type are the only official fields which may contain quoted-strings, but any unofficial field can also use it.

Cf. iipc/warc-specifications#71 and iipc/warc-specifications#72 for bugs in the standard related to this.

The text was updated successfully, but these errors were encountered:

CorentinB mentioned this issue Sep 9, 2023

"warcio check" does not warn of illegal characters in field names or values, including LF #158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quoted-string WARC header values are not parsed correctly #130

quoted-string WARC header values are not parsed correctly #130

JustAnotherArchivist commented Jul 4, 2021

quoted-string WARC header values are not parsed correctly #130

quoted-string WARC header values are not parsed correctly #130

Comments

JustAnotherArchivist commented Jul 4, 2021