You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
warcio fails to parse this valid WARC record correctly:
importgzipimportioimportwarcio.archiveiteratornoise0=b'WARC/1.1\r\nWARC-Record-ID: <urn:uuid:fe4275e8-87bd-435c-a3ff-9586e86427be>\r\nWARC-Type: warcinfo\r\nContent-Length: 0\r\nWARC-Date: 2021-07-04T17:52:55Z\r\n'signal=b'WARC-Filename: "foo\\\nbar"\r\n'noise1=b'\r\n\r\n\r\n'# End of headers and end of recordf=io.BytesIO(gzip.compress(noise0+signal+noise1))
forrecordinwarcio.archiveiterator.WARCIterator(f):
print(repr(record.rec_headers.get_header('WARC-Filename')))
The critical part here is the escaped line feed in the WARC-Filename value. WARC's quoted-string definition permits any valid UTF-8 character to be escaped with a backslash, including an LF. The correct value for the header would be, in Python notation, 'foo\nbar'. Instead, the code above prints '"foo\\'.
WARC-Filename and Content-Type are the only official fields which may contain quoted-strings, but any unofficial field can also use it.
warcio fails to parse this valid WARC record correctly:
The critical part here is the escaped line feed in the
WARC-Filename
value. WARC'squoted-string
definition permits any valid UTF-8 character to be escaped with a backslash, including an LF. The correct value for the header would be, in Python notation,'foo\nbar'
. Instead, the code above prints'"foo\\'
.WARC-Filename
andContent-Type
are the only official fields which may containquoted-string
s, but any unofficial field can also use it.Cf. iipc/warc-specifications#71 and iipc/warc-specifications#72 for bugs in the standard related to this.
The text was updated successfully, but these errors were encountered: