Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warcio does not preserve HTTP header whitespace #129

Open
JustAnotherArchivist opened this issue May 27, 2021 · 3 comments
Open

warcio does not preserve HTTP header whitespace #129

JustAnotherArchivist opened this issue May 27, 2021 · 3 comments
Labels

Comments

@JustAnotherArchivist
Copy link
Contributor

JustAnotherArchivist commented May 27, 2021

import io
import warcio


output = io.BytesIO()
writer = warcio.warcwriter.WARCWriter(output, gzip = False)
payload = io.BytesIO()
payload.write(b'HTTP/1.1 200 OK\r\nDate: Thu, 27 May 2021 22:03:54 GMT\r\nContent-Length: 0\r\nX-custom:  header with two spaces before the value and a tab after\t\r\n\r\n')
payload.seek(0)
record = writer.create_warc_record('http://example.org/', 'response', payload = payload)
writer.write_record(record)
print(output.getvalue())

Expected output for the custom header (where \t is a literal tab):

X-custom:  header with two spaces before the value and a tab after\t

Actual output (only one space between the colon and the value, and the tab after the header is lost):

X-custom: header with two spaces before the value and a tab after
@ikreymer
Copy link
Member

This is sort of an edge case, and the whitespace was at one point used to indicate multi-line headers (which have now been deprecated, but warcio still supports). I'm not sure that the whitespace is significant anymore from a parsing perspective.
Similar to #128, perhaps there could be a 'raw' mode flag that preserves the whitespace here if desired for when capturing HTTP traffic.

@ikreymer
Copy link
Member

FWIW, I've never seen an HTTP server that returns a header like this, so (i hope) its not very common :)

@JustAnotherArchivist
Copy link
Contributor Author

The whitespace on the line with the field-name has never been significant semantically as far as I know. Neither the whitespace after the colon nor the one at the end of the line is part of the actual field value content. And even with continuation lines: the optional whitespace at the end of a line, CRLF, and leading space/tab on the continuation line are overall equivalent to a single space.
But yeah, same as #128, this is about correctly preserving the data sent by the server, not the semantic meaning. I've suggested a possible solution there because they are indeed very similar and have essentially the same root cause.

Yeah, it is fortunately not very common, but I have seen it before, sadly enough. There are a lot of weird HTTP servers out there that operate at the edges of or beyond the specifications...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Triage
Development

No branches or pull requests

3 participants