-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARC-Cipher-Suite field proposal #86
Comments
Not specifically opposed to this, but is the cipher suite alone actually useful / actionable? The original issue was around storing the full SSL cert, which arguably has more value. What is the actual problem being solved? When crawling via a Chromium-based browser, its possible to get full info about the cert, for example using:
which conveys two key properties: the issuer of the cert, and whether Chrome thinks it passes Certificate Transparency-compliant. This could be used to distinguish MITM certs for example. |
I think separating certificates from TLS cipher info makes sense for several small reasons:
However the big reason to separate them is the amount of questions and nuances of how to efficiently store X.509 certs in WARCs overwhelms the questions for cipher suites if the issues are combined.
A lot of that is obviously up the WARC creator, and things like revisits are optional, and other things are extreme edge cases (e.g. client-side certificates) But there are a lot of decisions and complexity with storing the (often multiple) KB of certificates associated with web resources. Compare that with a WARC field whose value is smaller than a base-16 SHA-256 Content Digest field, I think it's helpful to separate them. 😀 (Personally I do want to hear thoughts on how to store certificates and if things have changed since @JustAnotherArchivist work a few years ago, but thought it made sense to do that in a separate issue) |
The new release of Wget-AT at https://github.com/ArchiveTeam/wget-lua/releases/tag/v1.21.3-at.20231213.01 now implements this WARC header. The decision was made to use Next to the Wget-AT is used for the Archive Team Warrior projects. As this new Wget-AT version is rolled out to all Warrior projects, the This release is a first of several releases to improve SSL/TLS recording in WARC records. The two new headers are seen as a 'minimal' representation of the details of the SSL/TLS session. |
Since @acidus99 supports the name change and the only software I could find using the original name WARC-TLS-Cipher-Suite is acidus99/Kennedy I've edited the proposal to the new name WARC-Cipher-Suite. I've left the text of the definition as is but suggestions for how to update the wording to cover the case of SSL would be welcome. |
I updated Kennedy to use the new |
- HTTP headers: replace HTTP/2 and alike by HTTP/1.1 to ensure backward-compatibility for WARC readers, see iipc/warc-specifications#15 - store protocol versions and cipher suites in WARC headers WARC-Protocol and WARC-Cipher-Suite, see iipc/warc-specifications#42 iipc/warc-specifications#86 - allow multiple WARC headers of the same name (WARC-Protocol may occur twice to hold the HTTP and TLS version)
This field was previously discussed by @ato @nlevitt and @JustAnotherArchivist on an issue in a different repository. That discussion intermixed many topics like the proposed WARC-Protocol field as well as storing X.509 certificates in
metadata
records. Adding this issue so the idea can be properly discussed and tracked for WARC 1.1+Proposal
The
WARC-Cipher-Suite
field is the TLS cipher suite which was used to retrieve any included content. The TLS cipher suite shall be written as the IANA TLS Cipher Suites Value (e.g.TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
).The WARC-Cipher-Suite field may be used on ‘response’, ‘resource’, ‘request’, ‘metadata’, and ‘revisit’ records, but shall not be used on ‘warcinfo’, ‘conversion’ or ‘continuation’ records.
Motivation
Storing the TLS parmeters used to retrieve content is valuable for many use cases (research, archival/postierity, troubleshooting). For example, it could provide context why a
request
doesn't have a correspondingresponse
record. The proposed WARC-Protocol field is used to record the protocol version. WARC-Cipher-Suite field augments this by including what cipher suite was used. As a bonus, the IANA already defines and standardizes the values of these cipher suites, and those values are already used internally by many tools (especially for more modern ciphers).Background
Per this thread @nlevitt and @ato both liked the idea of recording TLS protocol and cipher info in a WARC file. @nlevitt originally proposed a single custom field that would include both the TLS protocol version and cipher suite that were negotiated. However given that the
WARC-Protocol
field was being planned separately @ato recommended usingWARC-Protocol
to record the TLS protocol version and a new field to record the cipher.Questions
Should the field be namedWARC-Cipher-Suite
to future proof for other uses beyond TLS? TheWARC-Protocol
field defines what protocol is used (FTP, TLS, or even a successor). This cipher suite field is an additional/optional field, applicable only when used with a WARC-Protocol value that supports encryption, recording what cipher suite was used. Baking "TLS" into the field name may cause a problem in the future. (I can't help but think of software and standards that still use the "SSL Certificate" or "SSL connection" terminology 🤮)Edited 2023-12-19 by @ato Renamed from WARC-TLS-Cipher-Suite to WARC-Cipher-Suite as implemented by @Arkiver2 in Wget-AT and agreed to by @acidus99
The text was updated successfully, but these errors were encountered: