-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record SSL certificates? #13
Comments
There hasn't been any internal work or discussion, nor does heritrix currently record certificates. I think it's a great idea though. The standard python ssl library has getpeercert(), but doesn't have a way to get the full cert chain. Looks like it can be done with the pyopenssl library which warcprox already requires. But it might be enough to record only the peer cert for now. The main question is how best to record the information in the warc. There are a number of ways we could do it. My inclination at the moment is to write a special metadata record concurrent to the first url recorded with a given certificate. Subsequent urls from the same host could reference that metadata record. The content of the record could be the cert only in pem format. Even better would be something that looks like "openssl x509 -text ..." (which includes the cert in pem format). What do you think? |
After having wanted to investigate this for a long time, I just finally looked a bit into it since I'd like to add TLS certificate records to other WARC-writing tools written in Python and thought I'd share the key findings here. While pyOpenSSL does have the Regarding the actual WARC records, I basically arrived at the same idea as you did. On every newly established TLS connection, a metadata record containing the certificate would be written (optionally deduped against previous such records using a This seems like the wrong place for discussion about the details (since it's not only about warcprox), but the warc-specifications issue was basically closed with "implementations first please", so I don't know where it would be appropriate. |
Hey thanks for the research @JustAnotherArchivist.
Why does this come into play? Warcprox doesn't validate certificates currently. https://github.com/internetarchive/warcprox/blob/f77c152037/warcprox/mitmproxy.py#L303 Does
I like the idea of recording TLS protocol and cipher info. I think it can be expressed with enough brevity to fit in a single warc header on each warc record though.
Drawing inspiration from that, we could do something like
I think this is a fine place for this discussion. Also the #warc channel on IIPC slack. Having trouble finding the warc-specifications issue, link? To me that seems like a good place too, especially if framed as discussion / WIP. |
I do too. It looks like there's some variation in the way cipher suites are named. It looks like the
Also noting for discussion there's an existing header proposal (WARC-Protocol) which includes the TLS version but not ciphersuite: |
Thanks @ato, that's really useful information. I suppose we should try to use the IANA names. Interestingly I get an IANA name from ssl_socket.cipher() on my mac with python 3.7:
AFAICT it is using openssl, not some other library. On linux, python 3.5, I see what you see
We might have to hardcode a mapping in warcprox. I found these relevant resources:
Would you propose we put this info in the WARC-Protocol header? |
Hmm. Looks like OpenSSL uses the IANA names for TLS 1.3 ciphers but its own names for older ciphers. This is on Fedora with openssl 1.1.1d:
The command There's a PEP for a unified Python TLS API which mentions the name problem but it's still a draft:
Nah, adding protocol-specific details would probably make WARC-Protocol unnecessarily complex to parse. I think we should either amend the WARC-Protocol proposal not to cover TLS or use two separate headers like:
I don't have a strong preference either way. I originally included TLS in WARC-Protocol for completeness and to try to represent protocol layering but it may instead be simpler to treat TLS separately as it's not the application layer protocol. Edit 2023-12-19: See the WARC-Cipher-Suite proposal in iipc/warc-specifications#86 |
I'm curious about teaching warcprox to record SSL certificates. Has there been any internal work or discussion you can share? Do any other crawlers (Heritrix) currently record certificates?
Cf. this discussion of the WARC spec: iipc/warc-specifications#12
The text was updated successfully, but these errors were encountered: