Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional columns to Almanac Requests table #1293

Closed
tunetheweb opened this issue Sep 15, 2020 · 14 comments
Closed

Add additional columns to Almanac Requests table #1293

tunetheweb opened this issue Sep 15, 2020 · 14 comments
Assignees
Labels
analysis Querying the dataset
Milestone

Comments

@tunetheweb
Copy link
Member

tunetheweb commented Sep 15, 2020

The httparchive.almanac.requests table includes a respHttpVersion and a reqHttpVersion column neither of which contain any meaningful data.

All the HTTP/2 queries last year were based on the JSON_EXTRACT_SCALAR(payload, "$._protocol") column which does appear to be mostly accurate. We should add the protocol column to the httparchive.almanac.requests table and/or replace the respHttpVersion and a reqHttpVersion columns with this data to make it easier and cheaper to query HTTP version requests.

Edit. Complete list of headers we should add if possible:

  • $._protocol
  • $._was_pushed
  • $._tls_version
  • $._tls_cipher_suite
  • $._securityDetails.issuer
  • $._securityDetails.keyExchange
  • $._securityDetails.cipher
  • $._securityDetails.protocol - not sure how this is different than TLS Version to be honest
  • $.request.headers - in JSON format rather than like reqOtherHeaders
  • $.response.headers - in JSON format rather than like respOtherHeaders
  • CrUX rank from summary_pages

Also fix the startedDateTime field as off by 1000.

CC: @gregorywolf

@tunetheweb tunetheweb added the analysis Querying the dataset label Sep 15, 2020
@tunetheweb tunetheweb added this to the 2020 Analysis milestone Sep 15, 2020
@rviscomi
Copy link
Member

rviscomi commented Sep 15, 2020

Yeah I don't know why those fields have such weird values. I think making the protocol field more easily accessible in the request data sounds like a great idea. @tunetheweb @gregorywolf is this something you'd like to see done ASAP in order to update the HTTP/2 chapter queries or is it for 2021+?

@tunetheweb
Copy link
Member Author

I've checked and they are wrong in the HAR.

If we could update it ASAP, it would be really useful and save @gregorywolf (and therefore your coupons!) a lot. However if that's a lot of effort than can just query this from the payload column. It's just of the 17 queries @gregorywolf has so far, 16 are doing a 7.5TB query each for the payload data! 😞 So we should maybe thing about joining queries or creating another temp table if we don't want to touch requests table now. From a quick look I reckon about half the queries would no longer need to use this expensive payload column if we had this.

And if we added these columns while we are at it, then perhaps none would need to look at the payload column at all:

  • _was_pushed
  • response.headers.link - useful for Resource Hints chapter too
  • response.headers.alt-svc - could be come important in future as QUIC and HTTP/3 takes hold
  • response.headers.upgrade
  • _tls_version - useful for security chapter too

But some of these might be less useful so maybe should just take the hit there rather than clogging up the table with so many columns.

@tunetheweb tunetheweb mentioned this issue Sep 15, 2020
17 tasks
@tunetheweb
Copy link
Member Author

Actually I was wrong - all 17 queries need this expensive column at the moment and not just 16/17 ☺️

@gregorywolf
Copy link
Contributor

gregorywolf commented Sep 16, 2020

@rviscomi I think getting this change done ASAP would be great. I also worked with @paulcalvano this morning and he created a new sample_data.requests_withprotocol table which confirmed that there is a massive reduction in cost. In addition to the protocol field it would be great to add the other fields that @tunetheweb has identified in the above comment

@rviscomi
Copy link
Member

I'm supportive of this change. It'd be good to see if there are any other easy optimization wins that we can do all at once.

FWIW I've made it about halfway through running/optimizing the H2 queries. The results are saved to the H2 sheet, so we don't need to rerun those queries unless there are any other structural changes.

@tunetheweb
Copy link
Member Author

Here's some more fields from security chapter last year:

  • $._tls_cipher_suite
  • $._securityDetails.issuer
  • $._securityDetails.keyExchange
  • $._securityDetails.cipher
  • $._securityDetails.protocol - not sure how this is different than TLS Version to be honest

I also discovered they are looking at the respOtherHeaders column. Would that contain all the response headers I listed in the first comment above? If so don't need them added.

@gregorywolf
Copy link
Contributor

gregorywolf commented Sep 16, 2020

@rviscomi Thank you for running the queries for me and dropping the results into the H2 sheet. I am reviewing all of @tunetheweb comments in my PR. There may be reason to rerun a couple of the queries (i.e., combining QUIC & http/2+quic/46 as H3). If/when I decide to rerun some queries should I ask you or @paulcalvano or will there be a new table created that contains the fields of interest as requested above?

@tunetheweb
Copy link
Member Author

tunetheweb commented Sep 18, 2020

I also discovered they are looking at the respOtherHeaders column. Would that contain all the response headers I listed in the first comment above? If so don't need them added.

To answer my own question, it looks like it does! @gregorywolf you should move your queries to using these instead of the payload field. Made one such suggestion on your open PR but should replicate for the rest.

@paulcalvano / @rviscomi I've updated the first comment with the complete list as far as I can see. Can we get requests table updated with these? And also update the request table definition at the same time.

@rviscomi
Copy link
Member

rviscomi commented Sep 18, 2020

Chatted with @gregorywolf about this. Given that we're so far info analysis, I wouldn't want to make analysts rewrite their queries to be more efficient and use these fields. In the H2 queries, I've already run them and saved results, so efficiency isn't a blocker there. If any analysts are running out of quota due to these expensive queries, I'm happy to do the query execution using my unlimited quota.

I do think these are good changes to make for 2021 analysis, so we should update the requests query. But I don't want to make this an urgent change for 2020 analysis.

@tunetheweb
Copy link
Member Author

tunetheweb commented Sep 20, 2020

FYI: I think I've found the reason why the HTTP Version fields are not set correctly and what the odd values are.

Most of them are HTTP/2 pseudo fields incorrectly being parsed under the assumption they are the initial HTTP/1 lines (GET / HTTP/1.1 or HTTP/1.1 200), so it grabs characters 6-8, so :method: becomes od:, :authority: becomes ori (originally I thought this was short for origin!) and status: becomes us: (originally thought this was for USA and being picked up from accept-language field!). So basically all of those are HTTP/2 or HTTP/3.

The blank ones are HTTP/1.1 where the blank line at the end was being incorrectly parsed to get the version.

I've submitted a PR to fix this so probably wanna keep the reqHttpVersion and respHttpVersion fields in our requests table if that gets accepted.

Still think the other columns (including protocol column) would be good to add anyway.

@rviscomi
Copy link
Member

Great catch, and thanks for the fix!

@rviscomi rviscomi modified the milestones: 2020 Analysis, 2020 Backlog Sep 20, 2020
@tunetheweb tunetheweb changed the title Add protocol value to Almanac Requests table Add additional columns to Almanac Requests table Oct 1, 2020
@tunetheweb
Copy link
Member Author

@rviscomi what do you think about adding reqHeadersJSON and respHeadersJSON columns too?

Speaking to @tomvangoethem as his Security PR has a lot of querying of those and the reqOtherHeaders and respOtherHeaders columns are a bit flakey to parse if header contains a comma so easier to parse in JSON.

We could also replace reqOtherHeaders and respOtherHeaders columns with JSON columns, though that would break some existing queries so may be easier to add columns despite the duplication.

Also presume this is for next year? In which case we'll need to run @tomvangoethem 's queries for him this year if you're OK with that.

@rviscomi
Copy link
Member

rviscomi commented Oct 1, 2020

I'm open to more granular summary data, but let's not block on it for this year.

@tunetheweb
Copy link
Member Author

OK added those two columns to first comment for when we return to this later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset
Projects
None yet
Development

No branches or pull requests

4 participants