Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate 2020 almanac tables on BigQuery #1258

Closed
7 tasks done
rviscomi opened this issue Aug 26, 2020 · 31 comments
Closed
7 tasks done

Generate 2020 almanac tables on BigQuery #1258

rviscomi opened this issue Aug 26, 2020 · 31 comments
Assignees
Labels
analysis Querying the dataset project management Keeping the ball rolling
Milestone

Comments

@rviscomi
Copy link
Member

rviscomi commented Aug 26, 2020

The 2020_08_01 HTTP Archive crawl completed yesterday and the tables are available on BigQuery. However, to facilitate Web Almanac analysis, we reorganize the data into the almanac dataset to make the results more efficient to query.

@paulcalvano and I will be prepping this dataset with the 2020 results. The existing tables already contain 2019 data and they do not necessarily make that clear. We should continue to retain the 2019 data and alter the table schemas to add a new date field to distinguish the annual editions.

  • parsed_css
    • Use Rework CSS to parse CSS bodies and save resulting JSON to the table
    • Note: this table already contains a date column
    • See this thread for more context behind the process to generate this table
    • Parse both stylesheets and inline style blocks
  • requests
  • summary_response_bodies
    • Combination of summary_requests metadata with response_bodies blobs

There are also a couple of externally-sourced tables:

  • third_parties
  • h2_prioritization_cdns
    • The table is currently named h2_prioritization_cdns_201909 and is in use by the 2019 HTTP/2 metric 20_07.sql
    • Not sure if this data will be useful again this year, if so @paulcalvano may know how to regenerate it with the latest data
    • The table schema should be altered to be partitioned by a date field so we can distinguish between annual versions

And there are a couple of convenience tables that may or may not need to be updated, depending on 2020 usage:

  • manifests
  • service_workers

I'd like to explore whether it's feasible to combine the request/response tables into a single table that contains the summary metadata, request payload, and response bodies. That way there would be no SQL joins to contend with for the analysts. The tables would be enormous but AFAIK BigQuery only bills for the columns used, so queries that don't require the bodies would be much cheaper. Not sure if performance is worse.

@rviscomi rviscomi added analysis Querying the dataset project management Keeping the ball rolling labels Aug 26, 2020
@rviscomi rviscomi added this to the 2020 Analysis milestone Aug 26, 2020
@rviscomi
Copy link
Member Author

I'm replacing the requests table with the contents of requests3, which was the third and most accurate representation of the summary+JSON data. I'm also including a new field requestId which is extracted from the JSON. There are 11 queries from 2019 that reference requests3. These should all be updated to refer to requests instead.

I plan to regenerate the response_bodies tables for the 2020_08_01 crawl with a change to the Dataflow pipeline so that the requestId from the request is included with the response bodies. This will make it possible to accurately join requests with response bodies for the summary_response_bodies table. Otherwise it's not possible to disambiguate responses to repeated/redundant requests.

@tunetheweb
Copy link
Member

And there are a couple of convenience tables that may or may not need to be updated, depending on 2020 usage:

  • manifests
  • service_workers

Was looking over the 2019 PWA queries and think it's highly likely this data would be asked for for 2020, as they form the basis of most of the 2019 queries. I gather they are just subsets of the request_bodies tables made to allow cheaper querying?

@tunetheweb
Copy link
Member

The PWA authors confirmed they do want same stats as last night so let us know if it's possible to date stamp those two tables and add this years stats.

Also see this comment:

@rviscomi would it be possible to create an almanac.response_bodies_scripts table of just the initial HTML (in case of inline <script> tags) and script resources to cut this down as much as possible? Or maybe should have almanac.response_bodies_firsthtml and almanac.response_bodies_scripts?

Or is there a way to partition the response_bodies table on firsthtml and type columns so you only way for the rows you're interested in without having to duplicate the data?

@rviscomi
Copy link
Member Author

The summary_response_bodies table is already clustered on firstHtml and type so this should be efficient.

@rviscomi
Copy link
Member Author

As for generating the manifests table, I'd imagine this would be straightforward by parsing out the href value of link[rel=manifest] and saving the corresponding response body. We should have this link data somewhere in the Almanac custom metrics.

Not so sure how to best generate the service_workers table, since they are registered in JS and trickier to parse. We reached out to the Lighthouse team to provide more metadata in their results, which they fixed recently, but I don't think it was in time for the August crawl: GoogleChrome/lighthouse#9683. @tomayac or @paulcalvano did one of you generate this table and do you remember how it was done?

@tunetheweb
Copy link
Member

The summary_response_bodies table is already clustered on firstHtml and type so this should be efficient.

I'm trying this:

SELECT
   client, page, type, body
FROM
  `httparchive.almanac.summary_response_bodies`
WHERE
  date = '2020-08-01' AND
  (type = 'script' OR firstHTML = true)

And it's telling me it This query will process 25.4 TB when run. 😱

This query produces a much more reasonable 24.5 GB but need the bodies for some of the queries.

SELECT
   client, page, type, count(1)
FROM
  `httparchive.almanac.summary_response_bodies`
WHERE
  date = '2020-08-01' AND
  (type = 'script' OR firstHTML = false)
 GROUP BY
   client, page, type

@tunetheweb tunetheweb mentioned this issue Sep 10, 2020
27 tasks
@rviscomi
Copy link
Member Author

~25 TB is the size of the entire 2020-08-01 partition. BigQuery gives an upper estimate of the query size without knowing how efficient the clustering will be until runtime. So the query should only incur the cost to process HTML and JS responses, which are still maybe ~75% of all responses considering that binary files are excluded. So it's efficient in the sense that it doesn't query all response bodies, but if you're still querying most rows it will be expensive, which is why we discourage this approach in favor of custom metrics.

@tunetheweb
Copy link
Member

Ah good to know about binary files not being included. Makes sense but that was my hope to reduce this.

And does look like JS is the vast majority of requests by volume - and probably even more so by size :-(

Response Bodies Request Size

SELECT
   client, type, firstHtml, count(1) AS count
FROM
  `httparchive.almanac.summary_response_bodies`
WHERE
  date = '2020-08-01'
GROUP BY
   client, type, firstHtml
ORDER BY
   count(1) DESC

@tunetheweb
Copy link
Member

tunetheweb commented Sep 10, 2020

Not so sure how to best generate the service_workers table, since they are registered in JS and trickier to parse. We reached out to the Lighthouse team to provide more metadata in their results, which they fixed recently, but I don't think it was in time for the August crawl: GoogleChrome/lighthouse#9683. @tomayac or @paulcalvano did one of you generate this table and do you remember how it was done?

I found this post from @tomayac and from that I created below SQL:

#standardSQL
CREATE TEMPORARY FUNCTION
  pathResolve(path1 STRING,
    path2 STRING)
  RETURNS STRING
  LANGUAGE js AS """
  function normalizeStringPosix(e,t){for(var n="",r=-1,i=0,l=void 0,o=!1,h=0;h<=e.length;++h){if(h<e.length)l=e.charCodeAt(h);else{if(l===SLASH)break;l=SLASH}if(l===SLASH){if(r===h-1||1===i);else if(r!==h-1&&2===i){if(n.length<2||!o||n.charCodeAt(n.length-1)!==DOT||n.charCodeAt(n.length-2)!==DOT)if(n.length>2){for(var g=n.length-1,a=g;a>=0&&n.charCodeAt(a)!==SLASH;--a);if(a!==g){n=-1===a?"":n.slice(0,a),r=h,i=0,o=!1;continue}}else if(2===n.length||1===n.length){n="",r=h,i=0,o=!1;continue}t&&(n.length>0?n+="/..":n="..",o=!0)}else{var f=e.slice(r+1,h);n.length>0?n+="/"+f:n=f,o=!1}r=h,i=0}else l===DOT&&-1!==i?++i:i=-1}return n}function resolvePath(){for(var e=[],t=0;t<arguments.length;t++)e[t]=arguments[t];for(var n="",r=!1,i=void 0,l=e.length-1;l>=-1&&!r;l--){var o=void 0;l>=0?o=e[l]:(void 0===i&&(i=getCWD()),o=i),0!==o.length&&(n=o+"/"+n,r=o.charCodeAt(0)===SLASH)}return n=normalizeStringPosix(n,!r),r?"/"+n:n.length>0?n:"."}var SLASH=47,DOT=46,getCWD=function(){return""};if(/^https?:/.test(path2)){return path2;}if(/^\\//.test(path2)){return path1+path2.substr(1);}return resolvePath(path1, path2).replace(/^(https?:\\/)/, '$1/');
""";
SELECT
  DISTINCT
  client,
  REGEXP_REPLACE(page, "^http:", "https:") AS page,
  pathResolve(REGEXP_REPLACE(page, "^http:", "https:"),
    REGEXP_EXTRACT(body, "navigator\\.serviceWorker\\.register\\s*\\(\\s*[\"']([^\\),\\s\"']+)")) AS url,
  body
FROM
  `httparchive.almanac.summary_response_bodies`
WHERE
  date = '2020-08-01' AND
  (REGEXP_EXTRACT(body, "navigator\\.serviceWorker\\.register\\s*\\(\\s*[\"']([^\\),\\s\"']+)") IS NOT NULL
    AND REGEXP_EXTRACT(body, "navigator\\.serviceWorker\\.register\\s*\\(\\s*[\"']([^\\),\\s\"']+)") != "/")

Was that what you did last year @tomayac ?

@tunetheweb
Copy link
Member

And btw there's similar code to generate the manifest data. Looks like in that post he got both in one go, but I'd imagine that would only work for service workers defined on the main index page in inline <script> tags as need both <link rel="manifest" ... > and navigator.serviceWorker.register defined in same response_body. Doing it separately would pick up service workers defined in external .js files as well.

@rviscomi
Copy link
Member Author

Good find! That's from 2018 but I assume it was reused for the 2019 Almanac. I'll try running it and report back here if it works.

@tomayac
Copy link
Member

tomayac commented Sep 11, 2020

Awesome, you figured it out based on the traces I left in the article, all while I was sleeping :-) So, yeah, what @bazzadp found in #1258 (comment) is exactly the way I did it and would do it again in 2020 if I needed to…

@tunetheweb
Copy link
Member

Cool thanks for confirming. Btw the SQL on your origin post on your blog has lost some of the escapes and so isn’t recognised as valid SQL by BigQuery in case you want to fix it. The SQL in the Medium article seems to work though.

@tomayac
Copy link
Member

tomayac commented Sep 11, 2020

Yeah, Medium is the authoritative source in this case. The export to my origin was lossy unfortunately, tracked as tomayac/blogccasion#3 that I will deal with some time in the future when I will have time™.

@tunetheweb
Copy link
Member

tunetheweb commented Sep 11, 2020

  • h2_prioritization_cdns

    • The table is currently named h2_prioritization_cdns_201909 and is in use by the 2019 HTTP/2 metric 20_07.sql
    • Not sure if this data will be useful again this year, if so @paulcalvano may know how to regenerate it with the latest data
    • The table schema should be altered to be partitioned by a date field so we can distinguish between annual versions

BTW the data for this comes from here: https://github.com/andydavies/http2-prioritization-issues#cdns--cloud-hosting-services. Looks like only a very small number of updates since last year, if someone can add a date column and convert the above table to SQL inserts.

@rviscomi
Copy link
Member Author

rviscomi commented Sep 11, 2020

Thanks @bazzadp! I created a new dated table h2_prioritization_cdns and added 2019 and 2020 data.

Here's the script snippet I used on the GitHub page, with $0 holding the reference to the <tbody> of the table:

copy(Array.from($0.querySelectorAll('tr')).map(tr => {
var passFail = tr.querySelector('td:nth-child(2)').textContent.split(' ')[0].toLowerCase().split('');
passFail[0] = passFail[0].toUpperCase();
passFail = passFail.join('');
return `STRUCT<cdn STRING, prioritization_status STRING>(${[tr.querySelector('td:first-child').textContent, passFail].map(JSON.stringify).join(', ')})`;
}).join(`,
`))

Then in BigQuery I appended the output of this query to the table:

SELECT CAST('2020-08-01' AS DATE) AS date, *
FROM UNNEST([
STRUCT<cdn STRING, prioritization_status STRING>("Akamai", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("Amazon CloudFront", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("BitGravity", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Cachefly", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("CDN77", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("CDNetworks", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("CDNsun", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("ChinaCache", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Cloudflare", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("DreamHost", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("Edgecast", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Facebook", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("Fastly", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("Google Cloud CDN", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Google Firebase", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("Google Storage", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Highwinds", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Incapsula", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Instart Logic", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("KeyCDN", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("LeaseWeb CDN", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Level 3", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Limelight", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Medianova", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Microsoft Azure", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Netlify", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Reflected Networks", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Rocket CDN", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("section.io", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("Sucuri Firewall", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("StackPath/NetDNA/MaxCDN", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("WordPress.com", "Pass"),
STRUCT<cdn STRING, prioritization_status STRING>("WordPress.com Jetpack CDN (Photon)", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Yottaa", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Zeit", "Fail"),
STRUCT<cdn STRING, prioritization_status STRING>("Zenedge", "Fail")
])

EDIT: Then check and correct any CDN names as noted below

Here's the comparison table:

cdn 2019-07-01 2020-08-01
Akamai
Amazon CloudFront
BitGravity  
Cachefly
CDN77
CDNetworks
CDNsun  
ChinaCache
Cloudflare
DreamHost  
Edgecast
Facebook
Fastly
Google  
Google Cloud CDN  
Google Firebase  
Google Storage  
Highwinds
Incapsula
Instart Logic
KeyCDN
LeaseWeb CDN
Level 3
Limelight
Medianova
Microsoft Azure
NetDNA  
Netlify
Reflected Networks  
Rocket CDN
section.io
StackPath  
StackPath/NetDNA/MaxCDN  
Sucuri Firewall
WordPress  
WordPress.com  
WordPress.com Jetpack CDN (Photon)  
Yottaa
Zeit  
Zenedge  

@tunetheweb tunetheweb mentioned this issue Sep 11, 2020
17 tasks
@rviscomi
Copy link
Member Author

I've partitioned the manifests and service_workers tables by date and added the 2020 data. @bazzadp let me know if the tables look ok to you. There seem to be many more SW but fewer manifests, which makes me suspicious.

@tunetheweb
Copy link
Member

tunetheweb commented Sep 14, 2020

Running the following:

SELECT
  date,
  count(distinct pwa_url) as pwas,
  count(distinct sw_url) as service_workers,
  count(distinct manifest_url) as manifests,
  count(0) as total
FROM
  `httparchive.almanac.pwa_candidates`
GROUP BY
  date

Gives this:

Row date pwas service_workers manifests total
1 2019-07-01 12,196 12,176 12,045 35,782
2 2020-08-01 34,395 34,493 34,152 138,584

Which seems roughly realistic - a 3 fold increase in use in last year, and tracks roughly with this graph (though it's possibly closer to two fold increase).

However it does seem there is an awful lot of duplication in the data. Should we just include distinct pages? This SQL returns 29,714 rows:

SELECT date, client, pwa_url, sw_url, manifest_url, count(0)
FROM `httparchive.almanac.pwa_candidates`
GROUP BY date, client, pwa_url, sw_url, manifest_url
HAVING count(0) > 1

However, that aside, I think there are bigger issues.

When I run this:

SELECT
  date,
  count(distinct page) as pages,
  count(distinct url) as urls,
  count(0) as total
FROM
  `httparchive.almanac.manifests`
GROUP BY
  date

I get the following:

Row date pages urls total
1 2019-07-01 266,438 254,139 476,138
2 2020-08-01 31,255 28,917 142,162

Which is way off.

Digging into it more I think I understand why. https://zeals.co.jp/ for example contains a manifest but not a service worker and was included in 2019 but not 2020. This is because httparchive.almanac.pwa_candiates only selects html pages with both - which means the service worker needs to be registered with inline HTML on the page. I noted this above as a downside of the way the blog post suggests and said that it would be better to run each table separately to also pick up pages where service worker is set up in a secondary .js file, or even where there is no service worker at all (like in this https://zeals.co.jp/ example!). This seems to have been what was done last year which will explain at least some of the discrepancies. So think we need to drop pwa_candidates, or at least populate it in two steps (manifests and then service workers).

Similarly when I run this:

SELECT
  date,
  count(distinct page) as pages,
  count(distinct url) as urls,
  count(0) as total
FROM
  `httparchive.almanac.service_workers`
GROUP BY
  date

I get this:

Row date pages urls total
1 2019-07-01 30,308 30,008 52,792
2 2020-08-01 23,389 22,025 116,492

That drop definitely looks suspect but think we should look at this again when we run manifests and service_workers independently of each other.

@rviscomi
Copy link
Member Author

Thanks @bazzadp, that's really helpful. I've regenerated pwa_candidates using an OR instead. I've also deduplicated the results in service_workers and manifests by grouping in their respective queries. See #1291 for the latest util updates.

The growth from 2019 to 2020 now looks more natural. Please confirm if it looks good from your end.

@tunetheweb
Copy link
Member

That looks better to me thanks!

There are still some dupes in there.

For example:

SELECT *
FROM `httparchive.almanac.service_workers`
WHERE date = '2020-08-01' AND page = 'https://www.lix.com/' AND client = 'desktop'

Maybe need a distinct on those last two queries?

Anyway nothing major and can ignore those.

One more request. There are only 63k service worker pages, so I don't think the 1k or 10k pages are going to be very representative of those. Would it be possible to get a reduced httparchive.almanac.summary_response_bodies version with just the requests for those URLs? That is ALL the requests and not just the ServiceWorker body. Even if it's only in scratch area. Would just make searching for various JS code a lot cheaper for me. Can then switch to the full httparchive.almanac.summary_response_bodies for the actual queries (and hopefully to custom metrics next year!).

@paulcalvano
Copy link
Contributor

Was just working on this with @bazzadp and we ran the following query to extract all response bodies for pages that have a service-worker.

select *
from `httparchive.response_bodies.2020_08_01_mobile`  
where page in (
select distinct page
from `httparchive.almanac.service_workers`
where date = "2020-08-01"
and client = "mobile")

This query processes 14.5TB. The results for it are stored in httparchive.scratchspace.service_worker_response_bodies, which is 187GB

@rviscomi
Copy link
Member Author

Only mobile?

@tunetheweb
Copy link
Member

Yeah. Just for hunting around for now and figure mobile and desktop will be similar enough so thought I’d half my costs and runtimes. Can then decide whether it’s even useful and whether to create a real serviceworkersbodies table with both in almanac schema, or write queries against the full summaryresponsebodies table - once I know exactly what I wanna query.

@rviscomi
Copy link
Member Author

SGTM. There was some duplication in the PWA tables, but I think there are easy workaround using DISTINCT or GROUP BY in their queries. Any objections to closing this issue?

@tunetheweb
Copy link
Member

No objection. Raised #1301 for the PWA table definitions for next year (no need to rerun for this year) and we can track the requests table updates in #1293

@tunetheweb
Copy link
Member

tunetheweb commented Sep 21, 2020

@rviscomi can we run the following to match the WPT CDN name?

UPDATE `httparchive.almanac.h2_prioritization_cdns` SET cdn = 'Google' WHERE cdn = 'Google Cloud CDN';
UPDATE `httparchive.almanac.h2_prioritization_cdns` SET cdn = 'Automattic' WHERE cdn = 'WordPress.com';

Those seems to be the main two with volume that need set based on this query:

SELECT _cdn_provider, count(0)
FROM `httparchive.almanac.requests` r
  LEFT OUTER JOIN `httparchive.almanac.h2_prioritization_cdns` c ON c.date = r.date AND c.cdn = r._cdn_provider
WHERE
  r.date = "2020-08-01" AND
  --firstHTML AND -- Check with both this commented out and not.
  _cdn_provider IS NOT NULL AND
  cdn IS NULL
GROUP BY _cdn_provider
ORDER BY count(0) DESC;

@tunetheweb tunetheweb reopened this Sep 21, 2020
@rviscomi
Copy link
Member Author

Should we update the 2019 "WordPress" CDN to "Automattic" as well?

@tunetheweb
Copy link
Member

Yes and no. Yes because it's wrong. No because those aren't the stats we used for 2019 chapter.

WDYT?

@rviscomi
Copy link
Member Author

rviscomi commented Sep 21, 2020

Maybe not so a JOIN works across years.

Ran the updates. Had to change Wordpress.com to WordPress.com.

@tunetheweb
Copy link
Member

Thanks. Closing again.

@rviscomi rviscomi reopened this Oct 2, 2020
@rviscomi
Copy link
Member Author

rviscomi commented Oct 2, 2020

Some chapters depend on data from the September crawl. Reopening this to track adding '2020-09-01' dated rows to the almanac tables.

@tunetheweb tunetheweb mentioned this issue May 11, 2021
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset project management Keeping the ball rolling
Projects
None yet
Development

No branches or pull requests

4 participants