Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many firstHtml fields in the almanac dataset #180

Closed
rviscomi opened this issue Sep 29, 2019 · 8 comments
Closed

Too many firstHtml fields in the almanac dataset #180

rviscomi opened this issue Sep 29, 2019 · 8 comments
Labels
analysis Querying the dataset bug Something isn't working

Comments

@rviscomi
Copy link
Member

The almanac.requests table is the result of joining summary_requests and requests table by client, page, and url. But these three fields do not necessarily uniquely identify a given request, because a page can reload the same resource multiple times.

There are 61 Almanac queries already merged that depend on firstHtml. These would be slightly skewed because they include requests that have the same URL but are not necessarily the firstHtml. @paulcalvano found a workaround to query the request payload's _is_base_page JSON field. We can update the clustered firstHtml field of the almanac.requests table to match this value. Changing it would require rerunning all of the queries. Also the almanac.summary_response_bodies table has no such field so it's not clear whether these are possible to disambiguate given the current schemas.

We can either:

  1. accept the small degree of skew (maybe +/- 2%) in all of these results and mention the discrepancy in our methodology
  2. find a way to join these datasets more reliably and rerun all of the queries

Any thoughts on the best way to proceed? @HTTPArchive/data-analysts @paulcalvano @OBTo

In any case, this shows an incompatibility in the current tables schemas that should be fixed for future crawls.

@rviscomi rviscomi added bug Something isn't working analysis Querying the dataset labels Sep 29, 2019
@rviscomi rviscomi added this to the Content written milestone Sep 29, 2019
@foxdavidj
Copy link
Contributor

I ran through the queries to get a feel for how many use firstHtml in the affected tables:

httparchive.almanac.summary_response_bodies: 33 queries (14 of which are in the A11Y chapter)
httparchive.almanac.requests: 20 queries

It seems fixing almanac.summary_response_bodies in time isn't very feasible... and I couldn't find any pattern in the data to implement a simple stop-gap to make the data a bit more accurate (i.e., adding another conditional along with firstHtml).

However, it looks like updating almanac.requests is feasible. Which is great, because if we update the table and re-run the affected queries... we can get a very good idea how big of a problem this is. Plus this would fix ~40% of the affected queries.

So why not start here and wait to see how much of an impact it has on the data.

When/if we update the tables, I'll take the task of re-running the queries, updating the datasheets and pinging the authors about the changes.

@tunetheweb
Copy link
Member

tunetheweb commented Sep 29, 2019

@rviscomi I don't think the _is_base_page stats is reliable either.

It is being set to the first request. Which is often not the first HTML page. So far I've found two reasons why not:

  1. For those sites that do a redirect the initial page (which just returns a 301) is set as the _is_base_page. Which may be true but is not the firstHTML page and I'd imagine most are interested in the firstHTML page as a redirect doesn't contain much.
SELECT page, url
FROM `httparchive.almanac.requests_desktop_1k`
WHERE page != url
and JSON_EXTRACT_SCALAR(payload, "$._is_base_page") = 'true'
  1. Additionally those stats that do a certificate look up first (where the HTTPS certificate chain is incomplete but Chrome is able to look up the missing intermediate certificate) the first request is for the missing intermediate cert, and that is incorrectly being set as the _is_base_page. Take https://www.swiftmd.com/ and https://eportal.hec.gov.pk/ for example (only two instances in the _1k table but a lot more in production):
SELECT page, url
FROM `httparchive.almanac.requests_desktop_1k`
WHERE JSON_EXTRACT_SCALAR(payload, "$._is_base_page") = 'true'
and page != url
and url like '%.crt'
LIMIT 1000

Not sure if there are any other reasons like this, but in both these instances the almanac.requests table seems to be setting the firstRequest and firstHTML values correctly as far as I can tell (except for the duplicates you mentioned).

However, back to your original question, you are right there are plenty of examples of pages with multiple firstHTML pages but I think I know what's going on with at least some of them - service workers! Sites with these seem to be having two rows for firstHTML. Luckily the ones with a referrer of serviceworker.js or sw.js seem to set the Request Method to blank. So you could just correct the data like this:

UPDATE `httparchive.almanac.requests`
SET firstHTML = false
WHERE firstHTML = true
and JSON_EXTRACT_SCALAR(payload, "$.request.method") is null

That would correct 20,899 rows of what I am guessing are duplicates but probably should do some more digging to be sure.

However when running this:

SELECT page, client, count(1)
FROM `httparchive.almanac.requests`
WHERE firstHTML = true
group by page, client
having count(1) > 1

I'm getting 342,984 rows so those 20,899 are not the full story. Still digging into the rest...

@foxdavidj
Copy link
Contributor

While digging into this I realized each requestid is not unique (like i previously thought) but sometimes shared with over 500 requests? Is this intended? Am I missing something really obvious here?

@tunetheweb
Copy link
Member

I'm also confused with this comment:

because a page can reload the same resource multiple times.

While that's true I would expect Chrome to just reuse the same asset rather than downloading again (even if caching is turned off). And I also would expect that to happen on assets but not on the HTML page itself. Yet I have found instance of pages loading themselves. First was the service workers discussed above, but there are others (e.g. https://coopervision.se/) and I can't see why (though it is repeatable in Chrome and Webpagetest so they are definitely reloading themselves for some reason)...

To be honest I'd be tempted to just update the firstHTML to false for any instance where the JSON_EXTRACT_SCALAR(payload, "$.startedDateTime") is later than an instance which already exists for that client, page and url.

@rviscomi
Copy link
Member Author

rviscomi commented Oct 1, 2019

While digging into this I realized each requestid is not unique (like i previously thought) but sometimes shared with over 500 requests? Is this intended? Am I missing something really obvious here?

@OBTo this is another side-effect of the firstHtml issue. All of the fields from the summary_requests table are duplicated in addition to firstHtml because of the non-unique table join.

I'm getting 342,984 rows so those 20,899 are not the full story. Still digging into the rest...

Here's an example of the same page being requested hundreds of times with a non-null referrer.

@rviscomi
Copy link
Member Author

rviscomi commented Oct 1, 2019

I've made some progress to disambiguate repeated requests. In the requests dataset, each payload has an _index field, which is a 0-based index of the ordered requests. In theory, we can use this as a key to join the HAR and summary requests tables, in addition to client, page, and url.

For example:

SELECT
  *
FROM (
  SELECT
    _TABLE_SUFFIX AS client,
    ROW_NUMBER() OVER (PARTITION BY _TABLE_SUFFIX, pageid ORDER BY requestid) - 1 AS index,
    *
  FROM
    `httparchive.summary_requests.2019_07_01_*`
  ORDER BY
    requestid)
JOIN (
  SELECT
    _TABLE_SUFFIX AS client,
    url AS page,
    pageid
  FROM
    `httparchive.summary_pages.2019_07_01_*`)
USING
  (client, pageid)
JOIN (
  SELECT
    _TABLE_SUFFIX AS client,
    page,
    url,
    CAST(JSON_EXTRACT_SCALAR(payload, '$._index') AS INT64) AS index,
    payload
  FROM
    `httparchive.requests.2019_07_01_*`)
USING
  (client, page, url, index)
ORDER BY
  index

The biggest issue with this is that there's a huge discrepancy between the number of requests in each dataset:

requests.2019_07_01_desktop: 420,510,876
summary_requests.2019_07_01_desktop: 413,924,986

requests.2019_07_01_mobile: 468,544,640
summary_requests.2019_07_01_mobile: 463,862,666

~25% of pages have a discrepancy between datasets. 1% of pages have a discrepancy larger than 10 requests. A single missing request in the summary dataset would throw off the sequencing of the ROW_NUMBER function and the join would fail to match any subsequent requests for that page.

The query above results in 326,814,605 desktop and 385,157,365 mobile requests. But the good news is that every page has exactly one firstHtml request. You can explore the results in httparchive.almanac.requests2.

@rviscomi
Copy link
Member Author

rviscomi commented Oct 3, 2019

Had an idea for another approach. Rather than trying to join the incompatible tables, we can extract the summary metadata from the JSON-based HARs themselves. We've already done something like this before: HTTPArchive/httparchive.org#135 (comment)

Looking at the query that powered it, I'm reminded of another HAR field, _final_base_page, which more accurately reflects the meaning of firstHtml than _is_base_page.

Running a new query to try to regenerate the table.

Update: The query is complete and saved to the almanac.requests3 table (my second attempt at fixing it). I've run a couple of queries against the old and new tables and TBH there's not much of a noticeable difference. If anything on the order of hundredths of a percent. Given that, I'm comfortable keeping the results as-is and not rerunning them or updating spreadsheets/chapters. We can keep a list of errata or known issues in the methodology and make a note about this error there. cc #154 @HTTPArchive/data-analysts

@rviscomi rviscomi closed this as completed Oct 3, 2019
@rviscomi
Copy link
Member Author

rviscomi commented Jul 7, 2020

@paulcalvano when generating the tables for the 2020 queries, we should keep this issue in mind. Here's the query that produced the best results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants