-
-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SEO 2020 queries #1062
SEO 2020 queries #1062
Conversation
Pulls in LCP, FID, CLS (Core Web Vitals) and calculates pass cores based on 75% in the good place. Also pulls in TTFB, FCP and a speed score from last years query. Data is grouped by device.
I just noticed that I should be converting pct to 0-100 values. |
added 2019 SLQ scripts and renamed them added notes for the new script
Thanks @Tiggerito! I'll give it a more thorough review later. Meanwhile could you also edit the top comment of this PR to include a checklist to show what queries are in this PR and what the remaining metrics to query are?
You can remove the
I'm also on the fence about this pattern. (context) |
I'm learning. I didn't realise all commits automatically become part of an ongoing pull request. So I need to keep the first comment up to date on all changes made. We're still working on the metrics and all the queries needed. I believe the other analysis can now see this and can work on it, which is where I wanted to be. Maybe next time we don't do the draft pull so early, just do the branch? Or is the doing the pull important? I guess it triggers a review from the smart guys ;-)
I left them in so that I could identify references to them from last year. Will we be updating almanac.js to list the fullnames of the scripts that use each property? We'd probably want to include the chapter name so people can find them and know who's using them. e.g. // Used by: Or maybe we just say which chapter // Used by : SEO almanac.js could do with a restructure as some properties use the '10_1' type syntax. Maybe they should also stick with descriptive names? ?
Gave it a thumbs up. |
I'll remove the analyst reviewers for now. I thought it'd be good for everyone to peek at each others' work, but this PR isn't necessarily ready for review yet.
Early is good. And having the draft PR open gives you a place to track progress for each metric using the checklist. It's ok to continuously add to and modify the branch/checklist as the chapter planning evolves. Hopefully in about a week things will be settled and you'll know which metrics need to be queried.
Yes, this is an unfortunate vestige of last year's system. Not sure if it's better to leave them so the 2019 queries make sense or update them for consistency with 2020. In any case, the file name is not the best place to reference the related 2019 query - for that you can add it as a comment in the SQL, |
I can help out with reviewing this when its ready for review @rviscomi @Tiggerito Can you edit the name of the PR to reflect what its for as well? By default it just uses the name of the first commit you made haha. Maybe something like: |
@@ -0,0 +1,98 @@ | |||
# how many lson-ld scripts on a page | |||
SELECT | |||
ARRAY_LENGTH(REGEXP_EXTRACT_ALL(body, '(?i)<script[^>]*type=[\'"]?application\\/ld\\+json[\'"]?[^>]*>(.*?)<\\/script>')) AS jsonld_scripts_count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposed custom metric:
document.querySelectorAll('script[type="application/ld+json" i]').length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one probably needs to be on the raw body as it's designed to find out if the tags parses into json. But thinking about it, I think I can read the raw text of a script in the DOM and try to parse it there. Will test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the i in the selector signify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Case insensitivity. Plug the regex in here: https://regexr.com/ and make sure you choose PCRE instead of JavaScript and it should explain each part to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, talking about the css selector from @rviscomi . I though that, but tested and it's case insensitive without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just discovered that some attributes are case sensitive, so my gathering for meta tags is missing a whole bunch of data. I opened an issue for it:
HTTPArchive/legacy.httparchive.org#188
What's the best way to resolve this. Do I need to fork and fix to get it into the next crawl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to fix for the future but it won't be included until the October crawl at the soonest, the results of which would be available too late to be included in the Almanac. I don't think the case sensitivity would significantly affect the results, but if that's a concern we can include a side note about it in the chapter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned to the team that we might be able to do a raw html crawl at the end to get anything we missed. If not, the side note will work.
My concern is that older websites tended to shout (upper case) their html, so they would be the ones we miss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we submit the PR to fix this now while it’s fresh in our mind? Really want to avoid what we did last year which was not think about custom metrics for 11 months and then have a complete blind panic for the last month! Would be really nice to work on issues like this (or even custom metrics we wish we’d added in hindsight!) as we analyse the 2020 results so we’ve a head start on 2021.
Would also mean you can have a quick look during (or at end of) October and if the stats are radically different then can decide what to do.
Fix looks relatively simple since we’re using Chrome for the crawl: https://stackoverflow.com/questions/38399089/ignoring-case-sensitiveness-in-queryselectorall
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
Where possible I have change the queries to use data from almanac.js. I've identified almanac.js queires by rendered_ and body queries by raw_. I've tried to merge as many queries as possible based on the data source and format of the results. This should make it a lot easier to manage as well as save on the number of queries made $$$ lighthouse.sql - all data from lighthouse rendered_by_device.sql - rendered data from almanac.js grouped by device rendered_percentiles_by_device.txt - rendered data from almanac.js reported by percentiles and grouped by device raw_by_device.sql - data from the raw html - very expensive query when done on the real data - does not gather much at the moment. This could become our raw/rendered comparison query. Other scripts are ones that have a unique output so can't be merged. Typically because they group by the field they are analysing. At the moment most are pointing to test tables. But be careful. Always double check the cost before making a query as some old ones are still pointing to live tables. We could do with a test table for httparchive.almanac.summary_response_bodies as the table I'm currently testing with is in a different format. We're starting to tie down the exact metrics needed, and get the almanac.js updated to fit those needs. My next move here is to see if I can get the comparisons data they want. This would be making a query joining several of the tables.
Added json_ld_scripts to other rendered queries Moved no longer needed raw scripts into a sandbox folder.
renamed them set to use test tables merged where possible
I think it's ok to remove these, the results spreadsheet will be the publicly available place for query results. Edit: Oh nevermind, I see these are the sample inputs to the queries, not the outputs. Anyway, I do still think they can be removed. BigQuery table previews can give future analysts sample data. If there are any gotchas in the data, README would be the best place to document those. |
The way I look at the queries that we check into source control is to help readers scrutinize our methodology. For example, answering "how did you calculate that metric?" I think our queries should be optimized for legibility, so it's ok to have lots of descriptive comments for any parts of the query that can be confusing. IMO the testing system woven into the queries gets in the way of the meat of the queries, which is how the metrics are implemented. Don't get me wrong, it's a brilliant idea to have a toggle to switch back to sample data, but I think it's only useful to us now while we debug the 2020 queries and future analysts won't need to debug queries that are already working, so they'd be mostly superfluous. So I'd recommend stripping the testing code out of the queries, leaving only the essentials. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the first query. Let me know what you think of those suggestions, then I'll continue with the other queries.
@@ -0,0 +1,78 @@ | |||
#standardSQL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few suggestions that I'll propose by way of a rewritten query.
- I agree with your comment about using percentiles. For this query it makes sense to represent the % of images with alt attributes per page as a distribution.
- Do the division in the JS function while you're at it, which also simplifies the return type. Function renamed accordingly.
- You could simplify some of the defensive
if
statements by returningnull
in thecatch
block. Null values are ignored in the percentile aggregation by default. - Various whitespace tidying.
- Uppercase for all keywords, like
AS
. - Remove all "TEST" logic.
#standardSQL
# Distribution of the coverage of alt attributes over all images per page.
CREATE TEMPORARY FUNCTION get_percent_alt_images(markup_string STRING)
RETURNS FLOAT64 LANGUAGE js AS '''
try {
var markup = JSON.parse(markup_string);
return markup.images.img.alt.present / markup.images.img.total;
} catch (e) {
return null;
}
''';
SELECT
client,
percentile,
ROUND(APPROX_QUANTILES(pct_alt, 1000)[OFFSET(percentile * 10)], 4) AS pct_alt
FROM (
SELECT
_TABLE_SUFFIX AS client,
get_percent_alt_images(JSON_EXTRACT_SCALAR(payload, '$._markup')) AS pct_alt
FROM
`httparchive.pages.2020_08_01_*`),
UNNEST([10, 25, 50, 75, 90]) AS percentile
GROUP BY
client,
percentile
ORDER BY
percentile,
client
Results:
client | percentile | pct_alt |
---|---|---|
desktop | 10 | 0.00% |
mobile | 10 | 0.00% |
desktop | 25 | 10.53% |
mobile | 25 | 9.09% |
desktop | 50 | 50.00% |
mobile | 50 | 50.00% |
desktop | 75 | 89.74% |
mobile | 75 | 89.47% |
desktop | 90 | 100.00% |
mobile | 90 | 100.00% |
(to be honest, I'm very surprised that the median is exactly 50% - please let me know if you spot anything wrong with my query!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have made the first query the last one. It was one I did not think was too valuable, so did not focus on it much.
I think I'd already added it into my percentile query. Will check and fold in your comments there.
Do the division in the JS function while you're at it, which also simplifies the return type. Function renamed accordingly.
A few reasons I leave some maths to the SQL:
- SQL has a safe div so no exception on zeros. Not an issue with one return value, bot could break all JS result data with a single mistake.
- Returning more data could come in handy for the SQL writer. e.g. I often return a count and not a boolean indicating non zero. Again, to minify churn if the SQL writer decided they wanted something else. Or to try and keep the data provided consistent between queries.
- I have the feeling that SQL runs far faster then JS.
You could simplify some of the defensive if statements by returning null in the catch block. Null values are ignored in the percentile aggregation by default.
That script does not go as defensive as a normally go!
This goes against my own beliefs and what I think is best practice. Exceptions should not be used to handle normal code flow.
It's more apparent when you see some of my more complex JS. If I was not defensive, then an exception would mean all subsequent data calculations would be lost. A bug in an early bit of code could cause the loss of much data.
The pattern I use where I create the result object at the start and always return it (even in an exception) is deliberate. That way an exception does not stop all data from being returned (null).
I've also tried to make my exception/error handling provide visible clues in the results that things were broken. Returning something like a null means an issue is silently ignored. Returning an error message where possible really stands out.
That reminds me, I built in exception handling to my custom metrics. I should check that to see if they did have issues. I did that after noticing that a reasonable percentage of results from last year had blank custom metrics, indicating an exception happened and was not handled.
Other comments are sound and I will work on them today.
I'll double check the custom metrics for this to make sure there is nothing strange causing the 50% median.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to get my head around using percents to feed into percentiles 🤯
I've updated pages_markup_by_device_and_percentile.sql to include the percentile version of that query and have it pulling out a few values to help me check validity and understand things. Here's the result from a live run:
Row | percentile | client | total | img_count | images_with_alt_present_percent | images_with_alt_blank_percent | images_with_alt_missing_percent | images_with_alt_present | images_with_alt_blank | images_with_alt_missing | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 10 | desktop | 5593642 | 2 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | |
2 | 10 | mobile | 6347640 | 2 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | |
3 | 25 | desktop | 5593642 | 8 | 0.1429 | 0.0 | 0.0 | 1 | 0 | 0 | |
4 | 25 | mobile | 6347640 | 8 | 0.125 | 0.0 | 0.0 | 1 | 0 | 0 | |
5 | 50 | desktop | 5593642 | 21 | 0.5385 | 0.1 | 0.0299 | 7 | 2 | 1 | |
6 | 50 | mobile | 6347640 | 19 | 0.5122 | 0.1111 | 0.0244 | 6 | 2 | 1 | |
7 | 75 | desktop | 5593642 | 43 | 0.9091 | 0.4906 | 0.3333 | 22 | 10 | 5 | |
8 | 75 | mobile | 6347640 | 39 | 0.9063 | 0.5 | 0.3333 | 20 | 9 | 5 | |
9 | 90 | desktop | 5593642 | 82 | 1.0 | 0.8261 | 0.8037 | 48 | 24 | 19 | |
10 | 90 | mobile | 6347640 | 73 | 1.0 | 0.8333 | 0.8235 | 43 | 22 | 18 |
The first column and last three columns are just counts. The others are the percents.
So I'm guessing it's saying things like (mobile):
The median page has 19 images, 10% have 2 or less, and 10% have 73 or more.
10% of pages have alt attributes present on all their images? While 10% have no images with the alt present. The median is to have 51% of alts present?
10% have 82% of their images without any alt attribute. 50% do well with less than 3% missing.
50% use less than 10% blank attributes.
Time to rest my brain and clean out all that test code.
I agree, it does distract from the main code. On my first real query I found a bug that meant I had to quickly revert back to test mode, so what I had was handy. At least now the test code is available in version control, so I can quickly get it back again. I think I will remove all test stuff and document my testing process in the README, including an indication on how to see my test versions. Can I label commits so it would be easy to find the last version before switching to live values? e.g. "contains test code". I can also document how I get the custom metrics result data to help me test, and then remove my results folder. |
I was going to reference a commit that contained all my test code, so people could refer back to it to see how I did things. But I suspect all my commits are lost when it merges? So no record of my test code is kept? |
testing image percentile query
removed commented out code minor cosmetic changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay @Tiggerito! Great work. This PR is most of the way there.
Similar to my feedback on the Markup PR, I do think these queries need to be split up by metric, not by table or grouping. I understand if you don't have the bandwidth to do that now, and I don't want to hold this up any further.
I've made a couple of comments about being more descriptive in the file names about what the underlying metrics are. These comments are applicable now for the queries that only contain a single metric. When we have time to split up the monolithic queries, each of the individual files should be similarly named more descriptively for the metric, so this is a step in that direction.
Other than that, most of my comments apply to how the metrics are aggregated. Grouping by INT64 values would effectively generate a histogram of granularity 1, which could lead to an unlimited number of values. Instead, to summarize distributions of values like this, we typically calculate as percentiles.
Once these comments are addressed we should be ready to run the queries, save the results to the Sheet, and start drafting the chapter. Thanks again for all the time you've spent on this, it will show through on the final result!
COUNT(0) AS total, | ||
|
||
# Pages with videos | ||
COUNTIF(almanac_info.videos_total > 0) AS has_videos, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to prefer this over reqVideo or bytesVideo in summary_pages
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't know about them. How do I find out how they are generated?
This metric is a bit dodgy anyhow as it only finds video tags, and many people use other ways to show videos, like iframe. They are aware if its limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requests.sql
query emulates how videos are detected:
function prettyType(mimeType, ext) { |
These fields are the count of requests and sum of requests' bytes per page. Could be a useful proxy for this metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick test. I don't think it picks up the embedding of videos unless they auto play.
I'd class this one as a low priority metric anyhow. Video is not a major SEO thing. I'll make it clear that the results are limited (already indicated that). If they want something better we can negotiate.
_TABLE_SUFFIX AS client, | ||
percentile, | ||
url, | ||
get_almanac_info(JSON_EXTRACT_SCALAR(payload, '$._almanac')) AS almanac_info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto on reqVideo/bytesVideo
sql/2020/07_SEO/pages_wpt_bodies_hreflang_by_device_and_http_header_value.sql
Outdated
Show resolved
Hide resolved
sql/2020/07_SEO/pages_wpt_bodies_hreflang_by_device_and_link_tag_value.sql
Outdated
Show resolved
Hide resolved
sql/2020/07_SEO/pages_wpt_bodies_meta_description_by_device_and_char_count .sql
Outdated
Show resolved
Hide resolved
SELECT | ||
client, | ||
COUNT(0) AS total, | ||
ROUND(AVG(wpt_bodies_info.image_links)) AS image_links_avg, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto on percentiles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antoineeripret I'm not sure how this one works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tiggerito : what I did there, as I was aware that a freq table would end up with a lot of possible values, was to apply an AVG() function on the returned results. Maybe not the best idea I have had.
We can change it to return percentiles instead. I'll have a look at that during the afternoon (Spanish time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've already added percentiles for those values in the percentiles script. Just want to make sure we can remove these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, my bad! Yes, for sure !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some image v text link percentiles and dropped all the queries that are now covered by percentiles.
The requirements for "image v text links" are vague, in fact that is the requirements. I've created separate percentiles for each as well as a percentile of the percent of links that are images. Hopefully that will cover what they asked for.
I think all is done.
Co-authored-by: Rick Viscomi <[email protected]>
Co-authored-by: Rick Viscomi <[email protected]>
…ibute.sql Co-authored-by: Rick Viscomi <[email protected]>
…_type.sql Co-authored-by: Rick Viscomi <[email protected]>
…eader_value.sql Co-authored-by: Rick Viscomi <[email protected]>
…ag_value.sql Co-authored-by: Rick Viscomi <[email protected]>
I think all that's left is: Removing the number of links queries (@antoineeripret are you ok with that?) and using the percentile results instead.. Maybe adding a reqVideo/bytesVideo if that data looks to be of any use. Working out how to do pages_wpt_bodies_text_vs_images_links_by_device.sql (@antoineeripret ?) |
@Tiggerito : I will have time to work on what is needed this afternoon. If you figure out how to do pages_wpt_bodies_text_vs_images_links_by_device.sql before, just let me know, otherwise just commit what you have so far and I'll continue :) |
added image v text link percentiles
This is approved, so I'll merge and any last minute changes can be applied to a new PR. Great work everyone! |
SEO 2020 Queries (#908)
Checklist of the metrics needed
Crawlability & Indexability
Not possible as all home pages are 200 OK
pages_robots_txt_by_device_and_status.sql - status_code
lighthouse.sql - robots_txt_valid
Not sure how this is different to Validity?
pages_wpt_bodies_by_device.sql = pct_has_meta_robots
pages_wpt_bodies_by_device.sql = pct_robots_has_x_robots_tag
pages_robots_txt_by_device_and_useragent.sql
pages_robots_txt_by_device_and_useragent.sql
pages_wpt_bodies_by_device.sql - has_canonical
pages_wpt_bodies_by_device.sql - pct_has_self_canonical
pages_wpt_bodies_by_device.sql - pct_is_canonicalized
pages_wpt_bodies_by_device.sql - pct_http_canonical
pages_wpt_bodies_by_device.sql - pct_has_conflict_http_header_changed_canonical
pages_wpt_bodies_by_device.sql - pct_has_meta_robots
pages_wpt_bodies_by_device.sql - pct_has_x_robots_tag
pages_wpt_bodies_by_device.sql - pct_has_meta_robots_and_x_robots_tag
pages_wpt_bodies_by_device.sql - pct_rendering_changed_robots_meta_tag
pages_wpt_bodies_by_device.sql - pct_http_canonical
pages_wpt_bodies_by_device.sql - pct_has_raw_canonical
pages_wpt_bodies_by_device.sql - pct_has_rendered_canonical
pages_wpt_bodies_by_device.sql - pct_has_rendered_but_not_raw_canonical
pages_wpt_bodies_by_device.sql - pct_has_conflict_rendering_changed_canonical
pages_wpt_bodies_by_device.sql - pct_has_conflict_http_header_changed_canonical
pages_wpt_bodies_by_device.sql - pct_has_conflict_http_header_or_rendering_changed_canonical
pages_wpt_bodies_by_device.sql - pct_has_conflict_raw_rendered_hreflang
pages_wpt_bodies_by_device.sql - pct_has_hreflang
pages_wpt_bodies_by_device.sql - pct_has_rendered_hreflang
pages_wpt_bodies_by_device.sql - pct_has_raw_hreflang
pages_wpt_bodies_by_device.sql - pct_has_rendered_but_not_raw_hreflang
pages_wpt_bodies_by_device.sql - pct_has_conflict_raw_rendered_hreflang
Not Available
Not Available
pages_wpt_bodies_by_device.sql - pct_has_raw_jsonld_or_microdata
pages_wpt_bodies_by_device.sql - pct_has_rendered_jsonld_or_microdata
not really conflict
pages_wpt_bodies_by_device.sql - has_only_rendered_jsonld_or_microdata
pages_wpt_bodies_by_device.sql - pct_rendering_changes_structured_data
Content
pages_wpt_bodies_by_device.sql - pct_has_title_tag
pages_wpt_bodies_by_device_and_percentile.sql - title_characters
pages_wpt_bodies_by_device_and_percentile.sql - title_words
pages_wpt_bodies_by_device_and_title_char_count.sql
pages_wpt_bodies_by_device.sql - pct_has_meta_description
pages_wpt_bodies_by_device_and_percentile.sql - meta_description_characters
pages_wpt_bodies_by_device_and_percentile.sql - meta_description_words
pages_wpt_bodies_by_device.sql - pct_has_h1
pages_wpt_bodies_by_device.sql - pct_has_h2
pages_wpt_bodies_by_device.sql - pct_has_h3
pages_wpt_bodies_by_device.sql - pct_has_h4
pages_wpt_bodies_by_device.sql - pct_has_non_empty_h1
pages_wpt_bodies_by_device.sql - pct_has_non_empty_h2
pages_wpt_bodies_by_device.sql - pct_has_non_empty_h3
pages_wpt_bodies_by_device.sql - pct_has_non_empty_h4
pages_wpt_bodies_by_device.sql - pct_has_same_h1_title
pages_wpt_bodies_by_device_and_percentile.sql - visible_words_rendered
pages_wpt_bodies_by_device_and_percentile.sql - visible_words_raw
pages_wpt_bodies_by_device_and_word_count.sql (rendered)
pages_wpt_bodies_structured_data_by_device_and_format.sql
pages_wpt_bodies_structured_data_by_device_and_type.sql
Probably covered by other things
pages_markup_by_device.sql - pct_has_img
pages_markup_by_device.sql - pct_images_with_img_alt_blank_or_present
pages_markup_by_device_and_img_loading.sql - loading
pages_almanac_by_device.sql - pct_has_videos
pages_almanac_by_device_and_percentile.sql - videos_count
Links
pages_wpt_bodies_by_device_and_percentile.sql - outgoing_links_internal
pages_wpt_bodies_anchors_by_device_and_number_outgoing_internal_links.sql
pages_wpt_bodies_by_device_and_percentile.sql - outgoing_links_external
pages_wpt_bodies_anchors_by_device_and_number_outgoing_external_links.sql
pages_wpt_bodies_by_device_and_percentile.sql - outgoing_links
pages_wpt_bodies_anchors_by_device_and_number_links.sql
pages_wpt_bodies_by_device_and_number_rel_attribute.sql
pages_wpt_bodies_text_vs_images_links_by_device.sql - not sure if this is enough?
Advanced
?
?
Asked Performance chapter if they have this
Performance chapter - web_vitals_by_device.sql
Performance 2020 queries #1091 (comment)
"Any metrics that depend on Lighthouse performance audits/scores should wait for the 2020_09_01 dataset results"
Markup chapter - pages_almanac_by_device.sql - pct_no_html_lang_m404
Markup chapter - pages_almanac_by_device_and_html_lang.sql
pages_wpt_bodies_by_device.sql - has_hreflang
pages_wpt_bodies_hreflang_by_device_and_link_tag_value.sql
pages_wpt_bodies_by_device.sql - pct_has_http_hreflang
pages_wpt_bodies_hreflang_by_device_and_http_header_value.sql
pages_wpt_bodies_by_device.sql -pct_https
Don't know how
pages_markup_by_device.sql - pct_has_html_amp_or_emoji_attribute
pages_markup_by_device.sql - pct_has_rel_amphtml_tag
This is from last year and I think an out of date or flawed idea. I've asked the authors