Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEO 2020 queries #1062

Merged
merged 71 commits into from
Sep 24, 2020
Merged

SEO 2020 queries #1062

merged 71 commits into from
Sep 24, 2020

Conversation

Tiggerito
Copy link
Contributor

@Tiggerito Tiggerito commented Jul 19, 2020

SEO 2020 Queries (#908)

Checklist of the metrics needed

Crawlability & Indexability

  • HTTP Status Code
    Not possible as all home pages are 200 OK
  • Robots.txt - Status Code
    pages_robots_txt_by_device_and_status.sql - status_code
  • Robots.txt - Validity
    lighthouse.sql - robots_txt_valid
  • Robots.txt - Syntax Compliance
    Not sure how this is different to Validity?
  • Meta robots tags inclusion
    pages_wpt_bodies_by_device.sql = pct_has_meta_robots
  • X-robots tags inclusion
    pages_wpt_bodies_by_device.sql = pct_robots_has_x_robots_tag
  • Robots directives (generic vs UA-specific)
    pages_robots_txt_by_device_and_useragent.sql
  • Robots values
    pages_robots_txt_by_device_and_useragent.sql
  • % of pages with a canonical tag
    pages_wpt_bodies_by_device.sql - has_canonical
  • % of pages with a self canonical tag
    pages_wpt_bodies_by_device.sql - pct_has_self_canonical
  • % of pages with a canonicalized tag
    pages_wpt_bodies_by_device.sql - pct_is_canonicalized
  • % of pages specifying canonical via http header
    pages_wpt_bodies_by_device.sql - pct_http_canonical
  • % of pages with different canonical values in HTTP header and html tag
    pages_wpt_bodies_by_device.sql - pct_has_conflict_http_header_changed_canonical
  • Usage of Rendering strategies
  • Where the meta robots is included
    pages_wpt_bodies_by_device.sql - pct_has_meta_robots
    pages_wpt_bodies_by_device.sql - pct_has_x_robots_tag
    pages_wpt_bodies_by_device.sql - pct_has_meta_robots_and_x_robots_tag
  • Conflict between raw and rendered meta robots
    pages_wpt_bodies_by_device.sql - pct_rendering_changed_robots_meta_tag
  • Where the meta canonical is included
    pages_wpt_bodies_by_device.sql - pct_http_canonical
    pages_wpt_bodies_by_device.sql - pct_has_raw_canonical
    pages_wpt_bodies_by_device.sql - pct_has_rendered_canonical
    pages_wpt_bodies_by_device.sql - pct_has_rendered_but_not_raw_canonical
  • Conflict between raw and rendered canonical
    pages_wpt_bodies_by_device.sql - pct_has_conflict_rendering_changed_canonical
    pages_wpt_bodies_by_device.sql - pct_has_conflict_http_header_changed_canonical
    pages_wpt_bodies_by_device.sql - pct_has_conflict_http_header_or_rendering_changed_canonical
    pages_wpt_bodies_by_device.sql - pct_has_conflict_raw_rendered_hreflang
  • Where the hreflang is included
    pages_wpt_bodies_by_device.sql - pct_has_hreflang
    pages_wpt_bodies_by_device.sql - pct_has_rendered_hreflang
    pages_wpt_bodies_by_device.sql - pct_has_raw_hreflang
    pages_wpt_bodies_by_device.sql - pct_has_rendered_but_not_raw_hreflang
  • Conflict between raw and rendered hreflang
    pages_wpt_bodies_by_device.sql - pct_has_conflict_raw_rendered_hreflang
  • Where the metadata is included
    Not Available
  • Conflict between raw and rendered metadata
    Not Available
  • Where the structured data is included
    pages_wpt_bodies_by_device.sql - pct_has_raw_jsonld_or_microdata
    pages_wpt_bodies_by_device.sql - pct_has_rendered_jsonld_or_microdata
  • Conflict between raw and rendered structured data
    not really conflict
    pages_wpt_bodies_by_device.sql - has_only_rendered_jsonld_or_microdata
    pages_wpt_bodies_by_device.sql - pct_rendering_changes_structured_data

Content

  • % of pages with <title>
    pages_wpt_bodies_by_device.sql - pct_has_title_tag
  • <title> character count
    pages_wpt_bodies_by_device_and_percentile.sql - title_characters
    pages_wpt_bodies_by_device_and_percentile.sql - title_words
    pages_wpt_bodies_by_device_and_title_char_count.sql
  • % of pages with a meta description
    pages_wpt_bodies_by_device.sql - pct_has_meta_description
  • Meta description character count
    pages_wpt_bodies_by_device_and_percentile.sql - meta_description_characters
    pages_wpt_bodies_by_device_and_percentile.sql - meta_description_words
  • % of pages with h1
    pages_wpt_bodies_by_device.sql - pct_has_h1
  • % of pages with h2
    pages_wpt_bodies_by_device.sql - pct_has_h2
  • % of pages with h3
    pages_wpt_bodies_by_device.sql - pct_has_h3
  • % of pages with h4
    pages_wpt_bodies_by_device.sql - pct_has_h4
  • % of non-empty h1
    pages_wpt_bodies_by_device.sql - pct_has_non_empty_h1
  • % of non-empty h2
    pages_wpt_bodies_by_device.sql - pct_has_non_empty_h2
  • % of non-empty h3
    pages_wpt_bodies_by_device.sql - pct_has_non_empty_h3
  • % of non-empty h4
    pages_wpt_bodies_by_device.sql - pct_has_non_empty_h4
  • % of cases where h1 and <title> are the same
    pages_wpt_bodies_by_device.sql - pct_has_same_h1_title
  • Page word count
    pages_wpt_bodies_by_device_and_percentile.sql - visible_words_rendered
    pages_wpt_bodies_by_device_and_percentile.sql - visible_words_raw
    pages_wpt_bodies_by_device_and_word_count.sql (rendered)
  • Structured data - format used
    pages_wpt_bodies_structured_data_by_device_and_format.sql
  • Structured data - types used
    pages_wpt_bodies_structured_data_by_device_and_type.sql
  • Structured data - Implementation policy
    Probably covered by other things
  • % of pages with an img
    pages_markup_by_device.sql - pct_has_img
  • % of images with an alt
    pages_markup_by_device.sql - pct_images_with_img_alt_blank_or_present
  • % of pages using browser and framework-based image optimization configuration
    pages_markup_by_device_and_img_loading.sql - loading
  • % of pages with a video
    pages_almanac_by_device.sql - pct_has_videos
    pages_almanac_by_device_and_percentile.sql - videos_count

Links

  • No of outgoing links to internal pages (include subdomains)
    pages_wpt_bodies_by_device_and_percentile.sql - outgoing_links_internal
    pages_wpt_bodies_anchors_by_device_and_number_outgoing_internal_links.sql
  • No of outgoing links to external pages
    pages_wpt_bodies_by_device_and_percentile.sql - outgoing_links_external
    pages_wpt_bodies_anchors_by_device_and_number_outgoing_external_links.sql
  • No of links in mobile and desktop version
    pages_wpt_bodies_by_device_and_percentile.sql - outgoing_links
    pages_wpt_bodies_anchors_by_device_and_number_links.sql
  • No of rel
    pages_wpt_bodies_by_device_and_number_rel_attribute.sql
  • Text vs image links
    pages_wpt_bodies_text_vs_images_links_by_device.sql - not sure if this is enough?

Advanced

  • Type of mobile web implementation
    ?
  • Changes to core measures
    ?
  • CWS pass / fail
    Asked Performance chapter if they have this
  • LCP, FID, CLS; TTFB, FCB
    Performance chapter - web_vitals_by_device.sql
    Performance 2020 queries #1091 (comment)
    "Any metrics that depend on Lighthouse performance audits/scores should wait for the 2020_09_01 dataset results"
  • HTML lang inclusion
    Markup chapter - pages_almanac_by_device.sql - pct_no_html_lang_m404
  • Most common lang attributes
    Markup chapter - pages_almanac_by_device_and_html_lang.sql
  • Hreflang inclusion in link tags
    pages_wpt_bodies_by_device.sql - has_hreflang
  • Most common hreflang attributes (link tag)
    pages_wpt_bodies_hreflang_by_device_and_link_tag_value.sql
  • Hreflang tag inclusion in HTTP header
    pages_wpt_bodies_by_device.sql - pct_has_http_hreflang
  • Hreflang most common attributes in HTTP header
    pages_wpt_bodies_hreflang_by_device_and_http_header_value.sql
  • HTTPS usage
    pages_wpt_bodies_by_device.sql -pct_https
  • Browser-based security changes to CORS
    Don't know how
  • contains a attribute
    pages_markup_by_device.sql - pct_has_html_amp_or_emoji_attribute
  • Contains a rel amphtml tag
    pages_markup_by_device.sql - pct_has_rel_amphtml_tag
  • No of links via hashed URLs
    This is from last year and I think an out of date or flawed idea. I've asked the authors

Pulls in LCP, FID, CLS (Core Web Vitals) and calculates pass cores based on 75% in the good place. Also pulls in TTFB, FCP and a speed score from last years query. Data is grouped by device.
@Tiggerito
Copy link
Contributor Author

I just noticed that I should be converting pct to 0-100 values.

@rviscomi
Copy link
Member

Thanks @Tiggerito! I'll give it a more thorough review later.

Meanwhile could you also edit the top comment of this PR to include a checklist to show what queries are in this PR and what the remaining metrics to query are?

10_01_structured_data_rich_results_rendered.sql

You can remove the 10_##_ prefixes from the file names, that was the old metric naming scheme that we're doing away with this year in favor of descriptive file names.

I just noticed that I should be converting pct to 0-100 values.

I'm also on the fence about this pattern. (context)

@rviscomi rviscomi added the analysis Querying the dataset label Jul 19, 2020
@rviscomi rviscomi added this to the 2020 Analysis milestone Jul 19, 2020
@rviscomi rviscomi requested a review from a team July 19, 2020 17:43
@Tiggerito
Copy link
Contributor Author

Thanks @Tiggerito! I'll give it a more thorough review later.

Meanwhile could you also edit the top comment of this PR to include a checklist to show what queries are in this PR and what the remaining metrics to query are?

I'm learning. I didn't realise all commits automatically become part of an ongoing pull request. So I need to keep the first comment up to date on all changes made.

We're still working on the metrics and all the queries needed. I believe the other analysis can now see this and can work on it, which is where I wanted to be.

Maybe next time we don't do the draft pull so early, just do the branch? Or is the doing the pull important? I guess it triggers a review from the smart guys ;-)

10_01_structured_data_rich_results_rendered.sql

You can remove the 10_##_ prefixes from the file names, that was the old metric naming scheme that we're doing away with this year in favor of descriptive file names.

I left them in so that I could identify references to them from last year. Will we be updating almanac.js to list the fullnames of the scripts that use each property? We'd probably want to include the chapter name so people can find them and know who's using them. e.g.

// Used by:
// 2020/07_SEO/core_web_vitals_by_device.sql

Or maybe we just say which chapter

// Used by : SEO

almanac.js could do with a restructure as some properties use the '10_1' type syntax. Maybe they should also stick with descriptive names?

?

I just noticed that I should be converting pct to 0-100 values.

I'm also on the fence about this pattern. (context)

Gave it a thumbs up.

@rviscomi rviscomi removed the request for review from a team July 19, 2020 18:44
@rviscomi
Copy link
Member

rviscomi commented Jul 19, 2020

I'll remove the analyst reviewers for now. I thought it'd be good for everyone to peek at each others' work, but this PR isn't necessarily ready for review yet.

Maybe next time we don't do the draft pull so early, just do the branch? Or is the doing the pull important?

Early is good. And having the draft PR open gives you a place to track progress for each metric using the checklist. It's ok to continuously add to and modify the branch/checklist as the chapter planning evolves. Hopefully in about a week things will be settled and you'll know which metrics need to be queried.

almanac.js could do with a restructure as some properties use the '10_1' type syntax. Maybe they should also stick with descriptive names?

Yes, this is an unfortunate vestige of last year's system. Not sure if it's better to leave them so the 2019 queries make sense or update them for consistency with 2020. In any case, the file name is not the best place to reference the related 2019 query - for that you can add it as a comment in the SQL, # See related: sql/2019/10_SEO/10_01.sql.

@foxdavidj
Copy link
Contributor

I can help out with reviewing this when its ready for review @rviscomi

@Tiggerito Can you edit the name of the PR to reflect what its for as well? By default it just uses the name of the first commit you made haha. Maybe something like: SEO 2020 Queries

@Tiggerito Tiggerito mentioned this pull request Jul 20, 2020
10 tasks
@tunetheweb tunetheweb changed the title Create core_web_vitals_by_device.sql 2020 SEO queries Jul 20, 2020
sql/2020/07_SEO/10_02_lang_attribute_raw.sql Outdated Show resolved Hide resolved
sql/2020/07_SEO/10_04a_has_hreflang_raw.sql Outdated Show resolved Hide resolved
sql/2020/07_SEO/10_04b_top_hreflang_values.sql Outdated Show resolved Hide resolved
sql/2020/07_SEO/10_16_h1_length_percentiles_raw.sql Outdated Show resolved Hide resolved
@@ -0,0 +1,98 @@
# how many lson-ld scripts on a page
SELECT
ARRAY_LENGTH(REGEXP_EXTRACT_ALL(body, '(?i)<script[^>]*type=[\'"]?application\\/ld\\+json[\'"]?[^>]*>(.*?)<\\/script>')) AS jsonld_scripts_count,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed custom metric:

document.querySelectorAll('script[type="application/ld+json" i]').length

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one probably needs to be on the raw body as it's designed to find out if the tags parses into json. But thinking about it, I think I can read the raw text of a script in the DOM and try to parse it there. Will test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the i in the selector signify?

Copy link
Member

@tunetheweb tunetheweb Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case insensitivity. Plug the regex in here: https://regexr.com/ and make sure you choose PCRE instead of JavaScript and it should explain each part to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, talking about the css selector from @rviscomi . I though that, but tested and it's case insensitive without it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just discovered that some attributes are case sensitive, so my gathering for meta tags is missing a whole bunch of data. I opened an issue for it:

HTTPArchive/legacy.httparchive.org#188

What's the best way to resolve this. Do I need to fork and fix to get it into the next crawl?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to fix for the future but it won't be included until the October crawl at the soonest, the results of which would be available too late to be included in the Almanac. I don't think the case sensitivity would significantly affect the results, but if that's a concern we can include a side note about it in the chapter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned to the team that we might be able to do a raw html crawl at the end to get anything we missed. If not, the side note will work.

My concern is that older websites tended to shout (upper case) their html, so they would be the ones we miss.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we submit the PR to fix this now while it’s fresh in our mind? Really want to avoid what we did last year which was not think about custom metrics for 11 months and then have a complete blind panic for the last month! Would be really nice to work on issues like this (or even custom metrics we wish we’d added in hindsight!) as we analyse the 2020 results so we’ve a head start on 2021.

Would also mean you can have a quick look during (or at end of) October and if the stats are radically different then can decide what to do.

Fix looks relatively simple since we’re using Chrome for the crawl: https://stackoverflow.com/questions/38399089/ignoring-case-sensitiveness-in-queryselectorall

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

sql/2020/07_SEO/json-ld-scripts-raw.sql Outdated Show resolved Hide resolved
Where possible I have change the queries to use data from almanac.js. I've identified almanac.js queires by rendered_ and body queries by raw_.

I've tried to merge as many queries as possible based on the data source and format of the results. This should make it a lot easier to manage as well as save on the number of queries made $$$

lighthouse.sql - all data from lighthouse
rendered_by_device.sql - rendered data from almanac.js grouped by device
rendered_percentiles_by_device.txt - rendered data from almanac.js reported by percentiles and grouped by device
raw_by_device.sql - data from the raw html - very expensive query when done on the real data - does not gather much at the moment. This could become our raw/rendered comparison query.

Other scripts are ones that have a unique output so can't be merged. Typically because they group by the field they are analysing.

At the moment most are pointing to test tables. But be careful. Always double check the cost before making a query as some old ones are still pointing to live tables. We could do with a test table for httparchive.almanac.summary_response_bodies as the table I'm currently testing with is in a different format.

We're starting to tie down the exact metrics needed, and get the almanac.js updated to fit those needs. My next move here is to see if I can get the comparisons data they want. This would be making a query joining several of the tables.
Added json_ld_scripts to other rendered queries
Moved no longer needed raw scripts into a sandbox folder.
renamed them
set to use test tables
merged where possible
@rviscomi
Copy link
Member

rviscomi commented Sep 10, 2020

WebPageTest Results is basically data to help me test. Do you not keep that sort of thing? I feel like it will give next years analyst a clue.

I think it's ok to remove these, the results spreadsheet will be the publicly available place for query results.

Edit: Oh nevermind, I see these are the sample inputs to the queries, not the outputs. Anyway, I do still think they can be removed. BigQuery table previews can give future analysts sample data. If there are any gotchas in the data, README would be the best place to document those.

@rviscomi
Copy link
Member

Most of my real queries contain a test data section, controlled by a simple if (false) statement. They can add a lot to the JavaScript but seem to have no effect on query performance. Should I remove my testing system?

The way I look at the queries that we check into source control is to help readers scrutinize our methodology. For example, answering "how did you calculate that metric?" I think our queries should be optimized for legibility, so it's ok to have lots of descriptive comments for any parts of the query that can be confusing. IMO the testing system woven into the queries gets in the way of the meat of the queries, which is how the metrics are implemented. Don't get me wrong, it's a brilliant idea to have a toggle to switch back to sample data, but I think it's only useful to us now while we debug the 2020 queries and future analysts won't need to debug queries that are already working, so they'd be mostly superfluous.

So I'd recommend stripping the testing code out of the queries, leaving only the essentials.

@rviscomi rviscomi requested a review from a team September 10, 2020 16:21
Copy link
Member

@rviscomi rviscomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the first query. Let me know what you think of those suggestions, then I'll continue with the other queries.

@@ -0,0 +1,78 @@
#standardSQL
Copy link
Member

@rviscomi rviscomi Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few suggestions that I'll propose by way of a rewritten query.

  • I agree with your comment about using percentiles. For this query it makes sense to represent the % of images with alt attributes per page as a distribution.
  • Do the division in the JS function while you're at it, which also simplifies the return type. Function renamed accordingly.
  • You could simplify some of the defensive if statements by returning null in the catch block. Null values are ignored in the percentile aggregation by default.
  • Various whitespace tidying.
  • Uppercase for all keywords, like AS.
  • Remove all "TEST" logic.
#standardSQL
# Distribution of the coverage of alt attributes over all images per page.
CREATE TEMPORARY FUNCTION get_percent_alt_images(markup_string STRING)
RETURNS FLOAT64 LANGUAGE js AS '''
try {
  var markup = JSON.parse(markup_string);
  return markup.images.img.alt.present / markup.images.img.total;
} catch (e) {
  return null;
}
''';

SELECT
  client,
  percentile,
  ROUND(APPROX_QUANTILES(pct_alt, 1000)[OFFSET(percentile * 10)], 4) AS pct_alt
FROM ( 
  SELECT 
    _TABLE_SUFFIX AS client,
    get_percent_alt_images(JSON_EXTRACT_SCALAR(payload, '$._markup')) AS pct_alt
  FROM
    `httparchive.pages.2020_08_01_*`),
  UNNEST([10, 25, 50, 75, 90]) AS percentile
GROUP BY
  client,
  percentile
ORDER BY
  percentile,
  client

Results:

client percentile pct_alt
desktop 10 0.00%
mobile 10 0.00%
desktop 25 10.53%
mobile 25 9.09%
desktop 50 50.00%
mobile 50 50.00%
desktop 75 89.74%
mobile 75 89.47%
desktop 90 100.00%
mobile 90 100.00%

(to be honest, I'm very surprised that the median is exactly 50% - please let me know if you spot anything wrong with my query!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have made the first query the last one. It was one I did not think was too valuable, so did not focus on it much.

I think I'd already added it into my percentile query. Will check and fold in your comments there.

Do the division in the JS function while you're at it, which also simplifies the return type. Function renamed accordingly.

A few reasons I leave some maths to the SQL:

  • SQL has a safe div so no exception on zeros. Not an issue with one return value, bot could break all JS result data with a single mistake.
  • Returning more data could come in handy for the SQL writer. e.g. I often return a count and not a boolean indicating non zero. Again, to minify churn if the SQL writer decided they wanted something else. Or to try and keep the data provided consistent between queries.
  • I have the feeling that SQL runs far faster then JS.

You could simplify some of the defensive if statements by returning null in the catch block. Null values are ignored in the percentile aggregation by default.

That script does not go as defensive as a normally go!

This goes against my own beliefs and what I think is best practice. Exceptions should not be used to handle normal code flow.

It's more apparent when you see some of my more complex JS. If I was not defensive, then an exception would mean all subsequent data calculations would be lost. A bug in an early bit of code could cause the loss of much data.

The pattern I use where I create the result object at the start and always return it (even in an exception) is deliberate. That way an exception does not stop all data from being returned (null).

I've also tried to make my exception/error handling provide visible clues in the results that things were broken. Returning something like a null means an issue is silently ignored. Returning an error message where possible really stands out.

That reminds me, I built in exception handling to my custom metrics. I should check that to see if they did have issues. I did that after noticing that a reasonable percentage of results from last year had blank custom metrics, indicating an exception happened and was not handled.

Other comments are sound and I will work on them today.

I'll double check the custom metrics for this to make sure there is nothing strange causing the 50% median.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to get my head around using percents to feed into percentiles 🤯

I've updated pages_markup_by_device_and_percentile.sql to include the percentile version of that query and have it pulling out a few values to help me check validity and understand things. Here's the result from a live run:

Row percentile client total img_count images_with_alt_present_percent images_with_alt_blank_percent images_with_alt_missing_percent images_with_alt_present images_with_alt_blank images_with_alt_missing  
1 10 desktop 5593642 2 0.0 0.0 0.0 0 0 0  
2 10 mobile 6347640 2 0.0 0.0 0.0 0 0 0  
3 25 desktop 5593642 8 0.1429 0.0 0.0 1 0 0  
4 25 mobile 6347640 8 0.125 0.0 0.0 1 0 0  
5 50 desktop 5593642 21 0.5385 0.1 0.0299 7 2 1  
6 50 mobile 6347640 19 0.5122 0.1111 0.0244 6 2 1  
7 75 desktop 5593642 43 0.9091 0.4906 0.3333 22 10 5  
8 75 mobile 6347640 39 0.9063 0.5 0.3333 20 9 5  
9 90 desktop 5593642 82 1.0 0.8261 0.8037 48 24 19  
10 90 mobile 6347640 73 1.0 0.8333 0.8235 43 22 18

The first column and last three columns are just counts. The others are the percents.

So I'm guessing it's saying things like (mobile):

The median page has 19 images, 10% have 2 or less, and 10% have 73 or more.

10% of pages have alt attributes present on all their images? While 10% have no images with the alt present. The median is to have 51% of alts present?

10% have 82% of their images without any alt attribute. 50% do well with less than 3% missing.

50% use less than 10% blank attributes.

Time to rest my brain and clean out all that test code.

@Tiggerito
Copy link
Contributor Author

Most of my real queries contain a test data section, controlled by a simple if (false) statement. They can add a lot to the JavaScript but seem to have no effect on query performance. Should I remove my testing system?

The way I look at the queries that we check into source control is to help readers scrutinize our methodology. For example, answering "how did you calculate that metric?" I think our queries should be optimized for legibility, so it's ok to have lots of descriptive comments for any parts of the query that can be confusing. IMO the testing system woven into the queries gets in the way of the meat of the queries, which is how the metrics are implemented. Don't get me wrong, it's a brilliant idea to have a toggle to switch back to sample data, but I think it's only useful to us now while we debug the 2020 queries and future analysts won't need to debug queries that are already working, so they'd be mostly superfluous.

So I'd recommend stripping the testing code out of the queries, leaving only the essentials.

I agree, it does distract from the main code.

On my first real query I found a bug that meant I had to quickly revert back to test mode, so what I had was handy. At least now the test code is available in version control, so I can quickly get it back again.

I think I will remove all test stuff and document my testing process in the README, including an indication on how to see my test versions. Can I label commits so it would be easy to find the last version before switching to live values? e.g. "contains test code".

I can also document how I get the custom metrics result data to help me test, and then remove my results folder.

@Tiggerito Tiggerito mentioned this pull request Sep 10, 2020
75 tasks
@Tiggerito
Copy link
Contributor Author

I was going to reference a commit that contained all my test code, so people could refer back to it to see how I did things. But I suspect all my commits are lost when it merges? So no record of my test code is kept?

testing image percentile query
removed commented out code
minor cosmetic changes
@Tiggerito
Copy link
Contributor Author

@rviscomi @bazzadp Do I need to hold off on gathering real data until this pull request is merged?

Copy link
Member

@rviscomi rviscomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @Tiggerito! Great work. This PR is most of the way there.

Similar to my feedback on the Markup PR, I do think these queries need to be split up by metric, not by table or grouping. I understand if you don't have the bandwidth to do that now, and I don't want to hold this up any further.

I've made a couple of comments about being more descriptive in the file names about what the underlying metrics are. These comments are applicable now for the queries that only contain a single metric. When we have time to split up the monolithic queries, each of the individual files should be similarly named more descriptively for the metric, so this is a step in that direction.

Other than that, most of my comments apply to how the metrics are aggregated. Grouping by INT64 values would effectively generate a histogram of granularity 1, which could lead to an unlimited number of values. Instead, to summarize distributions of values like this, we typically calculate as percentiles.

Once these comments are addressed we should be ready to run the queries, save the results to the Sheet, and start drafting the chapter. Thanks again for all the time you've spent on this, it will show through on the final result!

sql/2020/07_SEO/pages_almanac_by_device.sql Outdated Show resolved Hide resolved
COUNT(0) AS total,

# Pages with videos
COUNTIF(almanac_info.videos_total > 0) AS has_videos,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to prefer this over reqVideo or bytesVideo in summary_pages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about them. How do I find out how they are generated?

This metric is a bit dodgy anyhow as it only finds video tags, and many people use other ways to show videos, like iframe. They are aware if its limitation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requests.sql query emulates how videos are detected:

function prettyType(mimeType, ext) {

These fields are the count of requests and sum of requests' bytes per page. Could be a useful proxy for this metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick test. I don't think it picks up the embedding of videos unless they auto play.

I'd class this one as a low priority metric anyhow. Video is not a major SEO thing. I'll make it clear that the results are limited (already indicated that). If they want something better we can negotiate.

sql/2020/07_SEO/pages_almanac_by_device_and_percentile.sql Outdated Show resolved Hide resolved
_TABLE_SUFFIX AS client,
percentile,
url,
get_almanac_info(JSON_EXTRACT_SCALAR(payload, '$._almanac')) AS almanac_info
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto on reqVideo/bytesVideo

sql/2020/07_SEO/pages_robots_txt_by_device_and_status.sql Outdated Show resolved Hide resolved
SELECT
client,
COUNT(0) AS total,
ROUND(AVG(wpt_bodies_info.image_links)) AS image_links_avg,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto on percentiles

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoineeripret I'm not sure how this one works?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tiggerito : what I did there, as I was aware that a freq table would end up with a lot of possible values, was to apply an AVG() function on the returned results. Maybe not the best idea I have had.

We can change it to return percentiles instead. I'll have a look at that during the afternoon (Spanish time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already added percentiles for those values in the percentiles script. Just want to make sure we can remove these?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, my bad! Yes, for sure !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some image v text link percentiles and dropped all the queries that are now covered by percentiles.

The requirements for "image v text links" are vague, in fact that is the requirements. I've created separate percentiles for each as well as a percentile of the percent of links that are images. Hopefully that will cover what they asked for.

I think all is done.

@Tiggerito
Copy link
Contributor Author

@rviscomi

I think all that's left is:

Removing the number of links queries (@antoineeripret are you ok with that?) and using the percentile results instead..

Maybe adding a reqVideo/bytesVideo if that data looks to be of any use.

Working out how to do pages_wpt_bodies_text_vs_images_links_by_device.sql (@antoineeripret ?)

@antoineeripret
Copy link
Contributor

antoineeripret commented Sep 17, 2020

@rviscomi

I think all that's left is:

Removing the number of links queries (@antoineeripret are you ok with that?) and using the percentile results instead..

Maybe adding a reqVideo/bytesVideo if that data looks to be of any use.

Working out how to do pages_wpt_bodies_text_vs_images_links_by_device.sql (@antoineeripret ?)

@Tiggerito : I will have time to work on what is needed this afternoon. If you figure out how to do pages_wpt_bodies_text_vs_images_links_by_device.sql before, just let me know, otherwise just commit what you have so far and I'll continue :)

added image v text link percentiles
@rviscomi
Copy link
Member

This is approved, so I'll merge and any last minute changes can be applied to a new PR. Great work everyone!

@rviscomi rviscomi merged commit f1d86c0 into main Sep 24, 2020
@rviscomi rviscomi deleted the seo-sql-2020 branch September 24, 2020 04:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants