Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Samples tables for 2020 almanac. #1032

Closed
paulcalvano opened this issue Jul 15, 2020 · 12 comments
Closed

Create Samples tables for 2020 almanac. #1032

paulcalvano opened this issue Jul 15, 2020 · 12 comments
Assignees
Labels
analysis Querying the dataset
Milestone

Comments

@paulcalvano
Copy link
Contributor

paulcalvano commented Jul 15, 2020

I'm going to be deleting the existing sample_data dataset, and replacing it with a fresh set of sample data using the June 2020 tables.

Will be selecting 10k random pages, and then will use that page list to create all of the other tables.

Tables I'll be creating are

summary_pages_mobile_10k
summary_requests_mobile_10k
pages_mobile_10k
requests_mobile_10k
lighthouse_mobile_10k
response_bodies_mobile_10k
technologies_mobile_10k

summary_pages_desktop_10k
summary_requests_desktop_10k
pages_desktop_10k
requests_desktop_10k
response_bodies_desktop_10k
technologies_desktop_10k

@paulcalvano paulcalvano self-assigned this Jul 15, 2020
@paulcalvano
Copy link
Contributor Author

I backed up the existing sample_data dataset to sample_data_2019

@paulcalvano
Copy link
Contributor Author

Used RAND() to randomly select the 10K pages.

-- summary_pages_mobile_10k
SELECT *
FROM  `httparchive.summary_pages.2020_06_01_mobile` 
ORDER BY RAND() ASC 
LIMIT 10000
-- summary_pages_desktop_10k
SELECT *
FROM  `httparchive.summary_pages.2020_06_01_desktop` 
ORDER BY RAND() ASC 
LIMIT 10000

@paulcalvano
Copy link
Contributor Author

paulcalvano commented Jul 15, 2020

Summary Requests

-- summary_requests_mobile_10k
SELECT *
FROM  `httparchive.summary_requests.2020_06_01_mobile` 
WHERE pageid IN (
    SELECT pageid FROM `httparchive.sample_data.summary_pages_mobile_10k` 
    )
-- summary_requests_desktop_10k
SELECT *
FROM  `httparchive.summary_requests.2020_06_01_desktop` 
WHERE pageid IN (
    SELECT pageid FROM `httparchive.sample_data.summary_pages_desktop_10k` 
    )

@rviscomi rviscomi added the analysis Querying the dataset label Jul 15, 2020
@rviscomi rviscomi added this to the 2020 Analysis milestone Jul 15, 2020
@paulcalvano
Copy link
Contributor Author

Pages

-- pages_mobile_10k
SELECT *
FROM  `httparchive.pages.2020_06_01_mobile` 
WHERE url IN (
    SELECT url FROM `httparchive.sample_data.summary_pages_mobile_10k` 
    )
-- pages_desktop_10k
SELECT *
FROM  `httparchive.pages.2020_06_01_desktop` 
WHERE url IN (
    SELECT url FROM `httparchive.sample_data.summary_pages_desktop_10k` 
    )

@paulcalvano
Copy link
Contributor Author

Requests

-- requests_mobile_10k
SELECT *
FROM  `httparchive.requests.2020_06_01_mobile` 
WHERE page IN (
    SELECT url FROM `httparchive.sample_data.summary_pages_mobile_10k` 
    )
-- requests_desktop_10k
SELECT *
FROM  `httparchive.requests.2020_06_01_desktop` 
WHERE page IN (
    SELECT url FROM `httparchive.sample_data.summary_pages_desktop_10k` 
    )

@paulcalvano
Copy link
Contributor Author

Lighthouse

-- lighthouse_mobile_10k
SELECT *
FROM  `httparchive.lighthouse.2020_06_01_mobile` 
WHERE url IN (
    SELECT url FROM `httparchive.sample_data.summary_pages_mobile_10k` 
    )

@paulcalvano
Copy link
Contributor Author

Response Bodies

-- response_bodies_mobile_10k
SELECT *
FROM  `httparchive.response_bodies.2020_06_01_mobile` 
WHERE page IN (
    SELECT url FROM `httparchive.sample_data.summary_pages_mobile_10k` 
    )
-- response_bodies_desktop_10k
SELECT *
FROM  `httparchive.response_bodies.2020_06_01_desktop` 
WHERE page IN (
    SELECT url FROM `httparchive.sample_data.summary_pages_desktop_10k` 
    )

@paulcalvano
Copy link
Contributor Author

Technologies

-- technologies_mobile_10k
SELECT *
FROM  `httparchive.technologies.2020_06_01_mobile` 
WHERE url IN (
    SELECT url FROM `httparchive.sample_data.summary_pages_mobile_10k` 
    )
-- technologies_desktop_10k
SELECT *
FROM  `httparchive.technologies.2020_06_01_desktop` 
WHERE url IN (
    SELECT url FROM `httparchive.sample_data.summary_pages_desktop_10k` 
    )

@rviscomi
Copy link
Member

rviscomi commented Jul 15, 2020

@paulcalvano thanks for working on these! The almanac dataset last year had some preprocessed tables, for example the summary_response_bodies table combined summary_request data with response_bodies. The other benefit of the almanac dataset was that it used partitioning and clustering to make the queries more efficient. Do you anticipate making something similar available this year? And if so do you think those would be useful at 10k scale? It might be good to train analysts on data that more closely matches the real thing.

See #180 (comment) for some context behind how those were generated.

@paulcalvano
Copy link
Contributor Author

All done. Here's a summary of the tables:

Mobile

Table Rows Size
summary_pages_mobile_10k 10,000 7.4 MB
summary_requests_mobile_10k 896,527 979 MB
pages_mobile_10k 9,999 493.4 MB
requests_mobile_10k 906,324 5.85 GB
lighthouse_mobile_10k 9,954 2.77 GB
response_bodies_mobile_10k 419,970 22 GB
technologies_mobile_10k 102,412 5.9 MB

Desktop

Table Rows Size
summary_pages_desktop_10k 10,000 7.4 MB
summary_requests_desktop_10k 926,387 960 MB
pages_desktop_10k 10,000 490.4 MB
requests_desktop_10k 940,001 5.78 GB
response_bodies_desktop_10k 372,134 18.8 GB
technologies_desktop_10k 104,162 6 MB

@paulcalvano
Copy link
Contributor Author

Thanks @rviscomi . I'll take a look at the preprocessed tables in the morning.

This was referenced Jul 16, 2020
@smatei smatei mentioned this issue Jul 18, 2020
10 tasks
@paulcalvano
Copy link
Contributor Author

I just updated all the sample data tables with August's data. httparchive.sample_data now contains data from 10,000 randomly selected sites from the August 2020 dataset.

The previous sample dataset has been archived as httparchive.sample_data_jun2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset
Projects
None yet
Development

No branches or pull requests

2 participants