GPC Web Crawler

The GPC Web Crawler is developed and maintained by the OptMeowt team. In addition to this readme, check out our Wiki.

1. Research Publications
2. Introduction
3. Data
4. Development
5. Architecture
6. Components
7. Limitations/Known Issues/Bug Fixes
8. Other Resources
9. Thank You!

1. Research Publications

You can find a list of our research publications in the OptMeowt Analysis extension repo.

2. Introduction

The GPC Web Crawler analyzes websites' compliance with Global Privacy Control (GPC) at scale. GPC is a privacy preference signal that people can use to exercise their rights to opt out from web tracking. The GPC Web Crawler is based on Selenium and the OptMeowt Analysis extension.

3. Data

To track the evolution of GPC compliance on the web over time we are performing regular crawls of a set of 11,708 websites. Our crawl results are publicly available (results are for California; Connecticut and Colorado coming soon):

View crawl results on Google Sheets

Please note the following:

While our Crawler has high accuracy, occasional misclassifications are possible (for the accuracy of our Crawler see section 3.5 of our paper "Websites' Global Privacy Control Compliance at Scale and over Time").
Whether GPC applies to a site depends on thresholds of revenue, users, and other criteria. In our paper we estimated GPC applicability based on a site's web traffic estimate (see section 3.2 of our paper).

If you have any questions or suggestions, especially, if you believe a website has been incorrectly identified as non-compliant, please contact us at sebastian[at]privacytechlab.org.

4. Development

You can install the GPC Web Crawler on a consumer-grade computer. We use a MacBook. Get started as follows:

If you want to test sites' compliance with a particular law, for example, the California Consumer Privacy Act (CCPA), make sure to crawl the sites from a computer in the respective geographical location. If you are located in a different location, you can use a VPN. For example, we perform our crawls for the CCPA using Mullvad VPN set to Los Angeles, California.
Sign in to Docker, or create a Docker account if you do not already have one.
Download docker by following the instructions in the official Docker documentation
Authenticate to Docker Hub by following the instructions in the official Docker Documentation.
Clone this repo locally or download a zipped copy and unzip it.
If you are performing a test run of the Crawler or plan on running the Crawler on your own set of sites, follow the directions in the sublist of this bullet. If not, skip to step 6.
1. Open sites.csv and enter the URLs of the sites you want to analyze in the first column. Some examples are included in the file - do not change anything if you simply want to perform a test run.
2. In the root directory of the repo, the Crawler can be started on the chosen test batch of sites in sites.csv with debug mode on by running:
```
make custom
```
To run the Crawler on one of our eight preselected batches sites:
1. If you have already run the Crawler (perhaps to test it, or on another batch) and have containers running, run "make stop && make clean"
2. To start the Crawler with debug mode off, run:
```
  make start
```
  or to start the Crawler with debug mode on:
```
  make start-debug
```
3. When prompted with "Enter a batch number (1-8):", enter a number from one to eight, representing which batch of sites you wish to crawl.
4. If the crawl unexpectedly fails midway through, run make start again and re-select the batch you are interested in.
To check the analysis results, open a browser and navigate to http://localhost:8080/analysis. Ports may be different depending on your local server setup. So, you would need to adjust the URL or your configuration accordingly.
- After the crawl is completed, a .json file containing the analysis results will also be dumped in the crawl_results directory
To view the crawl results in a phpmyadmin, navigate to localhost in your browser. Enter the following credentials when prompted.
- Username: root
- Password: toor
If you modify the analysis extension, you should test it to make sure it still works properly. Some guidelines can be found in the Wiki.

Note: When you perform a crawl, for one reason or another, some sites may fail to analyze. We always perform a second crawl for the sites that failed the first time (i.e., the redo sites).

5. Architecture

Here is an overview of the GPC Web Crawler architecture:

All of this happens within the Desktop environment provided by the headless VNC container. The editable version of this image is in the Google Drive.

6. Components

The GPC Web Crawler consists of various components:

6.1 Crawler Script

The flow of the Crawler script is described in the diagram below.

This script is stored and executed on a Desktop environment living in a docker image. The Crawler also keeps a log of sites that cause errors. It stores these logs in the error-logging.json file and updates this file after each error.

Types of Errors that May Be Logged

TimeoutError: A Selenium error that is thrown when either the page has not loaded in 30 seconds or the page has not responded for 30 seconds. Timeouts are set in driver.setTimeouts.
HumanCheckError: A custom error that is thrown when the site has a title that we have observed means our VPN IP address is blocked or there is a human check on that site. See Limitations/Known Issues for more details.
InsecureCertificateError: A Selenium error that indicates that the site will not be loaded, as it has an insecure certificate.
WebDriverError: A Selenium error that indicates that the WebDriver has failed to execute some part of the script.
WebDriverError: Reached Error Page: This error indicates that an error page has been reached when Selenium tried to load the site.
UnexpectedAlertOpenError: This error indicates that a popup on the site disrupted Selenium's ability to analyze the site (such as a mandatory login).

6.2 OptMeowt Analysis Extension

The OptMeowt Analysis extension is packaged as an xpi file and installed on a Firefox Nightly browser by the Crawler script. When a site loads, the OptMeowt Analysis extension automatically analyzes the site and sends the analysis data to the local SQL database via a POST request. The analysis performed by the OptMeowt Analysis extension investigates the GPC compliance of a given site using a 4-step approach:

The extension checks whether the site is subject to the CCPA by looking at Firefox's urlClassification object. Requests returned by this object are based on the Disconnect list per Firefox's Enhanced Tracking Protection. Sending data to a site on the Disconnect will often qualify as sharing or selling of data subject to people's opt out right.
The extension checks the value of the US Privacy string, the GPP string, and OneTrust's OptanonConsent, OneTrustWPCCPAGoogleOptOut, and OTGPPConsent cookies, if any of these exist.
The extension sends a GPC signal to the site.
The extension rechecks the value of the US Privacy string, OneTrust cookies, and GPP string. If a site respects GPC, the values should be now set to opt out.

The information collected during this process is used to determine whether the site respects GPC. Note that legal obligations to respect GPC differ by geographic location. In order for a site to be GPC compliant, the following statements should be true after the GPC signal was sent for each string or cookie that the site implemented:

the third character of the US Privacy string is a Y
the value of the OptanonConsent cookie is isGpcEnabled=1
the opt out columns in the GPP string's relevant US section (i.e., SaleOptOut, TargetedAdvertisingOptOut, SharingOptOut) have a value of 1; Note that the columns and opt out requirements vary by state
the value of the OneTrustWPCCPAGoogleOptOut cookie is true

6.3 Node.js REST API

We use the REST API to make GET, PUT, and POST requests to the SQL database. The REST API is also local and is run in a separate terminal from the Crawler. Instructions for the REST API can be found in the Wiki.

6.4 SQL Database

The SQL database is a local database that stores analysis data. Instructions to set up the SQL database can be found in the Wiki. The columns of our database tables are below:

id	site_id	domain	sent_gpc	uspapi_before_gpc	uspapi_after_gpc	usp_cookies_before_gpc	usp_cookies_after_gpc	OptanonConsent_before_gpc	OptanonConsent_after_gpc	gpp_before_gpc	gpp_after_gpc	gpp_version	urlClassification	OneTrustWPCCPAGoogleOptOut_before_gpc	OneTrustWPCCPAGoogleOptOut_after_gpc	OTGPPConsent_before_gpc	OTGPPConsent_after_gpc

The first few columns primarily pertain to identifying the site and verifying that the OptMeowt Analysis extension is working properly.

id: autoincrement primary key to identify the database entry
site_id: the id of the domain in the csv file that lists the sites to crawl. This id is used for processing purposes (i.e., to identify domains that redirect to another domain) and is set by the Crawler script
domain: the domain name of the site
sent_gpc: a binary indicator of whether the OptMeowt Analysis extension sent a GPC opt out signal to the site

The remaining columns pertain to the opt out status of a user, i.e., the OptMeowt Analysis extension, which is indicated by the value of the US Privacy string, OptanonConsent cookie, and GPP string. The US Privacy string can be implemented on a site via (1) the client-side JavaScript USPAPI, which returns the US Privacy string value when called, or (2) an HTTP cookie that stores its value. The OptMeowt Analysis extension checks each site for both implementations of the US Privacy string by calling the USPAPI and checking all cookies. The GPP string's value is obtained via the CMPAPI for GPP.

uspapi_before_gpc: return value of calling the USPAPI before a GPC opt out signal is sent
uspapi_after_gpc: return value of calling the USPAPI after a GPC opt out signal was sent
usp_cookies_before_gpc: the value of the US Privacy string in an HTTP cookie before a GPC opt out signal is sent
usp_cookies_after_gpc: the value of the US Privacy string in an HTTP cookie after a GPC opt out signal was sent
OptanonConsent_before_gpc: the isGpcEnabled string from OneTrust's OptanonConsent cookie before a GPC opt out signal is sent. The user is opted out if isGpcEnabled=1, and the user is not opted out if isGpcEnabled=0. If the cookie is present but does not have an isGpcEnabled string, we return "no_gpc"
OptanonConsent_after_gpc: the isGpcEnabled string from OneTrust's OptanonConsent cookie after a GPC opt out signal was sent. The user is opted out if isGpcEnabled=1, and the user is not opted out if isGpcEnabled=0. If the cookie is present but does not have an isGpcEnabled string, we return "no_gpc"
gpp_before_gpc: the value of the GPP string before a GPC opt out signal is sent
gpp_after_gpc: the value of the GPP string after a GPC opt out signal was sent
gpp_version: the version of the CMP API that obtains the GPP string (i.e., v1.0 has a getGPPdata command while v1.1 removes the getGPPdata command and its return values in favor of callback functions)
urlClassification: the return value of Firefox's urlClassificaton object, sorted by category and filtered for the following categories: fingerprinting, tracking_ad, tracking_social, any_basic_tracking, any_social_tracking
OneTrustWPCCPAGoogleOptOut_before_gpc: the value of the OneTrustWPCCPAGoogleOptOut cookie before a GPC signal is sent. This cookie is described by OneTrust. Additional information is available in issue #94
OneTrustWPCCPAGoogleOptOut_after_gpc: the value of the OneTrustWPCCPAGoogleOptOut cookie after a GPC signal was sent. This cookie is described by OneTrust. Additional information is available in issue #94
OTGPPConsent_before_gpc: the value of the OTGPPConsent cookie before a GPC signal is sent. This cookie is described by OneTrust. Additional information is available in issue #94
OTGPPConsent_after_gpc: the value of the OTGPPConsent cookie after a GPC signal was sent. This cookie is described by OneTrust. Additional information is available in issue #94

7. Limitations/Known Issues/Bug Fixes

7.1 Sites that Cannot Be Analyzed

Since we are using Selenium and a VPN to visit the sites we analyze, there are some limitations to the sites we can analyze. There are some types of sites that we cannot analyze due to our methodology:

Sites where the VPN's IP address is blocked.

For example, a site titled "Access Denied" that says we do not have permission to access the site on this server is loaded instead of the real site.
Sites that have some kind of human check.

Some sites can detect that we are using automation tools (i.e., Selenium) and do not let us access the real site. Instead, we are redirected to a page with some kind of captcha or puzzle. We do not try to bypass any human checks.

Since the data collected from both of these types of sites (i.e., (1) sites that block our VPN's IP address and (2) sites that have some kind of human check) will be incorrect and occur because our automation was detected, we list them under HumanCheckError in the error-logging.json file. We have observed a few different site titles that indicate we have reached a site in one of these categories. Most of the titles occur for multiple sites, with the most common being "Just a Moment…" on a captcha from Cloudflare. We detect when our Crawler visits one of these sites by matching the site title of the loaded site with a set of regular expressions that match with the known titles. Clearly, we will miss some sites in this category if we have not seen it and added the title to the set of regular expressions. We are updating the regular expressions as we see more sites like this. For more information, see issue #51.
Sites that block script injection.

For instance, https://www.flickr.com blocks script injection and will not successfully be analyzed. In the debugging table, on the first attempt, the last message will be runAnalysis-fetching, and on the second attempt, the extension logs SQL POSTING: SOMETHING WENT WRONG.
Sites that redirect between multiple domains throughout analysis.

For instance, https://spothero.com/ and https://parkingpanda.com/ are now one entity but still can use both domains. In the debugging table, you will see multiple debugging entries under each domain. Because we store analysis data by domain, the data will be incomplete and will not be added to the database.

7.2 Important Bug Fixes

At some point the Crawler kept returning an empty result for Firefox's urlClassification object. @eakubilo fixed this tricky bug.

8. Other Resources

8.1 Python Library for GPP String Decoding

GPP strings must be decoded. The IAB provides a JavaScript library and an interactive html decoder to do so. To integrate decoding with our colab notebooks for data analysis, we rewrote the library in Python. The library can be found on our Google Drive. More info can be found in our Wiki and the related issue.

8.2 .well-known/gpc.json Python Script

We collect .well-known/gpc.json data after the whole crawl finishes with a separate Python script, selenium-optmeowt-crawler/well-known-collection.py.

Here are the steps for doing so:

Just as the GPC Web Crawler, this script should be run using the same VPN location after all eight crawl batches are completed
Ensure the lock screen setting is as for the usual crawl
Change directories to well-known-crawl
On line 25 of well-known-adhoc.py, change csv_path to the location of the list of sites you wish to crawl
On line 38 of well-known-adhoc.py, change save_path to the location you wish to save the results to, for example: `save_path = "well-known-data.csv"'
Start the script using:
```
python3 well-known-adhoc.py
```

Running this script requires three input files: selenium-optmeowt-crawler/full-crawl-set.csv, which is in the repo, redo-original-sites.csv, and redo-sites.csv. The second two files are not found in the repo and should be created for that crawl based on the instructions in our Wiki. As explained in selenium-optmeowt-crawler/well-known-collection.py, the output is a csv called well-known-data.csv with three columns: Site URL, request status, json data as well as an error json file called well-known-errors.json that logs all errors. To run this script on a csv file of sites without accounting for redo sites, comment all lines between line 27 and line 40 except for line 34.

Details of the .well-known Analysis

Analyze the full crawl set with the redo sites replaced, i.e., using the full set of sites and the sites that we have redone (which replaced the original sites with redo sites).

Output
1. If successful, a csv with three columns will be created: Site URL, request status, json data
2. If not successful, an error json file will be created: logs all errors, including the reason for an error and 500 characters of the request text
  
  Examples of an error:
  - "Expecting value: line 1 column 1 (char 0)": the status code was 200 (site exists and loaded) or 202 (the request is accepted but incomplete processing) but did not find a json (output: Site_URL, 200, None or Site_URL, 202, None)
  - Reason: site sent all incorrect URLs to a generic error page instead of not serving the page, which would have been a 404 status code
Status Codes (HTTP Responses)
- In general, we expect a 404 status code (Not Found) when a site does not have a .well-known/gpc.json (output: Site_URL, 404, None)
- Other possible status codes signaling that the .well-known data is not found include but are not limited to: 403 (Forbidden: the server understands the request but refuses to authorize it), 500 (Internal Server Error: the server encountered an unexpected condition that prevented it from fulfilling the request), 406 (Not Acceptable: the server cannot produce a response matching the list of acceptable values define), 429 (Too Many Requests)
.well-known-collection.py Code Rundown
1. First, the file reads in the full site set, i.e., original sites and redo sites
  - sites_df.index(redo_original_sites[idx]): get the index of the site we want to change
  - sites_list[x] = redo_new_sites[idx]: replace the site with the new site
2. r = requests.get(sites*df[site_idx] + '/.well-known/gpc.json', timeout=35): The code runs with a timeout of 35 seconds (to stay consistent with Crawler timeouts)
  (i) checks if there will be json data, then logging all three columns (Site URL, request status, json data)
  (ii) if there is no json data, it will just log the status and site
  (iii) if r.json is not json data(), the "Expecting value: line 1 column 1 (char 0)", means that the status .." error will appear in the error logging and the error will log site and status
  (iv) if the request.get does not finish within 35 seconds, it will store errors and only log site
Important Code Documentation
- "file1.write(sites_df[site_idx] + "," + str(r.status_code) + ',"' + str(r.json()) + '"\n')" : writing data to a file with three columns (site, status and json data)
- "errors[sites_df[site_idx]] = str(e)" -> store errors with original links
- "with open("well-known-errors.json", "w") as outfile: json.dump(errors, outfile)" -> convert and write JSON object as containing errors to file

9. Thank You!

We would like to thank our supporters!

Major financial support provided by the National Science Foundation.

Additional financial support provided by the Alfred P. Sloan Foundation, Wesleyan University, and the Anil Fernando Endowment.

Conclusions reached or positions taken are our own and not necessarily those of our financial supporters, its trustees, officers, or staff.

Name		Name	Last commit message	Last commit date
Latest commit History 385 Commits
gpc-analysis-extension		gpc-analysis-extension
mariadb-compose		mariadb-compose
rest_api		rest_api
selenium-optmeowt-crawler		selenium-optmeowt-crawler
well-known-crawl		well-known-crawl
.gitignore		.gitignore
FUNDING.yml		FUNDING.yml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
compose.yaml		compose.yaml
crawler-architecture.png		crawler-architecture.png
data_screen.png		data_screen.png
gpc-logo-small-black.svg		gpc-logo-small-black.svg
nsf.png		nsf.png
package-lock.json		package-lock.json
package.json		package.json
plt_logo.png		plt_logo.png
run-crawlers.sh		run-crawlers.sh
sloan_logo.jpg		sloan_logo.jpg
supervisord.conf		supervisord.conf
wesleyan_shield.png		wesleyan_shield.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPC Web Crawler

1. Research Publications

2. Introduction

3. Data

4. Development

5. Architecture

6. Components

6.1 Crawler Script

Types of Errors that May Be Logged

6.2 OptMeowt Analysis Extension

6.3 Node.js REST API

6.4 SQL Database

7. Limitations/Known Issues/Bug Fixes

7.1 Sites that Cannot Be Analyzed

7.2 Important Bug Fixes

8. Other Resources

8.1 Python Library for GPP String Decoding

8.2 .well-known/gpc.json Python Script

Details of the .well-known Analysis

9. Thank You!

About

Releases 5

Sponsor this project

Packages

Contributors 11

Languages

License

privacy-tech-lab/gpc-web-crawler

Folders and files

Latest commit

History

Repository files navigation

GPC Web Crawler

1. Research Publications

2. Introduction

3. Data

4. Development

5. Architecture

6. Components

6.1 Crawler Script

Types of Errors that May Be Logged

6.2 OptMeowt Analysis Extension

6.3 Node.js REST API

6.4 SQL Database

7. Limitations/Known Issues/Bug Fixes

7.1 Sites that Cannot Be Analyzed

7.2 Important Bug Fixes

8. Other Resources

8.1 Python Library for GPP String Decoding

8.2 .well-known/gpc.json Python Script

Details of the .well-known Analysis

9. Thank You!

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 5

Sponsor this project

Packages 0

Contributors 11

Languages

Packages