Skip to content

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale

License

Notifications You must be signed in to change notification settings

privacy-tech-lab/gpc-web-crawler

Repository files navigation

GitHub release (latest by date) GitHub Release Date GitHub last commit GitHub issues GitHub closed issues GitHub GitHub watchers GitHub Repo stars GitHub forks GitHub sponsors

OptMeowt logo

GPC Web Crawler

The GPC Web Crawler is developed and maintained by the OptMeowt team. In addition to this readme, check out our Wiki.

1. Research Publications
2. Introduction
3. Development
4. Architecture
5. Components
6. Limitations/Known Issues/Bug Fixes
7. Other Resources
8. Thank You!

1. Research Publications

You can find a list of our research publications in the OptMeowt Analysis extension repo.

2. Introduction

The GPC Web Crawler analyzes websites' compliance with Global Privacy Control (GPC) at scale. GPC is a privacy preference signal that people can use to exercise their rights to opt out from web tracking. The GPC Web Crawler is based on Selenium and the OptMeowt Analysis extension.

3. Development

You can install the GPC Web Crawler on a consumer-grade computer. We use a MacBook. Get started as follows:

  1. If you want to test sites' compliance with a particular law, for example, the California Consumer Privacy Act (CCPA), make sure to crawl the sites from a computer in the respective geographical location. If you are located in a different location, you can use a VPN. We perform our crawls for the CCPA using Mullvad VPN set to Los Angeles, California.

  2. Sign in to Docker, or create a Docker account if you do not already have one.

  3. Download docker by following the instructions in the official Docker documentation

  4. Authenticate to Docker Hub by following the instructions in the official Docker Documentation.

  5. Clone this repo locally or download a zipped copy and unzip it.

  6. Open sites.csv and enter the URLs of the sites you want to analyze in the first column. Some examples are included in the file.

  7. In the root directory of the repo, the crawler can be started on the Docker image by running:

    sh scripts/start_container.sh

    or to start the crawler with enhanced debugging information:

     sh scripts/start_container.sh debug
    • If you instead want to run the crawler on your local machine, follow the instructions in the Wiki.
  8. To check the analysis results, open a browser and navigate to http://localhost:8080/analysis. Ports may be different depending on your local server setup. So, you would ned to adjust the URL or your configuration accordingly.

  9. To watch the crawler operate on the Desktop environment, open a browser and navigate to http://localhost:6901/vnc.html. Click the button that says "connect" in the center of the screen. When prompted for a password, enter vncpassword.

  10. To set up the analysis database on your local machine with the same structure and data as in the container, follow these steps:

    Once the crawl is complete, enter the running container by using:

    docker exec -it crawl_test /bin/bash

    Inside of the container, create a SQL dump file containing both the database structure and data:

    mysqldump -u root -p analysis > /srv/analysis/entries_export.sql
    • Password: toor

    In a new terminal window, use the following command to copy the SQL file from the container to current directory on your local machine:

    docker cp crawl_test:/srv/analysis/entries_export.sql ./entries_export.sql

    Finally, open your preferred database manager (such as phpMyAdmin or MySQL Workbench) and import the entries_export.sql file to recreate the database on your local machine.

  11. If you modify the analysis extension, you should test it to make sure it still works properly. Some guidelines can be found in the Wiki.

Note: When you perform a crawl, for one reason or another, some sites may fail to analyze. We always perform a second crawl for the sites that failed the first time (i.e., the redo sites).

4. Architecture

Here is an overview of the GPC Web Crawler architecture:

crawler-architecture

All of this happens within the Desktop environment provided by the headless VNC container. The editable version of this image is in the Google Drive.

5. Components

The GPC Web Crawler consists of various components:

5.1 Crawler Script

The flow of the crawler script is described in the diagram below.

analysis-flow

This script is stored and executed on a Desktop environment living in a docker image. The Crawler also keeps a log of sites that cause errors. It stores these logs in the error-logging.json file and updates this file after each error.

Types of Errors that May Be Logged

  1. TimeoutError: A Selenium error that is thrown when either the page has not loaded in 30 seconds or the page has not responded for 30 seconds. Timeouts are set in driver.setTimeouts.
  2. HumanCheckError: A custom error that is thrown when the site has a title that we have observed means our VPN IP address is blocked or there is a human check on that site. See Limitations/Known Issues for more details.
  3. InsecureCertificateError: A Selenium error that indicates that the site will not be loaded, as it has an insecure certificate.
  4. WebDriverError: A Selenium error that indicates that the WebDriver has failed to execute some part of the script.
  5. WebDriverError: Reached Error Page: This error indicates that an error page has been reached when Selenium tried to load the site.
  6. UnexpectedAlertOpenError: This error indicates that a popup on the site disrupted Selenium's ability to analyze the site (such as a mandatory login).

5.2 OptMeowt Analysis Extension

The OptMeowt Analysis extension is packaged as an xpi file and installed on a Firefox Nightly browser by the crawler script. When a site loads, the OptMeowt Analysis extension automatically analyzes the site and sends the analysis data to the local SQL database via a POST request. The analysis performed by the OptMeowt Analysis extension investigates the GPC compliance of a given site using a 4-step approach:

  1. The extension checks whether the site is subject to the CCPA by looking at Firefox's urlClassification object. Requests returned by this object are based on the Disconnect list per Firefox's Enhanced Tracking Protection. Sending data to a site on the Disconnect will often qualify as sharing or selling of data subject to people's opt out right.
  2. The extension checks the value of the US Privacy string, the GPP string, and OneTrust's OptanonConsent, OneTrustWPCCPAGoogleOptOut, and OTGPPConsent cookies, if any of these exist.
  3. The extension sends a GPC signal to the site.
  4. The extension rechecks the value of the US Privacy string, OneTrust cookies, and GPP string. If a site respects GPC, the values should be now set to opt out.

The information collected during this process is used to determine whether the site respects GPC. Note that legal obligations to respect GPC differ by geographic location. In order for a site to be GPC compliant, the following statements should be true after the GPC signal was sent for each string or cookie that the site implemented:

  1. the third character of the US Privacy string is a Y
  2. the value of the OptanonConsent cookie is isGpcEnabled=1
  3. the opt out columns in the GPP string's relevant US section (i.e., SaleOptOut, TargetedAdvertisingOptOut, SharingOptOut) have a value of 1; Note that the columns and opt out requirements vary by state
  4. the value of the OneTrustWPCCPAGoogleOptOut cookie is true

5.3 Node.js REST API

We use the REST API to make GET, PUT, and POST requests to the SQL database. The REST API is also local and is run in a separate terminal from the crawler. Instructions for the REST API can be found in the Wiki.

5.4 SQL Database

The SQL database is a local database that stores analysis data. Instructions to set up the SQL database can be found in the Wiki. The columns of our database tables are below:

id site_id domain sent_gpc uspapi_before_gpc uspapi_after_gpc usp_cookies_before_gpc usp_cookies_after_gpc OptanonConsent_before_gpc OptanonConsent_after_gpc gpp_before_gpc gpp_after_gpc gpp_version urlClassification OneTrustWPCCPAGoogleOptOut_before_gpc OneTrustWPCCPAGoogleOptOut_after_gpc OTGPPConsent_before_gpc OTGPPConsent_after_gpc

The first few columns primarily pertain to identifying the site and verifying that the OptMeowt Analysis extension is working properly.

  • id: autoincrement primary key to identify the database entry
  • site_id: the id of the domain in the csv file that lists the sites to crawl. This id is used for processing purposes (i.e., to identify domains that redirect to another domain) and is set by the crawler script
  • domain: the domain name of the site
  • sent_gpc: a binary indicator of whether the OptMeowt Analysis extension sent a GPC opt out signal to the site

The remaining columns pertain to the opt out status of a user, i.e., the OptMeowt Analysis extension, which is indicated by the value of the US Privacy string, OptanonConsent cookie, and GPP string. The US Privacy string can be implemented on a site via (1) the client-side JavaScript USPAPI, which returns the US Privacy string value when called, or (2) an HTTP cookie that stores its value. The OptMeowt Analysis extension checks each site for both implementations of the US Privacy string by calling the USPAPI and checking all cookies. The GPP string's value is obtained via the CMPAPI for GPP.

  • uspapi_before_gpc: return value of calling the USPAPI before a GPC opt out signal is sent
  • uspapi_after_gpc: return value of calling the USPAPI after a GPC opt out signal was sent
  • usp_cookies_before_gpc: the value of the US Privacy string in an HTTP cookie before a GPC opt out signal is sent
  • usp_cookies_after_gpc: the value of the US Privacy string in an HTTP cookie after a GPC opt out signal was sent
  • OptanonConsent_before_gpc: the isGpcEnabled string from OneTrust's OptanonConsent cookie before a GPC opt out signal is sent. The user is opted out if isGpcEnabled=1, and the user is not opted out if isGpcEnabled=0. If the cookie is present but does not have an isGpcEnabled string, we return "no_gpc"
  • OptanonConsent_after_gpc: the isGpcEnabled string from OneTrust's OptanonConsent cookie after a GPC opt out signal was sent. The user is opted out if isGpcEnabled=1, and the user is not opted out if isGpcEnabled=0. If the cookie is present but does not have an isGpcEnabled string, we return "no_gpc"
  • gpp_before_gpc: the value of the GPP string before a GPC opt out signal is sent
  • gpp_after_gpc: the value of the GPP string after a GPC opt out signal was sent
  • gpp_version: the version of the CMP API that obtains the GPP string (i.e., v1.0 has a getGPPdata command while v1.1 removes the getGPPdata command and its return values in favor of callback functions)
  • urlClassification: the return value of Firefox's urlClassificaton object, sorted by category and filtered for the following categories: fingerprinting, tracking_ad, tracking_social, any_basic_tracking, any_social_tracking
  • OneTrustWPCCPAGoogleOptOut_before_gpc: the value of the OneTrustWPCCPAGoogleOptOut cookie before a GPC signal is sent. This cookie is described by OneTrust. Additional information is available in issue #94
  • OneTrustWPCCPAGoogleOptOut_after_gpc: the value of the OneTrustWPCCPAGoogleOptOut cookie after a GPC signal was sent. This cookie is described by OneTrust. Additional information is available in issue #94
  • OTGPPConsent_before_gpc: the value of the OTGPPConsent cookie before a GPC signal is sent. This cookie is described by OneTrust. Additional information is available in issue #94
  • OTGPPConsent_after_gpc: the value of the OTGPPConsent cookie after a GPC signal was sent. This cookie is described by OneTrust. Additional information is available in issue #94

6. Limitations/Known Issues/Bug Fixes

6.1 Sites that Cannot Be Analyzed

Since we are using Selenium and a VPN to visit the sites we analyze, there are some limitations to the sites we can analyze. There are some types of sites that we cannot analyze due to our methodology:

  1. Sites where the VPN's IP address is blocked.

    For example, a site titled "Access Denied" that says we do not have permission to access the site on this server is loaded instead of the real site.

  2. Sites that have some kind of human check.

    Some sites can detect that we are using automation tools (i.e., Selenium) and do not let us access the real site. Instead, we are redirected to a page with some kind of captcha or puzzle. We do not try to bypass any human checks.

    Since the data collected from both of these types of sites (i.e., (1) sites that block our VPN's IP address and (2) sites that have some kind of human check) will be incorrect and occur because our automation was detected, we list them under HumanCheckError in the error-logging.json file. We have observed a few different site titles that indicate we have reached a site in one of these categories. Most of the titles occur for multiple sites, with the most common being "Just a Moment…" on a captcha from Cloudflare. We detect when our Crawler visits one of these sites by matching the site title of the loaded site with a set of regular expressions that match with the known titles. Clearly, we will miss some sites in this category if we have not seen it and added the title to the set of regular expressions. We are updating the regular expressions as we see more sites like this. For more information, see issue #51.

  3. Sites that block script injection.

    For instance, https://www.flickr.com blocks script injection and will not successfully be analyzed. In the debugging table, on the first attempt, the last message will be runAnalysis-fetching, and on the second attempt, the extension logs SQL POSTING: SOMETHING WENT WRONG.

  4. Sites that redirect between multiple domains throughout analysis.

    For instance, https://spothero.com/ and https://parkingpanda.com/ are now one entity but still can use both domains. In the debugging table, you will see multiple debugging entries under each domain. Because we store analysis data by domain, the data will be incomplete and will not be added to the database.

6.2 Important Bug Fixes

  1. At some point the Crawler kept returning an empty result for Firefox's urlClassification object. @eakubilo fixed this tricky bug.

7. Other Resources

7.1 Python Library for GPP String Decoding

GPP strings must be decoded. The IAB provides a JavaScript library and an interactive html decoder to do so. To integrate decoding with our colab notebooks for data analysis, we rewrote the library in Python. The library can be found on our Google Drive.

7.2 .well-known/gpc.json Python Script

We collect .well-known/gpc.json data after the whole crawl finishes with a separate Python script, selenium-optmeowt-crawler/well-known-collection.py.

Here are the steps for doing so:

  1. Just as the GPC Web Crawler, this script should be run using the same California VPN after all eight crawl batches are completed

  2. Ensure the lock screen setting is as for the usual crawl

  3. Start the script using:

    python3 well-known-collection.py

Running this script requires three input files: selenium-optmeowt-crawler/full-crawl-set.csv, which is in the repo, redo-original-sites.csv, and redo-sites.csv. The second two files are not found in the repo and should be created for that crawl based on the instructions in our Wiki. As explained in selenium-optmeowt-crawler/well-known-collection.py, the output is a csv called well-known-data.csv with three columns: Site URL, request status, json data as well as an error json file called well-known-errors.json that logs all errors. To run this script on a csv file of sites without accounting for redo sites, comment all lines between line 27 and line 40 except for line 34.

Details of the .well-known Analysis

Analyze the full crawl set with the redo sites replaced, i.e., using the full set of sites and the sites that we have redone (which replaced the original sites with redo sites).

  • Output

    1. If successful, a csv with three columns will be created: Site URL, request status, json data

    2. If not successful, an error json file will be created: logs all errors, including the reason for an error and 500 characters of the request text

      Examples of an error:

      • "Expecting value: line 1 column 1 (char 0)": the status code was 200 (site exists and loaded) or 202 (the request is accepted but incomplete processing) but did not find a json (output: Site_URL, 200, None or Site_URL, 202, None)
      • Reason: site sent all incorrect URLs to a generic error page instead of not serving the page, which would have been a 404 status code
  • Status Codes (HTTP Responses)

    • In general, we expect a 404 status code (Not Found) when a site does not have a .well-known/gpc.json (output: Site_URL, 404, None)
    • Other possible status codes signaling that the .well-known data is not found include but are not limited to: 403 (Forbidden: the server understands the request but refuses to authorize it), 500 (Internal Server Error: the server encountered an unexpected condition that prevented it from fulfilling the request), 406 (Not Acceptable: the server cannot produce a response matching the list of acceptable values define), 429 (Too Many Requests)
  • .well-known-collection.py Code Rundown

    1. First, the file reads in the full site set, i.e., original sites and redo sites
      • sites_df.index(redo_original_sites[idx]): get the index of the site we want to change
      • sites_list[x] = redo_new_sites[idx]: replace the site with the new site
    2. r = requests.get(sites*df[site_idx] + '/.well-known/gpc.json', timeout=35): The code runs with a timeout of 35 seconds (to stay consistent with Crawler timeouts)
      (i) checks if there will be json data, then logging all three columns (Site URL, request status, json data)
      (ii) if there is no json data, it will just log the status and site
      (iii) if r.json is not json data(), the "Expecting value: line 1 column 1 (char 0)", means that the status .." error will appear in the error logging and the error will log site and status
      (iv) if the request.get does not finish within 35 seconds, it will store errors and only log site
  • Important Code Documentation

    • "file1.write(sites_df[site_idx] + "," + str(r.status_code) + ',"' + str(r.json()) + '"\n')" : writing data to a file with three columns (site, status and json data)
    • "errors[sites_df[site_idx]] = str(e)" -> store errors with original links
    • "with open("well-known-errors.json", "w") as outfile: json.dump(errors, outfile)" -> convert and write JSON object as containing errors to file

8. Thank You!

We would like to thank our supporters!


Major financial support provided by the National Science Foundation.

National Science Foundation Logo

Additional financial support provided by the Alfred P. Sloan Foundation, Wesleyan University, and the Anil Fernando Endowment.

Sloan Foundation Logo Wesleyan University Logo

Conclusions reached or positions taken are our own and not necessarily those of our financial supporters, its trustees, officers, or staff.

privacy-tech-lab logo