My analysis revolves around checking what percentage of along with which websites and scripts in this dataset are tracking users location (geolocation) and language preferences as well as their country code. So, as to provide them with a customized content based on the users preferences (eg. location, language)
sample 10 percent - 3.7GB download / 7.4GB on disk
-
Out of the total of 205949 websites/locations in this dataset 46482
(22.57%)
websites were found to be checking for preferred language of the user, usually the language of the browser UI, and their subsequent location/scripts can be found in thelanguage_pref_df
dataframe. -
Out of the total of 205949 websites/locations in this dataset 2216
(1.08%)
websites were found to be checking for user's location using the geolocation api, and their subsequent location/scripts can be found in thegeolocation_df
dataframe.
-
Out of the total of 166862 scripts in this dataset 7700
(4.61%)
scripts were found to be making the calls to check for preferred language of the user, usually the language of the browser UI, and their subsequent scripts can be found in thelanguage_pref_df.script_url
dataframe. -
Out of the total of 166862 scripts in this dataset 1359
(0.81%)
scripts were found to be making the calls to check for user's location using the geolocation api, and their subsequent scripts can be found in thegeolocation_df.script_url
dataframe.
- For geolocation tracking: scripts were being executed from
400
unique domains (geolocation_unique_domains
). - For language & country code tracking: scripts were being executed from
2271
unique domains (language_pref_unique_domains
).