Skip to content

Analyse regex blacklist #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 11 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ This script tries to provide you with a bunch of information that enables you to
id, status, total domains on adlist, covered domains, hits, unique covered domains, address
- the sum of unique covered domains
- optional: list of unique coverd domains with adlist_id, address
- optional: analyse regex blacklist

As domains usually appear on more then one adlist I introduce the concept of ***unique covered domains***. Those are domains that have been visited, would have been blocked and appear on just one adlist. This might help you to value your adlists not just by how many domains are covered but also what would happen if you disable this adlist.

Expand All @@ -25,21 +26,26 @@ As domains usually appear on more then one adlist I introduce the concept of ***
- ~~Disabled blocklist won't be analyzed as gravity is not including domains from deactivated adlists. You can enable all adlists from within the script.~~
The script will warn you, if there is a mismatch between the enabled adlists and data found in the gravity database. Users have the choise to run gravity to clear the mismatch or proceed anyway. In this case the tool will analyze all availabe data, but results must be interpreted with caution. (see [8dab71](https://github.com/yubiuser/pihole_adlist_tool/commit/8dab71836c1b2407c9626b17fd592399a7ef0b58))

- Black/Whitelisted domains (including regex) are not considered when calculating the number of covered domains (and hits)
- Black/Whitelisted domains (~~including regex~~ see [PR #19](https://github.com/yubiuser/pihole_adlist_tool/pull/19) are not considered when calculating the number of covered domains (and hits)
- Whitelisted domains reduce the number of blocked domains as reported by pihole compared to the calculated numbers
- Blacklisted domains increase the number of blocked domains as reported by pihole compared to the calculated numbers

- ~~This tool can not deal with domains that have been blocked due to CNAME inspection because pihole doesn't store the actual blocked domain but the CNAME and a corresponding status ("Blocked during deep CNAME inspection"). This CNAME domain will not match a domain from an adlist - if it would it would have been blocked directly.~~ (see [PR #3](https://github.com/yubiuser/pihole_adlist_tool/pull/3))

- Other differences between the number of domains/hits as reported by pihole and calculated numbers are due to change in adlist configuration over time

---
**Cave**
- For the limits of the regex analysis see the [notes of PR #19](https://github.com/yubiuser/pihole_adlist_tool/pull/19)

Depending on the number of enabled adlists and the number of visited domains in the selected time period the calculation might take some time - please be patient.
---
**Caveat**

- Depending on the number of enabled adlists and the number of visited domains in the selected time period the calculation might take some time - please be patient.
On my [NanoPi NeoPlus2](http://wiki.friendlyarm.com/wiki/index.php/NanoPi_NEO_Plus2) (ARM, Quad-core Cortex A53) it takes ~17-18sec to analyse 2.3 million queries from pihole-ftl.db and 347603 domains in gravity.db

- Analysis of regex blacklist can take minutes easily!

- While lists that have attracted no or only very few hits in the analysis are prime candidates for removal, you should also consider the type of blocklist before you ultimately decide do remove a list, e.g. you may want to keep malware or telemetry focused blocklists nonetheless.

---
**Requirements**

Expand Down Expand Up @@ -70,6 +76,7 @@ options:
-s [total/domains/hits/unique] Set sorting order to total domains, domains covered, hits covered or unique covered domains DESC. (Default sorting: id ASC)
-u Show covered unique domains
-a Run in 'automatic mode'. No user input is requiered at all, assuming default choice would be to leave everything untouched.
-r Analyse regex as well. Depending on the amount of domains and regex this might take a while.
-h Show this help dialog

```
Expand Down
112 changes: 110 additions & 2 deletions pihole_adlist_tool
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ TOP=0
SORT=
SORT_ORDER=
UNIQUE=0
REGEX_MODE=0

# variables for adlist management
declare -a adlist_conf_old_enabled
Expand All @@ -46,6 +47,10 @@ BLACKLIST_GRAVITY=
NUM_TOTAL_UNIQUE_DOMAINS=
BLACKLIST_CNAME=
NUM_DOMAINS_BLOCKED_FUTURE=
NUM_ENABLED_REGEX=
NUM_REGEX=
NUM_ALL_DOMAINS=
NUM_DOMAINS_BLOCKED_BY_REGEX=

# variables for general info
PIHOLE_DNSMASQ_VERSION=
Expand All @@ -61,6 +66,11 @@ declare -i FURTHER_ACTION
bold=$(tput bold)
normal=$(tput sgr0)

# variables for regex analysis
declare -a all_domains
CURRENT_DOMAIN=
REGEX_ID=

# help message
print_help() {
echo
Expand All @@ -78,6 +88,8 @@ print_help() {

-a Run in 'automatic mode'. No user input is requiered at all, assuming default choice would be
to leave everything untouched.

-r Analyse regex as well. Depending on the amount of domains and regex this might take a while.

-h Show this help dialog.
"
Expand Down Expand Up @@ -165,13 +177,14 @@ SUDO_SQLITE=$(declare -f sqlite)
trap cleanup_on_trap INT

# getopts flags and assing arguments to variables
while getopts 'd:t:s:uah' flag; do
while getopts 'd:t:s:uarh' flag; do
case "${flag}" in
d) DAYS_REQUESTED="${OPTARG}" ;;
t) TOP="${OPTARG}" ;;
s) SORT="${OPTARG}" ;;
u) UNIQUE=1 ;;
a) AUTOMATIC_MODE=1 ;;
r) REGEX_MODE=1 ;;
h) print_help
exit 0 ;;
*) print_help
Expand Down Expand Up @@ -254,7 +267,15 @@ if [ "$UNIQUE" -eq 1 ];
echo -e " [i] UNIQUE: Shown"
else
echo -e " [i] UNIQUE: Not shown"
fi
fi

# print if regex should be analysed as well
if [ "$REGEX_MODE" -eq 1 ];
then
echo -e " [i] REGEX_MODE: Enabled"
else
echo -e " [i] REGEX_MODE: Disabled"
fi

echo -e "\n ++++++++++++++++++++++\n\n"
}
Expand Down Expand Up @@ -761,4 +782,91 @@ if [ "$UNIQUE" = 1 ];
echo
fi

# analyse regex
if [ "$REGEX_MODE" -eq 1 ];
then
echo
echo
echo " [i] Analysing regex blacklist ....."
echo " [i] This might take some time (minutes!) - please be patient."
echo
echo


#
# table regex_blacklist contains all blacklist regex from gravity.db
# table all_domains contains all domains (in the selected time periode) from the pihole-FTL.db (including domains from CNAME inspection)
# table domain_by_regex contains all domains and the blocking regex

# 1.) copy blacklisted regex info from gravity database
# 2.) copy distinct domains from pihole-FTL.db
# 3.) add distinct domains from pihole-FTL.db found in additional_info columen coming from CNAME inspection (type 9,10,11)
# 4.) Save some statistics


sqlite -cmd ".timeout 5000" $TEMP_DB << EOF
ATTACH DATABASE "${PIHOLE_FTL}" AS pihole_ftl_db;
ATTACH DATABASE "${GRAVITY}?mode=ro" AS gravity_db;

CREATE table regex_blacklist (id TEXT UNIQUE, regex TEXT, enabled INTEGER, domains_covered INTEGER);
CREATE table all_domains(domain TEXT UNIQUE);
CREATE table domain_by_regex(domain TEXT, regex_id INTEGER);

INSERT INTO regex_blacklist(id, regex,enabled) SELECT id, domain, enabled FROM gravity_db.domainlist where type=3;
INSERT INTO all_domains(domain) SELECT distinct domain FROM pihole_ftl_db.queries WHERE id>=${FTL_ID};
INSERT OR IGNORE INTO all_domains(domain) SELECT distinct additional_info FROM pihole_ftl_db.queries WHERE status in (9,10,11) AND id>=${FTL_ID};

INSERT INTO info (property, value) Select 'NUM_ALL_DOMAINS', COUNT(*) FROM all_domains;
INSERT INTO info (property, value) Select 'NUM_REGEX', COUNT(*) FROM regex_blacklist;
INSERT INTO info (property, value) Select 'NUM_ENABLED_REGEX', COUNT(id) FROM regex_blacklist WHERE enabled=1;

DETACH DATABASE gravity_db;
DETACH DATABASE pihole_ftl_db;
.exit
EOF


# copy all domains from table all_domais in array all_domains
# interate over each domain in all_domains
# for each domain check if it is covered by a regex (using pihole-FTL regex-test)
# if the test returns regex_ids, save them in domain_by_regex table
# NOTE: pihole-FTL regex-test will test also against regex whitelist BUT this is still much faster than to create a second loop to check against each regex blacklist indiviually


all_domains=(`sqlite $TEMP_DB "SELECT domain FROM all_domains"`)

for CURRENT_DOMAIN in "${all_domains[@]}"; do
pihole-FTL regex-test $CURRENT_DOMAIN |grep -E -o "DB ID [0-9]*"|awk '{print $3}' | while read REGEX_ID; do
sqlite $TEMP_DB "INSERT INTO domain_by_regex(domain, regex_id) VALUES ('$CURRENT_DOMAIN',$REGEX_ID);"
done
done

# count for each regex_id how many domains are in domain_by_regex and store it in table regex_blacklist

sqlite $TEMP_DB "UPDATE regex_blacklist SET domains_covered=(SELECT COUNT(regex_id) from domain_by_regex WHERE id=regex_id GROUP BY regex_id );"

# get stats
# the number of different domains that would have been blocked by regex with the current regex configuration
sqlite $TEMP_DB "INSERT INTO info (property, value) Select 'NUM_DOMAINS_BLOCKED_BY_REGEX', COUNT (distinct domain) FROM domain_by_regex JOIN regex_blacklist ON regex_id=id where enabled=1 ;"

NUM_ALL_DOMAINS=$(sqlite $TEMP_DB "SELECT value FROM info where property='NUM_ALL_DOMAINS';")
NUM_REGEX=$(sqlite $TEMP_DB "SELECT value FROM info where property='NUM_REGEX';")
NUM_ENABLED_REGEX=$(sqlite $TEMP_DB "SELECT value FROM info WHERE property ='NUM_ENABLED_REGEX';")
NUM_DOMAINS_BLOCKED_BY_REGEX=$(sqlite $TEMP_DB "SELECT value FROM info WHERE property ='NUM_DOMAINS_BLOCKED_BY_REGEX';")

echo
echo
echo
echo " [i] ${bold}Regex coverage${normal}"
# prints the regex table
echo
echo
sqlite -column -header $TEMP_DB "SELECT id, enabled, domains_covered, regex FROM regex_blacklist;"
echo
echo
echo " [i] Since "$DATE_FIRST_ANALYZED" you have been visiting ${bold}"$NUM_ALL_DOMAINS" different domains${normal}."
echo " You have ${bold}"$NUM_REGEX" blacklist regex${normal} configured ("$NUM_ENABLED_REGEX" enabled)"
echo " With your enabled blacklist regex you would have covered ${bold}"$NUM_DOMAINS_BLOCKED_BY_REGEX" different domains${normal}."
fi

remove_temp_database
112 changes: 110 additions & 2 deletions pihole_adlist_tool_docker
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ TOP=0
SORT=
SORT_ORDER=
UNIQUE=0
REGEX_MODE=0

# variables for adlist management
declare -a adlist_conf_old_enabled
Expand All @@ -46,6 +47,10 @@ BLACKLIST_GRAVITY=
NUM_TOTAL_UNIQUE_DOMAINS=
BLACKLIST_CNAME=
NUM_DOMAINS_BLOCKED_FUTURE=
NUM_ENABLED_REGEX=
NUM_REGEX=
NUM_ALL_DOMAINS=
NUM_DOMAINS_BLOCKED_BY_REGEX=

# variables for general info
PIHOLE_DNSMASQ_VERSION=
Expand All @@ -61,6 +66,11 @@ declare -i FURTHER_ACTION
bold=$(tput bold)
normal=$(tput sgr0)

# variables for regex analysis
declare -a all_domains
CURRENT_DOMAIN=
REGEX_ID=

# help message
print_help() {
echo
Expand All @@ -78,6 +88,8 @@ print_help() {

-a Run in 'automatic mode'. No user input is requiered at all, assuming default choice would be
to leave everything untouched.

-r Analyse regex as well. Depending on the amount of domains and regex this might take a while.

-h Show this help dialog.
"
Expand Down Expand Up @@ -165,13 +177,14 @@ SUDO_SQLITE=$(declare -f sqlite)
trap cleanup_on_trap INT

# getopts flags and assing arguments to variables
while getopts 'd:t:s:uah' flag; do
while getopts 'd:t:s:uarh' flag; do
case "${flag}" in
d) DAYS_REQUESTED="${OPTARG}" ;;
t) TOP="${OPTARG}" ;;
s) SORT="${OPTARG}" ;;
u) UNIQUE=1 ;;
a) AUTOMATIC_MODE=1 ;;
r) REGEX_MODE=1 ;;
h) print_help
exit 0 ;;
*) print_help
Expand Down Expand Up @@ -243,7 +256,15 @@ if [ "$UNIQUE" -eq 1 ];
echo -e " [i] UNIQUE: Shown"
else
echo -e " [i] UNIQUE: Not shown"
fi
fi

# print if regex should be analysed as well
if [ "$REGEX_MODE" -eq 1 ];
then
echo -e " [i] REGEX_MODE: Enabled"
else
echo -e " [i] REGEX_MODE: Disabled"
fi

echo -e "\n ++++++++++++++++++++++\n\n"
}
Expand Down Expand Up @@ -750,4 +771,91 @@ if [ "$UNIQUE" = 1 ];
echo
fi

# analyse regex
if [ "$REGEX_MODE" -eq 1 ];
then
echo
echo
echo " [i] Analysing regex blacklist ....."
echo " [i] This might take some time (minutes!) - please be patient."
echo
echo


#
# table regex_blacklist contains all blacklist regex from gravity.db
# table all_domains contains all domains (in the selected time periode) from the pihole-FTL.db (including domains from CNAME inspection)
# table domain_by_regex contains all domains and the blocking regex

# 1.) copy blacklisted regex info from gravity database
# 2.) copy distinct domains from pihole-FTL.db
# 3.) add distinct domains from pihole-FTL.db found in additional_info columen coming from CNAME inspection (type 9,10,11)
# 4.) Save some statistics


sqlite -cmd ".timeout 5000" $TEMP_DB << EOF
ATTACH DATABASE "${PIHOLE_FTL}" AS pihole_ftl_db;
ATTACH DATABASE "${GRAVITY}?mode=ro" AS gravity_db;

CREATE table regex_blacklist (id TEXT UNIQUE, regex TEXT, enabled INTEGER, domains_covered INTEGER);
CREATE table all_domains(domain TEXT UNIQUE);
CREATE table domain_by_regex(domain TEXT, regex_id INTEGER);

INSERT INTO regex_blacklist(id, regex,enabled) SELECT id, domain, enabled FROM gravity_db.domainlist where type=3;
INSERT INTO all_domains(domain) SELECT distinct domain FROM pihole_ftl_db.queries WHERE id>=${FTL_ID};
INSERT OR IGNORE INTO all_domains(domain) SELECT distinct additional_info FROM pihole_ftl_db.queries WHERE status in (9,10,11) AND id>=${FTL_ID};

INSERT INTO info (property, value) Select 'NUM_ALL_DOMAINS', COUNT(*) FROM all_domains;
INSERT INTO info (property, value) Select 'NUM_REGEX', COUNT(*) FROM regex_blacklist;
INSERT INTO info (property, value) Select 'NUM_ENABLED_REGEX', COUNT(id) FROM regex_blacklist WHERE enabled=1;

DETACH DATABASE gravity_db;
DETACH DATABASE pihole_ftl_db;
.exit
EOF


# copy all domains from table all_domais in array all_domains
# interate over each domain in all_domains
# for each domain check if it is covered by a regex (using pihole-FTL regex-test)
# if the test returns regex_ids, save them in domain_by_regex table
# NOTE: pihole-FTL regex-test will test also against regex whitelist BUT this is still much faster than to create a second loop to check against each regex blacklist indiviually


all_domains=(`sqlite $TEMP_DB "SELECT domain FROM all_domains"`)

for CURRENT_DOMAIN in "${all_domains[@]}"; do
pihole-FTL regex-test $CURRENT_DOMAIN |grep -E -o "DB ID [0-9]*"|awk '{print $3}' | while read REGEX_ID; do
sqlite $TEMP_DB "INSERT INTO domain_by_regex(domain, regex_id) VALUES ('$CURRENT_DOMAIN',$REGEX_ID);"
done
done

# count for each regex_id how many domains are in domain_by_regex and store it in table regex_blacklist

sqlite $TEMP_DB "UPDATE regex_blacklist SET domains_covered=(SELECT COUNT(regex_id) from domain_by_regex WHERE id=regex_id GROUP BY regex_id );"

# get stats
# the number of different domains that would have been blocked by regex with the current regex configuration
sqlite $TEMP_DB "INSERT INTO info (property, value) Select 'NUM_DOMAINS_BLOCKED_BY_REGEX', COUNT (distinct domain) FROM domain_by_regex JOIN regex_blacklist ON regex_id=id where enabled=1 ;"

NUM_ALL_DOMAINS=$(sqlite $TEMP_DB "SELECT value FROM info where property='NUM_ALL_DOMAINS';")
NUM_REGEX=$(sqlite $TEMP_DB "SELECT value FROM info where property='NUM_REGEX';")
NUM_ENABLED_REGEX=$(sqlite $TEMP_DB "SELECT value FROM info WHERE property ='NUM_ENABLED_REGEX';")
NUM_DOMAINS_BLOCKED_BY_REGEX=$(sqlite $TEMP_DB "SELECT value FROM info WHERE property ='NUM_DOMAINS_BLOCKED_BY_REGEX';")

echo
echo
echo
echo " [i] ${bold}Regex coverage${normal}"
# prints the regex table
echo
echo
sqlite -column -header $TEMP_DB "SELECT id, enabled, domains_covered, regex FROM regex_blacklist;"
echo
echo
echo " [i] Since "$DATE_FIRST_ANALYZED" you have been visiting ${bold}"$NUM_ALL_DOMAINS" different domains${normal}."
echo " You have ${bold}"$NUM_REGEX" blacklist regex${normal} configured ("$NUM_ENABLED_REGEX" enabled)"
echo " With your enabled blacklist regex you would have covered ${bold}"$NUM_DOMAINS_BLOCKED_BY_REGEX" different domains${normal}."
fi

remove_temp_database