MalwareUrlAnalyzer is a tool written in python3 allowing to classify a url or list of url using the following 3 algorithms.
- LogisticRegression
- RandomForest
- Naive bayes
POI : Point Of Interest indicates the interest of this url. If the 3 algorithms predict that it is interesting (1.0) then POI is true
The number of values to 1.0 can be set with the -r option
Features :
-
The training was carried out on "characteristic" urls and not on domains.
The ipynb file is being "cleaned up" ... -
MalwareUrlAnalyzer is fast and can parse large datasets in just seconds (27998 urls en 9s)
-
MalwareUrlAnalyzer has only been tested on linux (Ubuntu 20.04) so far.
-
The results can be displayed or exported in json and protobuf format
-
A volatility plugin makes it possible to extract the urls from the memory of the mapped processes and space
It is based on the following dataset :
- https://research.aalto.fi/en/datasets/phishstorm-phishing-legitimate-url-dataset
- http://205.174.165.80/CICDataset/ISCX-URL-2016/
- Extraction MISP / OTX
Structure :
models\ contains zip files and ML models
template \ contains proto file for export
volatility_plugins \ contains Vol plugin and readme
export_pb2.py result of the compilation of the proto file
MalwareUrlAnalyzer.py Main script
To use the models, you have to unzip them and position them where MalwareUrlAnalyze.py is or use -p option to specify the path.
python MalwareUrlAnalyzer.py URL
python MalwareUrlAnalyzer.py -p models URL
- Decompress all models
7z e models_all.zip
- Check file integrity (md5sum models_all.z*)
# zip -s 10m -r -9 models_all.zip lr.joblib nb.joblib randomforest_tfidf.joblib
14ec71a3d2bbefdcf08697e903d60311 models_all.z01
9aba3c3a98c3b7ee512492da6197c65e models_all.z02
3059a7b8d105162155d74d34193f9d86 models_all.z03
27c49785e4e02917d98f8bec54306006 models_all.z04
fb8815014ff95534183818245edf47e2 models_all.z05
5778deba3d1660a3e856a2f550bffee4 models_all.z06
d44b0286477e9aeeb8b6ec8af4cef029 models_all.zip
- Need to install thoses packages :
- pandas
- numpy
- joblib
- argparse
- time
- tqdm
- scikit-learn
pip install -r requirements.txt
(base) xophidia@ubuntu:~/Desktop$ python MalwareUrlAnalyzer.py -h
usage: MalwareUrlAnalyzer.py [-h] [-f F] [-p P] [-o] [-i] [strings [strings ...]]
Detect Url Benign/Malware
positional arguments:
strings The string to analyze
optional arguments:
-h, --help show this help message and exit
-f F file to analyze
-r {2,3} Set Point Of Interest (True/False) 2: [2 True] 3: [3 True]
-p P where models are. If no -p path is ./
-o print result Json format
-e Export result into protobuf format - filename export
-i, --info Print information
You can analyse one file or one string.
python MalwareUrlAnalyzer.py -p models -f temp.csv
███▄ ▄███▓ ▄▄▄ ██▓ █ █░ ▄▄▄ ██▀███ ▓█████ █ ██ ██▀███ ██▓ ▄▄▄ ███▄ █ ▄▄▄ ██▓ ▓██ ██▓▒███████▒▓█████ ██▀███
▓██▒▀█▀ ██▒▒████▄ ▓██▒ ▓█░ █ ░█░▒████▄ ▓██ ▒ ██▒▓█ ▀ ██ ▓██▒▓██ ▒ ██▒▓██▒ ▒████▄ ██ ▀█ █ ▒████▄ ▓██▒ ▒██ ██▒▒ ▒ ▒ ▄▀░▓█ ▀ ▓██ ▒ ██▒
▓██ ▓██░▒██ ▀█▄ ▒██░ ▒█░ █ ░█ ▒██ ▀█▄ ▓██ ░▄█ ▒▒███ ▓██ ▒██░▓██ ░▄█ ▒▒██░ ▒██ ▀█▄ ▓██ ▀█ ██▒▒██ ▀█▄ ▒██░ ▒██ ██░░ ▒ ▄▀▒░ ▒███ ▓██ ░▄█ ▒
▒██ ▒██ ░██▄▄▄▄██ ▒██░ ░█░ █ ░█ ░██▄▄▄▄██ ▒██▀▀█▄ ▒▓█ ▄ ▓▓█ ░██░▒██▀▀█▄ ▒██░ ░██▄▄▄▄██ ▓██▒ ▐▌██▒░██▄▄▄▄██ ▒██░ ░ ▐██▓░ ▄▀▒ ░▒▓█ ▄ ▒██▀▀█▄
▒██▒ ░██▒ ▓█ ▓██▒░██████▒░░██▒██▓ ▓█ ▓██▒░██▓ ▒██▒░▒████▒▒▒█████▓ ░██▓ ▒██▒░██████▒ ▓█ ▓██▒▒██░ ▓██░ ▓█ ▓██▒░██████▒ ░ ██▒▓░▒███████▒░▒████▒░██▓ ▒██▒
░ ▒░ ░ ░ ▒▒ ▓▒█░░ ▒░▓ ░░ ▓░▒ ▒ ▒▒ ▓▒█░░ ▒▓ ░▒▓░░░ ▒░ ░░▒▓▒ ▒ ▒ ░ ▒▓ ░▒▓░░ ▒░▓ ░ ▒▒ ▓▒█░░ ▒░ ▒ ▒ ▒▒ ▓▒█░░ ▒░▓ ░ ██▒▒▒ ░▒▒ ▓░▒░▒░░ ▒░ ░░ ▒▓ ░▒▓░
░ ░ ░ ▒ ▒▒ ░░ ░ ▒ ░ ▒ ░ ░ ▒ ▒▒ ░ ░▒ ░ ▒░ ░ ░ ░░░▒░ ░ ░ ░▒ ░ ▒░░ ░ ▒ ░ ▒ ▒▒ ░░ ░░ ░ ▒░ ▒ ▒▒ ░░ ░ ▒ ░ ▓███/@Xophidia_2021 ░ ░▒ ░ ▒░
░ ░ ░ ▒ ░ ░ ░ ░ ░ ▒ ░░ ░ ░ ░░░ ░ ░ ░░ ░ ░ ░ ░ ▒ ░ ░ ░ ░ ▒ ░ ░ ▒ ▒ ░░ ░ ░ ░ ░ ░ ░ ░░ ░
░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░
Based on :
- https://acris.aalto.fi/ws/portalfiles/portal/16859732/urlset.csv.zip
- https://www.unb.ca/cic/datasets/url-2016.html
- extract URL from OTX Alienvault
- http://205.174.165.80/CICDataset/ISCX-URL-2016/
In the order : ("Naive Bayes | LogisticRegression | RandomForest")
pynb file :
Analyse du fichier temp.csv en cours
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [01:45<00:00, 4.58s/it]
Naive Bayes Logistic Regression Random Forest POI
https://www.jeuxvideo.com 0.0 0.0 1.0 False
https://www.lemonde.fr/pixels/live/2021/07/21/p... 0.0 0.0 0.0 False
amazon.fr/ref=nav_logo 1.0 1.0 1.0 True
github.com/Invoke-IR/ForensicPosters 0.0 0.0 0.0 False
scikit-learn.org/stable/modules/generated/sklea... 0.0 0.0 0.0 False
jonashartley.com/hilaryolsen/wp-includes/random... 1.0 1.0 1.0 True
apk.mirror 0.0 0.0 0.0 False
www.rfc-editor.org/rfc/rfc2350.txt 0.0 0.0 0.0 False
cert.pl/en/posts/2018/07/dissecting-smoke-loader/ 0.0 0.0 0.0 False
www.kaggle.com/victorambonati/unsupervised-anom... 0.0 0.0 1.0 False
xsso.xjpakmdcfuqe.ru/e5718ce3090cb9e30634085055... 1.0 1.0 1.0 True
081.ftphosting.pw/user81249/4918/0124.txt 0.0 0.0 0.0 False
fusu.icu/ajax/7z.php?ext=me 1.0 1.0 1.0 True
keke.icu/ajax/7z.php?ext=me 1.0 1.0 1.0 True
luru.icu/js/filters.php 0.0 1.0 1.0 False
luru.icu/js/facebook.js?1555768638150 1.0 1.0 1.0 True
keke.icu/app/7za.exe?id=6986 1.0 1.0 1.0 True
www.jeuxvideo.com 0.0 0.0 1.0 False
www.youtube.com/watch?v=55iZ8qFE2MM 0.0 0.0 1.0 False
de.letscompareonline.com/cgi-bin/ztEE/ 1.0 1.0 1.0 True
rakikuma.com/cgi-bin/K/ 1.0 1.0 1.0 True
pacificgroup.ws/paradisesuiting.com/closed_module 1.0 1.0 1.0 True
real 0m5.973s
user 0m5.386s
sys 0m0.574s
python MalwareUrlAnalyzer.py -p models https://www.developpez.com/actu/316946/L-UE-envisage-de-rendre-les-transferts-de-bitcoins-plus-tracables-en-exigeant-la-collecte-d-informations-sur-le-destinataire-et-l-expediteur/
███▄ ▄███▓ ▄▄▄ ██▓ █ █░ ▄▄▄ ██▀███ ▓█████ █ ██ ██▀███ ██▓ ▄▄▄ ███▄ █ ▄▄▄ ██▓ ▓██ ██▓▒███████▒▓█████ ██▀███
▓██▒▀█▀ ██▒▒████▄ ▓██▒ ▓█░ █ ░█░▒████▄ ▓██ ▒ ██▒▓█ ▀ ██ ▓██▒▓██ ▒ ██▒▓██▒ ▒████▄ ██ ▀█ █ ▒████▄ ▓██▒ ▒██ ██▒▒ ▒ ▒ ▄▀░▓█ ▀ ▓██ ▒ ██▒
▓██ ▓██░▒██ ▀█▄ ▒██░ ▒█░ █ ░█ ▒██ ▀█▄ ▓██ ░▄█ ▒▒███ ▓██ ▒██░▓██ ░▄█ ▒▒██░ ▒██ ▀█▄ ▓██ ▀█ ██▒▒██ ▀█▄ ▒██░ ▒██ ██░░ ▒ ▄▀▒░ ▒███ ▓██ ░▄█ ▒
▒██ ▒██ ░██▄▄▄▄██ ▒██░ ░█░ █ ░█ ░██▄▄▄▄██ ▒██▀▀█▄ ▒▓█ ▄ ▓▓█ ░██░▒██▀▀█▄ ▒██░ ░██▄▄▄▄██ ▓██▒ ▐▌██▒░██▄▄▄▄██ ▒██░ ░ ▐██▓░ ▄▀▒ ░▒▓█ ▄ ▒██▀▀█▄
▒██▒ ░██▒ ▓█ ▓██▒░██████▒░░██▒██▓ ▓█ ▓██▒░██▓ ▒██▒░▒████▒▒▒█████▓ ░██▓ ▒██▒░██████▒ ▓█ ▓██▒▒██░ ▓██░ ▓█ ▓██▒░██████▒ ░ ██▒▓░▒███████▒░▒████▒░██▓ ▒██▒
░ ▒░ ░ ░ ▒▒ ▓▒█░░ ▒░▓ ░░ ▓░▒ ▒ ▒▒ ▓▒█░░ ▒▓ ░▒▓░░░ ▒░ ░░▒▓▒ ▒ ▒ ░ ▒▓ ░▒▓░░ ▒░▓ ░ ▒▒ ▓▒█░░ ▒░ ▒ ▒ ▒▒ ▓▒█░░ ▒░▓ ░ ██▒▒▒ ░▒▒ ▓░▒░▒░░ ▒░ ░░ ▒▓ ░▒▓░
░ ░ ░ ▒ ▒▒ ░░ ░ ▒ ░ ▒ ░ ░ ▒ ▒▒ ░ ░▒ ░ ▒░ ░ ░ ░░░▒░ ░ ░ ░▒ ░ ▒░░ ░ ▒ ░ ▒ ▒▒ ░░ ░░ ░ ▒░ ▒ ▒▒ ░░ ░ ▒ ░ ▓███/@Xophidia_2021 ░ ░▒ ░ ▒░
░ ░ ░ ▒ ░ ░ ░ ░ ░ ▒ ░░ ░ ░ ░░░ ░ ░ ░░ ░ ░ ░ ░ ▒ ░ ░ ░ ░ ▒ ░ ░ ▒ ▒ ░░ ░ ░ ░ ░ ░ ░ ░░ ░
░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░
Based on :
- https://acris.aalto.fi/ws/portalfiles/portal/16859732/urlset.csv.zip
- https://www.unb.ca/cic/datasets/url-2016.html
- extract URL from OTX Alienvault
- http://205.174.165.80/CICDataset/ISCX-URL-2016/
In the order : ("Naive Bayes | LogisticRegression | RandomForest")
pynb file :
Analyse de l'url ['https://www.developpez.com/actu/316946/L-UE-envisage-de-rendre-les-transferts-de-bitcoins-plus-tracables-en-exigeant-la-collecte-d-informations-sur-le-destinataire-et-l-expediteur/'] en cours
Naive Bayes Logistic Regression Random Forest POI
0.0 0.0 0.0 False
All the code of the project is licensed under the GNU Lesser General Public License