Short paper for CSNet 23.
We propose a generalizable approach for network flow image representation to detect patterns without performing any network flow cut-offs.
Further, we introduce a novel method to preprocess network traffic to enhance our resulting models.
In this step, we remove network protocol header information hindering the models' generalizability.
We use a data set containing malware and benign classes and train different deep learning architectures VGG-19, ResNet-50, and ResNeXt-50.
ResNet-50 reaches up to
All trained models and the preprocessed dataset are publicly available at heiBOX.
The folder ML_code contains all the code we used to train our machine learning models on our custom datasets. In following we describe how to use the different python scripts
> dataset_train_test_split.py
Will perform a stratified random split on a given dataset into train/test/validation set (70/15/15 split).
Usage: python dataset_train_test_split.py [DATASET_PATH]
DATASET_PATH:
The folder containing the dataset ordered into subfolders of classes.
> train_model.py
Will train a model on a given dataset from scratch or continue training a given model.
Usage: python train_model.py [OPTIONS]
Options:
-d, --dataset_type STRING Can be "multiclass" or "binary" [required]
-m, --model_type STRING Can be "fc" (VGG-FC), "notop" (VGG-NoTop), "resnet" (ResNet), or "next" (ResNeXt) [required]
-p, --preprocessing_type STRING Can be "preprocessed": tells the script to
load the dataset from the PREPROCESSED_PATH
(given in dataset.py), or "payload": tells the
script to load the dataset from the PAYLOAD_PATH
(given in dataset.py). (PATHS need to be modified
in dataset.py!) [required]
-s --s saved_model [NONE|MODEL_PATH] Path to a model to continue training. If
NONE training is performed from scratch
[default: NONE]
-t --training_optimizer [NONE|OPTIMIZER_PATH] Path to a optimizer to continue
training. If NONE a new optimier is initialized
[default: NONE]
-l --learning_rate [NONE|FLOAT] Defines the learning rate to be used. If
None the determined optimal hyperparameter for
the given model will be used [default: NONE]
-e --starting_epoch [INTEGER] Defines the epoch to start training at
[default: 1]
-n --num_epochs [INTEGER] The amount of epochs to train the model
[default: 35]
-o [BOOLEAN] If true oversampling is used. [default: False]
-a [BOOLEAN] If true an adaptive learning rate is used (If the
validation accuracy reaches 93% the lr will be
divided by 10. [default: False]
> hyperparameter_optimization.py
Will perform hyperparameter_opimization (currently only for multiclass classification)
Usage: hyperparameter_optimization.py [OPTIONS]
Options:
-m, --model_type STRING Can be "fc" (VGG-FC), "notop" (VGG-NoTop), "resnet" (ResNet), or "next" (ResNeXt) [required]
-p, --preprocessing_type STRING Can be "preprocessed": tells the script to
load the dataset from the PREPROCESSED_PATH
(given in dataset.py), or "payload": tells the
script to load the dataset from the PAYLOAD_PATH
(given in dataset.py). (PATHS need to be modified
in dataset.py!) [required]
-s --s save_dir_best_result [NONE|PATH] Path to save the best resulting model.
[default: NONE]
-o [BOOLEAN] If true oversampling is used. [default: False]
Accuracy | F1-Score | |
---|---|---|
All classes | ||
Neris/Virut combined |
Accuracy | F1-Score | |
---|---|---|
All classes |
VGG-NoTop | VGG-FullyConnected | ResNet | ResNeXt | |
---|---|---|---|---|
All classes | ||||
Neris/Virut combined |
VGG-NoTop | VGG-FullyConnected | ResNet | ResNeXt | |
---|---|---|---|---|
All classes |