This repository contains Amazon SageMaker training implementation with data pre-processing (decoding + augmentations) on both GPUs and CPUs for computer vision — allowing you to compare and reduce training time by addressing CPU bottlenecks caused by increasing data pre-processing load. This is achieved by GPU-accelerated JPEG image decoding and offloading of augmentation to GPUs using NVIDIA DALI. Performance bottlenecks and ystem utilizations metrics are compared using Amazon Sagemaker Debugger.
util_train.py
: Launch Amazon Sagemaker PyTorch traininng jobs with your custom training script.src/sm_augmentation_train-script.py
: Custom training script to train models of different complexities (RESNET-18
,RESNET-50
,RESNET-152
) with data pre-processing implementation for:- JPEG decoding and augmentation on CPUs using PyTorch Dataloader
- JPEG decoding and augmentation on CPUs & GPUs using NVIDIA DALI
util_debugger.py
: Extract system utilization metrics with SageMaker Debugger.
- Parameters such as training data path, S3 bucket, epochs and other training hyperparameters can be adapted at
util_train.py
. - The custom custom training script used is
src/sm_augmentation_train-script.py
.
from util_debugger import get_sys_metric
from util_train import aug_exp_train
aug_exp_train(model_arch = 'RESNET50',
batch_size = '32',
aug_operator = 'dali-gpu',
instance_type='ml.p3.2xlarge',
curr_sm_role = 'to-be-added')
- Note that this implementation at the moment is optimized for single-GPU training to address multi-core CPU bottlenecks. The DALI Decoder operation can be updated with improved usage of
device_memory_padding
andhost_memory_padding
for multi-GPU larger instances.
- Create an Amazon S3 bucket called
sm-aug-test
and upload the Imagenette dataset (download link). - Update your SageMaker execution role in the notebook
sm_augmentation_train-script.py
and run the notebook to compare seconds/epoch and system utilization for training jobs by toggling the following parameters:instance_type
(default:ml.p3.2xlarge
)model_arch
(default:RESNET18
)batch_size
(default:32
)aug_load_factor
(default:12
)AUGMENTATION_APPROACHES
(default:['pytorch-cpu', 'dali-gpu']
)
- Comparison results using the above default parameter setup:
- Seconds/ Epoch improvement of
72.59%
in Amazon SageMaker training job by offloading JPEG decoding and heavy augmentation to GPU — addressing data pre-processing bottleneck to improve performance-cost ratio. - Using the above strategy, training time improvement is higher for lighter models like
RESNET-18
(which causes more CPU bottlenecks) over heavier model such asRESNET-152
as theaug_load_factor
is increased while keeping lower batch size of32
. - System utilization Histograms and CPU bottleneck Heatmaps are generated with SageMaker Debugger in the notebook. Profiler Report and other interactive visuals available on SageMaker Studio.
- Seconds/ Epoch improvement of
- Further detailed results (based on different augmentation loads, batch sizes, and model complexities for training on 8-CPUs and 1-GPU) are available on request.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.