We present a prototype of ONES, which is implemented with RPyC and PyTorch.
- TACC Frontera RTX Nodes.
- CUDA requirements:
CUDA == 10.1 cuDNN >= 7.6.3 NCCL >= 2.4.8
- Download the v1.1.1 repository.
$ git clone https://github.com/kurisusnowdeng/ones_sc21.git
- Setup the virtual environment by runnning
scripts/env_setup.sh
. The following libraries will be installed.python == 3.7 rpyc >= 5.0.1 pytorch == 1.4.0 torchvision == 0.5.0 numpy >= 1.17.4 scipy == 1.3.1 pytorch-pretrained-bert >= 0.6.2
-
Launch the system.
$ python -m src.controller --size NUM_NODES --port CONTROLLER_PORT --cache-dir PATH/TO/CACHE/
--size
is compulsory to specify the number of nodes to use. -
On each worker node, make sure that there is no irrelevant process using any GPU. Join the node to the controller.
$ python -m src.app_manager --port MANAGER_PORT --controller_port CONTROLLER_PORT --cache-dir PATH/TO/CACHE/
-
Submit your job.
$ python -m src.workload submit path/to/your_script.py \ --batch-size=BATCH_SIZE --lr=LEARNING_RATE \ --dataset-size=DATASET_SIZE --early-stop-patience=PATIENCE
- Set
/path/to/project
inscripts/launch.sh
,scripts/master.slurm
andscripts/worker.slurm
to your project directory - Run
scripts/preparation.sh
to download datasets. - Submit the job to the
rtx
queue (in case thertx
queue is busy, please use the option-n
to run a smaller cluster such as n=8).$ ./scripts/launch.sh -j JOB_NAME -n 16 -t 06:00:00
- After the job is completed, extract and analyze results from logs (defaultly located in the folder
out/
, which can be changed insrc/config.py
).This script will generate the plots as presented in our paper under the project path.$ python ./scripts/measurement.py