God's in his heaven. All's right with the world! — Pippa Passes
Graduate's in his heaven. All's right with the code! — The dream of a struggling PhD student
Personal Python toolbox including project templates, useful functions, etc. The training framework is tailored for the computing cluster at Vector Institute, but can also be applied to other platforms.
Greatly inspired by and lots of code borrowed from:
* Image credit: Neon Genesis Evangelion
- Manually install PyTorch with cuda support (see
requirements.txt
for versions we tested) - Run
pip install -e .
to install the whole package
-
Our code automatically detect previous checkpoints according to the name of the config file. For example, if your config is
xxx_params.py
, we will save/load checkpoints undercheckpoint/xxx_params/models/
. Therefore, if you launch two runs with the same config file (even if their content are different), your new run might load the checkpoints from the old run. Instead, you should copy the config file to a new file, re-name it (potentially doing some modifications to test new settings), and thus avoid detecting old checkpoints.If you just want to run the same config file multiple times (e.g. 3 random seeds), you should use the dup_run_sbatch.sh script, which adds different suffix to differentiate the config files.
-
When you run the code with multi-GPU, we have implemented DDP in the Trainer. Please replace
python xxx.py ...
with
python -m torch.distributed.launch --nproc_per_node=$NUM_GPU --master_port=29501 xxx.py ...
to launch it. The deprecation warning can be ignored, which we plan to fix in the future.
-
If you run DDP with PyTorch>=2.0, you may encounter an error
error: unrecognized arguments: --local-rank=0
This is likely due to an incompatibility between PyTorch 1.x and 2.x. You can fix it by modifying the
--local_rank
argument to--local-rank
(e.g. here). -
When initializing the Trainer, we check the number of GPUs here. If you use
CUDA_VISIBLE_DEVICES=0,... python train.py xxx
to launch training in the commandline, then you will pass the check. But if you setos.environ['CUDA_VISIBLE_DEVICES'] = '0,...'
in the python file (after importingnerv
), this may trigger an error. We recommend to always useCUDA_VISIBLE_DEVICES=0,...
to run python commands. -
Some users have encountered a weird version issue with
opencv
. Fornerv>=0.1.0,<=v0.2.0
,opencv-python==4.5.5.64
,4.6.0.66
, and4.7.0.72
are three tested versions.
- SlotFormer (ICLR'23)