Skip to content

Latest commit



77 lines (50 loc) · 4.35 KB

File metadata and controls

77 lines (50 loc) · 4.35 KB

Running an experiment

Here we describe the set-up for training a model (including on the Stability cluster).

General set-up


  • Configs are in: experiments/configs.
  • If you wish to use a new model from Hugging Face as the starting point you will need to tokenise your data. We have an example script for chemrxiv which does this here: experiments/data/
  • You will also need to create a configuration file for the model if one does not exist e.g. experiments/configs/hugging-face/full_160M.yml.

If the data is already tokenised for the model you wish to use you can proceed to the next step.


We require Miniconda to be installed when working with the training scripts to create Python environments. You can follow the bash script here to install Miniconda.

Interactive run

  • Create a conda environment as shown in the documentation and install chemnlp.
  • If using Weights and Biases for logging: export WANDB_BASE_URL="".
  • Run using torchrun, for example:
    torchrun --nnodes 1 --nproc-per-node 4 experiments/scripts/ experiments/configs/hugging-face/full_160M.yml
  • You can use nvidia-smi or wandb logging to monitor efficiency during this step.

Launching an experiment run through SLURM

  • Take the sbatch_<suffix> script associated with the training run and execute this through an sbatch command as shown in the documentation. This will build the conda environment and install chemnlp before the job begins. Note that building the environment can be a little slow so if you aren't confident your code will run it's best to test it interactively first.
  • Example command:
sbatch experiments/scripts/ $1 $2 $3  # see script for description of arguments
sbatch experiments/scripts/ experiments/maw501 maw501 160M_full.yml  # explicit example
  • From within the stability cluster, you can monitor your job at /fsx/proj-chemnlp/experiments/logs or as set in the sbatch script.

Using Weights and Biases

If you don't have the required permission to log to W&B, please request this. In the interim you can disable this or log to a project under your name by changing the configuration options e.g. in experiments/configs/hugging-face/full_160M.yml.

Multi-node training

This is for Hugging Face fine-tuning only at the moment and is orchestrated through the torch.distributed package. It allows you to expand your computing environment to multiple nodes in a distributed data parallel manner. It uses multiprocessing to efficiently parallelise training across devices. In order to enable this feature you simply have to switch to using the *_multinode script instead of the original slurm training script as described in the scripts documentation.

Restarting from a checkpoint

This is for Hugging Face fine-tuning only at the moment.

WARNING: Hugging Face does not know you are restarting from a checkpoint and so you may wish to change output_dir in the config file to avoid overwriting old checkpoints. You may wish to use a lower learning rate / different scheduler if continuing training.

You can restart training from a checkpoint by passing checkpoint_path, a directory containing the output from a model saved by HF's Trainer class.

Example config block:

  base: GPTNeoXForCausalLM
  name: EleutherAI/pythia-160m
  revision: main
  checkpoint_path: /fsx/proj-chemnlp/experiments/checkpoints/finetuned/full_160M/checkpoint-1600 # directory to restart training from

DeepSpeed integration

This is for Hugging Face fine-tuning only and is described in detail here. You can enable DeepSpeed through the Hugging Face TrainerArguments by adding a configuration key of deepspeed_config followed by the name of your configuration file inside of experiments/configs/deepspeed configuration directory.

Example config block:

  deepspeed_config: deepspeed_offload_S3.json # looks in experiments/configs/deepspeed