Colab TPU training stuck at end of epoch if checkpoint_callback=True #3660

ktrapeznikov · 2020-09-25T20:04:03Z

I am seeing training stuck at the last step of the 1st epoch when checkpoint_callback=True is enabled. I am passing tpu_cores=8 to the trainer.

If it's False training is still slow for the first few steps and then it speeds up (I guess that this is expected since it takes a few steps for xla to compile stuff).

Here is a link to the colab notebook example.

### Setup Model
class GPT2Tuned(pl.LightningModule):
    def __init__(self, hparams: argparse.Namespace):
        super().__init__()
        self.hparams = hparams
        self.model = AutoModelForCausalLM.from_pretrained(self.hparams.model_name_or_path)

    def forward(self, **inputs):
        return self.model(**inputs)
        
    def _step(self,batch):
        inputs = {"input_ids": batch[0]}
        inputs["labels"] = batch[0]
        outputs = self(**inputs)
        loss = outputs[0]
        return dict(loss=loss)

    def training_step(self, batch, batch_idx):
        out_dict = self._step(batch)
        return dict(loss=out_dict["loss"], log=out_dict)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)

    def train_dataloader(self):
      # fake data
        dataset = TensorDataset(torch.randint(10000,(10000,128)).long()) 
        return  DataLoader(
            dataset,
            batch_size=self.hparams.batch_size,
            shuffle=True
        )

    @staticmethod
    def add_model_specific_args(parser):
        parser.add_argument("--model_name_or_path",type=str)
        parser.add_argument("--batch_size", default=4, type=int)
        parser.add_argument("--learning_rate", default=5e-5, type=float)
        return parser

#### Training 
parser = argparse.ArgumentParser()
parser = GPT2Tuned.add_model_specific_args(parser)
parser = pl.Trainer.add_argparse_args(parser)
args = parser.parse_args(["--gpus","0"])

args.__dict__.update(dict(model_name_or_path  = "gpt2-medium",
                          batch_size = 3,
                          precision=16,
                          tpu_cores=8))

model = GPT2Tuned(args)
trainer = pl.Trainer.from_argparse_args(args, checkpoint_callback=True)  
trainer.fit(model)

The text was updated successfully, but these errors were encountered:

github-actions · 2020-09-25T20:04:41Z

Hi! thanks for your contribution!, great first issue!

awaelchli · 2020-09-25T22:16:39Z

Thanks for the reproduction script.
Could be related #2700
There is a PR for fixing checkpointing on TPU: #2726 (need check if this applies)

ktrapeznikov · 2020-09-25T22:54:26Z

Thanks. Seems related. Is there another mechanism to save checkpoints as a temporary fix?

Borda · 2020-10-01T14:50:58Z

@ktrapeznikov mind checking it now as Google Colab had an internal issue with TPUs in past days...

ktrapeznikov · 2020-10-13T22:18:21Z

Last time I tried... same issue

evanatyourservice · 2020-10-14T20:04:24Z

This would be a nice feature to have since I would rarely do inference on TPU, just training.

edenlightning · 2020-10-22T16:06:22Z

@lezwon maybe you can help revive the tpu + checkpoint fix?

lezwon · 2020-10-22T16:25:27Z

Been working on it. Finding it a bit tricky. I'll raise a PR with the current work I've done.

sarmientoj24 · 2020-10-30T15:25:25Z

hi, i am experiencing being stuck on 0% at first epoch when using TPU with pytorch-lightning. any possible reasons why?

lezwon · 2020-11-01T18:51:37Z

@sarmientoj24 mind share the notebook?

ktrapeznikov changed the title ~~Colab TPU training stuck at end if epoch if checkpoint_callback=True~~ Colab TPU training stuck at end of epoch if checkpoint_callback=True Sep 25, 2020

awaelchli added accelerator: tpu Tensor Processing Unit bug Something isn't working labels Sep 28, 2020

Borda added the waiting on author Waiting on user action, correction, or update label Oct 1, 2020

edenlightning removed the waiting on author Waiting on user action, correction, or update label Oct 19, 2020

edenlightning added this to the 1.0.3 milestone Oct 19, 2020

lezwon mentioned this issue Nov 1, 2020

Tpu save #4309

Merged

8 tasks

edenlightning modified the milestones: 1.0.x, 1.0.7 Nov 10, 2020

Borda modified the milestones: 1.0.7, 1.0.x Nov 11, 2020

edenlightning removed this from the 1.0.x milestone Nov 13, 2020

edenlightning assigned lezwon Nov 16, 2020

edenlightning added the waiting on author Waiting on user action, correction, or update label Nov 17, 2020

SeanNaren closed this as completed in #4309 Dec 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Colab TPU training stuck at end of epoch if checkpoint_callback=True #3660

Colab TPU training stuck at end of epoch if checkpoint_callback=True #3660

ktrapeznikov commented Sep 25, 2020

github-actions bot commented Sep 25, 2020

awaelchli commented Sep 25, 2020

ktrapeznikov commented Sep 25, 2020

Borda commented Oct 1, 2020

ktrapeznikov commented Oct 13, 2020

evanatyourservice commented Oct 14, 2020

edenlightning commented Oct 22, 2020

lezwon commented Oct 22, 2020

sarmientoj24 commented Oct 30, 2020

lezwon commented Nov 1, 2020

Colab TPU training stuck at end of epoch if checkpoint_callback=True #3660

Colab TPU training stuck at end of epoch if checkpoint_callback=True #3660

Comments

ktrapeznikov commented Sep 25, 2020

github-actions bot commented Sep 25, 2020

awaelchli commented Sep 25, 2020

ktrapeznikov commented Sep 25, 2020

Borda commented Oct 1, 2020

ktrapeznikov commented Oct 13, 2020

evanatyourservice commented Oct 14, 2020

edenlightning commented Oct 22, 2020

lezwon commented Oct 22, 2020

sarmientoj24 commented Oct 30, 2020

lezwon commented Nov 1, 2020