Resume training from checkpoint with same hyperparameter #146

Amg9794 · 2024-10-11T11:39:23Z

🚀 The feature, motivation and pitch

Hi

I have trained trained both speech encoder (whisperL-v3) and linear projector with freezed llama 3.2- 1b model for ASR task
. I found that all steps were completed but eval loss still not saturate and there was still room for improvement in the model.

Now when i started resuming training from last saved checkpoint(which just saved trainable parameter (using original code's method to resume) . i found that result got degraded which was unexpected.

is there any way to resume training from same state with last saved hyperparameter.

I write this function to save checkpoint which save all detail like this

def save_model_checkpoint_peft(model, optimizer, lr_scheduler, epoch, step, best_val_loss, best_val_acc, scaler, cfg, checkpoint_name="checkpoint"):
logger.info(f"--> saving model checkpoint...")
save_dir = os.path.join(cfg.output_dir, checkpoint_name)
os.makedirs(save_dir, exist_ok=True)
save_full_path = os.path.join(save_dir, "checkpoint.pt")

if cfg.enable_ddp:
    model = model.module

# Save only trainable parameters
trainable_params = OrderedDict()
for name, param in model.named_parameters():
    if param.requires_grad:
        trainable_params[name] = param.data.cpu()

checkpoint = {
    'model_state_dict': trainable_params,
    'optimizer_state_dict': optimizer.state_dict(),
    'lr_scheduler_state_dict': lr_scheduler.state_dict() if lr_scheduler else None,
    'epoch': epoch,
    'step': step,
    'best_val_loss': best_val_loss,
    'best_val_acc': best_val_acc,
    'random_state': torch.get_rng_state(),
    'cuda_random_state': torch.cuda.get_rng_state_all() if torch.cuda.is_available() else None,
    'config': cfg.__dict__,
    'scaler': scaler.state_dict() if scaler else None,
}

torch.save(checkpoint, save_full_path)
logger.info(f"Checkpoint saved at {save_full_path}")

can some one help me with this ->does all these details are necessary to save ? and also how to use this aved lr_scheduler in train function .

it will be a great help for me and others too . 

Thank you

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

ddlBoJack · 2024-10-27T10:58:00Z

Hi, we did not implement the resuming of hyperparameters. Only the model parameters are saved. We want to implement it if time permits and welcome to contribute.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume training from checkpoint with same hyperparameter #146

Resume training from checkpoint with same hyperparameter #146

Amg9794 commented Oct 11, 2024 •

edited

Loading

ddlBoJack commented Oct 27, 2024

Resume training from checkpoint with same hyperparameter #146

Resume training from checkpoint with same hyperparameter #146

Comments

Amg9794 commented Oct 11, 2024 • edited Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

ddlBoJack commented Oct 27, 2024

Amg9794 commented Oct 11, 2024 •

edited

Loading