Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume training from checkpoint with same hyperparameter #146

Open
Amg9794 opened this issue Oct 11, 2024 · 1 comment
Open

Resume training from checkpoint with same hyperparameter #146

Amg9794 opened this issue Oct 11, 2024 · 1 comment

Comments

@Amg9794
Copy link

Amg9794 commented Oct 11, 2024

🚀 The feature, motivation and pitch

Hi

I have trained trained both speech encoder (whisperL-v3) and linear projector with freezed llama 3.2- 1b model for ASR task
. I found that all steps were completed but eval loss still not saturate and there was still room for improvement in the model.

Now when i started resuming training from last saved checkpoint(which just saved trainable parameter (using original code's method to resume) . i found that result got degraded which was unexpected.

is there any way to resume training from same state with last saved hyperparameter.

I write this function to save checkpoint which save all detail like this

def save_model_checkpoint_peft(model, optimizer, lr_scheduler, epoch, step, best_val_loss, best_val_acc, scaler, cfg, checkpoint_name="checkpoint"):
logger.info(f"--> saving model checkpoint...")
save_dir = os.path.join(cfg.output_dir, checkpoint_name)
os.makedirs(save_dir, exist_ok=True)
save_full_path = os.path.join(save_dir, "checkpoint.pt")

if cfg.enable_ddp:
    model = model.module

# Save only trainable parameters
trainable_params = OrderedDict()
for name, param in model.named_parameters():
    if param.requires_grad:
        trainable_params[name] = param.data.cpu()

checkpoint = {
    'model_state_dict': trainable_params,
    'optimizer_state_dict': optimizer.state_dict(),
    'lr_scheduler_state_dict': lr_scheduler.state_dict() if lr_scheduler else None,
    'epoch': epoch,
    'step': step,
    'best_val_loss': best_val_loss,
    'best_val_acc': best_val_acc,
    'random_state': torch.get_rng_state(),
    'cuda_random_state': torch.cuda.get_rng_state_all() if torch.cuda.is_available() else None,
    'config': cfg.__dict__,
    'scaler': scaler.state_dict() if scaler else None,
}

torch.save(checkpoint, save_full_path)
logger.info(f"Checkpoint saved at {save_full_path}")

can some one help me with this ->does all these details are necessary to save ? and also how to use this aved lr_scheduler in train function .

it will be a great help for me and others too . 

Thank you

Alternatives

No response

Additional context

No response

@ddlBoJack
Copy link
Collaborator

Hi, we did not implement the resuming of hyperparameters. Only the model parameters are saved. We want to implement it if time permits and welcome to contribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants