feat: add save_model_dir flag where final checkpoint saved#291
feat: add save_model_dir flag where final checkpoint saved#291Ssukriti merged 12 commits intofoundation-model-stack:mainfrom
Conversation
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
|
Wanted to call out a note from the description: For lm_head removal, removes lm_head from |
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
- small refactor of tests Signed-off-by: Anh-Uong <anh.uong@ibm.com>
| trainer = sft_trainer.train(MODEL_ARGS, DATA_ARGS, save_model_args, None) | ||
| logs_path = os.path.join( | ||
| tempdir, FileLoggingTrackerConfig.training_logs_filename | ||
| ) | ||
| _validate_logfile(logs_path) | ||
| # validate that no checkpoints created | ||
| assert not any(x.startswith("checkpoint-") for x in os.listdir(tempdir)) | ||
|
|
||
| sft_trainer.save(tempdir, trainer) |
There was a problem hiding this comment.
This was the best way I could think of to test the save() without having to try to mock up the SFTTrainer. I had tried doing something like...
training_args = SFTConfig(**transformer_kwargs)
trainer = SFTTrainer(
model=MODEL_NAME,
args=TRAIN_ARGS,
train_dataset=TWITTER_COMPLAINTS_DATA,
)
with tempfile.TemporaryDirectory() as tempdir:
sft_trainer.save(tempdir, trainer)but this requires needing to preprocess the train args so it can be parsed into SFTConfig and requires data preprocessing to run otherwise hits error AttributeError: 'str' object has no attribute 'column_names' when trying to tokenize the dataset
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
ashokponkumar
left a comment
There was a problem hiding this comment.
Thanks Anh! It looks good. Just a few questions.
README.md
Outdated
|
|
||
| `save_model_dir` can be set to a different directory than `output_dir`. If set to the same directory, the final checkpoint, training logs, and any intermediate checkpoints will all be saved to the same directory as seen below. | ||
|
|
||
| Fine tuning example with `save_strategy="epoch”`, `save_total_limit=2`, and `output_dir==save_model_dir==/tmp/same_dir`. Note the checkpoint directories as well as the `training_logs.jsonl`: |
There was a problem hiding this comment.
Should we instead show an example of save_model_dir being a subfolder of output_dir? That way the the logs, checkpoints and final model area easily segregable.
There was a problem hiding this comment.
Yes can do! I agree this is recommended behavior, but wanted to point out the edge case that may cause the most confusion.
| checkpoint_dir = job_config.get("save_model_dir") | ||
| if not checkpoint_dir: | ||
| checkpoint_dir = os.path.join( | ||
| output_dir, get_highest_checkpoint(output_dir) |
There was a problem hiding this comment.
Do we know if removing the lm_head is required while resuming a training, if so we should think about removing the lm_head from each intermediate checkpoint as it is being written.
There was a problem hiding this comment.
Removing lm_head is required for loading the tuned model with vLLM, but I don't think it's required for resuming training. Is there a way I could check resuming training? If you have a command to run with sft_trainer, that would be super helpful!
There was a problem hiding this comment.
I think we have to stop a run midway and then run the training again.
We have to ensure we use https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer.train.resume_from_checkpoint argument. Seems like we are not currently setting it. We should possibly set it. This will resume from the previously stopped last checkpoint.
There was a problem hiding this comment.
hi @ashokponkumar we have not heard of any requirement to remove lm_head to resume training. @Abhishek-TAMU will be working on resume training from checkpoint in follow up PR as per issue https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1007 . Can we wait for him to add the change?
ypu can post any tips on the issue @ashokponkumar
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
tuning/config/configs.py
Outdated
| save_model_dir: str = field( | ||
| default=None, | ||
| metadata={ | ||
| "help": "Directory where final checkpoint will be saved to \ |
There was a problem hiding this comment.
| "help": "Directory where final checkpoint will be saved to \ | |
| "help": "Directory where final tuned model will be saved to \ |
README.md
Outdated
|
|
||
| A useful flag to set to limit the number of checkpoints saved is [`save_total_limit`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.save_total_limit). Older checkpoints are deleted from the `output_dir`. For example if `save_total_limit=1`, this will only save the last checkpoint. However, while tuning, two checkpoints will exist in `output_dir` for a short time as the new checkpoint is created and then the older one will be deleted. | ||
|
|
||
| `save_model_dir` can optionally be set to save the designated checkpoint using `SFTTrainer.save_model()`. This can be used in tandem with `save_strategy="no"` to only save the designated checkpoint and not any intermediate checkpoints, which can help to save space. |
There was a problem hiding this comment.
I think there has been some confusion here.
If the user sets a validation dataset and load_best_model_at_end, then the best checkpoint will be saved. If no additional flags are set, the final checkpoint will be saved.
This is for when save_total_limit = 1 , then you can. control best or last
save_model() is always for saving tuned model at end after training. By then any checkpoint information is lost. Can you confirm
README.md
Outdated
|
|
||
| ### Tips on Parameters to Set | ||
|
|
||
| #### Saving models |
There was a problem hiding this comment.
This README is confusing and can be simplified.
We can have. 2 subsections
- Save checkpoints while training
- save_strategy = set to epoch , can also be set to steps or no (as described)
- checkpoints are saved to output_dir
- save_total_limit para
NOTE: load_best_model para applies when save_total_limit=1 , as it is a training argument . it does not apply to trainer.save_model as that is after saving. we can skip this para on load_best_model as users can look up TrainingArguments themselves. It si always described in URL you linked https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.save_total_limit
- Save model after training -
save_model_dircan optionally be set to save the tuned model usingSFTTrainer.save_model(). This can be used in tandem withsave_strategy="no"` to only save the final tuned and not any intermediate checkpoints, which can help to save space.
ways you can use save_model_dir and more tips (collapse this section)
examples on expanding:
README.md
Outdated
|
|
||
| `save_model_dir` can optionally be set to save the designated checkpoint using `SFTTrainer.save_model()`. This can be used in tandem with `save_strategy="no"` to only save the designated checkpoint and not any intermediate checkpoints, which can help to save space. | ||
|
|
||
| If the user sets a validation dataset and [`load_best_model_at_end`](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments.load_best_model_at_end), then the best checkpoint will be saved. If no additional flags are set, the final checkpoint will be saved. |
There was a problem hiding this comment.
better to skip this section as described above, does not apply to save_model. It applies to save checkpoint only
There was a problem hiding this comment.
i ended up keeping it but moving it within the save_total_limits section and not in the save_model_dir section
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
18bf1c9 to
6d29c6b
Compare
Yes this is described in the description of the PR above
|
6d29c6b to
a93fb72
Compare
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
a93fb72 to
9954678
Compare
* Set default value of target_modules to be None in LoraConfig Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Removal of transformers logger and addition of python logger Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT and lint check: Removal of transformers logger and addition of python logger Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix: remove lm_head for granite with llama arch models (#258) * initial code for deleting lm_head Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix logic for copying checkpoint Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix check that embed_tokens and lm_head weights are the same Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix warning assertion Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix lm_head check, remove test Signed-off-by: Anh-Uong <anh.uong@ibm.com> * small fixes from code review Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fmt Signed-off-by: Anh-Uong <anh.uong@ibm.com> --------- Signed-off-by: Anh-Uong <anh.uong@ibm.com> Co-authored-by: Anh-Uong <anh.uong@ibm.com> Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add config_utils tests Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Fix fmt Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Separate tests out and use docstrings Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Update more field/value checks from HF defaults Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Fix: Addition of env var TRANSFORMERS_VERBOSITY check Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT Fix: Addition of env var TRANSFORMERS_VERBOSITY check Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add test for tokenizer in lora config (should be ignored) Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Adding logging support to accelerate launch Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT_FIX: Adding logging support to accelerate launch Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * bug: On save event added to callback (#256) * feat: On save event added to callback Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Removed additional bracket Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Removed additional bracket Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Format issues resolved Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: rebase with upstream and add new line Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> --------- Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Co-authored-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: All metric handling changes (#263) * feat: All metric handling changes Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Format issues Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> --------- Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Configuration to set logging level for trigger log (#241) * feat: Added the triggered login in the operation Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Formatting issues Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Added default config Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Moved the variable to right scope Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Checked added to validate config log level Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Removed some unwanted log file Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> --------- Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * limit peft deps until investigate (#274) Signed-off-by: Anh-Uong <anh.uong@ibm.com> * Data custom collator (#260) * refactor code to preprocess datasets Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * fix formatting Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * allow input/output in validate args Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * format input/output JSON and mask Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * function to return suitable collator Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add tests for SFT Trainer input/output format Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * remove unused functions Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add eos token to input/output format Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * fix tests Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * improve docstrings Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * keeping JSON keys constant Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * support for input/output format Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * formatting fixes Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * update rEADME formats Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * formatting README Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> --------- Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> * Revert "limit peft deps until investigate (#274)" (#275) This reverts commit f57ff63. Signed-off-by: Anh-Uong <anh.uong@ibm.com> * feat: per process state metric (#239) Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com> * Modify test to pass with target_modules: None Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Logging changes and unit tests added Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * feat: Add a dockerfile argument to enable aimstack (#261) * Add a dockerfile argument at the end of final layer to enable aimstack. Currenlty guarded by a dockerfile argument. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Set the default value of ENABLE_AIM to false Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> --------- Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Solved conflict with main Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT:Fix Solved conflict with main Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * enabling tests for prompt tuning Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * feat: Support pretokenized (#272) * feat: support pretokenized datasets Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * consolidate collator code Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add valuerrors for incorrect args Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> --------- Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Alex Brooks <alex.brooks@ibm.com> * Update packaging requirement from <24,>=23.2 to >=23.2,<25 (#212) Updates the requirements on [packaging](https://github.com/pypa/packaging) to permit the latest version. - [Release notes](https://github.com/pypa/packaging/releases) - [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst) - [Commits](pypa/packaging@23.2...24.1) --- updated-dependencies: - dependency-name: packaging dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * enabling tests for prompt tuning (#278) Signed-off-by: Abhishek <maurya.abhishek@ibm.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * fix: do not add special tokens for custom tokenizer (#279) Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * PR changes for changing logger Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix: bug where the logger was not being used properly (#286) Signed-off-by: Hari <harikrishmenon@gmail.com> * Unit Tests changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add functionality to free disk space from Github Actions (#287) * Add functionality to free disk space from Github Actions Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add functionality to free disk space from Github Actions, relocate from build-and-publish.yaml to image.yaml Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Move freeing space step to before building image Signed-off-by: Will Johnson <mwjohnson728@gmail.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * commented os.environ[LOG_LEVEL] in accelerate.py for testing Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FIX:FMT Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add unit test to verify target_modules defaults correctly (#281) * Add unit test to verify target_modules defaults correctly Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add sft_trainer.main test to ensure target modules properly default for LoRA when set to None from CLI Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fmt Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Use model_args instead of importing, fix nits Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add test to ensure target_modules defaults to None in job config Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add additional check, fix nits Signed-off-by: Will Johnson <mwjohnson728@gmail.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * docs: Add documentation on experiment tracking. (#257) Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Ensure additional metadata to trackers don't throw error in happy case. (#290) Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix multiple runid creation bug with accelerate. (#268) Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * feat: logging control operation (#264) Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * Metrics file epoch indexing from 0 Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Revert last commit Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix run evaluation to get base model path (#273) Signed-off-by: Anh-Uong <anh.uong@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * feat: Added additional events such as on_step_begin, on_optimizer_step, on_substep_end (#293) Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * Always update setuptools to latest (#288) Signed-off-by: James Busche <jbusche@us.ibm.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * Rename all fixtures with correct .jsonl extension (#295) Signed-off-by: Will Johnson <mwjohnson728@gmail.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * feat: add save_model_dir flag where final checkpoint saved (#291) * add save_model_dir flag for final checkpoint Signed-off-by: Anh-Uong <anh.uong@ibm.com> * remove output_dir logic, add save method Signed-off-by: Anh-Uong <anh.uong@ibm.com> * update accelerate_launch, remove save tokenizer Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix: put back creation of .complete file Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix failing tests and add new ones Signed-off-by: Anh-Uong <anh.uong@ibm.com> * tests: add sft_trainer test to train and save - small refactor of tests Signed-off-by: Anh-Uong <anh.uong@ibm.com> * add docs on saving checkpoints and fix help msg Signed-off-by: Anh-Uong <anh.uong@ibm.com> * update example and note best checkpoint Signed-off-by: Anh-Uong <anh.uong@ibm.com> * changes based on PR review Signed-off-by: Anh-Uong <anh.uong@ibm.com> * add logging to save, fix error out properly Signed-off-by: Anh-Uong <anh.uong@ibm.com> --------- Signed-off-by: Anh-Uong <anh.uong@ibm.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com> Signed-off-by: Abhishek <maurya.abhishek@ibm.com> Signed-off-by: Anh-Uong <anh.uong@ibm.com> Signed-off-by: Angel Luu <angel.luu@us.ibm.com> Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com> Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Hari <harikrishmenon@gmail.com> Signed-off-by: James Busche <jbusche@us.ibm.com> Co-authored-by: Abhishek <maurya.abhishek@ibm.com> Co-authored-by: Sukriti Sharma <Ssukriti@users.noreply.github.com> Co-authored-by: Anh-Uong <anh.uong@ibm.com> Co-authored-by: Abhishek Maurya <124327945+Abhishek-TAMU@users.noreply.github.com> Co-authored-by: Angel Luu <angel.luu@us.ibm.com> Co-authored-by: Angel Luu <an317gel@gmail.com> Co-authored-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Co-authored-by: Hari <harikrishmenon@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> Co-authored-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: James Busche <jbusche@us.ibm.com>
Description of the change
save_model_dirflag where final checkpoint can be saved to usingtrainer.save_model()save()to sft_traineroutput_diris reserved for checkpoint saving and training logs. This param is still required to pass in even if no checkpoints are saved usingsave_strategy="no"save_model_dirif exists otherwise removes from final checkpointRelated issue number
#217
How to verify the PR
Tested:
save_strategy=“no”andsave_model_dirto diff path thanoutput_dir--> verified saves final model and does not save any checkpoints in ´output_dir` only logssave_total_limit=2andoutput_dirset (akasave_model_dirnot set) --> only checkpoints are saved with logssave_strategy=“no”andoutput_dir==save_model_dir--> verified that logs and model saved to pathsave_strategy="epoch”and save_total_limit=2andoutput_dir==save_model_dir` --> checkpoint dirs, model, and training logs are all written to pathaccelerate_launch.py
save_total_limit=3andsave_model_dir==output_dir--> same as 4, checkpoints, training logs, and model outputted to pathsave_strategy="no"andsave_model_dir==output_dir--> same as 3, only model and logs outputted to pathsave_total_limit=1andoutput_dirsubdir ofsave_model_dir--> output_dir with checkpoints and logs inside of save_model_dirsave_total_limit=1andsave_model_dirsubdir ofoutput_dir--> output_dir has checkpoints, logs, and dir with modelaccelerate_launch: Finally I also verified that the lm_head removal continued to work as expected:
Ran vLLM inference on LoRA and fine tuned llama-13b-base model that was saved in separate and same dir. Fine tuning got good "no complaint" inference result, LoRA got poor results after tuning but marginal improvement from base model.
Was the PR tested