feat: add save_model_dir flag where final checkpoint saved by anhuong · Pull Request #291 · foundation-model-stack/fms-hf-tuning

anhuong · 2024-08-08T05:44:37Z

Description of the change

Add optional save_model_dir flag where final checkpoint can be saved to using trainer.save_model()
- Note that this only saves the model, not the optimizer states
- Adds minimal save() to sft_trainer
output_dir is reserved for checkpoint saving and training logs. This param is still required to pass in even if no checkpoints are saved using save_strategy="no"
- Note that with this update, the training logs will be streamed to output_dir
Update accelerate_launch.py:
- Remove tempdir which caused issues with ephemeral storage
- Remove copy final checkpoint since this is now moved into sft_trainer
- For lm_head removal, removes lm_head from save_model_dirif exists otherwise removes from final checkpoint

Related issue number

#217

How to verify the PR

Tested:

save_strategy=“no” and save_model_dir to diff path than output_dir --> verified saves final model and does not save any checkpoints in ´output_dir` only logs
save_total_limit=2 and output_dir set (akasave_model_dir not set) --> only checkpoints are saved with logs
save_strategy=“no” and output_dir==save_model_dir --> verified that logs and model saved to path
save_strategy="epoch” and save_total_limit=2 andoutput_dir==save_model_dir` --> checkpoint dirs, model, and training logs are all written to path
accelerate_launch.py
- save_total_limit=3 and save_model_dir==output_dir --> same as 4, checkpoints, training logs, and model outputted to path
- save_strategy="no" and save_model_dir==output_dir --> same as 3, only model and logs outputted to path
- save_total_limit=1 and output_dir subdir of save_model_dir --> output_dir with checkpoints and logs inside of save_model_dir
- save_total_limit=1 and save_model_dir subdir of output_dir --> output_dir has checkpoints, logs, and dir with model
accelerate_launch: Finally I also verified that the lm_head removal continued to work as expected:

save_total_limit=1, save_model_dir==output_dir, granite-3b-code-base --> verified that with lora and ft the model was saved to given path with lm_head removed but the checkpoint didn't have lm_head removed
save_strategy="no", save_model_dir==output_dir, granite-3b-code-base --> verified lm_head removed and no addt checkpoints saved
save_total_limit=1, output_dir, granite-3b-code-base (aka no save_model_dir given) --> verified lm_head removed from final checkpoint

Ran vLLM inference on LoRA and fine tuned llama-13b-base model that was saved in separate and same dir. Fine tuning got good "no complaint" inference result, LoRA got poor results after tuning but marginal improvement from base model.

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

anhuong · 2024-08-08T05:56:22Z

Wanted to call out a note from the description: For lm_head removal, removes lm_head from save_model_dir if exists, otherwise removes from final checkpoint. Is this the behavior we want or do we only want to remove lm_head if save_model_dir passed?

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

- small refactor of tests Signed-off-by: Anh-Uong <anh.uong@ibm.com>

anhuong · 2024-08-08T19:05:54Z

tests/test_sft_trainer.py

+        trainer = sft_trainer.train(MODEL_ARGS, DATA_ARGS, save_model_args, None)
+        logs_path = os.path.join(
+            tempdir, FileLoggingTrackerConfig.training_logs_filename
+        )
+        _validate_logfile(logs_path)
+        # validate that no checkpoints created
+        assert not any(x.startswith("checkpoint-") for x in os.listdir(tempdir))
+
+        sft_trainer.save(tempdir, trainer)


This was the best way I could think of to test the save() without having to try to mock up the SFTTrainer. I had tried doing something like...

training_args = SFTConfig(**transformer_kwargs) trainer = SFTTrainer( model=MODEL_NAME, args=TRAIN_ARGS, train_dataset=TWITTER_COMPLAINTS_DATA, ) with tempfile.TemporaryDirectory() as tempdir: sft_trainer.save(tempdir, trainer)

but this requires needing to preprocess the train args so it can be parsed into SFTConfig and requires data preprocessing to run otherwise hits error AttributeError: 'str' object has no attribute 'column_names' when trying to tokenize the dataset

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

ashokponkumar

Thanks Anh! It looks good. Just a few questions.

README.md

ashokponkumar · 2024-08-09T06:59:55Z

README.md

+
+`save_model_dir` can be set to a different directory than `output_dir`. If set to the same directory, the final checkpoint, training logs, and any intermediate checkpoints will all be saved to the same directory as seen below.
+
+Fine tuning example with `save_strategy="epoch”`, `save_total_limit=2`, and `output_dir==save_model_dir==/tmp/same_dir`. Note the checkpoint directories as well as the `training_logs.jsonl`:


Should we instead show an example of save_model_dir being a subfolder of output_dir? That way the the logs, checkpoints and final model area easily segregable.

Yes can do! I agree this is recommended behavior, but wanted to point out the edge case that may cause the most confusion.

README.md

ashokponkumar · 2024-08-09T07:49:49Z

build/accelerate_launch.py

+        checkpoint_dir = job_config.get("save_model_dir")
+        if not checkpoint_dir:
+            checkpoint_dir = os.path.join(
+                output_dir, get_highest_checkpoint(output_dir)


Do we know if removing the lm_head is required while resuming a training, if so we should think about removing the lm_head from each intermediate checkpoint as it is being written.

Removing lm_head is required for loading the tuned model with vLLM, but I don't think it's required for resuming training. Is there a way I could check resuming training? If you have a command to run with sft_trainer, that would be super helpful!

I think we have to stop a run midway and then run the training again.

We have to ensure we use https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer.train.resume_from_checkpoint argument. Seems like we are not currently setting it. We should possibly set it. This will resume from the previously stopped last checkpoint.

hi @ashokponkumar we have not heard of any requirement to remove lm_head to resume training. @Abhishek-TAMU will be working on resume training from checkpoint in follow up PR as per issue https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1007 . Can we wait for him to add the change?

ypu can post any tips on the issue @ashokponkumar

Sure Sukriti.

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

Ssukriti · 2024-08-10T00:47:02Z

tuning/config/configs.py

+    save_model_dir: str = field(
+        default=None,
+        metadata={
+            "help": "Directory where final checkpoint will be saved to \


Suggested change

"help": "Directory where final checkpoint will be saved to \

"help": "Directory where final tuned model will be saved to \

Ssukriti · 2024-08-10T00:58:09Z

README.md

+
+A useful flag to set to limit the number of checkpoints saved is [`save_total_limit`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.save_total_limit). Older checkpoints are deleted from the `output_dir`. For example if `save_total_limit=1`, this will only save the last checkpoint. However, while tuning, two checkpoints will exist in `output_dir` for a short time as the new checkpoint is created and then the older one will be deleted.
+
+`save_model_dir` can optionally be set to save the designated checkpoint using `SFTTrainer.save_model()`. This can be used in tandem with `save_strategy="no"` to only save the designated checkpoint and not any intermediate checkpoints, which can help to save space.


I think there has been some confusion here.

If the user sets a validation dataset and load_best_model_at_end, then the best checkpoint will be saved. If no additional flags are set, the final checkpoint will be saved.

This is for when save_total_limit = 1 , then you can. control best or last

save_model() is always for saving tuned model at end after training. By then any checkpoint information is lost. Can you confirm

Ssukriti · 2024-08-12T19:56:58Z

README.md


+### Tips on Parameters to Set
+
+#### Saving models


This README is confusing and can be simplified.

We can have. 2 subsections

Save checkpoints while training

save_strategy = set to epoch , can also be set to steps or no (as described)

checkpoints are saved to output_dir

save_total_limit para

NOTE: load_best_model para applies when save_total_limit=1 , as it is a training argument . it does not apply to trainer.save_model as that is after saving. we can skip this para on load_best_model as users can look up TrainingArguments themselves. It si always described in URL you linked https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.save_total_limit

Save model after training -
save_model_dircan optionally be set to save the tuned model usingSFTTrainer.save_model(). This can be used in tandem with save_strategy="no"` to only save the final tuned and not any intermediate checkpoints, which can help to save space.

ways you can use save_model_dir and more tips (collapse this section)

examples on expanding:

Ssukriti · 2024-08-12T19:57:59Z

README.md

+
+`save_model_dir` can optionally be set to save the designated checkpoint using `SFTTrainer.save_model()`. This can be used in tandem with `save_strategy="no"` to only save the designated checkpoint and not any intermediate checkpoints, which can help to save space.
+
+If the user sets a validation dataset and [`load_best_model_at_end`](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments.load_best_model_at_end), then the best checkpoint will be saved. If no additional flags are set, the final checkpoint will be saved.


better to skip this section as described above, does not apply to save_model. It applies to save checkpoint only

i ended up keeping it but moving it within the save_total_limits section and not in the save_model_dir section

Ssukriti

Have you verified saved model, using save() still infers on vLLM?
Comments on re-organizing README
merge conflicts to be addressed

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

anhuong · 2024-08-13T01:32:24Z

Have you verified saved model, using save() still infers on vLLM?

Yes this is described in the description of the PR above

Ran vLLM inference on LoRA and fine tuned llama-13b-base model that was saved in separate and same dir. Fine tuning got good "no complaint" inference result, LoRA got poor results after tuning but marginal improvement from base model.

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

Ssukriti

Nice work @anhuong !

* Set default value of target_modules to be None in LoraConfig Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Removal of transformers logger and addition of python logger Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT and lint check: Removal of transformers logger and addition of python logger Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix: remove lm_head for granite with llama arch models (#258) * initial code for deleting lm_head Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix logic for copying checkpoint Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix check that embed_tokens and lm_head weights are the same Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix warning assertion Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix lm_head check, remove test Signed-off-by: Anh-Uong <anh.uong@ibm.com> * small fixes from code review Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fmt Signed-off-by: Anh-Uong <anh.uong@ibm.com> --------- Signed-off-by: Anh-Uong <anh.uong@ibm.com> Co-authored-by: Anh-Uong <anh.uong@ibm.com> Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add config_utils tests Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Fix fmt Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Separate tests out and use docstrings Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Update more field/value checks from HF defaults Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Fix: Addition of env var TRANSFORMERS_VERBOSITY check Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT Fix: Addition of env var TRANSFORMERS_VERBOSITY check Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add test for tokenizer in lora config (should be ignored) Signed-off-by: Angel Luu <angel.luu@us.ibm.com> * Adding logging support to accelerate launch Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT_FIX: Adding logging support to accelerate launch Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * bug: On save event added to callback (#256) * feat: On save event added to callback Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Removed additional bracket Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Removed additional bracket Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Format issues resolved Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: rebase with upstream and add new line Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> --------- Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Co-authored-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: All metric handling changes (#263) * feat: All metric handling changes Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Format issues Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> --------- Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * feat: Configuration to set logging level for trigger log (#241) * feat: Added the triggered login in the operation Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Formatting issues Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Added default config Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Moved the variable to right scope Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Checked added to validate config log level Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * fix: Removed some unwanted log file Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> --------- Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * limit peft deps until investigate (#274) Signed-off-by: Anh-Uong <anh.uong@ibm.com> * Data custom collator (#260) * refactor code to preprocess datasets Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * fix formatting Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * allow input/output in validate args Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * format input/output JSON and mask Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * function to return suitable collator Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add tests for SFT Trainer input/output format Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * remove unused functions Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add eos token to input/output format Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * fix tests Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * improve docstrings Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * keeping JSON keys constant Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * support for input/output format Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * formatting fixes Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * update rEADME formats Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * formatting README Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> --------- Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> * Revert "limit peft deps until investigate (#274)" (#275) This reverts commit f57ff63. Signed-off-by: Anh-Uong <anh.uong@ibm.com> * feat: per process state metric (#239) Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com> * Modify test to pass with target_modules: None Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Logging changes and unit tests added Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * feat: Add a dockerfile argument to enable aimstack (#261) * Add a dockerfile argument at the end of final layer to enable aimstack. Currenlty guarded by a dockerfile argument. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Set the default value of ENABLE_AIM to false Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> --------- Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Solved conflict with main Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FMT:Fix Solved conflict with main Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * enabling tests for prompt tuning Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * feat: Support pretokenized (#272) * feat: support pretokenized datasets Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: rebase with upstream and review commits Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * consolidate collator code Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * add valuerrors for incorrect args Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * feat: add unit tests for validate_data_args and format_dataset Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> --------- Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: Alex Brooks <alex.brooks@ibm.com> * Update packaging requirement from <24,>=23.2 to >=23.2,<25 (#212) Updates the requirements on [packaging](https://github.com/pypa/packaging) to permit the latest version. - [Release notes](https://github.com/pypa/packaging/releases) - [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst) - [Commits](pypa/packaging@23.2...24.1) --- updated-dependencies: - dependency-name: packaging dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * enabling tests for prompt tuning (#278) Signed-off-by: Abhishek <maurya.abhishek@ibm.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * fix: do not add special tokens for custom tokenizer (#279) Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * PR changes for changing logger Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix: bug where the logger was not being used properly (#286) Signed-off-by: Hari <harikrishmenon@gmail.com> * Unit Tests changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add functionality to free disk space from Github Actions (#287) * Add functionality to free disk space from Github Actions Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add functionality to free disk space from Github Actions, relocate from build-and-publish.yaml to image.yaml Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Move freeing space step to before building image Signed-off-by: Will Johnson <mwjohnson728@gmail.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * commented os.environ[LOG_LEVEL] in accelerate.py for testing Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * FIX:FMT Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add unit test to verify target_modules defaults correctly (#281) * Add unit test to verify target_modules defaults correctly Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add sft_trainer.main test to ensure target modules properly default for LoRA when set to None from CLI Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fmt Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Use model_args instead of importing, fix nits Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add test to ensure target_modules defaults to None in job config Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Add additional check, fix nits Signed-off-by: Will Johnson <mwjohnson728@gmail.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * docs: Add documentation on experiment tracking. (#257) Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * Ensure additional metadata to trackers don't throw error in happy case. (#290) Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix multiple runid creation bug with accelerate. (#268) Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> * feat: logging control operation (#264) Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * Metrics file epoch indexing from 0 Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Revert last commit Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix run evaluation to get base model path (#273) Signed-off-by: Anh-Uong <anh.uong@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * feat: Added additional events such as on_step_begin, on_optimizer_step, on_substep_end (#293) Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> * Always update setuptools to latest (#288) Signed-off-by: James Busche <jbusche@us.ibm.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * Rename all fixtures with correct .jsonl extension (#295) Signed-off-by: Will Johnson <mwjohnson728@gmail.com> Co-authored-by: Anh Uong <anh.uong@ibm.com> * feat: add save_model_dir flag where final checkpoint saved (#291) * add save_model_dir flag for final checkpoint Signed-off-by: Anh-Uong <anh.uong@ibm.com> * remove output_dir logic, add save method Signed-off-by: Anh-Uong <anh.uong@ibm.com> * update accelerate_launch, remove save tokenizer Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix: put back creation of .complete file Signed-off-by: Anh-Uong <anh.uong@ibm.com> * fix failing tests and add new ones Signed-off-by: Anh-Uong <anh.uong@ibm.com> * tests: add sft_trainer test to train and save - small refactor of tests Signed-off-by: Anh-Uong <anh.uong@ibm.com> * add docs on saving checkpoints and fix help msg Signed-off-by: Anh-Uong <anh.uong@ibm.com> * update example and note best checkpoint Signed-off-by: Anh-Uong <anh.uong@ibm.com> * changes based on PR review Signed-off-by: Anh-Uong <anh.uong@ibm.com> * add logging to save, fix error out properly Signed-off-by: Anh-Uong <anh.uong@ibm.com> --------- Signed-off-by: Anh-Uong <anh.uong@ibm.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com> Signed-off-by: Abhishek <maurya.abhishek@ibm.com> Signed-off-by: Anh-Uong <anh.uong@ibm.com> Signed-off-by: Angel Luu <angel.luu@us.ibm.com> Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com> Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Hari <harikrishmenon@gmail.com> Signed-off-by: James Busche <jbusche@us.ibm.com> Co-authored-by: Abhishek <maurya.abhishek@ibm.com> Co-authored-by: Sukriti Sharma <Ssukriti@users.noreply.github.com> Co-authored-by: Anh-Uong <anh.uong@ibm.com> Co-authored-by: Abhishek Maurya <124327945+Abhishek-TAMU@users.noreply.github.com> Co-authored-by: Angel Luu <angel.luu@us.ibm.com> Co-authored-by: Angel Luu <an317gel@gmail.com> Co-authored-by: Padmanabha V Seshadri <seshapad@in.ibm.com> Co-authored-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com> Co-authored-by: Hari <harikrishmenon@gmail.com> Co-authored-by: Dushyant Behl <dushyantbehl@users.noreply.github.com> Co-authored-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: James Busche <jbusche@us.ibm.com>

anhuong added 3 commits August 6, 2024 17:24

add save_model_dir flag for final checkpoint

b300216

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

remove output_dir logic, add save method

cee43b9

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

update accelerate_launch, remove save tokenizer

78f25aa

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

anhuong requested review from Ssukriti and alex-jw-brooks as code owners August 8, 2024 05:44

anhuong added 4 commits August 8, 2024 10:33

fix: put back creation of .complete file

b13333f

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

fix failing tests and add new ones

0c8b6e5

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

Merge branch 'main' into save_model_dir

b8fc646

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

tests: add sft_trainer test to train and save

2a9197c

- small refactor of tests Signed-off-by: Anh-Uong <anh.uong@ibm.com>

anhuong commented Aug 8, 2024

View reviewed changes

add docs on saving checkpoints and fix help msg

d28ec4d

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

Ssukriti requested review from ashokponkumar and removed request for alex-jw-brooks August 8, 2024 20:57

ashokponkumar requested changes Aug 9, 2024

View reviewed changes

update example and note best checkpoint

536442c

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

Ssukriti reviewed Aug 10, 2024

View reviewed changes

ashokponkumar previously approved these changes Aug 10, 2024

View reviewed changes

Ssukriti reviewed Aug 12, 2024

View reviewed changes

Ssukriti requested changes Aug 12, 2024

View reviewed changes

merge changes from main branch

9efc549

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

anhuong dismissed ashokponkumar’s stale review via 18bf1c9 August 13, 2024 01:30

changes based on PR review

59f840f

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

anhuong force-pushed the save_model_dir branch from 18bf1c9 to 6d29c6b Compare August 13, 2024 01:31

anhuong force-pushed the save_model_dir branch from 6d29c6b to a93fb72 Compare August 13, 2024 01:35

add logging to save, fix error out properly

9954678

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

anhuong force-pushed the save_model_dir branch from a93fb72 to 9954678 Compare August 13, 2024 01:47

Ssukriti approved these changes Aug 14, 2024

View reviewed changes

Ssukriti merged commit 78909af into foundation-model-stack:main Aug 14, 2024


		`save_model_dir` can be set to a different directory than `output_dir`. If set to the same directory, the final checkpoint, training logs, and any intermediate checkpoints will all be saved to the same directory as seen below.

		Fine tuning example with `save_strategy="epoch”`, `save_total_limit=2`, and `output_dir==save_model_dir==/tmp/same_dir`. Note the checkpoint directories as well as the `training_logs.jsonl`:

	"help": "Directory where final checkpoint will be saved to \
	"help": "Directory where final tuned model will be saved to \


		A useful flag to set to limit the number of checkpoints saved is [`save_total_limit`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.save_total_limit). Older checkpoints are deleted from the `output_dir`. For example if `save_total_limit=1`, this will only save the last checkpoint. However, while tuning, two checkpoints will exist in `output_dir` for a short time as the new checkpoint is created and then the older one will be deleted.

		`save_model_dir` can optionally be set to save the designated checkpoint using `SFTTrainer.save_model()`. This can be used in tandem with `save_strategy="no"` to only save the designated checkpoint and not any intermediate checkpoints, which can help to save space.


		`save_model_dir` can optionally be set to save the designated checkpoint using `SFTTrainer.save_model()`. This can be used in tandem with `save_strategy="no"` to only save the designated checkpoint and not any intermediate checkpoints, which can help to save space.

		If the user sets a validation dataset and [`load_best_model_at_end`](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments.load_best_model_at_end), then the best checkpoint will be saved. If no additional flags are set, the final checkpoint will be saved.

Conversation

anhuong commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of the change

Related issue number

How to verify the PR

Was the PR tested

Uh oh!

anhuong commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashokponkumar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ssukriti Aug 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ssukriti left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anhuong commented Aug 13, 2024

Uh oh!

Ssukriti left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anhuong commented Aug 8, 2024 •

edited

Loading

anhuong commented Aug 8, 2024 •

edited

Loading

Ssukriti Aug 12, 2024 •

edited

Loading

Ssukriti left a comment •

edited

Loading