Skip to content

feat: add save_model_dir flag where final checkpoint saved#291

Merged
Ssukriti merged 12 commits intofoundation-model-stack:mainfrom
anhuong:save_model_dir
Aug 14, 2024
Merged

feat: add save_model_dir flag where final checkpoint saved#291
Ssukriti merged 12 commits intofoundation-model-stack:mainfrom
anhuong:save_model_dir

Conversation

@anhuong
Copy link
Collaborator

@anhuong anhuong commented Aug 8, 2024

Description of the change

  • Add optional save_model_dir flag where final checkpoint can be saved to using trainer.save_model()
    • Note that this only saves the model, not the optimizer states
    • Adds minimal save() to sft_trainer
  • output_dir is reserved for checkpoint saving and training logs. This param is still required to pass in even if no checkpoints are saved using save_strategy="no"
    • Note that with this update, the training logs will be streamed to output_dir
  • Update accelerate_launch.py:
    • Remove tempdir which caused issues with ephemeral storage
    • Remove copy final checkpoint since this is now moved into sft_trainer
    • For lm_head removal, removes lm_head from save_model_dirif exists otherwise removes from final checkpoint

Related issue number

#217

How to verify the PR

Tested:

  1. save_strategy=“no” and save_model_dir to diff path than output_dir --> verified saves final model and does not save any checkpoints in ´output_dir` only logs

  2. save_total_limit=2 and output_dir set (akasave_model_dir not set) --> only checkpoints are saved with logs

  3. save_strategy=“no” and output_dir==save_model_dir --> verified that logs and model saved to path

  4. save_strategy="epoch” and save_total_limit=2 andoutput_dir==save_model_dir` --> checkpoint dirs, model, and training logs are all written to path

  5. accelerate_launch.py

    • save_total_limit=3 and save_model_dir==output_dir --> same as 4, checkpoints, training logs, and model outputted to path
    • save_strategy="no" and save_model_dir==output_dir --> same as 3, only model and logs outputted to path
    • save_total_limit=1 and output_dir subdir of save_model_dir --> output_dir with checkpoints and logs inside of save_model_dir
    • save_total_limit=1 and save_model_dir subdir of output_dir --> output_dir has checkpoints, logs, and dir with model
  6. accelerate_launch: Finally I also verified that the lm_head removal continued to work as expected:

  • save_total_limit=1, save_model_dir==output_dir, granite-3b-code-base --> verified that with lora and ft the model was saved to given path with lm_head removed but the checkpoint didn't have lm_head removed
  • save_strategy="no", save_model_dir==output_dir, granite-3b-code-base --> verified lm_head removed and no addt checkpoints saved
  • save_total_limit=1, output_dir, granite-3b-code-base (aka no save_model_dir given) --> verified lm_head removed from final checkpoint

Ran vLLM inference on LoRA and fine tuned llama-13b-base model that was saved in separate and same dir. Fine tuning got good "no complaint" inference result, LoRA got poor results after tuning but marginal improvement from base model.

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

anhuong added 3 commits August 6, 2024 17:24
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
@anhuong
Copy link
Collaborator Author

anhuong commented Aug 8, 2024

Wanted to call out a note from the description: For lm_head removal, removes lm_head from save_model_dir if exists, otherwise removes from final checkpoint. Is this the behavior we want or do we only want to remove lm_head if save_model_dir passed?

anhuong added 4 commits August 8, 2024 10:33
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
- small refactor of tests

Signed-off-by: Anh-Uong <anh.uong@ibm.com>
Comment on lines +463 to +471
trainer = sft_trainer.train(MODEL_ARGS, DATA_ARGS, save_model_args, None)
logs_path = os.path.join(
tempdir, FileLoggingTrackerConfig.training_logs_filename
)
_validate_logfile(logs_path)
# validate that no checkpoints created
assert not any(x.startswith("checkpoint-") for x in os.listdir(tempdir))

sft_trainer.save(tempdir, trainer)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the best way I could think of to test the save() without having to try to mock up the SFTTrainer. I had tried doing something like...

    training_args = SFTConfig(**transformer_kwargs)
    trainer = SFTTrainer(
        model=MODEL_NAME,
        args=TRAIN_ARGS,
        train_dataset=TWITTER_COMPLAINTS_DATA,
    )
    with tempfile.TemporaryDirectory() as tempdir:
        sft_trainer.save(tempdir, trainer)

but this requires needing to preprocess the train args so it can be parsed into SFTConfig and requires data preprocessing to run otherwise hits error AttributeError: 'str' object has no attribute 'column_names' when trying to tokenize the dataset

Signed-off-by: Anh-Uong <anh.uong@ibm.com>
@Ssukriti Ssukriti requested review from ashokponkumar and removed request for alex-jw-brooks August 8, 2024 20:57
Copy link
Collaborator

@ashokponkumar ashokponkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Anh! It looks good. Just a few questions.

README.md Outdated

`save_model_dir` can be set to a different directory than `output_dir`. If set to the same directory, the final checkpoint, training logs, and any intermediate checkpoints will all be saved to the same directory as seen below.

Fine tuning example with `save_strategy="epoch”`, `save_total_limit=2`, and `output_dir==save_model_dir==/tmp/same_dir`. Note the checkpoint directories as well as the `training_logs.jsonl`:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead show an example of save_model_dir being a subfolder of output_dir? That way the the logs, checkpoints and final model area easily segregable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes can do! I agree this is recommended behavior, but wanted to point out the edge case that may cause the most confusion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

checkpoint_dir = job_config.get("save_model_dir")
if not checkpoint_dir:
checkpoint_dir = os.path.join(
output_dir, get_highest_checkpoint(output_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if removing the lm_head is required while resuming a training, if so we should think about removing the lm_head from each intermediate checkpoint as it is being written.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing lm_head is required for loading the tuned model with vLLM, but I don't think it's required for resuming training. Is there a way I could check resuming training? If you have a command to run with sft_trainer, that would be super helpful!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to stop a run midway and then run the training again.

We have to ensure we use https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer.train.resume_from_checkpoint argument. Seems like we are not currently setting it. We should possibly set it. This will resume from the previously stopped last checkpoint.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @ashokponkumar we have not heard of any requirement to remove lm_head to resume training. @Abhishek-TAMU will be working on resume training from checkpoint in follow up PR as per issue https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1007 . Can we wait for him to add the change?

ypu can post any tips on the issue @ashokponkumar

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure Sukriti.

Signed-off-by: Anh-Uong <anh.uong@ibm.com>
save_model_dir: str = field(
default=None,
metadata={
"help": "Directory where final checkpoint will be saved to \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"help": "Directory where final checkpoint will be saved to \
"help": "Directory where final tuned model will be saved to \

README.md Outdated

A useful flag to set to limit the number of checkpoints saved is [`save_total_limit`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.save_total_limit). Older checkpoints are deleted from the `output_dir`. For example if `save_total_limit=1`, this will only save the last checkpoint. However, while tuning, two checkpoints will exist in `output_dir` for a short time as the new checkpoint is created and then the older one will be deleted.

`save_model_dir` can optionally be set to save the designated checkpoint using `SFTTrainer.save_model()`. This can be used in tandem with `save_strategy="no"` to only save the designated checkpoint and not any intermediate checkpoints, which can help to save space.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there has been some confusion here.

If the user sets a validation dataset and load_best_model_at_end, then the best checkpoint will be saved. If no additional flags are set, the final checkpoint will be saved.

This is for when save_total_limit = 1 , then you can. control best or last

save_model() is always for saving tuned model at end after training. By then any checkpoint information is lost. Can you confirm

ashokponkumar
ashokponkumar previously approved these changes Aug 10, 2024
README.md Outdated

### Tips on Parameters to Set

#### Saving models
Copy link
Collaborator

@Ssukriti Ssukriti Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This README is confusing and can be simplified.

We can have. 2 subsections

  1. Save checkpoints while training
  • save_strategy = set to epoch , can also be set to steps or no (as described)
  • checkpoints are saved to output_dir
  • save_total_limit para

NOTE: load_best_model para applies when save_total_limit=1 , as it is a training argument . it does not apply to trainer.save_model as that is after saving. we can skip this para on load_best_model as users can look up TrainingArguments themselves. It si always described in URL you linked https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.save_total_limit

  1. Save model after training -
    save_model_dircan optionally be set to save the tuned model usingSFTTrainer.save_model(). This can be used in tandem with save_strategy="no"` to only save the final tuned and not any intermediate checkpoints, which can help to save space.

ways you can use save_model_dir and more tips (collapse this section)

examples on expanding:

README.md Outdated

`save_model_dir` can optionally be set to save the designated checkpoint using `SFTTrainer.save_model()`. This can be used in tandem with `save_strategy="no"` to only save the designated checkpoint and not any intermediate checkpoints, which can help to save space.

If the user sets a validation dataset and [`load_best_model_at_end`](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments.load_best_model_at_end), then the best checkpoint will be saved. If no additional flags are set, the final checkpoint will be saved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to skip this section as described above, does not apply to save_model. It applies to save checkpoint only

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i ended up keeping it but moving it within the save_total_limits section and not in the save_model_dir section

Copy link
Collaborator

@Ssukriti Ssukriti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Have you verified saved model, using save() still infers on vLLM?

  2. Comments on re-organizing README

  3. merge conflicts to be addressed

Signed-off-by: Anh-Uong <anh.uong@ibm.com>
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
@anhuong
Copy link
Collaborator Author

anhuong commented Aug 13, 2024

Have you verified saved model, using save() still infers on vLLM?

Yes this is described in the description of the PR above

Ran vLLM inference on LoRA and fine tuned llama-13b-base model that was saved in separate and same dir. Fine tuning got good "no complaint" inference result, LoRA got poor results after tuning but marginal improvement from base model.

Signed-off-by: Anh-Uong <anh.uong@ibm.com>
Copy link
Collaborator

@Ssukriti Ssukriti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @anhuong !

@Ssukriti Ssukriti merged commit 78909af into foundation-model-stack:main Aug 14, 2024
anhuong added a commit that referenced this pull request Aug 14, 2024
* Set default value of target_modules to be None in LoraConfig

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

* Removal of transformers logger and addition of python logger

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* FMT and lint check: Removal of transformers logger and addition of python logger

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* fix: remove lm_head for granite with llama arch models (#258)

* initial code for deleting lm_head

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* fix logic for copying checkpoint

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* fix check that embed_tokens and lm_head weights are the same

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* fix warning assertion

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* fix lm_head check, remove test

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* small fixes from code review

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* fmt

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

---------

Signed-off-by: Anh-Uong <anh.uong@ibm.com>
Co-authored-by: Anh-Uong <anh.uong@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Add config_utils tests

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

* Fix fmt

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

* Separate tests out and use docstrings

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

* Update more field/value checks from HF defaults

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

* Fix: Addition of env var TRANSFORMERS_VERBOSITY check

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* FMT Fix: Addition of env var TRANSFORMERS_VERBOSITY check

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Add test for tokenizer in lora config (should be ignored)

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

* Adding logging support to accelerate launch

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* FMT_FIX: Adding logging support to accelerate launch

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* bug: On save event added to callback (#256)

* feat: On save event added to callback

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Removed additional bracket

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Removed additional bracket

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Format issues resolved

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: rebase with upstream and add new line

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

---------

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
Co-authored-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

* feat: All metric handling changes (#263)

* feat: All metric handling changes

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Format issues

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

---------

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* feat: Configuration to set logging level for trigger log (#241)

* feat: Added the triggered login in the operation

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Formatting issues

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Added default config

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Moved the variable to right scope

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Checked added to validate config log level

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* fix: Removed some unwanted log file

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

---------

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* limit peft deps until investigate (#274)

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* Data custom collator (#260)

* refactor code to preprocess datasets

Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* fix formatting

Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* allow input/output in validate args

Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* format input/output JSON and mask

Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* function to return suitable collator

Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* add tests for SFT Trainer input/output format

Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* remove unused functions

Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* add eos token to input/output format

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* fix tests

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* improve docstrings

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* keeping JSON keys constant

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* support for input/output format

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* formatting fixes

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* update rEADME formats

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* formatting README

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

---------

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com>

* Revert "limit peft deps until investigate (#274)" (#275)

This reverts commit f57ff63.

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* feat: per process state metric (#239)

Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com>

* Modify test to pass with target_modules: None

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

* Logging changes and unit tests added

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* feat: Add a dockerfile argument to enable aimstack (#261)

* Add a dockerfile argument at the end of final layer to enable aimstack.
Currenlty guarded by a dockerfile argument.

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

* Set the default value of ENABLE_AIM to false

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

---------

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

* Solved conflict with main

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* FMT:Fix Solved conflict with main

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* enabling tests for prompt tuning

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* feat: Support pretokenized (#272)

* feat: support pretokenized datasets

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

* fix: rebase with upstream and review commits

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

* fix: rebase with upstream and review commits

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

* fix: rebase with upstream and review commits

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

* consolidate collator code

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* add valuerrors for incorrect args

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* feat: add unit tests for validate_data_args and format_dataset

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

* feat: add unit tests for validate_data_args and format_dataset

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

* feat: add unit tests for validate_data_args and format_dataset

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

* feat: add unit tests for validate_data_args and format_dataset

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

---------

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Co-authored-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Co-authored-by: Alex Brooks <alex.brooks@ibm.com>

* Update packaging requirement from <24,>=23.2 to >=23.2,<25 (#212)

Updates the requirements on [packaging](https://github.com/pypa/packaging) to permit the latest version.
- [Release notes](https://github.com/pypa/packaging/releases)
- [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst)
- [Commits](pypa/packaging@23.2...24.1)

---
updated-dependencies:
- dependency-name: packaging
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Anh Uong <anh.uong@ibm.com>

* enabling tests for prompt tuning (#278)

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Co-authored-by: Anh Uong <anh.uong@ibm.com>

* fix: do not add special tokens for custom tokenizer (#279)

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

* PR changes for changing logger

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* fix: bug where the logger was not being used properly (#286)

Signed-off-by: Hari <harikrishmenon@gmail.com>

* Unit Tests changes

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Add functionality to free disk space from Github Actions (#287)

* Add functionality to free disk space from Github Actions

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

* Add functionality to free disk space from Github Actions, relocate from build-and-publish.yaml to image.yaml

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

* Move freeing space step to before building image

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

---------

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

* commented os.environ[LOG_LEVEL] in accelerate.py for testing

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* PR changes

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* FIX:FMT

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* PR Changes

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* PR Changes

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Add unit test to verify target_modules defaults correctly (#281)

* Add unit test to verify target_modules defaults correctly

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

* Add sft_trainer.main test to ensure target modules properly default for LoRA when set to None from CLI

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

* fmt

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

* Use model_args instead of importing, fix nits

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

* Add test to ensure target_modules defaults to None in job config

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

* Add additional check, fix nits

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

---------

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

* docs: Add documentation on experiment tracking. (#257)

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

* Ensure additional metadata to trackers don't throw error in happy case. (#290)

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

* PR Changes

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* fix multiple runid creation bug with accelerate. (#268)

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

* feat: logging control operation (#264)

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* Metrics file epoch indexing from 0

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Revert last commit

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* fix run evaluation to get base model path (#273)

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* PR Changes

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* PR Changes

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* feat: Added additional events such as on_step_begin, on_optimizer_step, on_substep_end (#293)

Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>

* Always update setuptools to latest (#288)

Signed-off-by: James Busche <jbusche@us.ibm.com>
Co-authored-by: Anh Uong <anh.uong@ibm.com>

* Rename all fixtures with correct .jsonl extension (#295)

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Co-authored-by: Anh Uong <anh.uong@ibm.com>

* feat: add save_model_dir flag where final checkpoint saved (#291)

* add save_model_dir flag for final checkpoint

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* remove output_dir logic, add save method

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* update accelerate_launch, remove save tokenizer

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* fix: put back creation of .complete file

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* fix failing tests and add new ones

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* tests: add sft_trainer test to train and save

- small refactor of tests

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* add docs on saving checkpoints and fix help msg

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* update example and note best checkpoint

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* changes based on PR review

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

* add logging to save, fix error out properly

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

---------

Signed-off-by: Anh-Uong <anh.uong@ibm.com>

---------

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Anh-Uong <anh.uong@ibm.com>
Signed-off-by: Angel Luu <angel.luu@us.ibm.com>
Signed-off-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Harikrishnan Balagopal <harikrishmenon@gmail.com>
Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Hari <harikrishmenon@gmail.com>
Signed-off-by: James Busche <jbusche@us.ibm.com>
Co-authored-by: Abhishek <maurya.abhishek@ibm.com>
Co-authored-by: Sukriti Sharma <Ssukriti@users.noreply.github.com>
Co-authored-by: Anh-Uong <anh.uong@ibm.com>
Co-authored-by: Abhishek Maurya <124327945+Abhishek-TAMU@users.noreply.github.com>
Co-authored-by: Angel Luu <angel.luu@us.ibm.com>
Co-authored-by: Angel Luu <an317gel@gmail.com>
Co-authored-by: Padmanabha V Seshadri <seshapad@in.ibm.com>
Co-authored-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
Co-authored-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Hari <harikrishmenon@gmail.com>
Co-authored-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>
Co-authored-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: James Busche <jbusche@us.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants