Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support quant ckpt limit strategy #9494

Merged
merged 5 commits into from
Nov 29, 2024

Conversation

wtmlon
Copy link
Collaborator

@wtmlon wtmlon commented Nov 26, 2024

PR types

PR changes

Description

支持 resume 压缩 ckpt 上限控制

Copy link

paddle-bot bot commented Nov 26, 2024

Thanks for your contribution!

Copy link

codecov bot commented Nov 26, 2024

Codecov Report

Attention: Patch coverage is 11.42857% with 31 lines in your changes missing coverage. Please review.

Project coverage is 53.10%. Comparing base (8fd33a9) to head (12d84c6).
Report is 17 commits behind head on develop.

Files with missing lines Patch % Lines
...p/trainer/unified_checkpoint/unified_checkpoint.py 5.00% 19 Missing ⚠️
...lp/quantization/unified_checkpoint_quantization.py 15.38% 11 Missing ⚠️
paddlenlp/transformers/model_utils.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9494      +/-   ##
===========================================
+ Coverage    52.91%   53.10%   +0.19%     
===========================================
  Files          688      694       +6     
  Lines       109331   110989    +1658     
===========================================
+ Hits         57848    58940    +1092     
- Misses       51483    52049     +566     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


# Quantization times exceeds the limit. Turn off the quantization strategy.
if quant_ckpt_resume_times > MAX_QUANTIZATION_TIMES:
ckpt_quant_stage = "O0"
Copy link
Contributor

@DesmonDay DesmonDay Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

把这个开关的修改写在这里感觉不太对?MAX_QUANTIZATION_TIMES主要是限制你保存为压缩checkpoint的次数,所以应该把 ckpt_quant_stage 的修改同步到 save逻辑,加载这里改了也没有作用吧?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最后这里还是有点疑问,这里是加载optimizer逻辑,ckpt_quant_stage = "O0"不应该有外界的改变,而是直接通过checkpoint保存的index来读取。

# save opt index json if checkpoint quantization is on.
if self.args.ckpt_quant_stage != "O0":
sharded_optim_index = {"ckpt_quant_stage": self.args.ckpt_quant_stage}
if self.args.ckpt_quant_stage != "O0" and "quant_reach_limit" not in infohub:

This comment was marked as resolved.

@@ -257,7 +267,7 @@ def save_non_merge_optimizer(self, model, optim_state_dict, master_weights, outp
signal_path=signal_dir,
is_sync=is_sync_save,
state_dict_type="optimizer_weight",
ckpt_quant_stage=self.args.ckpt_quant_stage,
ckpt_quant_stage=self.args.ckpt_quant_stage if "quant_reach_limit" not in infohub else "O0",

This comment was marked as resolved.

@@ -379,7 +389,7 @@ def save_unified_optimizer(self, model, optimizer, output_dir, signal_dir):
signal_path=signal_dir,
is_sync=is_sync_save,
state_dict_type="optimizer_weight",
ckpt_quant_stage=self.args.ckpt_quant_stage,
ckpt_quant_stage=self.args.ckpt_quant_stage if "quant_reach_limit" not in infohub else "O0",

This comment was marked as resolved.

Copy link
Contributor

@DesmonDay DesmonDay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wawltor wawltor merged commit 2985f90 into PaddlePaddle:develop Nov 29, 2024
9 of 12 checks passed
wtmlon added a commit to wtmlon/PaddleNLP that referenced this pull request Nov 29, 2024
* support quant ckpt limit strategy

* bug fix

* bug fix

* fix bug

* add log, fix bug
Conflicts:
	paddlenlp/utils/env.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants