Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto sft #9728

Open
wants to merge 88 commits into
base: develop
Choose a base branch
from

Conversation

blacksheep-Aristotle
Copy link
Contributor

PR types

New features

PR changes

Others

Description

nlp支持中层api/基础api sft训练。
支持sft 全量微调
支持lora 微调训练

Copy link

paddle-bot bot commented Jan 6, 2025

Thanks for your contribution!

blacksheep-Aristotle and others added 12 commits February 20, 2025 20:34
<!-- Demo: PaddlePaddle#26 -->
#### Before submitting



```shell
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
```



### PR types
<!-- One of [ New features | Bug fixes | Function optimization |
Performance optimization | Breaking changes | Others ] -->

### PR changes
<!-- One of [ Models | APIs | Docs | Others ] -->

### Description

1.rebase
2.修复sft文件依赖错误,LoRAAutoConfig错误
@CLAassistant
Copy link

CLAassistant commented Feb 21, 2025

CLA assistant check
All committers have signed the CLA.

liufengwei0103 and others added 12 commits February 22, 2025 14:30
<!-- Demo: PaddlePaddle#26 -->
#### Before submitting

- [ ] Lint code. If there are lint issues, please format the code first.

```shell
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
```

- [ ] Add test cases into `tests` folder. If there are codecov issues,
please add tests cases first.

### PR types
<!-- One of [ New features | Bug fixes | Function optimization |
Performance optimization | Breaking changes | Others ] -->

### PR changes
<!-- One of [ Models | APIs | Docs | Others ] -->

### Description
add ci guard
--auto_parallel_resume_form_hybrid_parallel true \
--sharding_parallel_config "enable_stage1_tensor_fusion" \
--tensor_parallel_config "enable_mp_async_allreduce" \
--pipeline_parallel_config "enable_sharding_comm_overlap enable_dp_comm_overlap enable_overlap_p2p_comm disable_p2p_cache_shape" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

files llm/run_sft_hand.sh and llm/run_lora_hand.sh need to be added?

Comment on lines +417 to +419
if "labels" in batch.keys():
value = batch.pop("labels")
batch["labels"] = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add this code? it seems that this code does not change batch?

Comment on lines +476 to +489
# #NOTE(zhangwl):this may move to wrap
# trainable_parameters = [p for p in model.parameters() if not p.stop_gradient]
# trainer.set_optimizer_grouped_parameters(trainable_parameters)

# def forward_pre_hook(layer, input):
# print(f"{layer} forward start")

# def forward_post_hook(layer, input, output):
# print(f"{layer} forward done")

# for layer in trainer.model.sublayers():
# layer.register_forward_pre_hook(forward_pre_hook)
# layer.register_forward_post_hook(forward_post_hook)
# Train
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be deleted?

@@ -2769,6 +2769,7 @@ def merge_auto_dist_configs(self, configs):
},....
]
"""
# import pdb;pdb.set_trace()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be deleted?

@jeff41404
Copy link
Contributor

The title and description are too simple, the title needs to express: using auto parallel to validate SFT/LORA of llama model. The description needs to include a comparison of loss and performance of SFT/LORA between manual parallel and auto parallel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants