Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rebase and fix lora config and file dep bug #28

Merged
merged 112 commits into from
Feb 21, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
112 commits
Select commit Hold shift + click to select a range
b286544
[AutoParallel]ckpt support local views keys to global views keys (#9604)
xuxinyi389 Jan 8, 2025
1d74d62
[Model] Add XLMRoBERTaModel in paddlenlp (#9720)
jie-z-0607 Jan 8, 2025
fb60645
[AutoParallel] Fix ernie auto_trainer error (#9753)
blacksheep-Aristotle Jan 8, 2025
55e7e33
fix get_block_shape_and_split_kv_block (#9752)
lizhenyun01 Jan 9, 2025
9af0466
fix speculate_verify_and_update op (#9759)
Wanglongzhi2001 Jan 10, 2025
048387f
[Inference] merge speculate_step into step op (#9674)
Wanglongzhi2001 Jan 10, 2025
331131b
[NPU] Adapt to new flash_attention_npu api (#9762)
will-jl944 Jan 10, 2025
b4325b9
[Trainer] update sequence parallel (#9757)
DesmonDay Jan 10, 2025
cf5e3e7
[tokenizer] Fix AutoTokenizer (#9745)
DrownFish19 Jan 10, 2025
2c556e7
[LLM] Add DeepseekV3 (#9738)
DrownFish19 Jan 10, 2025
027b530
[AutoParallel] open tensor_fusion for benchmark (#9749)
AndSonder Jan 10, 2025
94e798f
fix loraga merge (#9765)
greycooker Jan 14, 2025
595e74f
Fix ernie ci auto trainer error (#9758)
blacksheep-Aristotle Jan 14, 2025
4bdb06c
Update README.md (#9766)
ZHUI Jan 14, 2025
7b53fec
[BugFix] Fix matryoshka norm loss (#9774)
DesmonDay Jan 14, 2025
efd07c0
[Distributed] support fuse optimizer (#9519) (#9777)
SylarTiaNII Jan 14, 2025
8fb109a
Update register_sequence_parallel_allreduce_hooks (#9782)
DesmonDay Jan 15, 2025
9e0b830
Fix ce error (#9783)
blacksheep-Aristotle Jan 15, 2025
1afb9b2
fix (#9779)
DrownFish19 Jan 15, 2025
13053a7
[MoE] fix expert parallel (#9760)
DesmonDay Jan 16, 2025
d039ad2
fix dpo pp criterion (#9786)
wtmlon Jan 16, 2025
fb3e4c0
[Infer] Add pir_model path for server infer. (#9790)
aooxin Jan 17, 2025
730a762
[LLM] support flash device on static model (#9619) (#9787)
SylarTiaNII Jan 21, 2025
7c1c9ba
[LLM Benchmark]update scripts (#9722)
Liujie0926 Jan 21, 2025
ac095f5
mergekit gpu 1226 (#9702)
Mangodadada Jan 21, 2025
bb0c9ad
[LLM] merge code from fastdeploy (#9791)
kevincheng2 Jan 23, 2025
bb103a3
[Inference] Support eagle for llama (#9812)
freeliuzc Jan 23, 2025
30fa8b9
[CI] Fix ci of small models (#9633)
ZHUI Jan 23, 2025
e247c85
[Trainer] Wrap model when lora is ON and only do evaluation. (#9803)
wtmlon Jan 23, 2025
2f85a64
[README] Update README.md for documention (#9785)
ZHUI Jan 23, 2025
3967f76
fix split_param (#9817)
DesmonDay Jan 24, 2025
96856bd
[LLM] Update model convert and fix TP for deepseekv3 (#9797)
DrownFish19 Jan 24, 2025
54b8882
[AutoParallel] add sharding tensor_fusion save load switch (#9810)
AndSonder Feb 5, 2025
fa98268
修复benchmark多机任务异常退出的处理 (#9651)
XieYunshen Feb 6, 2025
bad2240
Fix LLAMA arg parsing bug in pp (#9806)
will-jl944 Feb 6, 2025
a9d8648
[Readme] Update mixtral.md (#9829)
yuanlehome Feb 7, 2025
eab22f2
[XPU] Support empty_cache on XPUs (#9789)
will-jl944 Feb 8, 2025
55db2ff
[Inference] Fix multibatch inference (#9831)
DrownFish19 Feb 10, 2025
86286e0
Fix position_ids for infra (#9841)
DrownFish19 Feb 11, 2025
58fc49f
[LLM] Add pipeline and flashmask for Qwen2Moe and Deepseek (#9827)
DrownFish19 Feb 11, 2025
765ab8d
[Mergekit]update & add LoRA merge (#9811)
lugimzzz Feb 11, 2025
9ded9bf
[Unified Checkpoint] Fix expert parallel (#9821)
DesmonDay Feb 12, 2025
85b77f2
[Inference] Flask server compatible with OpenAI api. (#9828)
ZHUI Feb 12, 2025
763c59a
[LLM] fix checkpoint save for non flash mode (#9830)
SylarTiaNII Feb 12, 2025
d53e39d
[DSK] support deepseek-v3/r1 (mha/fp16/bf16/wint8/wint4) (#9769)
yuanlehome Feb 12, 2025
3900428
Solve the compatibility problem of type annotation Python version (#9…
zty-king Feb 13, 2025
5a1c4ac
save extra special tokens (#9837)
DesmonDay Feb 13, 2025
1ca1d59
[Bugfix] Fix dsk rope diff (#9859)
yuanlehome Feb 14, 2025
ca22425
[Optimization] Support lower memory cards. (#9804)
ZHUI Feb 14, 2025
98d27d6
Support XPU for auto-paralllel LLaMa (#9796)
From00 Feb 14, 2025
5ebe42b
[XPU] Add xpu fused op for deepseek (#9854)
QingshuChen Feb 14, 2025
8f4e0f0
[Inference] Update deepseek (#9864)
DrownFish19 Feb 14, 2025
62eac0c
[PreTrain] Support deepseek mfu for pretraining and fix tflops for pr…
ZHUI Feb 15, 2025
235c24e
[Inference]Support mtp with deepseek-v3 (#9856)
freeliuzc Feb 15, 2025
5eeb7aa
[AutoParallel] Support deepseekv2 with DP/MP (#9862)
xuxinyi389 Feb 17, 2025
2e06a18
[LLM] move modeling.py and modeling_nv.py to transformers (#9676)
Li-Z-Q Feb 17, 2025
ac980d3
[Docs] fix docs for inference and servering (#9877)
ZHUI Feb 17, 2025
71bfc33
[Docs] news of DeepSeek (#9834)
DrownFish19 Feb 18, 2025
c42a3d6
[AutoParallel]support_ppo_ckpt (#9823)
xuxinyi389 Feb 18, 2025
8e4ff07
[Test] Suppport intermediate_api llama test (#9850)
liym27 Feb 18, 2025
775ed2a
fix (#9885)
lugimzzz Feb 19, 2025
60ef0af
[Server] Support multi machine deployment (#9872)
ltd0924 Feb 19, 2025
7e9052a
【SpecInfer】修复 InferenceWithReference 接收率不高的 bug (#9880)
Wanglongzhi2001 Feb 19, 2025
08183d7
[CI] Update the best conf for gpt-13b in dygraph mode (#9891)
liym27 Feb 19, 2025
1168406
[Inference] Fix deepseek_v3 with mtp in multi-gpu mode (#9894)
freeliuzc Feb 19, 2025
777fc27
[TaskFlow] Fix pir for taskflow (#9822)
DrownFish19 Feb 20, 2025
3a5e9b8
[LLM-IE] Add pp-uie to Taskflow (#9845)
Fantasy-02 Feb 20, 2025
2e1d57b
[DOC] Update README for PP-UIE (#9911)
DrownFish19 Feb 20, 2025
b49d9ff
[Benchmark] Align benchmark conf for static baichuan2 gpt3 (#9901)
liym27 Feb 20, 2025
7ea4228
[DOC] PP-UIE (#9913)
DrownFish19 Feb 20, 2025
347d77c
add gpu whl (#9890)
bukejiyu Feb 20, 2025
b8ebe3e
add count trained tokens (#9800)
lugimzzz Feb 20, 2025
14a0fc4
add single_model network and use intermediate api
blacksheep-Aristotle Nov 12, 2024
80a676b
[AutoParallel]: fix llama_model_network run error
blacksheep-Aristotle Nov 15, 2024
576c5f1
New version of auto config
FeixLiu Nov 19, 2024
6dc3345
fix gpt_network to use intermediate_api
blacksheep-Aristotle Dec 3, 2024
caa8a50
fix gpt_network to use intermediate_api
blacksheep-Aristotle Dec 5, 2024
c3b3ee6
update api
FeixLiu Dec 5, 2024
2d38858
update plan
FeixLiu Dec 9, 2024
f869aa0
qwen fit base api
FeixLiu Dec 9, 2024
4c79de3
[AutoParallel]:gpt single network support tp to share_embedding
blacksheep-Aristotle Dec 12, 2024
2b8992b
add intermediate ci
blacksheep-Aristotle Dec 18, 2024
098f454
add single_model network and use intermediate api
blacksheep-Aristotle Nov 12, 2024
3ec33ec
New version of auto config
FeixLiu Nov 19, 2024
c867b31
fix sharding
FeixLiu Nov 27, 2024
e314a49
fix gpt_network to use intermediate_api
blacksheep-Aristotle Dec 3, 2024
4017c25
fix gpt_network to use intermediate_api
blacksheep-Aristotle Dec 5, 2024
63a423d
update gpt run_pretrain_py
blacksheep-Aristotle Dec 19, 2024
882fe73
fix sharding error
blacksheep-Aristotle Dec 19, 2024
255c1f1
fix gpt format error
blacksheep-Aristotle Dec 19, 2024
fcb0593
[AutoParallel]:fix llama vpp ci error
blacksheep-Aristotle Dec 19, 2024
906c892
[AutoParallel]:fix ipp error
blacksheep-Aristotle Dec 24, 2024
763c00f
[AutoParallel]:fix a100 ci error
blacksheep-Aristotle Dec 24, 2024
1e9fc7a
[AutoParallel]:fix a100 ci error
blacksheep-Aristotle Dec 24, 2024
7f5b872
[AutoParallel]:add explanatory note
blacksheep-Aristotle Dec 27, 2024
ba7a3e7
Delete =1.0.0
blacksheep-Aristotle Dec 27, 2024
142d07f
[AutoParallel]:add run_fintune scripts
blacksheep-Aristotle Dec 23, 2024
587a936
[AutoParallel]:auto parallel support lora model
blacksheep-Aristotle Dec 27, 2024
b108cc0
update auto_lora_model
blacksheep-Aristotle Jan 2, 2025
f535993
update auto_lora_model
blacksheep-Aristotle Jan 2, 2025
24ed043
[AutoParallel]:nlp support run lora model with intermediate
blacksheep-Aristotle Jan 2, 2025
2e540c6
[AutoParallel]:update format
blacksheep-Aristotle Jan 6, 2025
981c246
[AutoParallel]:support input attentionmask
blacksheep-Aristotle Jan 7, 2025
bf1c607
[AutoParallel]:shard dataloader support multi inputs
blacksheep-Aristotle Jan 7, 2025
c259ff3
[AutoParallel]:auto_sft rebase develop
blacksheep-Aristotle Jan 8, 2025
8a0823e
[AutoParallel]:auto_sft rebase develop
blacksheep-Aristotle Jan 8, 2025
b8977d4
[AutoParallel]:fix lora parallel mode
blacksheep-Aristotle Jan 9, 2025
1d0585b
[AutoParallel]: fix bug about file dep circular and lora config error
liufengwei0103 Feb 21, 2025
059ba83
[AutoParallel]: recover default base api config in launch script
liufengwei0103 Feb 21, 2025
b3d0fa5
[AutoParallel]: add to do
liufengwei0103 Feb 21, 2025
d917f46
[AutoParallel]: add to do
liufengwei0103 Feb 21, 2025
95f7925
Merge branch 'auto_sft' into auto_sft
liufengwei0103 Feb 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 60 additions & 49 deletions README.md

Large diffs are not rendered by default.

8 changes: 6 additions & 2 deletions README_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
------------------------------------------------------------------------------------------

<p align="center">
<a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-dfd.svg"></a>
<a href="https://paddlenlp.readthedocs.io/en/latest/?badge=latest"><img src="https://readthedocs.org/projects/paddlenlp/badge/?version=latest">
<a href="https://github.com/PaddlePaddle/PaddleNLP/releases"><img src="https://img.shields.io/github/v/release/PaddlePaddle/PaddleNLP?color=ffa"></a>
<a href=""><img src="https://img.shields.io/badge/python-3.7+-aff.svg"></a>
<a href=""><img src="https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-pink.svg"></a>
Expand All @@ -16,6 +16,7 @@
<a href="https://pypi.org/project/paddlenlp/"><img src="https://img.shields.io/pypi/dm/paddlenlp?color=9cf"></a>
<a href="https://github.com/PaddlePaddle/PaddleNLP/issues"><img src="https://img.shields.io/github/issues/PaddlePaddle/PaddleNLP?color=9cc"></a>
<a href="https://github.com/PaddlePaddle/PaddleNLP/stargazers"><img src="https://img.shields.io/github/stars/PaddlePaddle/PaddleNLP?color=ccf"></a>
<a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-dfd.svg"></a>
</p>

<h4 align="center">
Expand Down Expand Up @@ -52,6 +53,9 @@ The fine-tuning algorithms are deeply integrated with zero-padding data streams

The high-performance inference module of the large model toolkit incorporates dynamic insertion and operator fusion strategies throughout the entire process, greatly accelerating parallel inference speed. The underlying implementation details are encapsulated, enabling out-of-the-box high-performance parallel inference capabilities.

## Documentation
For detailed documentation, visit the [PaddleNLP Documentation](https://paddlenlp.readthedocs.io/).

------------------------------------------------------------------------------------------

## Support Models
Expand All @@ -68,7 +72,7 @@ Detailed list 👉 [Supported Model List](https://github.com/PaddlePaddle/Paddle
### Pip Installation

```shell
pip install --upgrade paddlenlp==3.0.0b2
pip install --upgrade paddlenlp==3.0.0b3
```

or you can install the latest develop branch code with the following command:
Expand Down
7 changes: 5 additions & 2 deletions csrc/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# PaddleNLP 自定义 OP
# PaddleNLP 大模型高性能自定义推理算子

此文档介绍如何编译安装 PaddleNLP 自定义 OP。
此文档介绍如何编译安装 PaddleNLP 大模型高性能自定义推理算子的安装教程。

使用这些高性能算子,可以大幅提升大模型推理速度。
大模型推理相关教程详见[此处](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/README.md#6-%E6%8E%A8%E7%90%86)。

## 安装 C++ 依赖

Expand Down
37 changes: 27 additions & 10 deletions csrc/gpu/append_attention.cu
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
const std::string& cache_quant_type_str,
const bool use_neox_rotary_style,
const int max_input_length,
const float softmax_scale,
const float quant_max_bound,
const float quant_min_bound,
const float out_linear_in_scale,
Expand Down Expand Up @@ -97,21 +98,21 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
if (out_linear_in_scale > 0.0) {
if (fabs(quant_max_bound - 127.0f) < 0.000001) {
fmha_out = GetEmptyTensor(
{meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims},
{meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims_v},
paddle::DataType::INT8,
qkv.place());
}
else if (fabs(quant_max_bound - 448.0f) < 0.000001) {
fmha_out = GetEmptyTensor(
{meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims},
{meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims_v},
paddle::DataType::FLOAT8_E4M3FN,
qkv.place());
}else{
PD_THROW("Only supported attr of quant_max_bound in ['127.0', '448.0'].");
}
} else {
fmha_out = GetEmptyTensor(
{meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims},
{meta_data.token_nums, meta_data.q_num_heads * meta_data.head_dims_v},
D,
qkv.place());
}
Expand Down Expand Up @@ -203,6 +204,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
encoder_block_shape_q,
max_input_length,
max_enc_len_this_time_data,
softmax_scale,
quant_max_bound,
quant_min_bound,
out_linear_in_scale,
Expand Down Expand Up @@ -240,6 +242,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
encoder_block_shape_q,
max_input_length,
max_enc_len_this_time_data,
softmax_scale,
quant_max_bound,
quant_min_bound,
out_linear_in_scale,
Expand Down Expand Up @@ -282,6 +285,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
encoder_block_shape_q,
max_input_length,
max_enc_len_this_time_data,
softmax_scale,
quant_max_bound,
quant_min_bound,
out_linear_in_scale,
Expand Down Expand Up @@ -428,6 +432,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
decoder_block_shape_q,
max_input_length,
max_len_kv_data,
softmax_scale,
quant_max_bound,
quant_min_bound,
out_linear_in_scale,
Expand Down Expand Up @@ -465,6 +470,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
decoder_block_shape_q,
max_input_length,
max_len_kv_data,
softmax_scale,
quant_max_bound,
quant_min_bound,
out_linear_in_scale,
Expand Down Expand Up @@ -508,6 +514,7 @@ std::vector<paddle::Tensor> AppendAttentionKernel(
decoder_block_shape_q,
max_input_length,
max_len_kv_data,
softmax_scale,
quant_max_bound,
quant_min_bound,
out_linear_in_scale,
Expand Down Expand Up @@ -565,6 +572,7 @@ std::vector<paddle::Tensor> AppendAttention(
const std::string& cache_quant_type_str,
const bool use_neox_rotary_style,
const int max_input_length,
const float softmax_scale,
const float quant_max_bound,
const float quant_min_bound,
const float out_linear_in_scale,
Expand All @@ -578,9 +586,10 @@ std::vector<paddle::Tensor> AppendAttention(
meta_data.token_nums = qkv_dims[0];
meta_data.kv_num_heads = key_cache_dims[1];
meta_data.head_dims = key_cache_dims[3];
const int total_num_head =
qkv_dims[qkv_dims.size() - 1] / meta_data.head_dims;
meta_data.q_num_heads = total_num_head - 2 * meta_data.kv_num_heads;
meta_data.head_dims_v = value_cache.dims()[3];
const int q_hidden_size =
qkv_dims[qkv_dims.size() - 1] - meta_data.kv_num_heads * (meta_data.head_dims + meta_data.head_dims_v);
meta_data.q_num_heads = q_hidden_size / meta_data.head_dims;

meta_data.max_blocks_per_seq = block_tables.dims()[1];
meta_data.block_size = key_cache.dims()[2];
Expand Down Expand Up @@ -626,6 +635,7 @@ std::vector<paddle::Tensor> AppendAttention(
cache_quant_type_str,
use_neox_rotary_style,
max_input_length,
softmax_scale,
quant_max_bound,
quant_min_bound,
out_linear_in_scale,
Expand Down Expand Up @@ -672,6 +682,7 @@ std::vector<paddle::Tensor> AppendAttention(
cache_quant_type_str,
use_neox_rotary_style,
max_input_length,
softmax_scale,
quant_max_bound,
quant_min_bound,
out_linear_in_scale,
Expand Down Expand Up @@ -719,6 +730,7 @@ std::vector<paddle::Tensor> AppendAttention(
cache_quant_type_str,
use_neox_rotary_style,
max_input_length,
softmax_scale,
quant_max_bound,
quant_min_bound,
out_linear_in_scale,
Expand Down Expand Up @@ -764,6 +776,7 @@ std::vector<paddle::Tensor> AppendAttention(
cache_quant_type_str,
use_neox_rotary_style,
max_input_length,
softmax_scale,
quant_max_bound,
quant_min_bound,
out_linear_in_scale,
Expand Down Expand Up @@ -821,10 +834,12 @@ std::vector<std::vector<int64_t>> AppendAttentionInferShape(
const paddle::optional<std::vector<int64_t>>& out_linear_smooths_shape) {
const int token_num = qkv_shape[0];
const int kv_num_heads = key_cache_shape[1];
const int head_dim = key_cache_shape[3];
const int total_num_head = qkv_shape[qkv_shape.size() - 1] / head_dim;
const int num_heads = total_num_head - 2 * kv_num_heads;
return {{token_num, num_heads * head_dim}, qkv_shape};
const int head_dim_qk = key_cache_shape[3];
const int head_dim_v = value_cache_shape[3];
const int q_hidden_size =
qkv_shape[qkv_shape.size() - 1] - kv_num_heads * (head_dim_qk + head_dim_v);
const int num_heads = q_hidden_size / head_dim_qk;
return {{token_num, num_heads * head_dim_v}, qkv_shape};
}

std::vector<paddle::DataType> AppendAttentionInferDtype(
Expand Down Expand Up @@ -865,6 +880,7 @@ std::vector<paddle::DataType> AppendAttentionInferDtype(
const std::string& cache_quant_type_str,
const bool use_neox_rotary_style,
const int max_input_length,
const float softmax_scale,
const float quant_max_bound,
const float quant_min_bound,
const float out_linear_in_scale,
Expand Down Expand Up @@ -941,6 +957,7 @@ PD_BUILD_OP(append_attention)
"cache_quant_type: std::string",
"use_neox_rotary_style: bool",
"max_input_length: int",
"softmax_scale: float",
"quant_max_bound: float",
"quant_min_bound: float",
"out_linear_in_scale: float",
Expand Down
Loading