New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add RingFlashAttention for context parallel #8383

Merged

wawltor merged 7 commits into PaddlePaddle:develop from zhangyuqin1998:ring_flash_attention

Jun 5, 2024

Contributor

zhangyuqin1998 commented May 7, 2024 •

edited

Loading

PR types

New features

PR changes

Models

Description

为fleet的context parallel增加ring flash attention的支持

paddle兼容性:
使用paddle中的sep group，对paddle无改动

收敛性：
将cp和sep做对比。理论上，二者的收敛结果应该完全一致。经过测试，sep和cp的收敛情况近乎一致。绿色为cp，蓝色为sep。

性能：
单机8卡小模型测试，序列长度为20k时，性能对比如图。绿色为cp，蓝色为sep。

paddle-bot bot commented May 7, 2024

Thanks for your contribution!

CLAassistant commented May 7, 2024 •

edited

Loading

All committers have signed the CLA.

zhangyuqin1998 force-pushed the ring_flash_attention branch from 09fd62e to fbd16a1 Compare

May 8, 2024 04:20

codecov bot commented May 9, 2024 •

edited

Loading

Codecov Report

Attention: Patch coverage is 15.78947% with 208 lines in your changes missing coverage. Please review.

Project coverage is 53.87%. Comparing base (773497e) to head (26b7059).
Report is 263 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/transformers/ring_flash_attention.py	15.47%	142 Missing ⚠️
paddlenlp/transformers/context_parallel_utils.py	10.00%	27 Missing ⚠️
paddlenlp/trainer/training_args.py	6.66%	14 Missing ⚠️
paddlenlp/transformers/llama/fusion_ops.py	14.28%	12 Missing ⚠️
paddlenlp/transformers/llama/modeling.py	35.71%	9 Missing ⚠️
paddlenlp/trainer/trainer.py	20.00%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #8383      +/-   ##
===========================================
- Coverage    54.22%   53.87%   -0.35%     
===========================================
  Files          617      620       +3     
  Lines        96203    97068     +865     
===========================================
+ Hits         52164    52295     +131     
- Misses       44039    44773     +734

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ForFishes reviewed

View reviewed changes

paddlenlp/transformers/ring_flash_attention.py

+                      # if step != cp_size - 1:
+                      #     comm_buffer.wait()
+                      paddle.device.synchronize()

Member

ForFishes May 10, 2024

TODO：batch_isend_irecv异步流下，无法wait，需要修复。对性能有影响。

Contributor Author

zhangyuqin1998 May 28, 2024

TODO：batch_isend_irecv异步流下，无法wait，需要修复。对性能有影响。

done～

zhangyuqin1998 force-pushed the ring_flash_attention branch from f94a915 to 4e88520 Compare

May 19, 2024 08:46


          Add RingFlashAttention for context parallel

63b2be8

zhangyuqin1998 force-pushed the ring_flash_attention branch 2 times, most recently from cf7d334 to 88bc460 Compare

May 27, 2024 07:58

ForFishes reviewed

View reviewed changes

paddlenlp/transformers/ring_flash_attention.py Outdated

+                              block_out, _, block_lse, _ = _C_ops.flash_attn(
+                                  local_query,
+                                  block_k[:, : local_q_seq_len // 2, :, :],
+                                  block_v[:, : local_q_seq_len // 2, :, :],

Member

ForFishes May 27, 2024

这种方式，性能可能比较慢。看看能否直接使用op的方式调用。

Contributor Author

zhangyuqin1998 May 28, 2024

这种方式，性能可能比较慢。看看能否直接使用op的方式调用。

done～

paddlenlp/transformers/ring_flash_attention.py Outdated

+                  if attn_mask is not None:
+                      attn_masks_list = paddle.split(attn_mask, num_or_sections=cp_size * 2, axis=3)
+                  if is_causal:
+                      local_query_second_chunk = local_query[:, local_q_seq_len // 2 :, :, :].clone().contiguous()

Member

ForFishes May 27, 2024

contiguous ？可能不需要这个。尽量使用切分的api，不实用运算符重载。

Contributor Author

zhangyuqin1998 May 28, 2024

contiguous ？可能不需要这个。尽量使用切分的api，不实用运算符重载。

done～

paddlenlp/transformers/ring_flash_attention.py Outdated Show resolved Hide resolved

paddlenlp/transformers/ring_flash_attention.py Outdated

+                  grad_comm_buffer = RingCommunicator(group, key_grad_buffer, value_grad_buffer)
+                  if is_causal:
+                      local_query_second_chunk = local_query[:, local_q_seq_len // 2 :, :, :].clone().contiguous()

Member

ForFishes May 27, 2024

这个前向已经计算过了，是否可以优化不计算。

Contributor Author

zhangyuqin1998 May 28, 2024

这个前向已经计算过了，是否可以优化不计算。

已做优化

paddlenlp/transformers/ring_flash_attention.py Outdated

+                  def wait(self):
+                      # for req in self._reqs:
+                      #     req.wait()
+                      # self._reqs = None

Member

ForFishes May 27, 2024

改成TODO吧。不用注释。

Contributor Author

zhangyuqin1998 May 28, 2024

改成TODO吧。不用注释。

done～


          update, using sep_group

ab562b7

zhangyuqin1998 force-pushed the ring_flash_attention branch from a360468 to ab562b7 Compare

May 31, 2024 04:17

zhangyuqin1998 added 3 commits

May 31, 2024 12:32


          using sep group

94943a8

fix

812a13e

fix

16eaedd

ForFishes approved these changes

View reviewed changes

Member

ForFishes left a comment

LGTM

sneaxiy previously approved these changes

View reviewed changes

ZHUI reviewed

View reviewed changes

paddlenlp/trainer/training_args.py Outdated

@@ @@ -583,6 +587,15 @@ class TrainingArguments: @@
                           )
                       },
                   )
+                  cp_parallel_degree: int = field(

Collaborator

ZHUI May 31, 2024

换成 context_parallel_degree ？

Suggested change

      
                cp_parallel_degree: int = field(
          
                context_parallel_degree: int = field(

paddlenlp/trainer/trainer.py Outdated

@@ @@ -763,6 +764,8 @@ def train( @@
                               trainable_numel = int(trainable_numel_tensor.item()) // self.args.dataset_world_size
                               if self.args.sep_parallel_degree > 0:
                                   trainable_numel = trainable_numel // self.args.sep_parallel_degree
+                              if self.args.cp_parallel_degree > 0:
+                                  trainable_numel = trainable_numel // self.args.cp_parallel_degree

Collaborator

ZHUI May 31, 2024

cp_parallel_degree 会切分哪些参数？

ZHUI reviewed

View reviewed changes

paddlenlp/trainer/training_args.py Outdated

@@ @@ -230,6 +230,10 @@ class TrainingArguments: @@
                           The paddle sequence parallel strategy. It can reduce the GPU memory of activation to 1/sep, and it is orthogonal to
                           data parallel, sharding stage1, tensor parallel and pipeline parallel strategy.
                       )
+                      cp_parallel_degree (`int`, *optional*, defaults to `-1`)(

Collaborator

ZHUI May 31, 2024

这个参数在 docs/trainer.md 文档中也加一下吧。

ZHUI reviewed

View reviewed changes

paddlenlp/trainer/training_args.py

+                                      self.tensor_parallel_degree
+                                      * self.sep_parallel_degree
+                                      * self.cp_parallel_degree
+                                      * self.pipeline_parallel_degree

Collaborator

ZHUI May 31, 2024

保存相关的考虑了吗？通信组需要额外建吗？


          update

e7c4b1e

zhangyuqin1998 dismissed sneaxiy’s stale review via

e7c4b1e

May 31, 2024 12:11

ZHUI previously approved these changes

View reviewed changes

paddlenlp/trainer/training_args.py

@@ @@ -918,6 +931,7 @@ @@
                       if world_size > 1:
                           tensor_parallel_degree = max(self.tensor_parallel_degree, 1)
                           sep_parallel_degree = max(self.sep_parallel_degree, 1)
+                          context_parallel_degree = max(self.context_parallel_degree, 1)

Collaborator

ZHUI Jun 3, 2024

我再问一下，context parellel 和 seq parallel是不是互斥的，需不需要加一个判断？还是可以一起用

paddlenlp/trainer/trainer.py

@@ @@ -897,6 +900,8 @@ @@
                           for step, inputs in enumerate(epoch_iterator):
                               if self.args.use_hybrid_parallel and self.args.sep_parallel_degree > 1:
                                   inputs = split_inputs_sequence_dim(inputs)
+                              if self.args.use_hybrid_parallel and self.args.context_parallel_degree > 1:

Collaborator

ZHUI Jun 3, 2024

em，是不是开了 cp 的话，相当于是一路数据流，现在对应多份完整参数了？

Collaborator

ZHUI Jun 3, 2024

cp->2
tp->2

4card。两份参数，1路数据流

for- back
两份参数 -> grad ? grad sum?

Collaborator

ZHUI commented Jun 3, 2024 •

edited

Loading

热启动模型对齐精度。第一step精度，看二step loss diff

fix

26b7059

zhangyuqin1998 dismissed ZHUI’s stale review via

26b7059

June 3, 2024 07:39

ZHUI approved these changes

View reviewed changes

wawltor merged commit 4cb8b6d into PaddlePaddle:develop

6 of 12 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet