Integrate Flex Decoding #196

BoyuanFeng · 2024-08-21T03:29:45Z

This PR integrates flex decoding with gpt-fast.

End-to-end performance gain of Llama2-7b

Device: H100
Unit: tokens/sec

Length	spda	Flex Decoding	Speedup
1024	143.57	142.59	0.99x
2048	138.67	140.13	1.01x
4096	128.85	135.31	1.05x
8192	111.59	125.38	1.12x
16384	89	109.8	1.23x
32768	64.11	87.67	1.37x

command:

export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
python generate.py --compile --max_new_tokens 16384 --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "Hello, my name is"

Please also set ModelArgs.block_size = 65536 to repeat the result.

We expect to see larger speedup on longer context length.

model.py

Chillee · 2024-08-28T04:36:07Z

model.py

@@ -89,7 +103,7 @@ def update(self, input_pos, k_val, v_val):
        return k_out, v_out

 class Transformer(nn.Module):
-    def __init__(self, config: ModelArgs) -> None:
+    def __init__(self, config: ModelArgs, get_mask_mod: Callable[[int], _mask_mod_signature]) -> None:


get_mask_mod shouldn't take an integer - it should take a mask_mod. We also don't need to set at as an argument, just set it as an attribute within the module.

Specifically, you should be able to take any existing mask_mod and wrap it to make it automatically support an offset.

generate.py

Chillee

Mostly looks good (modulo some other nits), although we'll probably want to wait on landing this until the other PRs on pytorch core are landed at least.

merrymercy · 2024-08-30T12:37:57Z

This looks interesting. I would like to share some numbers we got with torch.compile + flashinfer in sglang. It can serve as some good baselines. To run the 32k one, you need to edit the config.json with "max_position_embeddings": 65536,.

# length = 1024
# Decode.  median latency: 0.00596 s, median throughput:    167.67 token/s
python3 -m sglang.bench_latency --model meta-llama/Llama-2-7b-chat-hf --batch-size 1 --input 1024 --output 8 --enable-torch-compile

# length = 32768
# Decode.  median latency: 0.01136 s, median throughput:     88.01 token/s
python3 -m sglang.bench_latency --model meta-llama/Llama-2-7b-chat-hf --batch-size 1 --input 32768 --output 8 --enable-torch-compile

You can find more numbers at sgl-project/sglang#1008

Chillee · 2024-08-30T20:11:24Z

@merrymercy We run on nerfed H100s internally at Meta with only 2.4 TB/s of bandwidth, so these numbers aren't 1:1 comparable.

But it's a good comparison :)

generate.py

BoyuanFeng · 2024-09-04T17:32:57Z

generate.py

-    logits = model(x, input_pos)
+    block_index = input_pos // block_mask.BLOCK_SIZE[0]
+    mask = block_mask[:, :, block_index]
+    mask.mask_mod = block_mask.mask_mod


offline discussed that BlockMask getitem sets mask_mod as None and the user needs to specify the correct mask_mod. In GPT-Fast, we rely on model.get_mask_mod to do so.

…compile (#134627) Adds a helper function for getting the block mask for a specific row index during decoding. We need this change to avoid the pytree + torch.compile issue #134731. Tested in gpt-fast [pr](pytorch-labs/gpt-fast#196). Pull Request resolved: #134627 Approved by: https://github.com/Chillee

…compile (pytorch#134627) Adds a helper function for getting the block mask for a specific row index during decoding. We need this change to avoid the pytree + torch.compile issue pytorch#134731. Tested in gpt-fast [pr](pytorch-labs/gpt-fast#196). Pull Request resolved: pytorch#134627 Approved by: https://github.com/Chillee

drisspg · 2024-10-21T21:14:32Z

generate.py

    return sample(logits, **sampling_kwargs)

 def decode_n_tokens(model: Transformer, cur_token: torch.Tensor, input_pos: torch.Tensor, num_new_tokens: int, callback=lambda _: _, **sampling_kwargs):
+    block_mask = create_block_mask(causal_mask, 1, 1, model.max_seq_length, model.max_seq_length, device="cuda")


try doing
create_block_mask_compile = torch.compile(create_block_mask)

as a global

cpuhrsch · 2024-12-13T08:42:57Z

Can this be combined with quantization and/or float8?

BoyuanFeng · 2024-12-14T01:12:21Z

Can this be combined with quantization and/or float8?

Yes it works with quantization. I have tested with int8: python generate.py --compile --checkpoint_path checkpoints/$MODEL/model_int8.pth --device cuda.

I have not tried with float8 yet.

BoyuanFeng added 3 commits August 19, 2024 13:48

init. works for 1024 but not for 2048

07165bb

nit

ede0013

nit

5ec99df

BoyuanFeng self-assigned this Aug 21, 2024

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 21, 2024

BoyuanFeng requested review from yanboliang, Chillee and drisspg August 21, 2024 03:30

Chillee reviewed Aug 22, 2024

View reviewed changes

model.py Outdated Show resolved Hide resolved

model.py Outdated Show resolved Hide resolved

refactor

e753f9a

BoyuanFeng mentioned this pull request Aug 27, 2024

[Flex Attention] update __getitem__ without tree_map_only to support compile pytorch/pytorch#134627

Closed

refactor

258a7ee

BoyuanFeng requested a review from Chillee August 28, 2024 04:23

Chillee approved these changes Aug 28, 2024

View reviewed changes

Chillee requested changes Aug 28, 2024

View reviewed changes

BoyuanFeng added 3 commits August 28, 2024 13:12

nit

3d0adb5

fix comments

4e917f5

nit

2b7976c

Merge branch 'main' into bf/flex-decoding-integrate

e53ffb5

Chillee reviewed Aug 30, 2024

View reviewed changes

generate.py Outdated Show resolved Hide resolved

generate.py Show resolved Hide resolved

remove divisible length constraints

d983e3d

BoyuanFeng commented Sep 4, 2024

View reviewed changes

Merge branch 'main' into bf/flex-decoding-integrate

a79e7e8

BoyuanFeng requested a review from Chillee September 5, 2024 17:24

Merge branch 'main' into bf/flex-decoding-integrate

c59198f

drisspg reviewed Oct 21, 2024

View reviewed changes

joydddd mentioned this pull request Nov 13, 2024

How to do KV Cache with FlexAttention and BlockMask by slicing? pytorch-labs/attention-gym#60

Open

BoyuanFeng added 2 commits December 9, 2024 12:41

compile create_block_mask

fce44e5

remove repeat_interleave since flex_decoding supports gqa

9ba2eac

fix seq_lengths

8b91ce6

BoyuanFeng merged commit 7dd5661 into main Dec 14, 2024
1 check passed

drisspg mentioned this pull request Jan 30, 2025

Dynamic mask block sizes during inference pytorch-labs/attention-gym#109

Open

vasqu mentioned this pull request Feb 9, 2025

Proper performant flex attention implementation huggingface/transformers#36103

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Flex Decoding #196

Integrate Flex Decoding #196

BoyuanFeng commented Aug 21, 2024

Chillee Aug 28, 2024

Chillee Aug 28, 2024

Chillee left a comment

merrymercy commented Aug 30, 2024 •

edited

Loading

Chillee commented Aug 30, 2024 •

edited

Loading

BoyuanFeng Sep 4, 2024

drisspg Oct 21, 2024

cpuhrsch commented Dec 13, 2024

BoyuanFeng commented Dec 14, 2024

Integrate Flex Decoding #196

Integrate Flex Decoding #196

Conversation

BoyuanFeng commented Aug 21, 2024

End-to-end performance gain of Llama2-7b

Chillee Aug 28, 2024

Choose a reason for hiding this comment

Chillee Aug 28, 2024

Choose a reason for hiding this comment

Chillee left a comment

Choose a reason for hiding this comment

merrymercy commented Aug 30, 2024 • edited Loading

Chillee commented Aug 30, 2024 • edited Loading

BoyuanFeng Sep 4, 2024

Choose a reason for hiding this comment

drisspg Oct 21, 2024

Choose a reason for hiding this comment

cpuhrsch commented Dec 13, 2024

BoyuanFeng commented Dec 14, 2024

merrymercy commented Aug 30, 2024 •

edited

Loading

Chillee commented Aug 30, 2024 •

edited

Loading