Only the first shard is prefilled #1

AlexCheema · 2024-07-21T05:03:58Z

First of all, awesome repo, really love what you did here!

This is a bug that exo also has (see exo-explore/exo#12).

The issue is that the prompt is only loaded (prefilled) into the layers in the first shard, see https://github.com/mzbac/mlx_sharding/blob/main/server/model/deepseek_v2.py#L441 - this is the wrong condition for taking a prompt. The prompt should be prefiled into all layers, then generation should begin. An interesting corollary of this is that prefill can happen in parallel across all shards, since there's no dependency - just the prompt.

You can most easily reproduce this bug / notice it with very long contexts OR simply make the first shard only one layer.

exo is offering a $100 bounty if you want to also fix it there :)

AlexCheema · 2024-07-21T05:05:52Z

Funnily enough, this "bug" is also kinda a feature, it speeds up the prefill stage significantly (which is typically compute bound) and you still get good results out of the model so it's hard to notice it. It tends to fail once you have a very long context and you see context collapse (see e.g. exo-explore/exo#23).

mzbac · 2024-07-21T05:25:29Z

First of all, awesome repo, really love what you did here!

This is a bug that exo also has (see exo-explore/exo#12).

The issue is that the prompt is only loaded (prefilled) into the layers in the first shard, see https://github.com/mzbac/mlx_sharding/blob/main/server/model/deepseek_v2.py#L441 - this is the wrong condition for taking a prompt. The prompt should be prefiled into all layers, then generation should begin. An interesting corollary of this is that prefill can happen in parallel across all shards, since there's no dependency - just the prompt.

You can most easily reproduce this bug / notice it with very long contexts OR simply make the first shard only one layer.

exo is offering a $100 bounty if you want to also fix it there :)

I am not sure if I understand the issue correctly. Currently, we are doing pipeline parallelization, and only the first shard processing the embedding looks fine unless I am missing something.

prompt -> first shard-> second shard -> logits -> token by token generation shard by shard.

AlexCheema · 2024-07-21T05:36:47Z

First of all, awesome repo, really love what you did here!
This is a bug that exo also has (see exo-explore/exo#12).
The issue is that the prompt is only loaded (prefilled) into the layers in the first shard, see https://github.com/mzbac/mlx_sharding/blob/main/server/model/deepseek_v2.py#L441 - this is the wrong condition for taking a prompt. The prompt should be prefiled into all layers, then generation should begin. An interesting corollary of this is that prefill can happen in parallel across all shards, since there's no dependency - just the prompt.
You can most easily reproduce this bug / notice it with very long contexts OR simply make the first shard only one layer.
exo is offering a $100 bounty if you want to also fix it there :)

I am not sure if I understand the issue correctly. Currently, we are doing pipeline parallelization, and only the first shard processing the embedding looks fine unless I am missing something.

prompt -> first shard-> second shard -> logits -> token by token generation shard by shard.

You're right, I was overcomplicating things -- neither exo or mlx_sharding has this issue :)

AlexCheema closed this as completed Jul 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only the first shard is prefilled #1

Only the first shard is prefilled #1

AlexCheema commented Jul 21, 2024 •

edited

Loading

AlexCheema commented Jul 21, 2024

mzbac commented Jul 21, 2024 •

edited

Loading

AlexCheema commented Jul 21, 2024

Only the first shard is prefilled #1

Only the first shard is prefilled #1

Comments

AlexCheema commented Jul 21, 2024 • edited Loading

AlexCheema commented Jul 21, 2024

mzbac commented Jul 21, 2024 • edited Loading

AlexCheema commented Jul 21, 2024

AlexCheema commented Jul 21, 2024 •

edited

Loading

mzbac commented Jul 21, 2024 •

edited

Loading