enable llava static generation. by lkk12014402 · Pull Request #767 · huggingface/optimum-habana

lkk12014402 · 2024-03-05T16:58:24Z

What does this PR do?

support llava image to text generation

lkk12014402 · 2024-03-05T17:04:10Z

based on the image-to-text generation pr #738

I test it on single card Gaudi2 with the --use_hpu_graphs:

python3 run_pipeline.py \
        --model_name_or_path "llava-hf/llava-1.5-7b-hf" \
        --image_path "https://llava-vl.github.io/static/images/view.jpg" \
        --prompt "<image>\nUSER: What's the content of the image?\nASSISTANT:" \
        --max_new_tokens 20 \
        --use_hpu_graphs \
        --bf16

result = [[{'generated_text': "[\nUSER: What's the content of the image?\nASSISTANT: The image features a pier extending out into a large body of water, likely a lake.\n\n"}]], time = 264.1947269439697ms

Input/outputs:
Throughput (including tokenization) = 75.80511326513157 tokens/second
Number of HPU graphs = 22
Memory allocated = 14.06 GB
Max memory allocated = 14.06 GB
Total memory available = 94.62 GB

lkk12014402 · 2024-03-05T17:26:17Z

For batch_size = 4

Input/outputs 1:
USER: What's the content of the image?
ASSISTANT: The image features a pier extending out into a large body of water, likely a lake. The pier

Input/outputs 2:
USER: What's the content of the image?
ASSISTANT: The image features a pier extending out over a large body of water, likely a lake. The pier

Input/outputs 3:
USER: describe the image?
ASSISTANT: The image features a pier extending out into a large body of water, likely a lake. The pier

Input/outputs 4:
USER: Is there a brige in the image?
ASSISTANT: Yes, there is a bridge in the image.
USER: Is the bridge over water?

Input/outputs:
Throughput (including tokenization) = 191.2782859461739 tokens/second

Number of HPU graphs = 26
Memory allocated = 16.26 GB
Max memory allocated = 16.27 GB
Total memory available = 94.62 GB

JoeyTPChou · 2024-03-07T23:16:44Z

Just want to let you know this works like a charm!

ssarkar2

@lkk12014402 , could you please provide a brief description of the changes needed in optimum/habana/transformers/models/llava/modeling_llava.py wrt the base model in transformers

I see a couple of single input where, which are usually dynamic on HPU. If these are on CPU, then its fine, but if these are on HPU, they might need rewriting.

optimum-habana/optimum/habana/transformers/models/llava/modeling_llava.py

Line 79 in d44c540

batch_indices, image_indices = torch.where(input_ids == image_token_index)

optimum-habana/optimum/habana/transformers/models/llava/modeling_llava.py

Line 29 in d44c540

    
           image_token_indices = torch.where(cur_input_ids == image_token_index)[0].tolist() + \

lkk12014402 · 2024-03-14T07:01:37Z

@lkk12014402 , could you please provide a brief description of the changes needed in optimum/habana/transformers/models/llava/modeling_llava.py wrt the base model in transformers

I see a couple of single input where, which are usually dynamic on HPU. If these are on CPU, then its fine, but if these are on HPU, they might need rewriting.

optimum-habana/optimum/habana/transformers/models/llava/modeling_llava.py

Line 79 in d44c540

batch_indices, image_indices = torch.where(input_ids == image_token_index)

optimum-habana/optimum/habana/transformers/models/llava/modeling_llava.py

Line 29 in d44c540

image_token_indices = torch.where(cur_input_ids == image_token_index)[0].tolist() + \

hi, @ssarkar2

I will give a description and check the operation torch.where() as soon as possible

lkk12014402 · 2024-03-15T17:01:18Z

@lkk12014402 , could you please provide a brief description of the changes needed in optimum/habana/transformers/models/llava/modeling_llava.py wrt the base model in transformers

I see a couple of single input where, which are usually dynamic on HPU. If these are on CPU, then its fine, but if these are on HPU, they might need rewriting.

optimum-habana/optimum/habana/transformers/models/llava/modeling_llava.py

Line 79 in d44c540

batch_indices, image_indices = torch.where(input_ids == image_token_index)

optimum-habana/optimum/habana/transformers/models/llava/modeling_llava.py

Line 29 in d44c540

image_token_indices = torch.where(cur_input_ids == image_token_index)[0].tolist() + \

hi, @ssarkar2 ,

Description

Let's assume the input text is ["hey" "<image>", "how", "are"], and one image

generation with huggingface transformers directly

the huggingface transformers will get the text embedding [1, 4, 4096] with llava-1.5-7b-hf, and get image embedding [1,576, 4096]. Then the 2 embeddings will be merged to final input embedding [1, 579, 4096] using here.

The merge function also has many dynamic op, like torch.where and the input shape is dynamic during the generation.

So when we use gaudi2 to do generation, there are 2 problems:

the generation is very slow, and we have test one example, like this

python3 run_pipeline.py \
    --model_name_or_path llava-hf/llava-1.5-7b-hf \
    --image_path https://llava-vl.github.io/static/images/view.jpg \
    --prompt "<image>\nUSER: What's the content of the image?\nASSISTANT:" \
    --max_new_tokens 20 \
--bf16
 
Output is 
03/04/2024 05:22:07 - INFO - __main__ - result = [[{'generated_text': \\nUSER: What's the content of the image?\\nASSISTANT: The image features a pier extending out into a large body of water, likely a lake.\n\n}]], time = 1148.6382484436035ms

note: to reproduce, you can use this pr image-to-text example

we can not apply --use_hpu_graph, because there are some errors.

my optimization

In order to maintain the transformers usage (same input, same generation script) and enable static shape by padding and inserting token_idx for generation, I add a new function _pad_inputs to pad input. The function will extend the special token <image> to the num patches of image feature, the padded input text can be regarded as ["hey" "<image>", "<image>", ......, "<image>""how", "are"] which sequence length is 579. So the text embedding shape is [1, 579, 4096]. When we merge 2 embeddings (text embedding and image embedding), we don't need complex computation like this and I simplify the function as you see in the modeling_llava.py file.

And for keeping same input shape during generation, I also use token_idx. So I create 2 auxiliary variables, tokens_pos and image_offset. The tokens_pos records the original input text position to select logits, which can keep same shape between input text/ids and output logits. The image_offset records the offset because of the special tokens , which should be added to the token_idx during the model forward.

the explanation of maintaining `torch.where`

we need use this function to compute the special token index, because we don't preprocess the input.

Ideally, we should preprocess the input, like the padding and extending the special tokens <image>, which will need more changes compared to transformers, especially we need to create a new generation script instead of using the script in this pr.

After the optimization, we can set --use_hpu_graph

python3 run_pipeline.py \
        --model_name_or_path "llava-hf/llava-1.5-7b-hf" \
        --image_path https://llava-vl.github.io/static/images/view.jpg \
        --prompt "<image>\nUSER: What's the content of the image?\nASSISTANT:" \
        --max_new_tokens 20 \
        --use_hpu_graphs \
        --bf16
 
The output is:
result = [[{'generated_text': "\\nUSER: What's the content of the image?\\nASSISTANT: The image features a pier extending out into a large body of water, likely a lake.\n\n"}]], time = 264.1947269439697ms

And there is the comparison with A100:

A100 card perf
03/06/2024 05:07:56 - INFO - __main__ - result = [[{'generated_text': \\nUSER: What's the content of the image?\\nASSISTANT: The image features a pier extending out into a large body of water, likely a lake.\n\n}]], time = 575.2068996429443ms

lkk12014402 · 2024-03-29T01:46:01Z

@ssarkar2 please help review~

libinta

@lkk12014402 can you add a ci test case and rebase?

lkk12014402 · 2024-04-12T01:51:46Z

@lkk12014402 can you add a ci test case and rebase?

@libinta I will update the pr with your comments soon.

lkk12014402 · 2024-04-22T07:58:47Z

@lkk12014402 can you add a ci test case and rebase?

hi, @libinta I have resolved the conflicting files. And I haven't seen image-to-text example test case like test_text_generation_example.py

libinta · 2024-04-22T22:11:12Z

@lkk12014402 can you add a file like test_image2text_generation_example.py to include image2text generation
and change

optimum-habana/Makefile

Line 76 in 081130d

to include it?

lkk12014402 · 2024-04-23T14:28:57Z

@lkk12014402 can you add a file like test_image2text_generation_example.py to include image2text generation and change

optimum-habana/Makefile

Line 76 in 081130d

to include it?

hi @libinta please help review/check the image to text ut. Thanks~

libinta · 2024-04-23T16:49:39Z

+
+
+@pytest.mark.parametrize("model_name, batch_size, reuse_cache, baseline", MODELS_TO_TEST["bf16"])
+def test_text_generation_bf16(model_name: str, baseline: float, batch_size: int, reuse_cache: bool, token: str):


better to have image_to_test rather than text_generation

libinta · 2024-04-23T17:54:19Z

+        f"--model_name_or_path {model_name}",
+        f"--batch_size {batch_size}",
+        "--use_kv_cache",
+        "--max_new_tokens 20",


have you ran the test with
GAUDI2_CI=1 RUN_SLOW=true python -m pytest tests/test_image_to_text_example.py -v -s
if so, you will see run_pipeline.py: error: unrecognized arguments: --use_kv_cache --output_dir /tmp/tmpsp9f6li_ --token None
you should include whatever arguments as python3 run_pipeline.py
--model_name_or_path "llava-hf/llava-1.5-7b-hf"
--image_path "https://llava-vl.github.io/static/images/view.jpg"
--prompt "\nUSER: What's the content of the image?\nASSISTANT:"
--max_new_tokens 20
--use_hpu_graphs
--bf16

libinta · 2024-04-23T17:55:02Z

+        pattern = re.compile(r"([\"\'].+?[\"\'])|\s")
+        command = [x for y in command for x in re.split(pattern, y) if x]
+
+        if fp8:


remove fp8 section for now

regisss

It seems there are some merge conflicts to solve, can you update your main branch and merge it into this one?
Also, please run

pip install -U ruff
make style

to have the code style check pass.

lkk12014402 · 2024-04-24T01:47:18Z

It seems there are some merge conflicts to solve, can you update your main branch and merge it into this one? Also, please run
pip install -U ruff
make style
to have the code style check pass.

update code style with the command

HuggingFaceDocBuilderDev · 2024-04-24T12:22:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lkk12014402 · 2024-04-24T15:20:43Z

hi, @regisss updated code with your comments.
And the ut results:

04/24/2024 15:19:05 - INFO - __main__ - result = [[{'generated_text': "\nUSER: What's the content of the image?\nASSISTANT: The image features a pier extending out into a large body of water, likely a lake. The pier"}]], time = 245.72740799630992ms, Throughput (including tokenization) = 81.39100218035239 tokens/second
PASSED

please review~ Thanks~

Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>

…ace#767) Co-authored-by: Adam Stachowicz <105052242+astachowiczhabana@users.noreply.github.com> Co-authored-by: Adam Stachowicz <astachow@habana.ai>

lkk12014402 requested review from bhargaveede, regisss, ssarkar2 and vivekgoe as code owners March 5, 2024 16:58

jiminha self-requested a review March 5, 2024 22:26

ssarkar2 reviewed Mar 12, 2024

View reviewed changes

mandy-li self-requested a review April 9, 2024 21:37

libinta reviewed Apr 9, 2024

View reviewed changes

llava static shape generation.

1a1ee0b

lkk12014402 force-pushed the enable_llava_generation branch from d44c540 to 1a1ee0b Compare April 22, 2024 07:53

add image-to-text ut.

9c1d058

libinta reviewed Apr 23, 2024

View reviewed changes

regisss reviewed Apr 23, 2024

View reviewed changes

Comment thread optimum/habana/transformers/models/llava/modeling_llava.py Outdated

Comment thread optimum/habana/transformers/models/llava/modeling_llava.py

Comment thread optimum/habana/transformers/models/llava/modeling_llava.py Outdated

lkk12014402 and others added 2 commits April 24, 2024 09:42

fix ut and code style.

02ee3d4

Merge branch 'main' into enable_llava_generation

39c7e21

lkk12014402 and others added 2 commits April 24, 2024 10:22

keep same code for position_ids if not static.

31f9f5c

Merge branch 'main' into enable_llava_generation

eea1f81

fix code style.

3e86385

regisss reviewed Apr 24, 2024

View reviewed changes

Comment thread tests/test_image_to_text_example.py

Comment thread examples/image-to-text/run_pipeline.py Outdated

Comment thread examples/image-to-text/run_pipeline.py Outdated

lkk12014402 added 2 commits April 24, 2024 23:12

add token and use model_type.

af36d20

add token.

a7ffde0

Add Gaudi1 CI baseline

2d9b5b2

regisss added the run-test Run CI for PRs from external contributors label Apr 25, 2024

regisss approved these changes Apr 25, 2024

View reviewed changes

regisss merged commit 91a5e57 into huggingface:main Apr 25, 2024

ccrhx4 pushed a commit to ccrhx4/ccrhx4.optimum-habana that referenced this pull request May 11, 2024

Enable llava static generation. (huggingface#767)

34e185a

Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>



		@pytest.mark.parametrize("model_name, batch_size, reuse_cache, baseline", MODELS_TO_TEST["bf16"])
		def test_text_generation_bf16(model_name: str, baseline: float, batch_size: int, reuse_cache: bool, token: str):

Conversation

lkk12014402 commented Mar 5, 2024

What does this PR do?

Uh oh!

lkk12014402 commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lkk12014402 commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoeyTPChou commented Mar 7, 2024

Uh oh!

ssarkar2 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lkk12014402 commented Mar 14, 2024

Uh oh!

lkk12014402 commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

generation with huggingface transformers directly

my optimization

the explanation of maintaining torch.where

Uh oh!

lkk12014402 commented Mar 29, 2024

Uh oh!

libinta left a comment

Choose a reason for hiding this comment

Uh oh!

lkk12014402 commented Apr 12, 2024

Uh oh!

lkk12014402 commented Apr 22, 2024

Uh oh!

libinta commented Apr 22, 2024

Uh oh!

lkk12014402 commented Apr 23, 2024

Uh oh!

libinta Apr 23, 2024

Choose a reason for hiding this comment

Uh oh!

lkk12014402 Apr 24, 2024

Choose a reason for hiding this comment

Uh oh!

libinta Apr 23, 2024

Choose a reason for hiding this comment

Uh oh!

lkk12014402 Apr 24, 2024

Choose a reason for hiding this comment

Uh oh!

libinta Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lkk12014402 Apr 24, 2024

Choose a reason for hiding this comment

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lkk12014402 commented Apr 24, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Apr 24, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lkk12014402 commented Apr 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

lkk12014402 commented Mar 5, 2024 •

edited

Loading

lkk12014402 commented Mar 5, 2024 •

edited

Loading

ssarkar2 left a comment •

edited

Loading

lkk12014402 commented Mar 15, 2024 •

edited

Loading

the explanation of maintaining `torch.where`

libinta Apr 23, 2024 •

edited

Loading

lkk12014402 commented Apr 24, 2024 •

edited

Loading