-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Pixtral fails when limit_mm_per_prompt not set #8382
Comments
@patrickvonplaten it looks like |
also encouter same error when processing this image: What's really weird is that, once I resize it to Error: [rank0]: Traceback (most recent call last):
[rank0]: File "/home/dongfuj/WorkSpace/LMM-Engines/test_vllm_pixtral.py", line 34, in <module>
[rank0]: outputs = llm.chat(messages, sampling_params=sampling_params)
[rank0]: File "/home/dongfuj/.conda/envs/lmm-engines/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 422, in chat
[rank0]: return self.generate(
[rank0]: File "/home/dongfuj/.conda/envs/lmm-engines/lib/python3.10/site-packages/vllm/utils.py", line 1032, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/home/dongfuj/.conda/envs/lmm-engines/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 348, in generate
[rank0]: outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]: File "/home/dongfuj/.conda/envs/lmm-engines/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 720, in _run_engine
[rank0]: step_outputs = self.llm_engine.step()
[rank0]: File "/home/dongfuj/.conda/envs/lmm-engines/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1600, in step
[rank0]: outputs = self.model_executor.execute_model(
[rank0]: File "/home/dongfuj/.conda/envs/lmm-engines/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 130, in execute_model
[rank0]: output = self.driver_worker.execute_model(execute_model_req)
[rank0]: File "/home/dongfuj/.conda/envs/lmm-engines/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
[rank0]: output = self.model_runner.execute_model(
[rank0]: File "/home/dongfuj/.conda/envs/lmm-engines/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/dongfuj/.conda/envs/lmm-engines/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1543, in execute_model
[rank0]: hidden_or_intermediate_states = model_executable(
[rank0]: File "/home/dongfuj/.conda/envs/lmm-engines/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/dongfuj/.conda/envs/lmm-engines/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/dongfuj/.conda/envs/lmm-engines/lib/python3.10/site-packages/vllm/model_executor/models/pixtral.py", line 181, in forward
[rank0]: inputs_embeds = merge_multimodal_embeddings(
[rank0]: File "/home/dongfuj/.conda/envs/lmm-engines/lib/python3.10/site-packages/vllm/model_executor/models/pixtral.py", line 117, in merge_multimodal_embeddings
[rank0]: assert (seq_len == N_txt +
[rank0]: AssertionError: seq_len 12 should be equal to N_txt + N_img (12, 4032, 0) |
I also tried to resize the image to |
Can you try out #8399 and see if it fixes the issue which you've encountered? |
Hello @jdf-prog! Just to confirm, you were able to launch the server, but only this particular image ran into an issue, correct? |
Hmm, I was able to run inference with that image without any resizing
|
Double checking this command:
BTW there is no need to pass --trust-remote-code here |
Ah yes I see when passing --max_num_batched_tokens , but not |
Yes, only this particular image. The following is the code I encounter this error. It shall be simple to be reproduced. from vllm import LLM
from vllm.sampling_params import SamplingParams
model_name = "mistralai/Pixtral-12B-2409"
sampling_params = SamplingParams(max_tokens=8192)
llm = LLM(model=model_name, tokenizer_mode="mistral", max_model_len=65536, limit_mm_per_prompt={"image":4})
prompt = "Can you derive Equation 6 from the image?"
image_url="https://f2c628843e9892f5c7.gradio.live/file=/tmp/gradio/3036880890cf17b59a0cc838afc217dcd4d91ba5bc294ff42a99f6a2090f8bf2/equation.png"
messages = [
{
"role": "user",
"content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": image_url}}]
},
]
outputs = llm.chat(messages, sampling_params=sampling_params)
print(outputs[0].outputs[0].text) |
And the code it will work again after the resize: from vllm import LLM
from vllm.sampling_params import SamplingParams
from PIL import Image
from io import BytesIO
import base64
import requests
def encode_image(image:Image.Image, image_format="PNG") -> str:
im_file = BytesIO()
image.save(im_file, format=image_format)
im_bytes = im_file.getvalue()
im_64 = base64.b64encode(im_bytes).decode("utf-8")
return im_64
model_name = "mistralai/Pixtral-12B-2409"
sampling_params = SamplingParams(max_tokens=8192)
llm = LLM(model=model_name, tokenizer_mode="mistral", max_model_len=65536, limit_mm_per_prompt={"image":4})
prompt = "Can you derive Equation 6 from the image?"
image_url="https://f2c628843e9892f5c7.gradio.live/file=/tmp/gradio/3036880890cf17b59a0cc838afc217dcd4d91ba5bc294ff42a99f6a2090f8bf2/equation.png"
image = Image.open(BytesIO(requests.get(image_url).content))
image = image.resize((3844, 2408))
new_image_url = f"data:image/png;base64,{encode_image(image, image_format='PNG')}"
messages = [
{
"role": "user",
"content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": new_image_url}}]
},
]
outputs = llm.chat(messages, sampling_params=sampling_params)
print(outputs[0].outputs[0].text) |
@jdf-prog I'm pretty certain this is due to the fact that chunked prefill is working pretty flakily with VLMs. By default, when the In the mean time, can you modify your model initialization similar to what's in model_name = "mistralai/Pixtral-12B-2409"
max_img_per_msg = 5
max_tokens_per_img = 4096
sampling_params = SamplingParams(max_tokens=8192, temperature=0.7)
llm = LLM(
model=model_name,
tokenizer_mode="mistral",
limit_mm_per_prompt={"image": max_img_per_msg},
max_num_batched_tokens=max_img_per_msg * max_tokens_per_img,
) |
Thanks, I tried this and it seems to work. Thanks for the help! |
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
The below command does not work
It leads to this error:
But the below works (following huggingface):
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: