Skip to content

Conversation

@shenoyvvarun
Copy link
Contributor

Purpose

This PR fixes the issue #21344 where large image requests hangs in vllm waiting for the request to timeout.
Bug: MultiModalProfiler counts only patch tokens but, there are other bookeeping tokens like tile_seperator, image_start, image_end tokens in the input. This causes the encoder_budget to be slightly lower the actual budget. Whenever an image that uses all tiles is sent, VLLM accept the request but, scheduler can never schedule it because there is not enough encoder budget. Silent issue and No error is produced

Fix: Use the length parameter on PlaceHolderRange instead of embedding which is real token budget during profiling.

Test Plan

  1. Large image ( 4k x 4k) - LLama Guard 4
  2. Full Context length Text + Large image + Llama Guard 4
  3. Sanity test with (7k x 4k Image)

Test Result

  1. Large image ( 4k x 4k)
{
    "id": "chatcmpl-464e45d8a7834c41aa96b93c388f5be6",
    "object": "chat.completion",
    "created": 1753798054,
    "model": "vllm-model",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "\n\nsafe",
                "refusal": null,
                "annotations": null,
                "audio": null,
                "function_call": null,
                "tool_calls": [],
                "reasoning_content": null
            },
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null
        }
    ],
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
        "prompt_tokens": 2669,
        "total_tokens": 2672,
        "completion_tokens": 3,
        "prompt_tokens_details": null
    },
    "prompt_logprobs": null,
    "kv_transfer_params": null
}
  1. Full Context length Text + Large image (2467 is the encoder budget)
{
    "id": "chatcmpl-9406bb1facd54ea194a270fed2577ca8",
    "object": "chat.completion",
    "created": 1753798730,
    "model": "vllm-model",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "0",
                "refusal": null,
                "annotations": null,
                "audio": null,
                "function_call": null,
                "tool_calls": [],
                "reasoning_content": null
            },
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": 200001
        }
    ],
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
        "prompt_tokens": 131069,
        "total_tokens": 131071,
        "completion_tokens": 2,
        "prompt_tokens_details": null
    },
    "prompt_logprobs": null,
    "kv_transfer_params": null
}
  1. Sanity Test
curl http: //localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/8/85/Portas_da_Cidade%2C_Ponta_Delgada%2C_isla_de_San_Miguel%2C_Azores%2C_Portugal%2C_2020-07-29%2C_DD_123-125_HDR.jpg"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe this image in detail and also specify where this can be found."
                }
            ]
        }
    ],
    "model": "vllm"
}'
{
    "id": "chatcmpl-86908cb63b644decbcfcbc63d26fbb36",
    "object": "chat.completion",
    "created": 1753799599,
    "model": "vllm",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The image depicts a large, ornate archway in the center of a city square at dusk. The archway is made of stone and features two large arches with a decorative top section. It is illuminated by purple lights on either side.\n\nHere are the main points describing the image:\n\n* **Archway**\n\t+ Made of stone\n\t+ Features two large arches\n\t+ Decorative top section with a crown-like design\n\t+ Illuminated by purple lights on either side\n* **City Square**\n\t+ Located in front of the archway\n\t+ Features a statue in the center\n\t+ Surrounded by buildings on all sides\n\t+ Streetlights and cars visible\n* **Buildings**\n\t+ White with brown trim\n\t+ Feature arched windows and doors\n\t+ Have balconies on the second floor\n\t+ Appear to be old and historic\n* **Statue**\n\t+ Located in the center of the square\n\t+ Depicts a person standing on a pedestal\n\t+ Not clearly visible due to distance\n* **Streetlights and Cars**\n\t+ Streetlights line the square and surrounding streets\n\t+ Cars are parked along the streets and driving through the area\n\t+ Traffic lights visible in the distance\n* **Sky**\n\t+ Dark blue and cloudy\n\t+ Indicates that it is dusk or evening\n\nIn summary, the image shows a beautiful and historic archway in a city square, surrounded by old buildings and a statue. The archway is illuminated by purple lights, and the square is bustling with streetlights and cars. The dark blue sky suggests that it is dusk or evening. \n\nThis archway can be found in Horta, Faial, Azores, Portugal. The archway is known as the Portão da Cidade (City Gate) and is a iconic landmark in Horta.",
                "refusal": null,
                "annotations": null,
                "audio": null,
                "function_call": null,
                "tool_calls": [],
                "reasoning_content": null
            },
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null
        }
    ],
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
        "prompt_tokens": 2346,
        "total_tokens": 2732,
        "completion_tokens": 386,
        "prompt_tokens_details": null
    },
    "prompt_logprobs": null,
    "kv_transfer_params": null
}

(Optional) Documentation Update

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request primarily addresses a bug in MultiModalProfiler that caused large image requests to hang due to an incorrect encoder budget calculation. The fix correctly uses the full placeholder length, which is a sound approach. Additionally, the PR refactors the Llama 3/4 tool parser. While reviewing this refactoring, I found a critical issue where the new implementation unsafely splits JSON strings by semicolons, which can fail if a semicolon is part of a string value within the JSON. I've provided a more robust solution using json.JSONDecoder to correctly parse multiple tool calls.

Comment on lines +81 to +99
json_str = match.group(0)
# Split by semicolon and strip whitespace
json_objects = [obj.strip() for obj in json_str.split(';')]

tool_calls: list[ToolCall] = []
for json_obj in json_objects:
if not json_obj: # Skip empty strings
continue
obj = json.loads(json_obj)
tool_calls.append(
ToolCall(
type="function",
function=FunctionCall(
name=obj["name"],
# function call args are JSON but as a string
arguments=json.dumps(obj["arguments"] \
if "arguments" in obj \
else obj["parameters"])))
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The use of json_str.split(';') to separate multiple JSON objects is unsafe. A semicolon can legally appear within a string value inside a JSON object, which would cause this split to break the JSON and lead to parsing errors. This could result in failed or incorrect tool calls.

A more robust approach is to iteratively parse JSON objects from the string using json.JSONDecoder().raw_decode(), which correctly handles such cases.

            json_str = match.group(0)
            tool_calls: list[ToolCall] = []
            decoder = json.JSONDecoder()
            idx = 0
            while idx < len(json_str):
                # Skip whitespace and semicolons before the next JSON object
                while idx < len(json_str) and (json_str[idx].isspace() or json_str[idx] == ';'):
                    idx += 1
                if idx >= len(json_str):
                    break

                try:
                    obj, end = decoder.raw_decode(json_str, idx)
                    idx = end
                except json.JSONDecodeError:
                    # This can happen if there's trailing text after the JSONs
                    # that the regex matched. We can ignore it.
                    break

                tool_calls.append(
                    ToolCall(
                        type="function",
                        function=FunctionCall(
                            name=obj["name"],
                            # function call args are JSON but as a string
                            arguments=json.dumps(obj["arguments"] \
                                    if "arguments" in obj \
                                    else obj["parameters"])))
                )

CatherineSue pushed a commit that referenced this pull request Jul 29, 2025
…res-based-on-v0.9.1

  ⚠️  This commit contains unresolved merge conflicts in files:
  benchmarks/kernels/benchmark_moe.py tests/v1/core/test_scheduler.py vllm/entrypoints/openai/api_server.py vllm/entrypoints/openai/tool_parsers/llama_tool_parser.py vllm/v1/attention/backends/flash_attn.py

  Commits being cherry-picked:
  c9bd9f0 pick opc-request-id middleware and handle_pydantic_validation_error
236db58 [Bugfix] Improve JSON extraction in LlamaToolParser (#2)
4dd7951 Add dev container
1ddce9a Add devcontainer.json
f085ef2 Merge pull request #3 from moirai-internal/dev_container

  📝 Edit files to remove conflict markers
CatherineSue pushed a commit that referenced this pull request Aug 7, 2025
Add dev container Dockerfile and devcontainer.json
CatherineSue pushed a commit that referenced this pull request Aug 7, 2025
…sed-on-v0.10.0

  ⚠️  This commit contains unresolved merge conflicts in files:
  requirements/common.txt tests/test_utils.py vllm/attention/backends/torch_sdpa.py

  Commits being cherry-picked:
  aefc22d Merge pull request #3 from moirai-internal/dev_container
50dadb7 pick [Bugfix] Improve JSON extraction in LlamaToolParser
4373a86 pick opc-request-id middleware
881ad1a cherry pick the fix for non english token in logprobs (#7)

  📝 Edit files to remove conflict markers
TJ5 added a commit that referenced this pull request Aug 12, 2025
…0.10.0 (#8)

* Revert "[V0 deprecation] Remove V0 CPU/XPU/TPU backends (#20412)"

This reverts commit e202dd2.

* Merge pull request #3 from moirai-internal/dev_container

Add dev container Dockerfile and devcontainer.json

* pick [Bugfix] Improve JSON extraction in LlamaToolParser

* pick opc-request-id middleware

* cherry pick the fix for non english token in logprobs (#7)

* [Bugfix]: Fix messy code when using logprobs (#19209)

Signed-off-by: chaunceyjiang <[email protected]>

* [Bugfix]: Fix messy code when using logprobs (#20910)

Signed-off-by: chaunceyjiang <[email protected]>

* fix transformers compatible issue vllm-project/vllm-ascend#2046

---------

Signed-off-by: chaunceyjiang <[email protected]>
Co-authored-by: Chauncey <[email protected]>

* Merge pull request #3 from moirai-internal/dev_container

Add dev container Dockerfile and devcontainer.json

* pick [Bugfix] Improve JSON extraction in LlamaToolParser

* pick opc-request-id middleware

* ✅ Resolved: Cherry-pick from features-based-on-v0.9.2 to features-based-on-v0.10.0

* Revert "Revert "[V0 deprecation] Remove V0 CPU/XPU/TPU backends (#20412)""

This reverts commit a5dd03c.

* remove fstring

* Delete torch_sdpa.py

---------

Signed-off-by: chaunceyjiang <[email protected]>
Co-authored-by: simon-mo <[email protected]>
Co-authored-by: Keyang Ru <[email protected]>
Co-authored-by: Chao Yang <[email protected]>
Co-authored-by: Chauncey <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Varun Shenoy <[email protected]>
Co-authored-by: Tejesh Anand <[email protected]>
paxiaatucsdedu added a commit that referenced this pull request Sep 5, 2025
…v0.10.1 (#10)

* Cherry-pick conflicts: features-based-on-v0.9.2 → features-based-on-v0.10.0 (#8)

* Revert "[V0 deprecation] Remove V0 CPU/XPU/TPU backends (#20412)"

This reverts commit e202dd2.

* Merge pull request #3 from moirai-internal/dev_container

Add dev container Dockerfile and devcontainer.json

* pick [Bugfix] Improve JSON extraction in LlamaToolParser

* pick opc-request-id middleware

* cherry pick the fix for non english token in logprobs (#7)

* [Bugfix]: Fix messy code when using logprobs (#19209)

Signed-off-by: chaunceyjiang <[email protected]>

* [Bugfix]: Fix messy code when using logprobs (#20910)

Signed-off-by: chaunceyjiang <[email protected]>

* fix transformers compatible issue vllm-project/vllm-ascend#2046

---------

Signed-off-by: chaunceyjiang <[email protected]>
Co-authored-by: Chauncey <[email protected]>

* Merge pull request #3 from moirai-internal/dev_container

Add dev container Dockerfile and devcontainer.json

* pick [Bugfix] Improve JSON extraction in LlamaToolParser

* pick opc-request-id middleware

* ✅ Resolved: Cherry-pick from features-based-on-v0.9.2 to features-based-on-v0.10.0

* Revert "Revert "[V0 deprecation] Remove V0 CPU/XPU/TPU backends (#20412)""

This reverts commit a5dd03c.

* remove fstring

* Delete torch_sdpa.py

---------

Signed-off-by: chaunceyjiang <[email protected]>
Co-authored-by: simon-mo <[email protected]>
Co-authored-by: Keyang Ru <[email protected]>
Co-authored-by: Chao Yang <[email protected]>
Co-authored-by: Chauncey <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Varun Shenoy <[email protected]>
Co-authored-by: Tejesh Anand <[email protected]>

* Resolve merge conflict Cherry-pick from features-based-on-v0.10.0 to features-based-on-v0.10.1

Resolve merge conflict Cherry-pick from features-based-on-v0.10.0 to features-based-on-v0.10.1

* Update llama_tool_parser.py

---------

Signed-off-by: chaunceyjiang <[email protected]>
Co-authored-by: simon-mo <[email protected]>
Co-authored-by: Keyang Ru <[email protected]>
Co-authored-by: Chao Yang <[email protected]>
Co-authored-by: Chauncey <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Varun Shenoy <[email protected]>
Co-authored-by: Tejesh Anand <[email protected]>
Co-authored-by: paxiaatucsdedu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants