Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Multistep with n>1 Fails #7968

Open
1 task done
robertgshaw2-redhat opened this issue Aug 28, 2024 · 10 comments · May be fixed by #8637
Open
1 task done

[Bug]: Multistep with n>1 Fails #7968

robertgshaw2-redhat opened this issue Aug 28, 2024 · 10 comments · May be fixed by #8637
Assignees
Labels
bug Something isn't working stale

Comments

@robertgshaw2-redhat
Copy link
Collaborator

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

🐛 Describe the bug

Launched server with:

vllm serve $MODEL --num-scheduler-steps 8

Sent the following request:

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Completion API
stream = False
completion = client.completions.create(
    model=model,
    prompt="A robot may not injure a human being",
    echo=False,
    n=2,
    stream=stream)

print("Completion results:")
if stream:
    for c in completion:
        print(c)
else:
    print(completion)

Got the following output:

INFO:     Finished server process [1668044]
INFO 08-28 19:29:45 server.py:222] vLLM ZMQ RPC Server was interrupted.
Future exception was never retrieved
future: <Future finished exception=RuntimeError('shape mismatch: value tensor of shape [2] cannot be broadcast to indexing result of shape [1, 1]')>
Traceback (most recent call last):
  File "/home/rshaw/vllm/vllm/entrypoints/openai/rpc/server.py", line 111, in generate
    async for request_output in results_generator:
  File "/home/rshaw/vllm/vllm/engine/async_llm_engine.py", line 1050, in generate
    async for output in await self.add_request(
  File "/home/rshaw/vllm/vllm/engine/async_llm_engine.py", line 110, in generator
    raise result
  File "/home/rshaw/vllm/vllm/engine/async_llm_engine.py", line 52, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/home/rshaw/vllm/vllm/engine/async_llm_engine.py", line 916, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/home/rshaw/vllm/vllm/engine/async_llm_engine.py", line 859, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rshaw/vllm/vllm/engine/async_llm_engine.py", line 346, in step_async
    output = await self.model_executor.execute_model_async(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rshaw/vllm/vllm/executor/gpu_executor.py", line 178, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rshaw/.pyenv/versions/3.11.9/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rshaw/vllm/vllm/worker/worker_base.py", line 327, in execute_model
    output = self.model_runner.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rshaw/vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/rshaw/vllm/vllm/worker/multi_step_model_runner.py", line 275, in execute_model
    output = self._base_model_runner.execute_model(frozen_model_input,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rshaw/vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/rshaw/vllm/vllm/worker/model_runner.py", line 1489, in execute_model
    output: SamplerOutput = self.model.sample(
                            ^^^^^^^^^^^^^^^^^^
  File "/home/rshaw/vllm/vllm/model_executor/models/llama.py", line 447, in sample
    next_tokens = self.sampler(logits, sampling_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rshaw/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rshaw/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rshaw/vllm/vllm/model_executor/layers/sampler.py", line 153, in forward
    sample_results, maybe_sampled_tokens_tensor = _sample(
                                                  ^^^^^^^^
  File "/home/rshaw/vllm/vllm/model_executor/layers/sampler.py", line 771, in _sample
    return _sample_with_torch(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/rshaw/vllm/vllm/model_executor/layers/sampler.py", line 633, in _sample_with_torch
    sampled_token_ids_tensor[long_sample_indices] = \
    ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape mismatch: value tensor of shape [2] cannot be broadcast to indexing result of shape [1, 1]

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@robertgshaw2-redhat robertgshaw2-redhat added the bug Something isn't working label Aug 28, 2024
@robertgshaw2-redhat robertgshaw2-redhat changed the title [Bug]: Multistep with n>1 Failes [Bug]: Multistep with n>1 Fails Aug 28, 2024
@SolitaryThinker
Copy link
Contributor

I will take a look later today

@tjohnson31415
Copy link
Contributor

tjohnson31415 commented Sep 17, 2024

Looks like @tdoublep encountered this issue a while ago in the context of speculative deocding and has a PR with a fix (that would need to be rebased):

I also found a couple other issues for the same crash:

@robertgshaw2-redhat
Copy link
Collaborator Author

cc @afeldman-nm

@m-harmonic
Copy link

I'm running into the same issue. Does anyone know of a workaround? We don't need best_of or use_beam_search

We can reproduce using VLLM's provided benchmark_throughput.py:

This runs ok:

python benchmarks/benchmark_throughput.py --input-len=768 --output-len=256 --model=codellama/CodeLlama-7b-hf --max-model-len=1024 --num-prompts=1 --num-scheduler-steps=2 --n=1

This crashes:

python benchmarks/benchmark_throughput.py --input-len=768 --output-len=256 --model=codellama/CodeLlama-7b-hf --max-model-len=1024 --num-prompts=1 --num-scheduler-steps=2 --n=2

The error I'm getting is:

[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1633, in execute_model
[rank0]:     output: SamplerOutput = self.model.sample(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 466, in sample
[rank0]:     next_tokens = self.sampler(logits, sampling_metadata)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 274, in forward
[rank0]:     maybe_deferred_sample_results, maybe_sampled_tokens_tensor = _sample(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 879, in _sample
[rank0]:     return _sample_with_torch(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 826, in _sample_with_torch
[rank0]:     sampled_token_ids_tensor[long_sample_indices] = \
[rank0]: RuntimeError: shape mismatch: value tensor of shape [2] cannot be broadcast to indexing result of shape [1, 1]

@m-harmonic
Copy link

m-harmonic commented Oct 2, 2024

@comaniac Hi just wondering if someone working on VLLM can provide an update on this. We want to use multi-step scheduler because the throughput is much better for our needs, however we also need to set n > 1. Simply disabling multistep in that case won't work for us. Thanks!

@comaniac
Copy link
Collaborator

comaniac commented Oct 2, 2024

Sorry we're busying with the company event (Ray Summit) until this week. Will try to find some time after the event to look into it. @SolitaryThinker could you also take a look if you got a chance?

@robertgshaw2-redhat
Copy link
Collaborator Author

@afeldman-nm has a WIP branch for this

@m-harmonic
Copy link

@afeldman-nm has a WIP branch for this

Thanks — are you referring to the branch linked above that disables the multi-step scheduler?

@robertgshaw2-redhat
Copy link
Collaborator Author

[Bugfix] Handle best_of>1 & use_beam_search by disabling multi-step scheduling. #8637

Yes - to avoid crashing the server.

We are not planning to support both multistep and beam search at the same time. Instead, we are working on rearchitecting vllm to have asynchronous scheduling which will accomplish the same goal as multistep for throughput performance while making it easier to support the other features

however, if you have an idea for how to do this with multistep, feel free to open up a PR

Copy link

github-actions bot commented Jan 1, 2025

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Jan 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
5 participants