[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path #8378

varun-sundar-rabindranath · 2024-09-11T18:50:48Z

Adds support for scheduling prompts with decodes in Multi-Step. This PR uses the Chunked-Prefill code path to this effect.

Adding chunked-prompts to multi-step, in realization of a full chunked-prefill support, is a future-work and needs some investigation as it is not clear that it will have a positive performance impact.

With this PR decode sequences can run in every step and will not have to be interrupted by prefills. This has an effect of reduced TPOT

Idea:

When Chunked-Prefill is enabled with Multi-Step, each scheduler iteration executes for num_scheduler_steps. Let num_scheduler_steps be 8,
In scheduling, a sequence can be,

Single Step Prompt : These are sequences that enter the multi-step as prefills and exit as decodes. Specifically, they are processed as a prefill in the 1st step and are treated as decodes in the rest of the steps.
Example 1: If a prompt has 3 token ids, it is scheduled with a token_chunk_size of 3. Step 1, treats this sequence as a prefill and processes all the 3 tokens. Step 2 - 8, treat it like a decode with a effective token_chunk_size of 1. At the end of 8 steps, we will have processed 10 (3+7) tokens and have generated 8 tokens (1 for each step).
Decode: These are the usual decode sequences.

Both sequence types listed above have a way of being active throughout the multi-step and can be scheduled as required.

Sample logging output:

Benchmark Serving with 1000 num_prompts and inf QPS
Main with --num-scheduler-steps 8

Scheduler step 0: waiting 0, running 1
    - 1st step : total input toks: 13 |  num prompt toks 13 |  num prompt to decode toks 0 | num decode toks 0
   - Seq Types 1: P 1 + Ps 0 + D 0 
Scheduler step 1: waiting 965, running 35
    - 1st step : total input toks: 7452 |  num prompt toks 7452 |  num prompt to decode toks 0 | num decode toks 0
   - Seq Types 34: P 34 + Ps 0 + D 0 
Scheduler step 2: waiting 932, running 68
    - 1st step : total input toks: 8070 |  num prompt toks 8070 |  num prompt to decode toks 0 | num decode toks 0
   - Seq Types 33: P 33 + Ps 0 + D 0 
Scheduler step 3: waiting 896, running 93
    - 1st step : total input toks: 7947 |  num prompt toks 7947 |  num prompt to decode toks 0 | num decode toks 0
   - Seq Types 36: P 36 + Ps 0 + D 0 
Scheduler step 4: waiting 857, running 122
    - 1st step : total input toks: 8006 |  num prompt toks 8006 |  num prompt to decode toks 0 | num decode toks 0
   - Seq Types 39: P 39 + Ps 0 + D 0 
Scheduler step 5: waiting 812, running 153
    - 1st step : total input toks: 8114 |  num prompt toks 8114 |  num prompt to decode toks 0 | num decode toks 0
   - Seq Types 45: P 45 + Ps 0 + D 0 
Scheduler step 6: waiting 777, running 177
    - 1st step : total input toks: 7476 |  num prompt toks 7476 |  num prompt to decode toks 0 | num decode toks 0
   - Seq Types 35: P 35 + Ps 0 + D 0 
Scheduler step 7: waiting 739, running 200
    - 1st step : total input toks: 8170 |  num prompt toks 8170 |  num prompt to decode toks 0 | num decode toks 0
   - Seq Types 38: P 38 + Ps 0 + D 0 
Scheduler step 8: waiting 694, running 231
    - 1st step : total input toks: 8162 |  num prompt toks 8162 |  num prompt to decode toks 0 | num decode toks 0
   - Seq Types 45: P 45 + Ps 0 + D 0 
Scheduler step 9: waiting 654, running 256
    - 1st step : total input toks: 6559 |  num prompt toks 6559 |  num prompt to decode toks 0 | num decode toks 0
   - Seq Types 40: P 40 + Ps 0 + D 0 
Scheduler step 10: waiting 642, running 256
    - 1st step : total input toks: 2386 |  num prompt toks 2386 |  num prompt to decode toks 0 | num decode toks 0
   - Seq Types 12: P 12 + Ps 0 + D 0 
Scheduler step 11: waiting 629, running 256
    - 1st step : total input toks: 2056 |  num prompt toks 2056 |  num prompt to decode toks 0 | num decode toks 0
   - Seq Types 13: P 13 + Ps 0 + D 0 
Scheduler step 12: waiting 628, running 256
    - 1st step : total input toks: 289 |  num prompt toks 289 |  num prompt to decode toks 0 | num decode toks 0
   - Seq Types 1: P 1 + Ps 0 + D 0 
Scheduler step 13: waiting 625, running 256
    - 1st step : total input toks: 619 |  num prompt toks 619 |  num prompt to decode toks 0 | num decode toks 0
   - Seq Types 3: P 3 + Ps 0 + D 0 
Scheduler step 14: waiting 625, running 256
    - 1st step : total input toks: 256 |  num prompt toks 0 |  num prompt to decode toks 0 | num decode toks 256
   - Seq Types 256: P 0 + Ps 0 + D 256 
Scheduler step 22: waiting 593, running 256
    - 1st step : total input toks: 5725 |  num prompt toks 5725 |  num prompt to decode toks 0 | num decode toks 0
   - Seq Types 32: P 32 + Ps 0 + D 0 
  ....

This PR (with --num-scheduler-steps 8 --enable-chunked-prefill)

Scheduler step 0: waiting 853, running 36, finished 0, swapped 0
    - 1st step : total input toks: 7654 |  num prompt toks 0 |  num prompt to decode toks 7654 | num decode toks 0
    - Rest steps : total input toks: 36 |  num prompt toks 0 |  num prompt to decode toks 36 | num decode toks 0
   - Seq Types 36: P 0 + Ps 36 + D 0 
Scheduler step 8: waiting 932, running 53, finished 15, swapped 0
    - 1st step : total input toks: 7963 |  num prompt toks 0 |  num prompt to decode toks 7942 | num decode toks 21
    - Rest steps : total input toks: 53 |  num prompt toks 0 |  num prompt to decode toks 32 | num decode toks 21
   - Seq Types 53: P 0 + Ps 32 + D 21 
Scheduler step 16: waiting 896, running 75, finished 29, swapped 0
    - 1st step : total input toks: 8052 |  num prompt toks 0 |  num prompt to decode toks 8013 | num decode toks 39
    - Rest steps : total input toks: 75 |  num prompt toks 0 |  num prompt to decode toks 36 | num decode toks 39
   - Seq Types 75: P 0 + Ps 36 + D 39 
Scheduler step 24: waiting 856, running 95, finished 49, swapped 0
    - 1st step : total input toks: 7877 |  num prompt toks 0 |  num prompt to decode toks 7822 | num decode toks 55
    - Rest steps : total input toks: 95 |  num prompt toks 0 |  num prompt to decode toks 40 | num decode toks 55
   - Seq Types 95: P 0 + Ps 40 + D 55 
Scheduler step 32: waiting 811, running 127, finished 62, swapped 0
    - 1st step : total input toks: 8188 |  num prompt toks 0 |  num prompt to decode toks 8106 | num decode toks 82
    - Rest steps : total input toks: 127 |  num prompt toks 0 |  num prompt to decode toks 45 | num decode toks 82
   - Seq Types 127: P 0 + Ps 45 + D 82 
 ....

Implementation Details:

Handling Single Step Prompts:

The scheduler output orders the single-step prompts before the decodes (as it does usually). After the first multi-step, we merge the Single Step Prompts into Decodes with some ModelInput and AttentionMetadata updates.
The Single Step Prompts required lookahead slots allocation.

Scheduling Logic:

We try to allocate entire prompt sequences with decodes until the token-budget allows. When a full prompt sequence cannot be scheduled we exit the scheduler call and and defer the prompt's scheduling to the next scheduler call.

Benchmarks:

Machine : H100

Benchmark Throughput
Command : python3 benchmarks/benchmark_throughput.py --model meta-llama/Meta-Llama-3-8B --use-v2-block-manager --tensor-parallel-size ${tp} --gpu-memory-utilization 0.90 --num-scheduler-steps ${ms} --max-num-batched-tokens 8192 --num-prompts 1000 --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json

Branch
	Multi-Step	Chunked Prefill	Requests / Sec	Tokens / Sec
main - TP 1	8	No	44.27	18305
PR 8378 - TP 1	8	Yes	43.66	18053
main - TP 2	8	No	59.97	24801
PR 8378 - TP 2	8	Yes	60.35	24955

Benchmark Serving

Server command : python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B --port 9000 --swap-space 16 --disable-log-requests --use-v2-block-manager --tensor-parallel-size ${tp} --worker-use-ray --pipeline-parallel-size 1 --gpu-memory-utilization 0.90 --num-scheduler-steps 8 --enable-chunked-prefill --max-num-batched-tokens 8192
Client command : python3 benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-8B --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --port 9000 --num-prompts 1000 --request-rate ${qps}

####### Config : TP = 1 , MS = 8

main

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  23.41     
Total input tokens:                      215196    
Total generated tokens:                  128169    
Request throughput (req/s):              42.72     
Output token throughput (tok/s):         5475.78   
Total Token throughput (tok/s):          14669.62  
---------------Time to First Token----------------
Mean TTFT (ms):                          4627.92   
Median TTFT (ms):                        3641.24   
P99 TTFT (ms):                           15246.19  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          42.38     
Median TPOT (ms):                        32.78     
P99 TPOT (ms):                           208.42    
---------------Inter-token Latency----------------
Mean ITL (ms):                           227.65    
Median ITL (ms):                         232.94    
P99 ITL (ms):                            642.93    
==================================================

This PR with --enable-chunked-prefill --max-num-batched-tokens 8192

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  22.02     
Total input tokens:                      215196    
Total generated tokens:                  128079    
Request throughput (req/s):              45.41     
Output token throughput (tok/s):         5815.49   
Total Token throughput (tok/s):          15586.56  
---------------Time to First Token----------------
Mean TTFT (ms):                          4685.13   
Median TTFT (ms):                        4024.45   
P99 TTFT (ms):                           14324.25  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          22.01     
Median TPOT (ms):                        25.09     
P99 TPOT (ms):                           34.38     
---------------Inter-token Latency----------------
Mean ITL (ms):                           192.59    
Median ITL (ms):                         211.36    
P99 ITL (ms):                            337.90    
==================================================

####### Config : TP = 2, MS = 8
main

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  19.96     
Total input tokens:                      215196    
Total generated tokens:                  126602    
Request throughput (req/s):              50.11     
Output token throughput (tok/s):         6343.54   
Total Token throughput (tok/s):          17126.18  
---------------Time to First Token----------------
Mean TTFT (ms):                          3899.19   
Median TTFT (ms):                        3045.08   
P99 TTFT (ms):                           13019.69  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          35.41     
Median TPOT (ms):                        28.81     
P99 TPOT (ms):                           162.29    
---------------Inter-token Latency----------------
Mean ITL (ms):                           195.12    
Median ITL (ms):                         178.06    
P99 ITL (ms):                            484.17    
==================================================

This PR with --enable-chunked-prefill --max-num-batched-tokens 8192

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  20.17     
Total input tokens:                      215196    
Total generated tokens:                  126059    
Request throughput (req/s):              49.57     
Output token throughput (tok/s):         6249.33   
Total Token throughput (tok/s):          16917.60  
---------------Time to First Token----------------
Mean TTFT (ms):                          4297.16   
Median TTFT (ms):                        3625.66   
P99 TTFT (ms):                           13590.24  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.16     
Median TPOT (ms):                        24.71     
P99 TPOT (ms):                           29.83     
---------------Inter-token Latency----------------
Mean ITL (ms):                           188.39    
Median ITL (ms):                         235.91    
P99 ITL (ms):                            288.30    
==================================================

Note : Since the prefills are scheduled together with decodes, may multi-steps run in eager-mode. This offsets some performance benefits. #8645 , based out of this PR, fixes this. Please look at #8645 PR description for performance numbers with the cuda-graph changes.

Next Steps (future PRs) :

~~Support TP~~
Support PP
~~Support LLMEngine class (at the moment only supports AsyncLLMEngine)~~
Support CUDA-Graph when decodes are scheduled with prompts. At the moment, all the steps in multi-step runs in eager-mode when prompts are scheduled with decodes. But except for the first step, all the other steps can run in cuda-graph mode. done in [Core] CUDA Graphs for Multi-Step + Chunked-Prefill #8645
Investigate performance with chunked-prompts.

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

github-actions · 2024-09-11T18:51:08Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

comaniac

First batch of comments for tests and scheduler

tests/multi_step/test_correctness_async_llm.py

vllm/core/block/block_table.py

vllm/core/scheduler.py

vllm/worker/model_runner.py

vllm/worker/multi_step_model_runner.py

vllm/engine/llm_engine.py

sam-h-bean · 2024-09-17T19:59:20Z

Just FYI @varun-sundar-rabindranath I went ahead and built an image from this branch so I could load test it and while I am able to get SOTA QPS for my setup (very long inputs ~4k tokens) the server quickly crashes with this message

ERROR 09-17 12:32:53 async_llm_engine.py:58] Engine background task failed
ERROR 09-17 12:32:53 async_llm_engine.py:58] Traceback (most recent call last):
ERROR 09-17 12:32:53 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 48, in _log_task_completion
ERROR 09-17 12:32:53 async_llm_engine.py:58]     return_value = task.result()
ERROR 09-17 12:32:53 async_llm_engine.py:58]                    ^^^^^^^^^^^^^
ERROR 09-17 12:32:53 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 736, in run_engine_loop
ERROR 09-17 12:32:53 async_llm_engine.py:58]     result = task.result()
ERROR 09-17 12:32:53 async_llm_engine.py:58]              ^^^^^^^^^^^^^
ERROR 09-17 12:32:53 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 676, in engine_step
ERROR 09-17 12:32:53 async_llm_engine.py:58]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 09-17 12:32:53 async_llm_engine.py:58]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-17 12:32:53 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 340, in step_async
ERROR 09-17 12:32:53 async_llm_engine.py:58]     outputs = await self.model_executor.execute_model_async(
ERROR 09-17 12:32:53 async_llm_engine.py:58]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-17 12:32:53 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 185, in execute_model_async
ERROR 09-17 12:32:53 async_llm_engine.py:58]     output = await make_async(self.driver_worker.execute_model
ERROR 09-17 12:32:53 async_llm_engine.py:58]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-17 12:32:53 async_llm_engine.py:58]   File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
ERROR 09-17 12:32:53 async_llm_engine.py:58]     result = self.fn(*self.args, **self.kwargs)
ERROR 09-17 12:32:53 async_llm_engine.py:58]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-17 12:32:53 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 09-17 12:32:53 async_llm_engine.py:58]     output = self.model_runner.execute_model(
ERROR 09-17 12:32:53 async_llm_engine.py:58]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-17 12:32:53 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-17 12:32:53 async_llm_engine.py:58]     return func(*args, **kwargs)
ERROR 09-17 12:32:53 async_llm_engine.py:58]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 09-17 12:32:53 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/multi_step_model_runner.py", line 464, in execute_model
ERROR 09-17 12:32:53 async_llm_engine.py:58]     model_input = self._advance_step(
ERROR 09-17 12:32:53 async_llm_engine.py:58]                   ^^^^^^^^^^^^^^^^^^^
ERROR 09-17 12:32:53 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/multi_step_model_runner.py", line 586, in _advance_step
ERROR 09-17 12:32:53 async_llm_engine.py:58]     attn_metadata.advance_step(
ERROR 09-17 12:32:53 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/flash_attn.py", line 402, in advance_step
ERROR 09-17 12:32:53 async_llm_engine.py:58]     ops.advance_step_flashattn(num_seqs=num_seqs,
ERROR 09-17 12:32:53 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 32, in wrapper
ERROR 09-17 12:32:53 async_llm_engine.py:58]     return fn(*args, **kwargs)
ERROR 09-17 12:32:53 async_llm_engine.py:58]            ^^^^^^^^^^^^^^^^^^^
ERROR 09-17 12:32:53 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 198, in advance_step_flashattn
ERROR 09-17 12:32:53 async_llm_engine.py:58]     return torch.ops._C.advance_step_flashattn(num_seqs, num_queries,
ERROR 09-17 12:32:53 async_llm_engine.py:58]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-17 12:32:53 async_llm_engine.py:58]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1061, in __call__
ERROR 09-17 12:32:53 async_llm_engine.py:58]     return self_._op(*args, **(kwargs or {}))
ERROR 09-17 12:32:53 async_llm_engine.py:58]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-17 12:32:53 async_llm_engine.py:58] RuntimeError: tensor: name = sampled_token_ids, shape = [55, 1] is_cont = 1, type = long int is not as expected: shape = [56, 1], type = Long
Exception in callback functools.partial(<function _log_task_completion at 0x7a5cb81df060>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7a5cac0846b0>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7a5cb81df060>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7a5cac0846b0>>)>

Hoping this is useful for your development! Can't wait for this to land in a stable release.

varun-sundar-rabindranath · 2024-09-17T22:41:27Z

Hey @sam-h-bean thanks for testing this out ! Can you please share the commands you used to test ? It'd help get a repo quickly.

sam-h-bean · 2024-09-18T02:39:22Z

@varun-sundar-rabindranath the script was a garden variety locust test

locust -f vllm_server.py --headless --master --expect-workers=1 -r 2 -u 200 -t 3m --host http://vllm-sft-service.vllm-sft:8000

the invocation is just a post against /v1/chat/completions

and the k8s config is also pretty standard

      containers:
      - name: vllm-sft-container
        image: {{ .Values.custom_vllm_image }}
        args:
        - "--model"
        - "{{ .model }}"
        - "--served-model-name"
        - "sft-llama"
        - "--disable-log-requests"
        - "--allow-credentials"
        - "--enable-prefix-caching"
        - "--enable-chunked-prefill"
        - "--max-num-batched-tokens"
        - "32768"
        - "--num-scheduler-steps"
        - "10"
        - "--gpu-memory-utilization"
        - "0.95"
        - "--tensor-parallel-size"
        - "{{ index .resources.limits "nvidia.com/gpu" }}"
        {{- if .extraArgs }}
        {{- range .extraArgs }}
        - "{{ . }}"
        {{- end }}
        {{- end }}

I will note that this doesn't show up until we get close to maximum load...

Sorry I can't share more. Hopefully you can still glean some useful debug info from this

varun-sundar-rabindranath · 2024-09-18T03:06:27Z

Thanks for sharing @sam-h-bean 👍 I'll check it out !
[edit] I noticed you use --enable-prefix-caching with --enable-chunked-prefill - I haven't tested them together as the PR only adds supports for Chunked-Prefill with Multi-Step (--num-scheduler-steps). That could be a follow up once this lands.
Also, I put in some performance bug fixes for --tensor-parallel-size > 1 recently.

Can you please try without --enable-prefix-caching to see if that solves the issue ? Thanks.

sam-h-bean · 2024-09-18T16:30:18Z

Thanks for sharing @sam-h-bean 👍 I'll check it out ! [edit] I noticed you use --enable-prefix-caching with --enable-chunked-prefill - I haven't tested them together as the PR only adds supports for Chunked-Prefill with Multi-Step (--num-scheduler-steps). That could be a follow up once this lands. Also, I put in some performance bug fixes for --tensor-parallel-size > 1 recently.

Can you please try without --enable-prefix-caching to see if that solves the issue ? Thanks.

this did get the error to go away!

sam-h-bean · 2024-09-18T17:07:58Z

Thanks for sharing @sam-h-bean 👍 I'll check it out ! [edit] I noticed you use --enable-prefix-caching with --enable-chunked-prefill - I haven't tested them together as the PR only adds supports for Chunked-Prefill with Multi-Step (--num-scheduler-steps). That could be a follow up once this lands. Also, I put in some performance bug fixes for --tensor-parallel-size > 1 recently.

Can you please try without --enable-prefix-caching to see if that solves the issue ? Thanks.

disabling prefix caching fixes the issue!

varun-sundar-rabindranath · 2024-09-18T18:34:01Z

vllm/attention/backends/abstract.py

+    class CountsUpdate:
+        num_prefills: int
+        num_prefill_tokens: int
+        num_decode_tokens: int


This data structure could be expanded when more sophisticated techniques are used (like inserting prefills at arbitrary steps)

I'm a bit concern about this design:

It's extremely confusing when someone came to read AttentionMetadata for the first time. At least we need better naming for both CountUpdate and its attributes.

We may not need to embed this dataclass in AttentionMetadata. After all, AttentionMetadata doesn't have count_update. It seems to me that CountUpdate is used to update AttentionMetadata instead of a part of its attribute.

@comaniac I removed this data structure. Instead introduced an argument turn_prefills_into_decode to the advance_step function.

sam-h-bean · 2024-09-18T18:44:37Z

@varun-sundar-rabindranath I am running into other issues with a similar setup

INFO 09-18 11:41:30 server.py:228] vLLM ZMQ RPC Server was interrupted.
Future exception was never retrieved
future: <Future finished exception=IndexError('list index out of range')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 862, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 106, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 862, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 106, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 862, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 106, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 48, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 736, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 676, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 370, in step_async
    step_num=self._current_step(seq_group_metadata_list) - 1)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1356, in _current_step
    current_step = seq_group_metadata_list[0].state.current_step
                   ~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

with this k8s setup

        - "--model"
        - "neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8"
        - "--served-model-name"
        - "cai-llama"
        - "--disable-log-requests"
        - "--allow-credentials"
        - "--enable-chunked-prefill"
        - "--max-model-len"
        - "8000"
        - "--num-scheduler-steps"
        - "10"
        - "--quantization"
        - "compressed-tensors"
        - "--tensor-parallel-size"
        - "{{ index .resources.limits "nvidia.com/gpu" }}"

curious if it is the combination of this experimental config and fp8 quantization

varun-sundar-rabindranath · 2024-09-18T19:47:38Z

@varun-sundar-rabindranath I am running into other issues with a similar setup

INFO 09-18 11:41:30 server.py:228] vLLM ZMQ RPC Server was interrupted.
Future exception was never retrieved
future: <Future finished exception=IndexError('list index out of range')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 862, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 106, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 862, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 106, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 862, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 106, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 48, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 736, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 676, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 370, in step_async
    step_num=self._current_step(seq_group_metadata_list) - 1)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1356, in _current_step
    current_step = seq_group_metadata_list[0].state.current_step
                   ~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

with this k8s setup

        - "--model"
        - "neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8"
        - "--served-model-name"
        - "cai-llama"
        - "--disable-log-requests"
        - "--allow-credentials"
        - "--enable-chunked-prefill"
        - "--max-model-len"
        - "8000"
        - "--num-scheduler-steps"
        - "10"
        - "--quantization"
        - "compressed-tensors"
        - "--tensor-parallel-size"
        - "{{ index .resources.limits "nvidia.com/gpu" }}"

curious if it is the combination of this experimental config and fp8 quantization

@sam-h-bean - I have been updating this PR with fixes - I fixed this particular issue this morning. Can you pull the PR again and try. Sorry about the trouble and thanks for testing 🙌

sam-h-bean · 2024-09-18T21:38:31Z

Thanks for sharing @sam-h-bean 👍 I'll check it out ! [edit] I noticed you use --enable-prefix-caching with --enable-chunked-prefill - I haven't tested them together as the PR only adds supports for Chunked-Prefill with Multi-Step (--num-scheduler-steps). That could be a follow up once this lands. Also, I put in some performance bug fixes for --tensor-parallel-size > 1 recently.
Can you please try without --enable-prefix-caching to see if that solves the issue ? Thanks.

disabling prefix caching fixes the issue!

I might suggest throwing an error at startup time if someone tries enabling prefix caching + prefill + scheduler steps in this case

vllm/engine/arg_utils.py

varun-sundar-rabindranath · 2024-09-18T21:43:11Z

Thanks for sharing @sam-h-bean 👍 I'll check it out ! [edit] I noticed you use --enable-prefix-caching with --enable-chunked-prefill - I haven't tested them together as the PR only adds supports for Chunked-Prefill with Multi-Step (--num-scheduler-steps). That could be a follow up once this lands. Also, I put in some performance bug fixes for --tensor-parallel-size > 1 recently.
Can you please try without --enable-prefix-caching to see if that solves the issue ? Thanks.

disabling prefix caching fixes the issue!

I might suggest throwing an error at startup time if someone tries enabling prefix caching + prefill + scheduler steps in this case

Yup. Added in this commit.

sam-h-bean · 2024-09-18T23:18:47Z

Seeing some new interersting behavior once I pulled your latest and reran the load test. Seems that requests just stack up in pending after a few get through. I get 3 completions then it just hangs

INFO 09-18 16:10:53 metrics.py:351] Avg prompt throughput: 67.0 tokens/s, Avg generation throughput: 0.4 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO:     10.0.0.99:58210 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.0.0.99:58222 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     10.0.0.99:58224 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 09-18 16:10:58 metrics.py:351] Avg prompt throughput: 70.6 tokens/s, Avg generation throughput: 5.6 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 12 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     10.64.214.1:45782 - "GET /health HTTP/1.1" 200 OK
INFO:     10.65.89.48:37148 - "GET /metrics HTTP/1.1" 200 OK
INFO:     10.64.145.165:48936 - "GET /metrics HTTP/1.1" 200 OK
INFO 09-18 16:11:03 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 22 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-18 16:11:08 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 32 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     10.64.214.1:37464 - "GET /health HTTP/1.1" 200 OK
INFO 09-18 16:11:13 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 42 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-18 16:11:18 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 52 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     10.64.214.1:39902 - "GET /health HTTP/1.1" 200 OK
INFO 09-18 16:11:23 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 62 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-18 16:11:28 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 72 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     10.64.214.1:37332 - "GET /health HTTP/1.1" 200 OK
INFO:     10.65.89.48:40538 - "GET /metrics HTTP/1.1" 200 OK
INFO:     10.64.145.165:40110 - "GET /metrics HTTP/1.1" 200 OK
INFO 09-18 16:11:33 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 82 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-18 16:11:38 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 92 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     10.64.214.1:33236 - "GET /health HTTP/1.1" 200 OK
INFO 09-18 16:11:43 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 102 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-18 16:11:48 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 112 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     10.64.214.1:53916 - "GET /health HTTP/1.1" 200 OK
INFO 09-18 16:11:53 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 122 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-18 16:11:58 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 132 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     10.64.214.1:60554 - "GET /health HTTP/1.1" 200 OK
INFO:     10.65.89.48:43442 - "GET /metrics HTTP/1.1" 200 OK
INFO:     10.64.145.165:55592 - "GET /metrics HTTP/1.1" 200 OK
INFO 09-18 16:12:03 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 142 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-18 16:12:08 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 152 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     10.64.214.1:47454 - "GET /health HTTP/1.1" 200 OK
INFO 09-18 16:12:13 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 162 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-18 16:12:18 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 172 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     10.64.214.1:48256 - "GET /health HTTP/1.1" 200 OK
INFO 09-18 16:12:23 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 182 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-18 16:12:28 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 192 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     10.64.214.1:44174 - "GET /health HTTP/1.1" 200 OK
INFO:     10.65.89.48:54242 - "GET /metrics HTTP/1.1" 200 OK
INFO:     10.64.145.165:58168 - "GET /metrics HTTP/1.1" 200 OK
INFO 09-18 16:12:33 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 200 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-18 16:12:38 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 200 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     10.64.214.1:47416 - "GET /health HTTP/1.1" 200 OK
INFO 09-18 16:12:43 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 200 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 09-18 16:12:48 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 200 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

sam-h-bean · 2024-09-19T03:40:27Z

I pulled a trace during the hanging load test if it is of help.
VLLM_TRACE_FUNCTION2.log

varun-sundar-rabindranath · 2024-09-19T04:07:44Z

Thanks for sharing the trace @sam-h-bean, Ill take a look.

Also I pushed some changes based on what I thought likely was happening - When a input prompt length is > the user set --max-num-batched-tokens the PR stalls because it thinks it has scheduled enough input tokens. But in reality, it could never schedule that sequence. One way to get around this is to increase the --max-num-batched-tokens. However, I pushed a change to ignore such sequences with a user message and continue with the rest of the sequences. Can you try it out please ?

The correct way to handle this is to process the prompt in multiple chunks. I am working on it now.

sam-h-bean · 2024-09-19T04:10:23Z

Thanks for sharing the trace @sam-h-bean, Ill take a look.

Also I pushed some changes based on what I thought likely was happening - When a input prompt length is > the user set --max-num-batched-tokens the PR stalls because it thinks it has scheduled enough input tokens. But in reality, it could never schedule that sequence. One way to get around this is to increase the --max-num-batched-tokens. However, I pushed a change to ignore such sequences with a user message and continue with the rest of the sequences. Can you try it out please ?

The correct way to handle this is to process the prompt in multiple chunks. I am working on it now.

I did indeed see this problem go away when I set max-num-batched-tokens so you are probably right!

comaniac

The code looks much cleaner now. Thanks for the efforts!
My last batch of comments. Mostly nits and refactoring. Will approve after they are addressed.

Also leave to @SolitaryThinker and @alexm-neuralmagic to review and sign off

vllm/core/scheduler.py

vllm/engine/llm_engine.py

vllm/model_executor/sampling_metadata.py

comaniac · 2024-09-25T21:39:05Z

vllm/worker/multi_step_model_runner.py

+def _get_supported_attention_backends(chunked_prefill_enabled: bool) \
+    -> List[str]:
+    if chunked_prefill_enabled:
+        return MULTI_STEP_CHUNKED_PREFILL_ATTENTION_BACKENDS
+    else:
+        return MULTI_STEP_ATTENTION_BACKENDS


Do we really need this? IIRC now only flash attention in MULTI_STEP_ATTENTION_BACKENDS supports chunked prefill, so I suppose if users specify the other backend with chunked prefill and multi-step, it will error out from the attention backend first.

The very first scheduler call, schedules all prefills.

Without this check, the first-step in multi-step will run fine. In the second step, in advance_step(), we will hit the assert here

vllm/vllm/attention/backends/flashinfer.py

Line 424 in f4a2886

assert not turn_prefills_into_decodes, \

With this check, we instead raise a ValueError during the initialization itself.

The check here seems better. What do you think ?

Hmm for this case I'd prefer to raise exception even before the engine is initialized (when loading the config), but I don't have a strong preference on this

vllm/worker/multi_step_model_runner.py

comaniac · 2024-09-25T22:00:57Z

vllm/model_executor/sampling_metadata.py

    ) -> None:
        self.seq_groups = seq_groups
        self.selected_token_indices = selected_token_indices
        self.categorized_sample_indices = categorized_sample_indices
        self.num_prompts = num_prompts
        self.skip_sampler_cpu_output = skip_sampler_cpu_output
        self.reuse_sampling_tensors = reuse_sampling_tensors
+        self.selected_token_indices_multistep = selected_token_indices_multistep
+
+    def prepare_multistep_tensors(self, num_queries: int, device: str,


I now understand the purpose of this function. Although it simply just creates a list(range(num_queries)), you kicked off this function asyncly to hide the latency in advance_step. If possible, can we consider the following alternative to simply the code here:

Inline prepare_multistep_tensors to advance_step, but still do async transfer.

[Need check] I suppose we don't need a barrier for self.selected_token_indices, because the kernel that uses this tensor will need to wait for it.

Yeah. I believe we can move this into multi_step_model_runner's advance_step and localize the changes to that file.

alexm-redhat

@varun-sundar-rabindranath went over the PR in detail. LGTM. Left some nit comments.

alexm-redhat · 2024-09-20T15:46:46Z

csrc/prepare_inputs/advance_step.cu

@@ -211,7 +211,7 @@ void advance_step_flashinfer(
    printf("  num_seqs = %d\n", num_seqs);
    printf("  num_queries = %d\n", num_queries);
    printf("  block_size = %d\n", block_size);
-    printf("  block_tables.stride(0) = %d\n", block_tables.stride(0));
+    printf("  block_tables.stride(0) = %ld\n", block_tables.stride(0));


good catch!

alexm-redhat · 2024-09-27T15:38:25Z

vllm/config.py

+                    # for now. Have max_num_batched_tokens set to max_model_len
+                    # so we don't reject sequences on account of a short
+                    # max_num_batched_tokens.
+                    max_num_batched_tokens = max(max_model_len, 2048)


Why not simply set max_num_batched_tokens = max_model_len? What's the reason for 2048 here?

There is an else part to if enable_chunked_prefill - there we do,

# If max_model_len is too short, use 2048 as the default value # for higher throughput. max_num_batched_tokens = max(max_model_len, 2048)

I replicated the same. This argument is the token-buget in the Scheduler. I believe it is so we can schedule more prefills and not be limited by the small max_model_len value.

alexm-redhat · 2024-09-27T15:44:02Z

vllm/core/scheduler.py

            infeasible_seq_groups=infeasible_seq_groups,
        )

    def _get_prompt_limit(self, seq_group: SequenceGroup) -> int:
-        if self.scheduler_config.chunked_prefill_enabled:
+        if self.scheduler_config.chunked_prefill_enabled and \


If you would have max_num_batched_tokens = max_model_len, then this separation here won't be necessary.

Yes. But we only set max_num_batched_tokens = max_model_len when the user does not specify any max_num_batched_tokens. This check here protects against the case when the user sets max_num_batched_tokens < max_model_len.

Makes sense now

alexm-redhat · 2024-09-27T15:51:32Z

vllm/core/scheduler.py

-        is_prefill = False
+        is_prefill = seq_group.is_prefill()
+
+        # Appending prefill slots only happens chunked prefill is enabled.


is it only when chunked + multistep or just chunked?

True. It is only when chunked-prefill + multi-step - I made the assert stronger.

alexm-redhat

LGTM from my side

comaniac

LGTM. Thanks!

SolitaryThinker

Thanks

LiuXiaoxuanPKU · 2024-09-29T22:16:29Z

QQ: what's the definition of num_computed_tokens? For example, given a prompt [1,2,3,4,5], after the prefill phase (after process_output), one new token is generated, we get [1,2,3,4,5,6]
Before this PR: num_computed_tokens=5
After this PR: num_computed_tokens=6
I guess we should set num_computed_tokens to 5 as the last token is sampled not computed. Please let me know if my understanding is wrong.
@varun-sundar-rabindranath @comaniac

varun-sundar-rabindranath · 2024-09-29T23:18:06Z

QQ: what's the definition of num_computed_tokens? For example, given a prompt [1,2,3,4,5], after the prefill phase (after process_output), one new token is generated, we get [1,2,3,4,5,6] Before this PR: num_computed_tokens=5 After this PR: num_computed_tokens=6 I guess we should set num_computed_tokens to 5 as the last token is sampled not computed. Please let me know if my understanding is wrong. @varun-sundar-rabindranath @comaniac

Hi @LiuXiaoxuanPKU . I believe you are right! I looked up the definition of num_computed_tokens,

    def get_num_computed_tokens(self) -> int:
        """Return the number of prefill tokens that are already computed."""
        return self._num_computed_tokens

I misunderstood it as including both prefills and decode. I'll put up a PR.

comaniac · 2024-09-29T23:18:44Z

QQ: what's the definition of num_computed_tokens? For example, given a prompt [1,2,3,4,5], after the prefill phase (after process_output), one new token is generated, we get [1,2,3,4,5,6]

Before this PR: num_computed_tokens=5

After this PR: num_computed_tokens=6

I guess we should set num_computed_tokens to 5 as the last token is sampled not computed. Please let me know if my understanding is wrong.

@varun-sundar-rabindranath @comaniac

I guess this is related to my previous comment about adding computed_tokens by 1. @varun-sundar-rabindranath could you clarify?

varun-sundar-rabindranath · 2024-09-30T00:18:37Z

@LiuXiaoxuanPKU @comaniac I have a PR #8950 up with a fix that reverts the updates. My bad that I totally misunderstood the semantics of num_computed_tokens. Sorry for the inconvenience! Thanks @LiuXiaoxuanPKU for catching this !

vllm-project#8378) Co-authored-by: Varun Sundar Rabindranath <[email protected]>

* [Build/CI] Upgrade to gcc 10 in the base build Docker image (vllm-project#8814) * [Docs] Add README to the build docker image (vllm-project#8825) * [CI/Build] Fix missing ci dependencies (vllm-project#8834) * [misc][installation] build from source without compilation (vllm-project#8818) * [ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM (vllm-project#8872) Signed-off-by: kevin <[email protected]> * [Bugfix] Include encoder prompts len to non-stream api usage response (vllm-project#8861) * [Misc] Change dummy profiling and BOS fallback warns to log once (vllm-project#8820) * [Bugfix] Fix print_warning_once's line info (vllm-project#8867) * fix validation: Only set tool_choice `auto` if at least one tool is provided (vllm-project#8568) * [Bugfix] Fixup advance_step.cu warning (vllm-project#8815) * [BugFix] Fix test breakages from transformers 4.45 upgrade (vllm-project#8829) * [Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility (vllm-project#8764) * [Feature] Add support for Llama 3.1 and 3.2 tool use (vllm-project#8343) Signed-off-by: Max de Bayser <[email protected]> * [Core] rename`PromptInputs` and `inputs` (vllm-project#8876) * [misc] fix collect env (vllm-project#8894) * [MISC] Fix invalid escape sequence '\' (vllm-project#8830) Signed-off-by: Peter Pan <[email protected]> * [Bugfix][VLM] Fix Fuyu batching inference with `max_num_seqs>1` (vllm-project#8892) * [TPU] Update pallas.py to support trillium (vllm-project#8871) * [torch.compile] use empty tensor instead of None for profiling (vllm-project#8875) * [Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (vllm-project#7271) * [Bugfix] fix for deepseek w4a16 (vllm-project#8906) Co-authored-by: mgoin <[email protected]> * [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (vllm-project#8378) Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [misc][distributed] add VLLM_SKIP_P2P_CHECK flag (vllm-project#8911) * [Core] Priority-based scheduling in async engine (vllm-project#8850) * [misc] fix wheel name (vllm-project#8919) * [Bugfix][Intel] Fix XPU Dockerfile Build (vllm-project#7824) Signed-off-by: tylertitsworth <[email protected]> Co-authored-by: youkaichao <[email protected]> * [Misc] Remove vLLM patch of `BaichuanTokenizer` (vllm-project#8921) * [Bugfix] Fix code for downloading models from modelscope (vllm-project#8443) * [Bugfix] Fix PP for Multi-Step (vllm-project#8887) * [CI/Build] Update models tests & examples (vllm-project#8874) Co-authored-by: Roger Wang <[email protected]> * [Frontend] Make beam search emulator temperature modifiable (vllm-project#8928) Co-authored-by: Eduard Balzin <[email protected]> * [Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (vllm-project#8891) * [doc] organize installation doc and expose per-commit docker (vllm-project#8931) * [Core] Improve choice of Python multiprocessing method (vllm-project#8823) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: youkaichao <[email protected]> * [Bugfix] Block manager v2 with preemption and lookahead slots (vllm-project#8824) * [Bugfix] Fix Marlin MoE act order when is_k_full == False (vllm-project#8741) Co-authored-by: Tyler Michael Smith <[email protected]> * [CI/Build] Add test decorator for minimum GPU memory (vllm-project#8925) * [Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching (vllm-project#8930) * [Model] Support Qwen2.5-Math-RM-72B (vllm-project#8896) * [Model][LoRA]LoRA support added for MiniCPMV2.5 (vllm-project#7199) * [BugFix] Fix seeded random sampling with encoder-decoder models (vllm-project#8870) Co-authored-by: Roger Wang <[email protected]> * [Misc] Fix typo in BlockSpaceManagerV1 (vllm-project#8944) * [Frontend] Added support for HF's new `continue_final_message` parameter (vllm-project#8942) * [Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (vllm-project#8533) * [Model] support input embeddings for qwen2vl (vllm-project#8856) * [Misc][CI/Build] Include `cv2` via `mistral_common[opencv]` (vllm-project#8951) * [Model][LoRA]LoRA support added for MiniCPMV2.6 (vllm-project#8943) Co-authored-by: DarkLight1337 <[email protected]> * [Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg (vllm-project#8946) * [Core] Make scheduling policy settable via EngineArgs (vllm-project#8956) * [Misc] Adjust max_position_embeddings for LoRA compatibility (vllm-project#8957) * [ci] Add CODEOWNERS for test directories (vllm-project#8795) Signed-off-by: kevin <[email protected]> * [CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. (vllm-project#8975) * [Frontend][Core] Move guided decoding params into sampling params (vllm-project#8252) Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [CI/Build] Fix machete generated kernel files ordering (vllm-project#8976) Signed-off-by: kevin <[email protected]> Co-authored-by: Cody Yu <[email protected]> * [torch.compile] fix tensor alias (vllm-project#8982) * [Misc] add process_weights_after_loading for DummyLoader (vllm-project#8969) * [Bugfix] Fix Fuyu tensor parallel inference (vllm-project#8986) * [Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders (vllm-project#8991) Signed-off-by: Alex-Brooks <[email protected]> * [Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API (vllm-project#8965) * [Doc] Update list of supported models (vllm-project#8987) * Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows (vllm-project#8997) * [Spec Decode] (1/2) Remove batch expansion (vllm-project#8839) * [Core] Combined support for multi-step scheduling, chunked prefill & prefix caching (vllm-project#8804) Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Andrew Feldman <[email protected]> * [Misc] Update Default Image Mapper Error Log (vllm-project#8977) Signed-off-by: Alex-Brooks <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Core] CUDA Graphs for Multi-Step + Chunked-Prefill (vllm-project#8645) Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [OpenVINO] Enable GPU support for OpenVINO vLLM backend (vllm-project#8192) * [Model] Adding Granite MoE. (vllm-project#8206) Co-authored-by: Nick Hill <[email protected]> * [Doc] Update Granite model docs (vllm-project#9025) * [Bugfix] example template should not add parallel_tool_prompt if tools is none (vllm-project#9007) * [Misc] log when using default MoE config (vllm-project#8971) * [BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser (vllm-project#9020) * [Core] Make BlockSpaceManagerV2 the default BlockManager to use. (vllm-project#8678) * [Frontend] [Neuron] Parse literals out of override-neuron-config (vllm-project#8959) Co-authored-by: Jerzy Zagorski <[email protected]> * [misc] add forward context for attention (vllm-project#9029) * Fix failing spec decode test (vllm-project#9054) * [Bugfix] Weight loading fix for OPT model (vllm-project#9042) Co-authored-by: dvres <[email protected]> * [Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model (vllm-project#8405) * [CI/Build] Per file CUDA Archs (improve wheel size and dev build times) (vllm-project#8845) * [Misc] Enable multi-step output streaming by default (vllm-project#9047) * [Models] Add remaining model PP support (vllm-project#7168) Signed-off-by: Muralidhar Andoorveedu <[email protected]> Signed-off-by: Murali Andoorveedu <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> * [Misc] Move registry to its own file (vllm-project#9064) * [Bugfix] Reshape the dimensions of the input image embeddings in Qwen2VL (vllm-project#9071) * [Bugfix] Flash attention arches not getting set properly (vllm-project#9062) * [Model] add a bunch of supported lora modules for mixtral (vllm-project#9008) Signed-off-by: Prashant Gupta <[email protected]> * Remove AMD Ray Summit Banner (vllm-project#9075) * [Hardware][PowerPC] Make oneDNN dependency optional for Power (vllm-project#9039) Signed-off-by: Varad Ahirwadkar <[email protected]> * [Core][VLM] Test registration for OOT multimodal models (vllm-project#8717) Co-authored-by: DarkLight1337 <[email protected]> * Adds truncate_prompt_tokens param for embeddings creation (vllm-project#8999) Signed-off-by: Flavia Beo <[email protected]> * [Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (vllm-project#8973) Co-authored-by: Dipika <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> * [CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang (vllm-project#7412) * [Misc] Improved prefix cache example (vllm-project#9077) * [Misc] Add random seed for prefix cache benchmark (vllm-project#9081) * [Misc] Fix CI lint (vllm-project#9085) * [Hardware][Neuron] Add on-device sampling support for Neuron (vllm-project#8746) Co-authored-by: Ashraf Mahgoub <[email protected]> * [torch.compile] improve allreduce registration (vllm-project#9061) * [Doc] Update README.md with Ray summit slides (vllm-project#9088) * [Bugfix] use blockmanagerv1 for encoder-decoder (vllm-project#9084) Co-authored-by: Roger Wang <[email protected]> * [Bugfix] Fixes Phi3v & Ultravox Multimodal EmbeddingInputs (vllm-project#8979) * [Model] Support Gemma2 embedding model (vllm-project#9004) * [Bugfix] Deprecate registration of custom configs to huggingface (vllm-project#9083) * [Bugfix] Fix order of arguments matters in config.yaml (vllm-project#8960) * [core] use forward context for flash infer (vllm-project#9097) * [Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model (vllm-project#9101) * [Frontend] API support for beam search (vllm-project#9087) Co-authored-by: youkaichao <[email protected]> * [Misc] Remove user-facing error for removed VLM args (vllm-project#9104) * [Model] PP support for embedding models and update docs (vllm-project#9090) Co-authored-by: Roger Wang <[email protected]> * [Bugfix] fix tool_parser error handling when serve a model not support it (vllm-project#8709) * [Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling (vllm-project#9038) Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix][Hardware][CPU] Fix CPU model input for decode (vllm-project#9044) * [BugFix][Core] Fix BlockManagerV2 when Encoder Input is None (vllm-project#9103) * [core] remove beam search from the core (vllm-project#9105) * [Model] Explicit interface for vLLM models and support OOT embedding models (vllm-project#9108) * [Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend (vllm-project#9089) * [Core] Refactor GGUF parameters packing and forwarding (vllm-project#8859) * [Model] Support NVLM-D and fix QK Norm in InternViT (vllm-project#9045) Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [Doc]: Add deploying_with_k8s guide (vllm-project#8451) * [CI/Build] Add linting for github actions workflows (vllm-project#7876) Signed-off-by: Russell Bryant <[email protected]> * [Doc] Include performance benchmark in README (vllm-project#9135) * [misc] fix comment and variable name (vllm-project#9139) * Add Slack to README (vllm-project#9137) * [misc] update utils to support comparing multiple settings (vllm-project#9140) * [Intel GPU] Fix xpu decode input (vllm-project#9145) * [misc] improve ux on readme (vllm-project#9147) * [Frontend] API support for beam search for MQLLMEngine (vllm-project#9117) * [Core][Frontend] Add Support for Inference Time mm_processor_kwargs (vllm-project#9131) Signed-off-by: Alex-Brooks <[email protected]> * Factor out common weight loading code * Fix EAGLE model loading * [Frontend] Add Early Validation For Chat Template / Tool Call Parser (vllm-project#9151) Signed-off-by: Alex-Brooks <[email protected]> * Improve efficiency * Rename * Update LLaVA-NeXT-Video * [CI/Build] Add examples folder into Docker image so that we can leverage the templates*.jinja when serving models (vllm-project#8758) Signed-off-by: Peter Pan <[email protected]> * [Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing (vllm-project#8537) * Automatic loading and save memory * Rename * Update docstring * Simplify * Cleanup * Fully enable recursive loading * Clarify * [Doc] Update vlm.rst to include an example on videos (vllm-project#9155) Co-authored-by: Cyrus Leung <[email protected]> * Fix incorrect semantics * Move function * Update error message * Fix Ultravox loading * spacing * [Doc] Improve contributing and installation documentation (vllm-project#9132) Signed-off-by: Rafael Vasquez <[email protected]> * Fix server * [Bugfix] Try to handle older versions of pytorch (vllm-project#9086) --------- Signed-off-by: kevin <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Peter Pan <[email protected]> Signed-off-by: tylertitsworth <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: Muralidhar Andoorveedu <[email protected]> Signed-off-by: Murali Andoorveedu <[email protected]> Signed-off-by: Prashant Gupta <[email protected]> Signed-off-by: Varad Ahirwadkar <[email protected]> Signed-off-by: Flavia Beo <[email protected]> Signed-off-by: Rafael Vasquez <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: fyuan1316 <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: Pernekhan Utemuratov <[email protected]> Co-authored-by: Chirag Jain <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: Peter Pan <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Brittany <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Sebastian Schoennenbeck <[email protected]> Co-authored-by: Tyler Titsworth <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: tastelikefeet <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Edouard B. <[email protected]> Co-authored-by: Eduard Balzin <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: sroy745 <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: Zilin Zhu <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: juncheoll <[email protected]> Co-authored-by: danieljannai21 <[email protected]> Co-authored-by: Mor Zusman <[email protected]> Co-authored-by: whyiug <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: vlsav <[email protected]> Co-authored-by: afeldman-nm <[email protected]> Co-authored-by: Andrew Feldman <[email protected]> Co-authored-by: Sergey Shlyapnikov <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: Travis Johnson <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: xendo <[email protected]> Co-authored-by: Jerzy Zagorski <[email protected]> Co-authored-by: Domen Vreš <[email protected]> Co-authored-by: dvres <[email protected]> Co-authored-by: 代君 <[email protected]> Co-authored-by: Murali Andoorveedu <[email protected]> Co-authored-by: Prashant Gupta <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Varad Ahirwadkar <[email protected]> Co-authored-by: Flávia Béo <[email protected]> Co-authored-by: Dipika <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Andy Dai <[email protected]> Co-authored-by: Chongming Ni <[email protected]> Co-authored-by: Ashraf Mahgoub <[email protected]> Co-authored-by: Zhuohan Li <[email protected]> Co-authored-by: hhzhang16 <[email protected]> Co-authored-by: Xin Yang <[email protected]> Co-authored-by: TJian <[email protected]> Co-authored-by: Brendan Wong <[email protected]> Co-authored-by: Yanyi Liu <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: TimWang <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: Daniele <[email protected]> Co-authored-by: Sayak Paul <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Rafael Vasquez <[email protected]> Co-authored-by: bnellnm <[email protected]>

[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility (vllm-project#8272) [Frontend] Add progress reporting to run_batch.py (vllm-project#8060) Co-authored-by: Adam Lugowski <[email protected]> [Bugfix] Correct adapter usage for cohere and jamba (vllm-project#8292) [Misc] GPTQ Activation Ordering (vllm-project#8135) [Misc] Fused MoE Marlin support for GPTQ (vllm-project#8217) Add NVIDIA Meetup slides, announce AMD meetup, and add contact info (vllm-project#8319) [Bugfix] Fix missing `post_layernorm` in CLIP (vllm-project#8155) [CI/Build] enable ccache/scccache for HIP builds (vllm-project#8327) [Frontend] Clean up type annotations for mistral tokenizer (vllm-project#8314) [CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail (vllm-project#8130) Fix ppc64le buildkite job (vllm-project#8309) [Spec Decode] Move ops.advance_step to flash attn advance_step (vllm-project#8224) [Misc] remove peft as dependency for prompt models (vllm-project#8162) [MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled (vllm-project#8342) [Bugfix] lookahead block table with cuda graph max capture (vllm-project#8340) [Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (vllm-project#8340) [Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers (vllm-project#8172) [CI/Build][Kernel] Update CUTLASS to 3.5.1 tag (vllm-project#8043) [Misc] Skip loading extra bias for Qwen2-MOE GPTQ models (vllm-project#8329) [Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel (vllm-project#8299) [Hardware][NV] Add support for ModelOpt static scaling checkpoints. (vllm-project#6112) [model] Support for Llava-Next-Video model (vllm-project#7559) Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> [Frontend] Create ErrorResponse instead of raising exceptions in run_batch (vllm-project#8347) [Model][VLM] Add Qwen2-VL model support (vllm-project#7905) Co-authored-by: Roger Wang <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> [Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (vllm-project#7257) [CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation (vllm-project#8373) [Bugfix] Add missing attributes in mistral tokenizer (vllm-project#8364) [Kernel][Misc] register ops to prevent graph breaks (vllm-project#6917) Co-authored-by: Sage Moore <[email protected]> [Misc] Move device options to a single place (vllm-project#8322) [Speculative Decoding] Test refactor (vllm-project#8317) Co-authored-by: youkaichao <[email protected]> Pixtral (vllm-project#8377) Co-authored-by: Roger Wang <[email protected]> Bump version to v0.6.1 (vllm-project#8379) [MISC] Dump model runner inputs when crashing (vllm-project#8305) [misc] remove engine_use_ray (vllm-project#8126) [TPU] Use Ray for default distributed backend (vllm-project#8389) Fix the AMD weight loading tests (vllm-project#8390) [Bugfix]: Fix the logic for deciding if tool parsing is used (vllm-project#8366) [Gemma2] add bitsandbytes support for Gemma2 (vllm-project#8338) [Misc] Raise error when using encoder/decoder model with cpu backend (vllm-project#8355) [Misc] Use RoPE cache for MRoPE (vllm-project#8396) [torch.compile] hide slicing under custom op for inductor (vllm-project#8384) [Hotfix][VLM] Fixing max position embeddings for Pixtral (vllm-project#8399) [Bugfix] Fix InternVL2 inference with various num_patches (vllm-project#8375) Co-authored-by: DarkLight1337 <[email protected]> [Model] Support multiple images for qwen-vl (vllm-project#8247) Signed-off-by: Alex-Brooks <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> [BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance (vllm-project#8403) [BugFix] Fix Duplicate Assignment in Hermes2ProToolParser (vllm-project#8423) [Bugfix] Offline mode fix (vllm-project#8376) Signed-off-by: Joe Runde <[email protected]> [multi-step] add flashinfer backend (vllm-project#7928) [Core] Add engine option to return only deltas or final output (vllm-project#7381) [Bugfix] multi-step + flashinfer: ensure cuda graph compatible (vllm-project#8427) [Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models (vllm-project#8425) [CI/Build] Disable multi-node test for InternVL2 (vllm-project#8428) [Hotfix][Pixtral] Fix multiple images bugs (vllm-project#8415) [Bugfix] Fix weight loading issue by rename variable. (vllm-project#8293) [Misc] Update Pixtral example (vllm-project#8431) [BugFix] fix group_topk (vllm-project#8430) [Core] Factor out input preprocessing to a separate class (vllm-project#7329) [Bugfix] Mapping physical device indices for e2e test utils (vllm-project#8290) [Bugfix] Bump fastapi and pydantic version (vllm-project#8435) [CI/Build] Update pixtral tests to use JSON (vllm-project#8436) [Bugfix] Fix async log stats (vllm-project#8417) [bugfix] torch profiler bug for single gpu with GPUExecutor (vllm-project#8354) bump version to v0.6.1.post1 (vllm-project#8440) [CI/Build] Enable InternVL2 PP test only on single node (vllm-project#8437) [doc] recommend pip instead of conda (vllm-project#8446) [Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 (vllm-project#8442) [misc][ci] fix quant test (vllm-project#8449) [Installation] Gate FastAPI version for Python 3.8 (vllm-project#8456) [plugin][torch.compile] allow to add custom compile backend (vllm-project#8445) [CI/Build] Reorganize models tests (vllm-project#7820) [Doc] Add oneDNN installation to CPU backend documentation (vllm-project#8467) [HotFix] Fix final output truncation with stop string + streaming (vllm-project#8468) bump version to v0.6.1.post2 (vllm-project#8473) [Hardware][intel GPU] bump up ipex version to 2.3 (vllm-project#8365) Co-authored-by: Yan Ma <[email protected]> [Kernel][Hardware][Amd]Custom paged attention kernel for rocm (vllm-project#8310) [Model] support minicpm3 (vllm-project#8297) Co-authored-by: DarkLight1337 <[email protected]> [torch.compile] fix functionalization (vllm-project#8480) [torch.compile] add a flag to disable custom op (vllm-project#8488) [TPU] Implement multi-step scheduling (vllm-project#8489) [Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations (vllm-project#8490) [Bugfix][Kernel] Add `IQ1_M` quantization implementation to GGUF kernel (vllm-project#8357) [Kernel] Enable 8-bit weights in Fused Marlin MoE (vllm-project#8032) Co-authored-by: Dipika <[email protected]> [Frontend] Expose revision arg in OpenAI server (vllm-project#8501) [BugFix] Fix clean shutdown issues (vllm-project#8492) [Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (vllm-project#8506) [Kernel] AQ AZP 3/4: Asymmetric quantization kernels (vllm-project#7270) [doc] update doc on testing and debugging (vllm-project#8514) [Bugfix] Bind api server port before starting engine (vllm-project#8491) [perf bench] set timeout to debug hanging (vllm-project#8516) [misc] small qol fixes for release process (vllm-project#8517) [Bugfix] Fix 3.12 builds on main (vllm-project#8510) Signed-off-by: Joe Runde <[email protected]> [refactor] remove triton based sampler (vllm-project#8524) [Frontend] Improve Nullable kv Arg Parsing (vllm-project#8525) Signed-off-by: Alex-Brooks <[email protected]> [Misc][Bugfix] Disable guided decoding for mistral tokenizer (vllm-project#8521) [torch.compile] register allreduce operations as custom ops (vllm-project#8526) [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change (vllm-project#8509) Signed-off-by: Rui Qiao <[email protected]> [Benchmark] Support sample from HF datasets and image input for benchmark_serving (vllm-project#8495) [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (vllm-project#7631) [Feature][kernel] tensor parallelism with bitsandbytes quantization (vllm-project#8434) [Model] Add mistral function calling format to all models loaded with "mistral" format (vllm-project#8515) Co-authored-by: Cyrus Leung <[email protected]> [Misc] Don't dump contents of kvcache tensors on errors (vllm-project#8527) [Bugfix] Fix TP > 1 for new granite (vllm-project#8544) Signed-off-by: Joe Runde <[email protected]> [doc] improve installation doc (vllm-project#8550) Co-authored-by: Andy Dai <[email protected]> [CI/Build] Excluding kernels/test_gguf.py from ROCm (vllm-project#8520) [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (vllm-project#8012) [CI/Build] fix Dockerfile.cpu on podman (vllm-project#8540) [Misc] Add argument to disable FastAPI docs (vllm-project#8554) [CI/Build] Avoid CUDA initialization (vllm-project#8534) [CI/Build] Update Ruff version (vllm-project#8469) Signed-off-by: Aaron Pham <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> [Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (vllm-project#8157) Co-authored-by: Nick Hill <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Simon Mo <[email protected]> [Core] *Prompt* logprobs support in Multi-step (vllm-project#8199) [Core] zmq: bind only to 127.0.0.1 for local-only usage (vllm-project#8543) Signed-off-by: Russell Bryant <[email protected]> [Model] Support Solar Model (vllm-project#8386) Co-authored-by: Michael Goin <[email protected]> [AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (vllm-project#8380) Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: Michael Goin <[email protected]> [Kernel] Change interface to Mamba selective_state_update for continuous batching (vllm-project#8039) [BugFix] Nonzero exit code if MQLLMEngine startup fails (vllm-project#8572) [Bugfix] add `dead_error` property to engine client (vllm-project#8574) Signed-off-by: Joe Runde <[email protected]> [Kernel] Remove marlin moe templating on thread_m_blocks (vllm-project#8573) Co-authored-by: [email protected] [Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. (vllm-project#8545) Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (vllm-project#8593) [Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (vllm-project#8616) [MISC] remove engine_use_ray in benchmark_throughput.py (vllm-project#8615) [Frontend] Use MQLLMEngine for embeddings models too (vllm-project#8584) [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (vllm-project#8577) [Core] simplify logits resort in _apply_top_k_top_p (vllm-project#8619) [Doc] Add documentation for GGUF quantization (vllm-project#8618) Create SECURITY.md (vllm-project#8642) [CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail (vllm-project#8551) [Misc] guard against change in cuda library name (vllm-project#8609) [Bugfix] Fix Phi3.5 mini and MoE LoRA inference (vllm-project#8571) [bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (vllm-project#8474) [Core] Support Lora lineage and base model metadata management (vllm-project#6315) [Model] Add OLMoE (vllm-project#7922) [CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build (vllm-project#8670) [Bugfix] Validate SamplingParam n is an int (vllm-project#8548) [Misc] Show AMD GPU topology in `collect_env.py` (vllm-project#8649) [Bugfix] Config got an unexpected keyword argument 'engine' (vllm-project#8556) [Bugfix][Core] Fix tekken edge case for mistral tokenizer (vllm-project#8640) [Doc] neuron documentation update (vllm-project#8671) Signed-off-by: omrishiv <[email protected]> [Hardware][AWS] update neuron to 2.20 (vllm-project#8676) Signed-off-by: omrishiv <[email protected]> [Bugfix] Fix incorrect llava next feature size calculation (vllm-project#8496) [Core] Rename `PromptInputs` and `inputs`(vllm-project#8673) [MISC] add support custom_op check (vllm-project#8557) Co-authored-by: youkaichao <[email protected]> [Core] Factor out common code in `SequenceData` and `Sequence` (vllm-project#8675) [beam search] add output for manually checking the correctness (vllm-project#8684) [Kernel] Build flash-attn from source (vllm-project#8245) [VLM] Use `SequenceData.from_token_counts` to create dummy data (vllm-project#8687) [Doc] Fix typo in AMD installation guide (vllm-project#8689) [Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 (vllm-project#8646) [dbrx] refactor dbrx experts to extend FusedMoe class (vllm-project#8518) [Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (vllm-project#8643) [Bugfix] Refactor composite weight loading logic (vllm-project#8656) [ci][build] fix vllm-flash-attn (vllm-project#8699) [Model] Refactor BLIP/BLIP-2 to support composite model loading (vllm-project#8407) [Misc] Use NamedTuple in Multi-image example (vllm-project#8705) Signed-off-by: Alex-Brooks <[email protected]> [MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler (vllm-project#8703) [Model][VLM] Add LLaVA-Onevision model support (vllm-project#8486) Co-authored-by: litianjian <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> [SpecDec][Misc] Cleanup, remove bonus token logic. (vllm-project#8701) [build] enable existing pytorch (for GH200, aarch64, nightly) (vllm-project#8713) [misc] upgrade mistral-common (vllm-project#8715) [Bugfix] Avoid some bogus messages RE CUTLASS's revision when building (vllm-project#8702) [Bugfix] Fix CPU CMake build (vllm-project#8723) Co-authored-by: Yuan <[email protected]> [Bugfix] fix docker build for xpu (vllm-project#8652) [Core][Frontend] Support Passing Multimodal Processor Kwargs (vllm-project#8657) Signed-off-by: Alex-Brooks <[email protected]> [Hardware][CPU] Refactor CPU model runner (vllm-project#8729) [Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner (vllm-project#8733) [Model] Support pp for qwen2-vl (vllm-project#8696) [VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size (vllm-project#8707) [CI/Build] use setuptools-scm to set __version__ (vllm-project#4738) Co-authored-by: youkaichao <[email protected]> [Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (vllm-project#7701) Co-authored-by: mgoin <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> [Kernel][LoRA] Add assertion for punica sgmv kernels (vllm-project#7585) [Core] Allow IPv6 in VLLM_HOST_IP with zmq (vllm-project#8575) Signed-off-by: Russell Bryant <[email protected]> Fix typical acceptance sampler with correct recovered token ids (vllm-project#8562) Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (vllm-project#8335) [Hardware][AMD] ROCm6.2 upgrade (vllm-project#8674) Fix tests in test_scheduler.py that fail with BlockManager V2 (vllm-project#8728) re-implement beam search on top of vllm core (vllm-project#8726) Co-authored-by: Brendan Wong <[email protected]> Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" (vllm-project#8750) [MISC] Skip dumping inputs when unpicklable (vllm-project#8744) [Core][Model] Support loading weights by ID within models (vllm-project#7931) [Model] Expose Phi3v num_crops as a mm_processor_kwarg (vllm-project#8658) Signed-off-by: Alex-Brooks <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> [Bugfix] Fix potentially unsafe custom allreduce synchronization (vllm-project#8558) [Kernel] Split Marlin MoE kernels into multiple files (vllm-project#8661) Co-authored-by: mgoin <[email protected]> [Frontend] Batch inference for llm.chat() API (vllm-project#8648) Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> [Bugfix] Fix torch dynamo fixes caused by `replace_parameters` (vllm-project#8748) [CI/Build] fix setuptools-scm usage (vllm-project#8771) [misc] soft drop beam search (vllm-project#8763) [[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (vllm-project#8768) [Core][Bugfix] Support prompt_logprobs returned with speculative decoding (vllm-project#8047) Signed-off-by: Travis Johnson <[email protected]> [Core] Adding Priority Scheduling (vllm-project#5958) [Bugfix] Use heartbeats instead of health checks (vllm-project#8583) Fix test_schedule_swapped_simple in test_scheduler.py (vllm-project#8780) [Bugfix][Kernel] Implement acquire/release polyfill for Pascal (vllm-project#8776) Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 (vllm-project#8752) [BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv (vllm-project#8250) [Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend (vllm-project#8770) [Bugfix] load fc bias from config for eagle (vllm-project#8790) [Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer (vllm-project#8672) [Bugfix] Ray 2.9.x doesn't expose available_resources_per_node (vllm-project#8767) Signed-off-by: darthhexx <[email protected]> [Misc] Fix minor typo in scheduler (vllm-project#8765) [CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade (vllm-project#8777) [Kernel] Fullgraph and opcheck tests (vllm-project#8479) [[Misc]] Add extra deps for openai server image (vllm-project#8792) [VLM][Bugfix] internvl with num_scheduler_steps > 1 (vllm-project#8614) rename PromptInputs and inputs with backward compatibility (vllm-project#8760) [Frontend] MQLLMEngine supports profiling. (vllm-project#8761) [Misc] Support FP8 MoE for compressed-tensors (vllm-project#8588) Revert "rename PromptInputs and inputs with backward compatibility (vllm-project#8760) (vllm-project#8810) [Model] Add support for the multi-modal Llama 3.2 model (vllm-project#8811) Co-authored-by: simon-mo <[email protected]> Co-authored-by: Chang Su <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> [Doc] Update doc for Transformers 4.45 (vllm-project#8817) [Misc] Support quantization of MllamaForCausalLM (vllm-project#8822) [Misc] Update config loading for Qwen2-VL and remove Granite (vllm-project#8837) [Build/CI] Upgrade to gcc 10 in the base build Docker image (vllm-project#8814) [Docs] Add README to the build docker image (vllm-project#8825) [CI/Build] Fix missing ci dependencies (vllm-project#8834) [misc][installation] build from source without compilation (vllm-project#8818) [ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM (vllm-project#8872) Signed-off-by: kevin <[email protected]> [Bugfix] Include encoder prompts len to non-stream api usage response (vllm-project#8861) [Misc] Change dummy profiling and BOS fallback warns to log once (vllm-project#8820) [Bugfix] Fix print_warning_once's line info (vllm-project#8867) fix validation: Only set tool_choice `auto` if at least one tool is provided (vllm-project#8568) [Bugfix] Fixup advance_step.cu warning (vllm-project#8815) [BugFix] Fix test breakages from transformers 4.45 upgrade (vllm-project#8829) [Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility (vllm-project#8764) [Feature] Add support for Llama 3.1 and 3.2 tool use (vllm-project#8343) Signed-off-by: Max de Bayser <[email protected]> [Core] rename`PromptInputs` and `inputs` (vllm-project#8876) [misc] fix collect env (vllm-project#8894) [MISC] Fix invalid escape sequence '\' (vllm-project#8830) Signed-off-by: Peter Pan <[email protected]> [Bugfix][VLM] Fix Fuyu batching inference with `max_num_seqs>1` (vllm-project#8892) [TPU] Update pallas.py to support trillium (vllm-project#8871) [torch.compile] use empty tensor instead of None for profiling (vllm-project#8875) [Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (vllm-project#7271) [Bugfix] fix for deepseek w4a16 (vllm-project#8906) Co-authored-by: mgoin <[email protected]> [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (vllm-project#8378) Co-authored-by: Varun Sundar Rabindranath <[email protected]> [misc][distributed] add VLLM_SKIP_P2P_CHECK flag (vllm-project#8911) [Core] Priority-based scheduling in async engine (vllm-project#8850) [misc] fix wheel name (vllm-project#8919) [Bugfix][Intel] Fix XPU Dockerfile Build (vllm-project#7824) Signed-off-by: tylertitsworth <[email protected]> Co-authored-by: youkaichao <[email protected]> [Misc] Remove vLLM patch of `BaichuanTokenizer` (vllm-project#8921) [Bugfix] Fix code for downloading models from modelscope (vllm-project#8443) [Bugfix] Fix PP for Multi-Step (vllm-project#8887) [CI/Build] Update models tests & examples (vllm-project#8874) Co-authored-by: Roger Wang <[email protected]> [Frontend] Make beam search emulator temperature modifiable (vllm-project#8928) Co-authored-by: Eduard Balzin <[email protected]> [Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (vllm-project#8891) [doc] organize installation doc and expose per-commit docker (vllm-project#8931) [Core] Improve choice of Python multiprocessing method (vllm-project#8823) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: youkaichao <[email protected]> [Bugfix] Block manager v2 with preemption and lookahead slots (vllm-project#8824) [Bugfix] Fix Marlin MoE act order when is_k_full == False (vllm-project#8741) Co-authored-by: Tyler Michael Smith <[email protected]> [CI/Build] Add test decorator for minimum GPU memory (vllm-project#8925) [Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching (vllm-project#8930) [Model] Support Qwen2.5-Math-RM-72B (vllm-project#8896) [Model][LoRA]LoRA support added for MiniCPMV2.5 (vllm-project#7199) [BugFix] Fix seeded random sampling with encoder-decoder models (vllm-project#8870) Co-authored-by: Roger Wang <[email protected]> [Misc] Fix typo in BlockSpaceManagerV1 (vllm-project#8944) [Frontend] Added support for HF's new `continue_final_message` parameter (vllm-project#8942) [Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (vllm-project#8533)

commit 94bf9ae4e9b8199636668ccbe4dabcdc3b9e5ae6 Author: Andy Dai <[email protected]> Date: Thu Oct 10 17:33:16 2024 -0700 [Misc] Fix sampling from sonnet for long context case (#9235) commit f990bab2a4198c4de6b5b349d35fc74bf0f36f3e Author: omrishiv <[email protected]> Date: Thu Oct 10 16:36:32 2024 -0700 [Doc][Neuron] add note to neuron documentation about resolving triton issue (#9257) Signed-off-by: omrishiv <[email protected]> commit e00c094f15e79c5a113fdf975df1ee9018cb65b3 Author: youkaichao <[email protected]> Date: Thu Oct 10 15:54:23 2024 -0700 [torch.compile] generic decorators (#9258) commit a78c6ba7c88a7bb42b38410f9dcfa5b342b95b57 Author: Kevin H. Luu <[email protected]> Date: Thu Oct 10 15:45:09 2024 -0700 [ci/build] Add placeholder command for custom models test (#9262) commit fb870fd491482cfe5a41648b8c081d1bd6941205 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu Oct 10 13:30:46 2024 -0700 Bump actions/setup-python from 3 to 5 (#9195) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 270953bafb1ccf444f2018d1c0a88c51472de22e Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu Oct 10 13:30:35 2024 -0700 Bump actions/checkout from 3 to 4 (#9196) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 9cc811c4ff3d5200cc23f16709f540821531b77c Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Thu Oct 10 13:30:24 2024 -0700 Bump actions/github-script from 6 to 7 (#9197) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit e4d652ea3ed9b2a60c1582cb2e2605695e61280f Author: youkaichao <[email protected]> Date: Thu Oct 10 12:39:36 2024 -0700 [torch.compile] integration with compilation control (#9058) commit 78c0b4166cb097de749993970b51cb7b8becba58 Author: Simon Mo <[email protected]> Date: Thu Oct 10 12:29:24 2024 -0700 Suggest codeowners for the core componenets (#9210) commit 21efb603f5f88a0d78ad11e4fbc6e18fe83916d4 Author: jordanyono <[email protected]> Date: Thu Oct 10 14:18:18 2024 -0400 [CI/Build] Make the `Dockerfile.cpu` file's `PIP_EXTRA_INDEX_URL` Configurable as a Build Argument (#9252) commit 055f3270d40bbc492630d0f2c96ec8b64823ba34 Author: Rafael Vasquez <[email protected]> Date: Thu Oct 10 13:48:51 2024 -0400 [Doc] Improve debugging documentation (#9204) Signed-off-by: Rafael Vasquez <[email protected]> commit 18511aeda64b473314bb7727a97a220565e0af41 Author: Lucas Wilkinson <[email protected]> Date: Thu Oct 10 13:39:56 2024 -0400 [Bugfix] Fix Machete unittests failing with `NotImplementedError` (#9218) commit 83ea5c72b9a287b65c9f7b95fbd868b3f613e6f5 Author: Ilya Lavrenov <[email protected]> Date: Thu Oct 10 21:18:58 2024 +0400 [OpenVINO] Use torch 2.4.0 and newer optimim version (#9121) Co-authored-by: DarkLight1337 <[email protected]> commit 04de9057ab8099291e66ad876e78693c7c2f2ce5 Author: whyiug <[email protected]> Date: Thu Oct 10 23:00:47 2024 +0800 [Model] support input image embedding for minicpmv (#9237) commit 07c11cf4d4b9a913fa52142fe134849f1e25e393 Author: Isotr0py <[email protected]> Date: Thu Oct 10 21:11:56 2024 +0800 [Bugfix] Fix lm_head weights tying with lora for llama (#9227) commit f3a507f1d31e13a99c4fc8ac02738a73c3e3136f Author: sroy745 <[email protected]> Date: Wed Oct 9 23:17:17 2024 -0700 [Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 (#9149) commit a64e7b940734b68d849ed2b07ca1bc3824713555 Author: Lucas Wilkinson <[email protected]> Date: Thu Oct 10 02:16:17 2024 -0400 [Bugfix] Machete garbage results for some models (large K dim) (#9212) commit ce00231a8bfb5eae85167b5a3def1b7304c723b6 Author: Michael Goin <[email protected]> Date: Thu Oct 10 02:15:40 2024 -0400 [Bugfix] Fix Weight Loading Multiple GPU Test - Large Models (#9213) commit de895f1697d22ea19a5a4d4ab3dc17037a3e9af3 Author: youkaichao <[email protected]> Date: Wed Oct 9 21:58:27 2024 -0700 [misc] improve model support check in another process (#9208) commit cf25b93bddb607077e52cbe4681332ca61aff189 Author: Russell Bryant <[email protected]> Date: Thu Oct 10 00:10:09 2024 -0400 [Core] Fix invalid args to _process_request (#9201) Signed-off-by: Russell Bryant <[email protected]> commit d5fbb8706d2c7fd00b64cff2efbe7c771fe82c3c Author: Michael Goin <[email protected]> Date: Wed Oct 9 14:51:47 2024 -0400 [CI/Build] Update Dockerfile install+deploy image to ubuntu 22.04 (#9130) Co-authored-by: DarkLight1337 <[email protected]> commit cdca8994bd856a234112875a92746c5782837768 Author: Russell Bryant <[email protected]> Date: Wed Oct 9 13:15:28 2024 -0400 [CI/Build] mypy: check vllm/entrypoints (#9194) Signed-off-by: Russell Bryant <[email protected]> commit ca77dd7a44f2bc103c668560818918ac0335835a Author: Li, Jiang <[email protected]> Date: Thu Oct 10 00:28:08 2024 +0800 [Hardware][CPU] Support AWQ for CPU backend (#7515) commit 7dea289066eaed35538e74dfadafd1fea1dbe05d Author: Ewout ter Hoeven <[email protected]> Date: Wed Oct 9 17:16:26 2024 +0200 Add Dependabot configuration for GitHub Actions updates (#1217) Co-authored-by: DarkLight1337 <[email protected]> commit cfaa6008e666d4e9bb5131ece68f8609b6f94ee4 Author: Cyrus Leung <[email protected]> Date: Wed Oct 9 22:59:57 2024 +0800 [Bugfix] Access `get_vocab` instead of `vocab` in tool parsers (#9188) commit 21906a6f50ee0edf49ede856a82e8840bab41471 Author: Ahmad Fahadh Ilyas <[email protected]> Date: Wed Oct 9 05:10:44 2024 -0700 [Bugfix] Fix lora loading for Compressed Tensors in #9120 (#9179) commit dc4aea677ab0520d91ff4979e80340cb5a090095 Author: Jiangtao Hu <[email protected]> Date: Wed Oct 9 16:59:42 2024 +0800 [Doc] Fix VLM prompt placeholder sample bug (#9170) commit c8627cd41b10747da393b76c382de5ef0eb635a2 Author: youkaichao <[email protected]> Date: Wed Oct 9 00:38:40 2024 -0700 [ci][test] use load dummy for testing (#9165) commit 8bfaa4e31eb63d41499fec933e68969ebbedb01f Author: Cyrus Leung <[email protected]> Date: Wed Oct 9 15:36:55 2024 +0800 [Bugfix] fix composite weight loading and EAGLE weight loading (#9160) commit 0b5b5d767e7fdc0b1070b37319de749e46a4d42a Author: AlpinDale <[email protected]> Date: Wed Oct 9 07:03:14 2024 +0000 [Frontend] Log the maximum supported concurrency (#8831) commit cdc72e3c80b7029c49de9667150f68481f386956 Author: Hui Liu <[email protected]> Date: Tue Oct 8 23:43:06 2024 -0700 [Model] Remap FP8 kv_scale in CommandR and DBRX (#9174) commit 7627172bf42b9cd628402c98845c6ac3de80859a Author: Joe Rowell <[email protected]> Date: Wed Oct 9 06:43:34 2024 +0100 [Bugfix][Doc] Report neuron error in output (#9159) commit 480b7f40cfa9a900e03ea4e825abc1a46b5d085b Author: Travis Johnson <[email protected]> Date: Tue Oct 8 22:54:48 2024 -0600 [Misc] Improve validation errors around best_of and n (#9167) Signed-off-by: Travis Johnson <[email protected]> commit acce7630c1dd655ca95a9f1abff23d92ef76262c Author: Yuan Tang <[email protected]> Date: Tue Oct 8 23:58:49 2024 -0400 Update link to KServe deployment guide (#9173) commit ffc4b27ea8924b4b5add13552063c93d0a14fb85 Author: Yuan Tang <[email protected]> Date: Tue Oct 8 22:30:48 2024 -0400 Add classifiers in setup.py (#9171) commit 2f4117c38e101ee63b65521c93b22efe3526f77e Author: chenqianfzh <[email protected]> Date: Tue Oct 8 18:52:19 2024 -0700 support bitsandbytes quantization with more models (#9148) commit 9ba0bd6aa6a9a3cefa5c320800ea736a0abbaf36 Author: Michael Goin <[email protected]> Date: Tue Oct 8 21:22:31 2024 -0400 Add `lm-eval` directly to requirements-test.txt (#9161) commit 2a131965a8144d571a4a211a44d1fc32e202ae10 Author: Russell Bryant <[email protected]> Date: Tue Oct 8 18:08:22 2024 -0400 mypy: check additional directories (#9162) Signed-off-by: Russell Bryant <[email protected]> commit bd37b9fbe274e28e12c0687cb9a8111dda270936 Author: bnellnm <[email protected]> Date: Tue Oct 8 17:28:12 2024 -0400 [Bugfix] Try to handle older versions of pytorch (#9086) commit de24046fcd24e8faa81de34b17351887bcdfbe51 Author: Rafael Vasquez <[email protected]> Date: Tue Oct 8 16:22:08 2024 -0400 [Doc] Improve contributing and installation documentation (#9132) Signed-off-by: Rafael Vasquez <[email protected]> commit 1874c6a1b0ae0f9eb2b485653b4e17ed1d861a32 Author: Sayak Paul <[email protected]> Date: Tue Oct 8 23:42:29 2024 +0530 [Doc] Update vlm.rst to include an example on videos (#9155) Co-authored-by: Cyrus Leung <[email protected]> commit 9a94ca4a5d31c0ba57ca67fc1c252233d3284012 Author: Daniele <[email protected]> Date: Tue Oct 8 18:38:40 2024 +0200 [Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing (#8537) commit cfba685bd462f360994da7ac0d33f9759589506e Author: Peter Pan <[email protected]> Date: Wed Oct 9 00:37:34 2024 +0800 [CI/Build] Add examples folder into Docker image so that we can leverage the templates*.jinja when serving models (#8758) Signed-off-by: Peter Pan <[email protected]> commit 069d3bd8d01a72e93c0a5b51f8b567e8aaddc6e9 Author: Alex Brooks <[email protected]> Date: Tue Oct 8 08:31:26 2024 -0600 [Frontend] Add Early Validation For Chat Template / Tool Call Parser (#9151) Signed-off-by: Alex-Brooks <[email protected]> commit a3691b6b5eb7e60039a8ff34550be5a7e8365394 Author: Alex Brooks <[email protected]> Date: Tue Oct 8 08:12:56 2024 -0600 [Core][Frontend] Add Support for Inference Time mm_processor_kwargs (#9131) Signed-off-by: Alex-Brooks <[email protected]> commit 8c746226c956f7c8a4672689fee91c7d22befed6 Author: Brendan Wong <[email protected]> Date: Mon Oct 7 22:51:43 2024 -0700 [Frontend] API support for beam search for MQLLMEngine (#9117) commit e1faa2a59876bba99d804c0a94d427cee87b0995 Author: youkaichao <[email protected]> Date: Mon Oct 7 22:26:25 2024 -0700 [misc] improve ux on readme (#9147) commit 80b57f00d554db8a2126d351bb5374c190b56699 Author: Kunshang Ji <[email protected]> Date: Tue Oct 8 11:51:14 2024 +0800 [Intel GPU] Fix xpu decode input (#9145) commit 04c12f81572be22c819018c2fcbddac5f08715d0 Author: youkaichao <[email protected]> Date: Mon Oct 7 19:51:49 2024 -0700 [misc] update utils to support comparing multiple settings (#9140) commit 8eeb85708428b7735bbd1156c81692431fd5ff34 Author: Simon Mo <[email protected]> Date: Mon Oct 7 17:06:21 2024 -0700 Add Slack to README (#9137) commit fa45513a5189b3a9f73a59730c9ac65d061e1311 Author: youkaichao <[email protected]> Date: Mon Oct 7 16:07:05 2024 -0700 [misc] fix comment and variable name (#9139) commit c0d9a98d0c7182b73c2e7f88508e690a186bf0e3 Author: Kuntai Du <[email protected]> Date: Mon Oct 7 15:04:06 2024 -0700 [Doc] Include performance benchmark in README (#9135) commit e0dbdb013dfe5cdbe044317b4d7d55644d6399b3 Author: Russell Bryant <[email protected]> Date: Mon Oct 7 17:18:10 2024 -0400 [CI/Build] Add linting for github actions workflows (#7876) Signed-off-by: Russell Bryant <[email protected]> commit 93cf74a8a7b0b483becdba95e3056adbf201b7b2 Author: TimWang <[email protected]> Date: Tue Oct 8 04:31:45 2024 +0800 [Doc]: Add deploying_with_k8s guide (#8451) commit 151ef4efd2fb52554f4d30408aca619e181ea751 Author: Cyrus Leung <[email protected]> Date: Mon Oct 7 19:55:12 2024 +0800 [Model] Support NVLM-D and fix QK Norm in InternViT (#9045) Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Isotr0py <[email protected]> commit f19da64871065510691cd4fcaa5f4096b661dcec Author: Isotr0py <[email protected]> Date: Mon Oct 7 18:01:46 2024 +0800 [Core] Refactor GGUF parameters packing and forwarding (#8859) commit 4f95ffee6f40198911ee824ed06d645fe9678511 Author: Isotr0py <[email protected]> Date: Mon Oct 7 14:50:35 2024 +0800 [Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend (#9089) commit 8c6de96ea1e6e51e49a170c28ad3efc16db9413e Author: Cyrus Leung <[email protected]> Date: Mon Oct 7 14:10:35 2024 +0800 [Model] Explicit interface for vLLM models and support OOT embedding models (#9108) commit 18b296fdb2248e8a65bf005e7193ebd523b875b6 Author: youkaichao <[email protected]> Date: Sun Oct 6 22:47:04 2024 -0700 [core] remove beam search from the core (#9105) commit c8f26bb63694adb4202ab275efb0759c13edcaa8 Author: sroy745 <[email protected]> Date: Sun Oct 6 20:52:42 2024 -0700 [BugFix][Core] Fix BlockManagerV2 when Encoder Input is None (#9103) commit 487678d046fe56560ff5dc6c91c3f3c31af7de6f Author: Isotr0py <[email protected]> Date: Mon Oct 7 10:14:27 2024 +0800 [Bugfix][Hardware][CPU] Fix CPU model input for decode (#9044) commit cb3b2b9ba4a95c413a879e30e2b8674187519a93 Author: Varun Sundar Rabindranath <[email protected]> Date: Sun Oct 6 15:48:11 2024 -0400 [Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling (#9038) Co-authored-by: Varun Sundar Rabindranath <[email protected]> commit fdf59d30eaf1a62979b2a13016b4f47f28f12f88 Author: Yanyi Liu <[email protected]> Date: Sun Oct 6 20:51:08 2024 +0800 [Bugfix] fix tool_parser error handling when serve a model not support it (#8709) commit b22b79847153ae10710523cdb4a5fb98ac864cf4 Author: Cyrus Leung <[email protected]> Date: Sun Oct 6 16:35:27 2024 +0800 [Model] PP support for embedding models and update docs (#9090) Co-authored-by: Roger Wang <[email protected]> commit f22619fe96c842ee2406638678d2b60009d8ff14 Author: Cyrus Leung <[email protected]> Date: Sun Oct 6 16:33:52 2024 +0800 [Misc] Remove user-facing error for removed VLM args (#9104) commit 168cab6bbfb733f97defc8c1aa13df90c5319f19 Author: Brendan Wong <[email protected]> Date: Sat Oct 5 23:39:03 2024 -0700 [Frontend] API support for beam search (#9087) Co-authored-by: youkaichao <[email protected]> commit 23fea8714a1e90f018163e0eee59d73bc5a500e7 Author: TJian <[email protected]> Date: Sat Oct 5 22:00:04 2024 -0700 [Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model (#9101) commit f4dd830e0945300dbe2039af79d1994f074ffcbb Author: youkaichao <[email protected]> Date: Sat Oct 5 19:37:31 2024 -0700 [core] use forward context for flash infer (#9097) commit 5df183489537a155bbaad9232f25b8e57694d7b8 Author: Andy Dai <[email protected]> Date: Sat Oct 5 10:35:11 2024 -0700 [Bugfix] Fix order of arguments matters in config.yaml (#8960) commit cfadb9c68798c0cc4d674de19970a8e3b5ea1273 Author: Chen Zhang <[email protected]> Date: Sat Oct 5 06:56:40 2024 -0700 [Bugfix] Deprecate registration of custom configs to huggingface (#9083) commit 15986f598c7b1f2969918c92f5c4cf7e28d5c0df Author: Xin Yang <[email protected]> Date: Fri Oct 4 23:57:05 2024 -0700 [Model] Support Gemma2 embedding model (#9004) commit 53b3a330273967a3c4124cbfef2cacac92f553ba Author: hhzhang16 <[email protected]> Date: Fri Oct 4 22:05:37 2024 -0700 [Bugfix] Fixes Phi3v & Ultravox Multimodal EmbeddingInputs (#8979) commit dac914b0d6bc36de4eb4bf70a9d20954560893ea Author: Chen Zhang <[email protected]> Date: Fri Oct 4 21:45:38 2024 -0700 [Bugfix] use blockmanagerv1 for encoder-decoder (#9084) Co-authored-by: Roger Wang <[email protected]> commit a95354a36ee65523a499b3eb42f70a4a0ea4322d Author: Zhuohan Li <[email protected]> Date: Fri Oct 4 19:54:45 2024 -0700 [Doc] Update README.md with Ray summit slides (#9088) commit 663874e048d88aa7bf087628430d50f9f5245175 Author: youkaichao <[email protected]> Date: Fri Oct 4 16:43:50 2024 -0700 [torch.compile] improve allreduce registration (#9061) commit cc90419e89c358f906e17a5ec484fbe04092c277 Author: Chongming Ni <[email protected]> Date: Fri Oct 4 16:42:20 2024 -0700 [Hardware][Neuron] Add on-device sampling support for Neuron (#8746) Co-authored-by: Ashraf Mahgoub <[email protected]> commit 27302dd5841d4b0fa4788076ad9ff2993e133409 Author: Cody Yu <[email protected]> Date: Fri Oct 4 16:07:54 2024 -0700 [Misc] Fix CI lint (#9085) commit 0cc566ca8fd2d21a94f3a8e48bf5c5b60d42b59f Author: Andy Dai <[email protected]> Date: Fri Oct 4 14:58:57 2024 -0700 [Misc] Add random seed for prefix cache benchmark (#9081) commit 05c531be476e8a864a1ab83a65f7e056315ea1fc Author: Andy Dai <[email protected]> Date: Fri Oct 4 14:38:42 2024 -0700 [Misc] Improved prefix cache example (#9077) commit fbb74420e7018bf0cc1bc81e6fd71a2392347227 Author: Kuntai Du <[email protected]> Date: Fri Oct 4 14:01:44 2024 -0700 [CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang (#7412) commit 05d686432f2e13296127962861b21c25cdcdfc8b Author: ElizaWszola <[email protected]> Date: Fri Oct 4 20:34:44 2024 +0200 [Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (#8973) Co-authored-by: Dipika <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> commit 0dcc8cbe5abd4f2fafd495bd1c65fdd75d8dd919 Author: Flávia Béo <[email protected]> Date: Fri Oct 4 15:31:40 2024 -0300 Adds truncate_prompt_tokens param for embeddings creation (#8999) Signed-off-by: Flavia Beo <[email protected]> commit 26aa325f4ffe8bf1d9b921535cc02fb31d80a96d Author: Roger Wang <[email protected]> Date: Fri Oct 4 10:38:25 2024 -0700 [Core][VLM] Test registration for OOT multimodal models (#8717) Co-authored-by: DarkLight1337 <[email protected]> commit e5dc713c2343b3549b43d6e2764a1036e4052bf8 Author: Varad Ahirwadkar <[email protected]> Date: Fri Oct 4 22:54:42 2024 +0530 [Hardware][PowerPC] Make oneDNN dependency optional for Power (#9039) Signed-off-by: Varad Ahirwadkar <[email protected]> commit 36eecfbddb9ac2c491174c86b28ee83c4773eb5e Author: Simon Mo <[email protected]> Date: Fri Oct 4 10:17:16 2024 -0700 Remove AMD Ray Summit Banner (#9075) commit 9ade8bbc8dc63c03b9399f05e85a0d0ddc6f5788 Author: Prashant Gupta <[email protected]> Date: Fri Oct 4 09:24:40 2024 -0700 [Model] add a bunch of supported lora modules for mixtral (#9008) Signed-off-by: Prashant Gupta <[email protected]> commit 22482e495e00d409c9b5c78dade6e672ddf7fbc2 Author: Lucas Wilkinson <[email protected]> Date: Fri Oct 4 11:43:15 2024 -0400 [Bugfix] Flash attention arches not getting set properly (#9062) commit 3d826d2c52242f4f78789adcb7c02938c84ed18b Author: whyiug <[email protected]> Date: Fri Oct 4 22:34:58 2024 +0800 [Bugfix] Reshape the dimensions of the input image embeddings in Qwen2VL (#9071) commit 0e36fd4909780392a9c5d0e367b0a84250d55fa8 Author: Cyrus Leung <[email protected]> Date: Fri Oct 4 18:01:37 2024 +0800 [Misc] Move registry to its own file (#9064) commit 0f6d7a9a347944bffd2204cbf9686299e9dd6557 Author: Murali Andoorveedu <[email protected]> Date: Thu Oct 3 19:56:58 2024 -0700 [Models] Add remaining model PP support (#7168) Signed-off-by: Muralidhar Andoorveedu <[email protected]> Signed-off-by: Murali Andoorveedu <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> commit 303d44790a2ccab86257f1b6097e67795f0845d4 Author: Michael Goin <[email protected]> Date: Thu Oct 3 22:55:42 2024 -0400 [Misc] Enable multi-step output streaming by default (#9047) commit aeb37c2a725554791ff6f258b1e18830867a3ab9 Author: Lucas Wilkinson <[email protected]> Date: Thu Oct 3 22:55:25 2024 -0400 [CI/Build] Per file CUDA Archs (improve wheel size and dev build times) (#8845) commit 3dbb215b38c010c050f7fde3528fe2c6673f7a07 Author: 代君 <[email protected]> Date: Fri Oct 4 10:36:39 2024 +0800 [Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model (#8405) commit 2838d6b38e1e37b303b01f2af0a9ddee2dd66f39 Author: Domen Vreš <[email protected]> Date: Fri Oct 4 01:53:29 2024 +0200 [Bugfix] Weight loading fix for OPT model (#9042) Co-authored-by: dvres <[email protected]> commit 91add85ec409a3628d01a1e4d4b3230e0fd3aa3f Author: sroy745 <[email protected]> Date: Thu Oct 3 16:07:29 2024 -0700 Fix failing spec decode test (#9054) commit 9aaf14c62e16a7c74b5192a44d01a78125dab2fc Author: youkaichao <[email protected]> Date: Thu Oct 3 12:09:42 2024 -0700 [misc] add forward context for attention (#9029) commit 63e39937f990818e2f22a9b821a4aa22387057a7 Author: xendo <[email protected]> Date: Thu Oct 3 20:02:07 2024 +0200 [Frontend] [Neuron] Parse literals out of override-neuron-config (#8959) Co-authored-by: Jerzy Zagorski <[email protected]> commit f5d72b2fc6771de19c351945f1fbbb0198d53b8e Author: sroy745 <[email protected]> Date: Thu Oct 3 09:44:21 2024 -0700 [Core] Make BlockSpaceManagerV2 the default BlockManager to use. (#8678) commit 83caf35e082b2657dce5f71ff965a13653a763b0 Author: Guillaume Calmettes <[email protected]> Date: Thu Oct 3 10:44:52 2024 +0200 [BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser (#9020) commit 01843c89b8ddae00d4a0f0f56b8aa7fbaa3efc42 Author: Divakar Verma <[email protected]> Date: Wed Oct 2 23:31:07 2024 -0500 [Misc] log when using default MoE config (#8971) commit 19a4dd09904975d121a10e5e3f707927f3e09faa Author: Travis Johnson <[email protected]> Date: Wed Oct 2 21:04:17 2024 -0600 [Bugfix] example template should not add parallel_tool_prompt if tools is none (#9007) commit 18c2e30c5754dc83f86d9b8c75af0499a77e4b3f Author: Nick Hill <[email protected]> Date: Thu Oct 3 03:42:24 2024 +0100 [Doc] Update Granite model docs (#9025) commit 19f0d2579695e518c9bfc166544cf23775772bf8 Author: Shawn Tan <[email protected]> Date: Wed Oct 2 21:33:57 2024 -0400 [Model] Adding Granite MoE. (#8206) Co-authored-by: Nick Hill <[email protected]> commit f58d4fccc9b270838be438f5f0db71bea156a56d Author: Sergey Shlyapnikov <[email protected]> Date: Thu Oct 3 01:50:01 2024 +0400 [OpenVINO] Enable GPU support for OpenVINO vLLM backend (#8192) commit afb050b29d0cac27c32c19c8206a9ac2a4662de2 Author: Varun Sundar Rabindranath <[email protected]> Date: Wed Oct 2 15:44:39 2024 -0400 [Core] CUDA Graphs for Multi-Step + Chunked-Prefill (#8645) Co-authored-by: Varun Sundar Rabindranath <[email protected]> commit 7f60520deb05d2e097b408e3310f1d383fbf1de6 Author: Alex Brooks <[email protected]> Date: Wed Oct 2 05:44:38 2024 -0600 [Misc] Update Default Image Mapper Error Log (#8977) Signed-off-by: Alex-Brooks <[email protected]> Co-authored-by: Roger Wang <[email protected]> commit 563649aafe7d4b9cb0047bba60d6f58efa53fd28 Author: afeldman-nm <[email protected]> Date: Wed Oct 2 03:52:20 2024 -0400 [Core] Combined support for multi-step scheduling, chunked prefill & prefix caching (#8804) Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Andrew Feldman <[email protected]> commit 15702038642192002cd8973cf8948751b750fd07 Author: Lily Liu <[email protected]> Date: Tue Oct 1 16:04:42 2024 -0700 [Spec Decode] (1/2) Remove batch expansion (#8839) commit 22f5851b807376a836eb3551903c7fc6c81eaa9b Author: vlsav <[email protected]> Date: Tue Oct 1 21:07:06 2024 +0300 Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows (#8997) commit 4f341bd4bf35c5b431dc523bab86e4ae210baaf8 Author: Cyrus Leung <[email protected]> Date: Wed Oct 2 00:35:39 2024 +0800 [Doc] Update list of supported models (#8987) commit 35bd2151684ffb20cdad825abe33e0e6f0cc005a Author: Sebastian Schoennenbeck <[email protected]> Date: Tue Oct 1 11:58:06 2024 +0200 [Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API (#8965) commit 1fe0a4264aa94ceeccc7e8d99ac0d72f0560f541 Author: Alex Brooks <[email protected]> Date: Tue Oct 1 03:52:44 2024 -0600 [Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders (#8991) Signed-off-by: Alex-Brooks <[email protected]> commit bc4eb65b5492b4f84a1b714bfc14bcff73d401f1 Author: Isotr0py <[email protected]> Date: Tue Oct 1 17:51:41 2024 +0800 [Bugfix] Fix Fuyu tensor parallel inference (#8986) commit 82f3937e599a4f088a62e59abe81d51e11bb8f83 Author: Divakar Verma <[email protected]> Date: Mon Sep 30 22:46:41 2024 -0500 [Misc] add process_weights_after_loading for DummyLoader (#8969) commit 7da2487591888da043254f8c7045a48d5dbcc753 Author: youkaichao <[email protected]> Date: Mon Sep 30 20:40:48 2024 -0700 [torch.compile] fix tensor alias (#8982) commit aaccca2b4d3895d64d34b123e61731404c8fc2c0 Author: Kevin H. Luu <[email protected]> Date: Mon Sep 30 20:33:12 2024 -0700 [CI/Build] Fix machete generated kernel files ordering (#8976) Signed-off-by: kevin <[email protected]> Co-authored-by: Cody Yu <[email protected]> commit 062c89e7c9c6fa9fd7fb2d28fd50321c6f78f389 Author: Joe Runde <[email protected]> Date: Mon Sep 30 19:34:25 2024 -0600 [Frontend][Core] Move guided decoding params into sampling params (#8252) Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Nick Hill <[email protected]> commit bce324487a8e36140143ea37f4b27d273a0fd661 Author: Lily Liu <[email protected]> Date: Mon Sep 30 17:51:40 2024 -0700 [CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. (#8975) commit 1425a1bcf9c53e24fe5f4812acc5b656f2aa02f3 Author: Kevin H. Luu <[email protected]> Date: Mon Sep 30 17:47:08 2024 -0700 [ci] Add CODEOWNERS for test directories (#8795) Signed-off-by: kevin <[email protected]> commit 1cabfcefb64a489c8ff9dcb289b4dd47cf8f89cf Author: Jee Jee Li <[email protected]> Date: Mon Sep 30 20:57:39 2024 +0800 [Misc] Adjust max_position_embeddings for LoRA compatibility (#8957) commit be76e5aabf8c026e1a82028ad70167e8c652cee9 Author: Sebastian Schoennenbeck <[email protected]> Date: Mon Sep 30 14:28:44 2024 +0200 [Core] Make scheduling policy settable via EngineArgs (#8956) commit 2ae25f79cf1e8d21f7bcba097e4c039463c22be4 Author: Isotr0py <[email protected]> Date: Mon Sep 30 13:01:20 2024 +0800 [Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg (#8946) commit 8e60afa15eb9a0540ce6c453b974a945adff3320 Author: Jee Jee Li <[email protected]> Date: Mon Sep 30 12:31:55 2024 +0800 [Model][LoRA]LoRA support added for MiniCPMV2.6 (#8943) Co-authored-by: DarkLight1337 <[email protected]> commit b6d7392579286b6dbd8ca96c0bcb4cc6f7c3c4a0 Author: Roger Wang <[email protected]> Date: Sun Sep 29 21:28:26 2024 -0700 [Misc][CI/Build] Include `cv2` via `mistral_common[opencv]` (#8951) commit e01ab595d897698c9a5fe9eaebd983eb3e23470a Author: whyiug <[email protected]> Date: Mon Sep 30 11:16:10 2024 +0800 [Model] support input embeddings for qwen2vl (#8856) commit f13a07b1f8c11ddbdc53b40f1fbb24bf3166b900 Author: Mor Zusman <[email protected]> Date: Mon Sep 30 00:35:58 2024 +0300 [Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (#8533) commit 6c9ba48fdebe2f44c82eabfe136dc8dc6ad6f4ed Author: danieljannai21 <[email protected]> Date: Sun Sep 29 20:59:47 2024 +0300 [Frontend] Added support for HF's new `continue_final_message` parameter (#8942) commit 1fb9c1b0bf8e65e6576ff4c45f5623d233d7194b Author: juncheoll <[email protected]> Date: Mon Sep 30 00:05:54 2024 +0900 [Misc] Fix typo in BlockSpaceManagerV1 (#8944) commit 31f46a0d35da80118bac5f80c533019cd50ddd9a Author: Nick Hill <[email protected]> Date: Sun Sep 29 10:43:14 2024 +0100 [BugFix] Fix seeded random sampling with encoder-decoder models (#8870) Co-authored-by: Roger Wang <[email protected]> commit 3d49776bbb25927abf91bb7c5537e0006c199c16 Author: Jee Jee Li <[email protected]> Date: Sun Sep 29 14:59:45 2024 +0800 [Model][LoRA]LoRA support added for MiniCPMV2.5 (#7199) commit bc2ef1f77c1578612198f60ec392731efb3847c5 Author: Zilin Zhu <[email protected]> Date: Sun Sep 29 12:19:39 2024 +0800 [Model] Support Qwen2.5-Math-RM-72B (#8896) commit 2e7fe7e79f41e294eeed2f484eeb791284ec48a2 Author: Tyler Michael Smith <[email protected]> Date: Sat Sep 28 23:13:01 2024 -0400 [Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching (#8930) commit 26a68d5d7e7dd47c7d8538a326493c8a171f5016 Author: Cyrus Leung <[email protected]> Date: Sun Sep 29 10:50:51 2024 +0800 [CI/Build] Add test decorator for minimum GPU memory (#8925) commit d081da0064b5cda9e344f0fd519d67523a437a39 Author: ElizaWszola <[email protected]> Date: Sun Sep 29 03:19:40 2024 +0200 [Bugfix] Fix Marlin MoE act order when is_k_full == False (#8741) Co-authored-by: Tyler Michael Smith <[email protected]> commit 5bf8789b2a28df1305f92b9999fe60264f839caa Author: sroy745 <[email protected]> Date: Sat Sep 28 18:17:45 2024 -0700 [Bugfix] Block manager v2 with preemption and lookahead slots (#8824) commit d1537039ce7e6018db510d0c0d9b0c0fccb62b63 Author: Russell Bryant <[email protected]> Date: Sat Sep 28 21:17:07 2024 -0400 [Core] Improve choice of Python multiprocessing method (#8823) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: youkaichao <[email protected]> commit cc276443b5ac0732b00a88472f4bc4330aa14606 Author: youkaichao <[email protected]> Date: Sat Sep 28 17:48:41 2024 -0700 [doc] organize installation doc and expose per-commit docker (#8931) commit e585b583a92903c9a5cc8055a444a208f4387891 Author: Chen Zhang <[email protected]> Date: Sat Sep 28 11:51:22 2024 -0700 [Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (#8891) commit 090e945e36cfe849b484db5414f64df96e97d678 Author: Edouard B. <[email protected]> Date: Sat Sep 28 20:30:21 2024 +0200 [Frontend] Make beam search emulator temperature modifiable (#8928) Co-authored-by: Eduard Balzin <[email protected]> commit e1a3f5e831a467b2867a66e0e56ac0f70ed44394 Author: Cyrus Leung <[email protected]> Date: Sun Sep 29 00:54:35 2024 +0800 [CI/Build] Update models tests & examples (#8874) Co-authored-by: Roger Wang <[email protected]> commit 19d02ff93812fb6a28f0f1a0a0f9233e9388d616 Author: Varun Sundar Rabindranath <[email protected]> Date: Sat Sep 28 11:52:46 2024 -0400 [Bugfix] Fix PP for Multi-Step (#8887) commit 39d3f8d94fd2691b70ee809e7565402f8a061c6b Author: tastelikefeet <[email protected]> Date: Sat Sep 28 23:24:12 2024 +0800 [Bugfix] Fix code for downloading models from modelscope (#8443) commit b0298aa8cc4a54bde659e57271778630785abc9b Author: Cyrus Leung <[email protected]> Date: Sat Sep 28 16:11:25 2024 +0800 [Misc] Remove vLLM patch of `BaichuanTokenizer` (#8921) commit 260024a3749fb6856625dfee28560a98a92dd339 Author: Tyler Titsworth <[email protected]> Date: Fri Sep 27 23:45:50 2024 -0700 [Bugfix][Intel] Fix XPU Dockerfile Build (#7824) Signed-off-by: tylertitsworth <[email protected]> Co-authored-by: youkaichao <[email protected]> commit d86f6b2afb006ea4b4b14a49a58f64bf3b952de6 Author: youkaichao <[email protected]> Date: Fri Sep 27 22:10:44 2024 -0700 [misc] fix wheel name (#8919) commit bd429f2b75f3622fabaf9c9470ca2e921f6f56ca Author: Sebastian Schoennenbeck <[email protected]> Date: Sat Sep 28 00:07:10 2024 +0200 [Core] Priority-based scheduling in async engine (#8850) commit 18e60d7d1394541b48bf48b0a57a546a93607ac2 Author: youkaichao <[email protected]> Date: Fri Sep 27 14:27:56 2024 -0700 [misc][distributed] add VLLM_SKIP_P2P_CHECK flag (#8911) commit c2ec430ab5713d0626c1a7809718ef6c4eebf389 Author: Varun Sundar Rabindranath <[email protected]> Date: Fri Sep 27 16:32:07 2024 -0400 [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (#8378) Co-authored-by: Varun Sundar Rabindranath <[email protected]> commit c5d55356f9d2b2075ac53cf20453358c1e2b7bde Author: Lucas Wilkinson <[email protected]> Date: Fri Sep 27 15:12:34 2024 -0400 [Bugfix] fix for deepseek w4a16 (#8906) Co-authored-by: mgoin <[email protected]> commit 172d1cd27634e9e7adc9cb9feac73552cfae1b24 Author: Luka Govedič <[email protected]> Date: Fri Sep 27 14:25:10 2024 -0400 [Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (#7271) commit a9b15c606fea67a072416ea0ea115261a2756058 Author: youkaichao <[email protected]> Date: Fri Sep 27 08:11:32 2024 -0700 [torch.compile] use empty tensor instead of None for profiling (#8875) commit 8df2dc3c8812c0abb97ce3e2913411d88524e59f Author: Brittany <[email protected]> Date: Fri Sep 27 01:16:55 2024 -0700 [TPU] Update pallas.py to support trillium (#8871) commit 6d792d2f31b2cfb335d1a4a7c45fe4ce143c203a Author: Isotr0py <[email protected]> Date: Fri Sep 27 16:15:58 2024 +0800 [Bugfix][VLM] Fix Fuyu batching inference with `max_num_seqs>1` (#8892) commit 0e088750af2e8035c07d356b56c03393cfb56004 Author: Peter Pan <[email protected]> Date: Fri Sep 27 16:13:25 2024 +0800 [MISC] Fix invalid escape sequence '\' (#8830) Signed-off-by: Peter Pan <[email protected]> commit dc4e3df5c23282b2ebaead95f179c25c9d7ec4d8 Author: youkaichao <[email protected]> Date: Fri Sep 27 00:26:38 2024 -0700 [misc] fix collect env (#8894) commit 3b00b9c26c91e9f9ada12975b613555698054e39 Author: Cyrus Leung <[email protected]> Date: Fri Sep 27 11:35:15 2024 +0800 [Core] rename`PromptInputs` and `inputs` (#8876) commit 344cd2b6f4c22bf278cff96066001d216ec1fe82 Author: Maximilien de Bayser <[email protected]> Date: Thu Sep 26 21:01:42 2024 -0300 [Feature] Add support for Llama 3.1 and 3.2 tool use (#8343) Signed-off-by: Max de Bayser <[email protected]> commit 1b49148e474d4d18731e159ea0460145ae52e220 Author: Cyrus Leung <[email protected]> Date: Fri Sep 27 07:54:09 2024 +0800 [Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility (#8764) commit 4b377d6febed7ddd964f1b96079d7e78c231325e Author: Nick Hill <[email protected]> Date: Fri Sep 27 00:46:43 2024 +0100 [BugFix] Fix test breakages from transformers 4.45 upgrade (#8829) commit 71d21c73abfb9b12ea402ce6b11c1b8e31eddf4c Author: Tyler Michael Smith <[email protected]> Date: Thu Sep 26 19:23:45 2024 -0400 [Bugfix] Fixup advance_step.cu warning (#8815) commit ee2da3e9efb38add804e2023d47e9f42f38bd638 Author: Chirag Jain <[email protected]> Date: Fri Sep 27 04:53:17 2024 +0530 fix validation: Only set tool_choice `auto` if at least one tool is provided (#8568) commit e2f6f26e8636b8a23e5c0cda533a70c40ade01ec Author: Tyler Michael Smith <[email protected]> Date: Thu Sep 26 19:18:26 2024 -0400 [Bugfix] Fix print_warning_once's line info (#8867) commit b28d2104dea6ba80c0f1f6c4596b5703d7ef923d Author: Michael Goin <[email protected]> Date: Thu Sep 26 19:18:14 2024 -0400 [Misc] Change dummy profiling and BOS fallback warns to log once (#8820) commit 93d364da3406f5523e5e4772ffbc3c72dac7bbf4 Author: Pernekhan Utemuratov <[email protected]> Date: Thu Sep 26 15:47:00 2024 -0700 [Bugfix] Include encoder prompts len to non-stream api usage response (#8861) commit d9cfbc891e2e1d62d74c7aae93bde436a29bd574 Author: Kevin H. Luu <[email protected]> Date: Thu Sep 26 15:02:16 2024 -0700 [ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM (#8872) Signed-off-by: kevin <[email protected]> commit 70de39f6b46f6b90aecba52358825127a50b3921 Author: youkaichao <[email protected]> Date: Thu Sep 26 13:19:04 2024 -0700 [misc][installation] build from source without compilation (#8818) commit 68988d4e0d8765901c51f07f9bfbda58f35f6f63 Author: fyuan1316 <[email protected]> Date: Fri Sep 27 02:04:39 2024 +0800 [CI/Build] Fix missing ci dependencies (#8834) commit 520db4dbc10cfc60be65e85ff4ef3a6aeeeb7836 Author: Michael Goin <[email protected]> Date: Thu Sep 26 14:02:52 2024 -0400 [Docs] Add README to the build docker image (#8825) commit f70bccac75a0aecc0a5fc934859158a3e1f019a5 Author: Tyler Michael Smith <[email protected]> Date: Thu Sep 26 13:07:18 2024 -0400 [Build/CI] Upgrade to gcc 10 in the base build Docker image (#8814) commit 4bb98f2190aaf408cb063df5184829fb54ee5f81 Author: Roger Wang <[email protected]> Date: Thu Sep 26 07:45:30 2024 -0700 [Misc] Update config loading for Qwen2-VL and remove Granite (#8837) commit 7193774b1ff8603ad5bf4598e5efba0d9a39b436 Author: Michael Goin <[email protected]> Date: Wed Sep 25 17:46:22 2024 -0400 [Misc] Support quantization of MllamaForCausalLM (#8822) commit e2c6e0a8291126c868b669f631837c7781646fdc Author: Roger Wang <[email protected]> Date: Wed Sep 25 13:29:48 2024 -0700 [Doc] Update doc for Transformers 4.45 (#8817) commit 770ec6024fc00cd696899f5c6fdc53b7148876e6 Author: Chen Zhang <[email protected]> Date: Wed Sep 25 13:29:32 2024 -0700 [Model] Add support for the multi-modal Llama 3.2 model (#8811) Co-authored-by: simon-mo <[email protected]> Co-authored-by: Chang Su <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> commit 4f1ba0844b83b4e7d0ff1672b7ba502ce8732f95 Author: Simon Mo <[email protected]> Date: Wed Sep 25 10:36:26 2024 -0700 Revert "rename PromptInputs and inputs with backward compatibility (#8760) (#8810) commit 873edda6cf8a2902e8b08eea0bf8f8f6d73704a8 Author: Michael Goin <[email protected]> Date: Wed Sep 25 12:43:36 2024 -0400 [Misc] Support FP8 MoE for compressed-tensors (#8588) commit 64840dfae48621c5c2004eb8f1cb7fba49f9b24e Author: 科英 <[email protected]> Date: Thu Sep 26 00:37:41 2024 +0800 [Frontend] MQLLMEngine supports profiling. (#8761) commit 28e1299e60e565a56a2db41396380f74b8d29e57 Author: Cyrus Leung <[email protected]> Date: Thu Sep 26 00:36:47 2024 +0800 rename PromptInputs and inputs with backward compatibility (#8760) commit 0c4d2ad5e641de145682674066a84ffc632e714e Author: DefTruth <[email protected]> Date: Thu Sep 26 00:35:53 2024 +0800 [VLM][Bugfix] internvl with num_scheduler_steps > 1 (#8614) commit c6f2485c823b5cd76cca70798e653c6eadb811de Author: Jee Jee Li <[email protected]> Date: Thu Sep 26 00:35:23 2024 +0800 [[Misc]] Add extra deps for openai server image (#8792) commit 300da09177477d0a4d2b55790addefd971f52ae0 Author: bnellnm <[email protected]> Date: Wed Sep 25 10:35:52 2024 -0400 [Kernel] Fullgraph and opcheck tests (#8479) commit 1c046447a6d1ac3c99b9f453796f0d355d673deb Author: Hongxia Yang <[email protected]> Date: Wed Sep 25 10:26:37 2024 -0400 [CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade (#8777) commit 8fae5ed7f6bfd63b81310fcb24b310d9205c9687 Author: Woo-Yeon Lee <[email protected]> Date: Wed Sep 25 16:53:03 2024 +0900 [Misc] Fix minor typo in scheduler (#8765) commit 3368c3ab36436af1342a3156971412e9efdb6419 Author: David Newman <[email protected]> Date: Wed Sep 25 17:52:26 2024 +1000 [Bugfix] Ray 2.9.x doesn't expose available_resources_per_node (#8767) Signed-off-by: darthhexx <[email protected]> commit 1ac3de09cd87290f7494ce6337623d6edd3f8667 Author: Adam Tilghman <[email protected]> Date: Wed Sep 25 00:49:26 2024 -0700 [Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer (#8672) commit 3e073e66f1790f7ce339dad71514983e6e402f30 Author: sohamparikh <[email protected]> Date: Wed Sep 25 02:16:30 2024 -0400 [Bugfix] load fc bias from config for eagle (#8790) commit c23953675f78bc85045d66fa98aea7d0581c2167 Author: Isotr0py <[email protected]> Date: Wed Sep 25 14:16:11 2024 +0800 [Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend (#8770) commit e3dd0692fa2c803cd6f59a88d2fdf8bca26d8d96 Author: zifeitong <[email protected]> Date: Tue Sep 24 22:53:43 2024 -0700 [BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv (#8250) commit fc3afc20df410dd523f94967b98836084f561ab7 Author: sroy745 <[email protected]> Date: Tue Sep 24 21:26:36 2024 -0700 Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 (#8752) commit b4522474a32b6e0bf5573a9b6a6830cb787dfb63 Author: sasha0552 <[email protected]> Date: Wed Sep 25 04:26:33 2024 +0000 [Bugfix][Kernel] Implement acquire/release polyfill for Pascal (#8776) commit ee777d9c30418ffa9d98f98dd27c0ddea346c49c Author: sroy745 <[email protected]> Date: Tue Sep 24 21:26:18 2024 -0700 Fix test_schedule_swapped_simple in test_scheduler.py (#8780) commit 6e0c9d6bd07464b311eb098e2dac8196eed16721 Author: Joe Runde <[email protected]> Date: Tue Sep 24 21:37:38 2024 -0600 [Bugfix] Use heartbeats instead of health checks (#8583) commit 6da1ab6b4134d76391a0c31a048e5d04b6283769 Author: Archit Patke <[email protected]> Date: Tue Sep 24 21:50:50 2024 -0500 [Core] Adding Priority Scheduling (#5958) commit 01b6f9e1f0530a7cb81486ff34d3d935e4f75d28 Author: Travis Johnson <[email protected]> Date: Tue Sep 24 18:29:56 2024 -0600 [Core][Bugfix] Support prompt_logprobs returned with speculative decoding (#8047) Signed-off-by: Travis Johnson <[email protected]> commit 13f9f7a3d0373421ee9fd7498e450214e134aa6c Author: Jee Jee Li <[email protected]> Date: Wed Sep 25 08:08:55 2024 +0800 [[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (#8768) commit 1e7d5c01f5c35424eede1bbe6f723dd8781120f0 Author: youkaichao <[email protected]> Date: Tue Sep 24 15:48:39 2024 -0700 [misc] soft drop beam search (#8763) commit 2467b642dd9bde32a334fe5967efd78a53aa49da Author: Daniele <[email protected]> Date: Tue Sep 24 21:38:12 2024 +0200 [CI/Build] fix setuptools-scm usage (#8771) commit 72fc97a0f100b92f1ff6c6a16e27d12f1c7569aa Author: Lucas Wilkinson <[email protected]> Date: Tue Sep 24 14:33:21 2024 -0400 [Bugfix] Fix torch dynamo fixes caused by `replace_parameters` (#8748) commit 2529d09b5a4a124a316b6976e7d782f54e0bddde Author: Andy <[email protected]> Date: Tue Sep 24 12:44:11 2024 -0400 [Frontend] Batch inference for llm.chat() API (#8648) Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> commit a928ded99519f803d4cf6389df6acc707239a5cc Author: ElizaWszola <[email protected]> Date: Tue Sep 24 18:31:42 2024 +0200 [Kernel] Split Marlin MoE kernels into multiple files (#8661) Co-authored-by: mgoin <[email protected]> commit cc4325b66ac49e403ed9e1a8c38156a5324e1174 Author: Hanzhi Zhou <[email protected]> Date: Tue Sep 24 01:08:14 2024 -0700 [Bugfix] Fix potentially unsafe custom allreduce synchronization (#8558) commit 8ff7ced996d5dc8b682913471f36c9fefb0e843f Author: Alex Brooks <[email protected]> Date: Tue Sep 24 01:36:46 2024 -0600 [Model] Expose Phi3v num_crops as a mm_processor_kwarg (#8658) Signed-off-by: Alex-Brooks <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> commit 3f06bae9079ee495a34cfadcd9c1ef2a23636084 Author: Peter Salas <[email protected]> Date: Tue Sep 24 00:14:15 2024 -0700 [Core][Model] Support loading weights by ID within models (#7931) commit b8747e8a7c318ab774862f94ccbdbba5b7d9dd4a Author: Cody Yu <[email protected]> Date: Mon Sep 23 23:10:03 2024 -0700 [MISC] Skip dumping inputs when unpicklable (#8744) commit 3185fb0ccae73816018d0936c03171b7cf1ba2f8 Author: Simon Mo <[email protected]> Date: Mon Sep 23 22:45:20 2024 -0700 Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" (#8750) commit 0250dd68c5df12ead29d2ec7d922855c9a257b06 Author: youkaichao <[email protected]> Date: Mon Sep 23 22:08:12 2024 -0700 re-implement beam search on top of vllm core (#8726) Co-authored-by: Brendan Wong <[email protected]> commit 88577ac92808cfd9468e4b54b757d5fcbe9aa486 Author: sroy745 <[email protected]> Date: Mon Sep 23 21:43:13 2024 -0700 Fix tests in test_scheduler.py that fail with BlockManager V2 (#8728) commit 530821d00cb2beeb8dc62f74f0e4e0003868dc93 Author: Hongxia Yang <[email protected]> Date: Mon Sep 23 21:52:39 2024 -0400 [Hardware][AMD] ROCm6.2 upgrade (#8674) commit 1a2aef3e59f5429299618bd3b242833cb377f554 Author: Alexander Matveev <[email protected]> Date: Mon Sep 23 18:38:04 2024 -0400 Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (#8335) commit 5f7bb584272ee15147a411b887e7ababd6b9b9d0 Author: jiqing-feng <[email protected]> Date: Tue Sep 24 03:32:27 2024 +0800 Fix typical acceptance sampler with correct recovered token ids (#8562) commit b05f5c9238c3e0c3a98080b4ffc90acfa33f9e1f Author: Russell Bryant <[email protected]> Date: Mon Sep 23 15:15:41 2024 -0400 [Core] Allow IPv6 in VLLM_HOST_IP with zmq (#8575) Signed-off-by: Russell Bryant <[email protected]> commit 9b0e3ec970f6a19427be358848a2ed663fd735e1 Author: Jee Jee Li <[email protected]> Date: Tue Sep 24 02:57:42 2024 +0800 [Kernel][LoRA] Add assertion for punica sgmv kernels (#7585) commit 86e9c8df29a954a7a2fc46e9985fecc2a2e15ae8 Author: Lucas Wilkinson <[email protected]> Date: Mon Sep 23 13:46:26 2024 -0400 [Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701) Co-authored-by: mgoin <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> commit ee5f34b1c2c71b2d56054a5ca23fe1c50c1458bb Author: Daniele <[email protected]> Date: Mon Sep 23 18:44:26 2024 +0200 [CI/Build] use setuptools-scm to set __version__ (#4738) Co-authored-by: youkaichao <[email protected]> commit f2bd246c17ba67d7749a2560a30711f74cd19177 Author: Jani Monoses <[email protected]> Date: Mon Sep 23 17:43:09 2024 +0300 [VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size (#8707) commit a79e5229843e2800956956d0668b1b4858dbb61e Author: Yanyi Liu <[email protected]> Date: Mon Sep 23 21:46:59 2024 +0800 [Model] Support pp for qwen2-vl (#8696) commit 3e83c12b5caa466bf533b144a9ec7944a9ce9d49 Author: Li, Jiang <[email protected]> Date: Mon Sep 23 21:15:16 2024 +0800 [Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner (#8733) commit e551ca1555b64ba1ecb2310ea658f3e25c62571d Author: Isotr0py <[email protected]> Date: Mon Sep 23 20:12:20 2024 +0800 [Hardware][CPU] Refactor CPU model runner (#8729) commit 9b8c8ba1198cbcd311d28b7647f0f8d5dcdc9212 Author: Alex Brooks <[email protected]> Date: Mon Sep 23 01:44:48 2024 -0600 [Core][Frontend] Support Passing Multimodal Processor Kwargs (#8657) Signed-off-by: Alex-Brooks <[email protected]> commit d23679eb9960ad2a876b88ebd0028dbe55c3172a Author: Yan Ma <[email protected]> Date: Mon Sep 23 13:54:18 2024 +0800 [Bugfix] fix docker build for xpu (#8652) commit 57a0702e63d9dc477ab7a82e686a30d14fb6c69d Author: Luka Govedič <[email protected]> Date: Sun Sep 22 23:40:46 2024 -0400 [Bugfix] Fix CPU CMake build (#8723) Co-authored-by: Yuan <[email protected]> commit 3dda7c22502033854e963fef3826c1f64627e33b Author: Tyler Michael Smith <[email protected]> Date: Sun Sep 22 22:24:59 2024 -0400 [Bugfix] Avoid some bogus messages RE CUTLASS's revision when building (#8702) commit 92ba7e7477619ec81464ccb64a17226f3d5047bb Author: youkaichao <[email protected]> Date: Sun Sep 22 15:41:59 2024 -0700 [misc] upgrade mistral-common (#8715) commit d4a2ac830291305f202a85e157bff3a07b58e616 Author: youkaichao <[email protected]> Date: Sun Sep 22 12:47:54 2024 -0700 [build] enable existing pytorch (for GH200, aarch64, nightly) (#8713) commit c6bd70d7728b50f358cb5cb6e66e02b75aeb3d20 Author: Lily Liu <[email protected]> Date: Sun Sep 22 12:34:14 2024 -0700 [SpecDec][Misc] Cleanup, remove bonus token logic. (#8701) commit 5b59532760c82a9d91f65a3e227524da2af7d4ef Author: litianjian <[email protected]> Date: Mon Sep 23 01:51:44 2024 +0800 [Model][VLM] Add LLaVA-Onevision model support (#8486) Co-authored-by: litianjian <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> commit ca2b628b3c25b014b9951731c0331b75262a59e0 Author: Huazhong Ji <[email protected]> Date: Mon Sep 23 01:44:09 2024 +0800 [MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler (#8703) commit 8ca5051b9afb6f8d2b3ae1b71d45d84e5d1c6f57 Author: Alex Brooks <[email protected]> Date: Sun Sep 22 06:56:20 2024 -0600 [Misc] Use NamedTuple in Multi-image example (#8705) Signed-off-by: Alex-Brooks <[email protected]> commit 06ed2815e2be50e527839c7ab09ce2639b7910b6 Author: Cyrus Leung <[email protected]> Date: Sun Sep 22 20:24:21 2024 +0800 [Model] Refactor BLIP/BLIP-2 to support composite model loading (#8407) commit 0e40ac9b7b5d953dfe38933bc7d2fb0a6c8da53c Author: youkaichao <[email protected]> Date: Sat Sep 21 23:24:58 2024 -0700 [ci][build] fix vllm-flash-attn (#8699) commit 13d88d4137f97b8cf3c79f39d7df5e4c8348603a Author: Isotr0py <[email protected]> Date: Sun Sep 22 12:33:27 2024 +0800 [Bugfix] Refactor composite weight loading logic (#8656) commit d66ac62854e04c8fda83506dc93ef7971ebf593a Author: Tyler Michael Smith <[email protected]> Date: Sat Sep 21 19:45:02 2024 -0400 [Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (#8643) commit 9dc7c6c7f332ac6c08311c7a946c6945e0782701 Author: Divakar Verma <[email protected]> Date: Sat Sep 21 16:09:39 2024 -0500 [dbrx] refactor dbrx experts to extend FusedMoe class (#8518) commit ec4aaad8124baadc7954e30c612ca9444b22d7e7 Author: rasmith <[email protected]> Date: Sat Sep 21 04:20:54 2024 -0500 [Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 (#8646) commit 4dfdf4319676c3dca72cdfba20470ac76d0cadf4 Author: Andy Dai <[email protected]> Date: Sat Sep 21 00:24:12 2024 -0700 [Doc] Fix typo in AMD installation guide (#8689) commit 5e85f4f82a5b6eaad6869198d6ac76a0c12cf6d0 Author: Cyrus Leung <[email protected]> Date: Sat Sep 21 14:28:56 2024 +0800 [VLM] Use `SequenceData.from_token_counts` to create dummy data (#8687) commit 71c60491f287d8a23bed1743513b4b3e7927c69e Author: Luka Govedič <[email protected]> Date: Sat Sep 21 02:27:10 2024 -0400 [Kernel] Build flash-attn from source (#8245) commit 0faab90eb006c677add65cd4c2d0f740a63e064d Author: youkaichao <[email protected]> Date: Fri Sep 20 19:55:33 2024 -0700 [beam search] add output for manually checking the correctness (#8684) commit 0455c46ed434d70f0a6219204e89ee04f1d01336 Author: Cyrus Leung <[email protected]> Date: Sat Sep 21 10:30:39 2024 +0800 [Core] Factor out common code in `SequenceData` and `Sequence` (#8675) commit d4bf085ad064ba68a77862e2022f37c33a66e94a Author: Kunshang Ji <[email protected]> Date: Sat Sep 21 10:03:55 2024 +0800 [MISC] add support custom_op check (#8557) Co-authored-by: youkaichao <[email protected]> commit 0057894ef7f8db0d51385aa7254219d7fbd6c784 Author: Cyrus Leung <[email protected]> Date: Sat Sep 21 10:00:54 2024 +0800 [Core] Rename `PromptInputs` and `inputs`(#8673) commit 0f961b3ce9ac3d3fd13e201c4358884bc094905e Author: zyddnys <[email protected]> Date: Fri Sep 20 18:48:32 2024 -0400 [Bugfix] Fix incorrect llava next feature size calculation (#8496) commit 7f9c8902e3d50a9d715b38e0531280a58d2bbe14 Author: omrishiv <[email protected]> Date: Fri Sep 20 15:19:44 2024 -0700 [Hardware][AWS] update neuron to 2.20 (#8676) Signed-off-by: omrishiv <[email protected]> commit 7c8566aa4ff16b79a576436fbb50f03643febf07 Author: omrishiv <[email protected]> Date: Fri Sep 20 15:04:37 2024 -0700 [Doc] neuron documentation update (#8671) Signed-off-by: omrishiv <[email protected]> commit b4e4eda92e1d3a013fc4007db64b69d8604264ff Author: Patrick von Platen <[email protected]> Date: Fri Sep 20 23:33:03 2024 +0200 [Bugfix][Core] Fix tekken edge case for mistral tokenizer (#8640) commit 2874bac618052a079efd837fc82cf3f3519079c7 Author: Pastel！ <[email protected]> Date: Sat Sep 21 05:00:45 2024 +0800 [Bugfix] Config got an unexpected keyword argument 'engine' (#8556) commit 035fa895ecedea87810889aabbe50ba8a2ad7d5d Author: Cyrus Leung <[email protected]> Date: Sat Sep 21 04:52:19 2024 +0800 [Misc] Show AMD GPU topology in `collect_env.py` (#8649) commit b28298f2f4bd4ec6d1020c10b923a9eb7993dc89 Author: saumya-saran <[email protected]> Date: Fri Sep 20 12:46:02 2024 -0700 [Bugfix] Validate SamplingParam n is an int (#8548) commit 2940afa04e39fa9f248c565687d9a2acf7401355 Author: Alexey Kondratiev(AMD) <[email protected]> Date: Fri Sep 20 13:27:44 2024 -0400 [CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build (#8670) commit 3b63de9353ce51ba6c1c167ae8d4b87b8bcf9c9e Author: Niklas Muennighoff <[email protected]> Date: Fri Sep 20 09:31:41 2024 -0700 [Model] Add OLMoE (#7922) commit 260d40b5ea48df9421325388abcc8d907a560fc5 Author: Jiaxin Shan <[email protected]> Date: Thu Sep 19 23:20:56 2024 -0700 [Core] Support Lora lineage and base model metadata management (#6315) commit 9e5ec35b1f8239453b1aaab28e7a02307db4ab1f Author: William Lin <[email protected]> Date: Thu Sep 19 20:49:54 2024 -0700 [bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (#8474) commit 18ae428a0d8792d160d811a9cd5bb004d68ea8bd Author: Amit Garg <[email protected]> Date: Thu Sep 19 17:54:02 2024 -0700 [Bugfix] Fix Phi3.5 mini and MoE LoRA inference (#8571) commit de6f90a13d7b98c4958ba107ec16cb6f95efb10f Author: bnellnm <[email protected]> Date: Thu Sep 19 18:36:30 2024 -0400 [Misc] guard against change in cuda library name (#8609) commit 6cb748e190a94e20987314025614b8bd806602f2 Author: Alexey Kondratiev(AMD) <[email protected]> Date: Thu Sep 19 16:06:32 2024 -0400 [CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail (#8551) commit 9e99407e3ccbb290bae77af230da38c70a52a055 Author: Simon Mo <[email protected]> Date: Thu Sep 19 12:16:28 2024 -0700 Create SECURITY.md (#8642) commit ea4647b7d77c4738c5ed2ab77a2c9f5ad335f6fb Author: Isotr0py <[email protected]> Date: Fri Sep 20 03:15:55 2024 +0800 [Doc] Add documentation for GGUF quantization (#8618) commit e42c634acbd1b86b5becca51e8b8108a32a438d5 Author: 盏一 <[email protected]> Date: Fri Sep 20 02:28:25 2024 +0800 [Core] simplify logits resort in _apply_top_k_top_p (#8619) commit 9cc373f39036af789fb1ffc1e06b23766996d3f4 Author: Charlie Fu <[email protected]> Date: Thu Sep 19 12:37:57 2024 -0500 [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577) commit 76515f303b44cb3ffc6de63c49148d5081a77119 Author: Nick Hill <[email protected]> Date: Thu Sep 19 17:51:06 2024 +0100 [Frontend] Use MQLLMEngine for embeddings models too (#8584) commit 855c8ae2c9a4085b1ebd66d9a978fb23f47f822c Author: Kunshang Ji <[email protected]> Date: Thu Sep 19 13:33:20 2024 +0800 [MISC] remove engine_use_ray in benchmark_throughput.py (#8615) commit c52ec5f03471008fa1312d82fb17d40b95a3ca5d Author: Kuntai Du <[email protected]> Date: Wed Sep 18 22:24:24 2024 -0700 [Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (#8616) commit 02c9afa2d04a85269faa2760e9af30527a61d7f6 Author: Roger Wang <[email protected]> Date: Wed Sep 18 21:14:28 2024 -0700 Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (#8593) commit 3118f63385c0d767fba8b6d2039fc35440678da9 Author: sroy745 <[email protected]> Date: Wed Sep 18 19:24:15 2024 -0700 [Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. (#8545) commit 4c34ce8916da0e4967eadefcb7f91eb58dd7ac61 Author: Tyler Michael Smith <[email protected]> Date: Wed Sep 18 21:42:49 2024 -0400 [Kernel] Remove marlin moe templating on thread_m_blocks (#8573) Co-authored-by: [email protected] commit 0d47bf3bf40edfe9fcfd7e5cd909388497535bc5 Author: Joe Runde <[email protected]> Date: Wed Sep 18 16:10:01 2024 -0600 [Bugfix] add `dead_error` property to engine client (#8574) Signed-off-by: Joe Runde <[email protected]> commit d9cd78eb718c233ebc5b84377fc2226af7ef0fa2 Author: Nick Hill <[email protected]> Date: Wed Sep 18 21:17:55 2024 +0100 [BugFix] Nonzero exit code if MQLLMEngine startup fails (#8572) commit db9120cdedba5033037432775417df0b6117495d Author: Tyler Michael Smith <[email protected]> Date: Wed Sep 18 16:05:06 2024 -0400 [Kernel] Change interface to Mamba selective_state_update for continuous batching (#8039) commit b3195bc9e4d57b6107af2222afea26c51475e262 Author: Gregory Shtrasberg <[email protected]> Date: Wed Sep 18 13:41:08 2024 -0400 [AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (#8380) Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: Michael Goin <[email protected]> commit e18749ff09c277f7cdab278895ebdd9b1041b6e8 Author: Geun, Lim <[email protected]> Date: Thu Sep 19 02:04:00 2024 +0900 [Model] Support Solar Model (#8386) Co-authored-by: Michael Goin <[email protected]> commit d65798f78c76f03f068fc2f69a68cff430ee6b6f Author: Russell Bryant <[email protected]> Date: Wed Sep 18 12:10:27 2024 -0400 [Core] zmq: bind only to 127.0.0.1 for local-only usage (#8543) Signed-off-by: Russell Bryant <[email protected]> commit a8c1d161a7d87dbc6c7cccfce303dcbe2e4ed6be Author: afeldman-nm <[email protected]> Date: Wed Sep 18 11:38:43 2024 -0400 [Core] *Prompt* logprobs support in Multi-step (#8199) commit 7c7714d856eee6fa94aade729b67f00584f72a4c Author: Alexander Matveev <[email protected]> Date: Wed Sep 18 09:56:58 2024 -0400 [Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (#8157) Co-authored-by: Nick Hill <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Simon Mo <[email protected]> commit 9d104b5beb7bbb51c64b680e007f39169489ea86 Author: Aaron Pham <[email protected]> Date: Wed Sep 18 07:00:56 2024 -0400 [CI/Build] Update Ruff version (#8469) Signed-off-by: Aaron Pham <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> commit 6ffa3f314c59e42238f1c5f923ff2839e0af9698 Author: Cyrus Leung <[email protected]> Date: Wed Sep 18 18:38:11 2024 +0800 [CI/Build] Avoid CUDA initialization (#8534) commit e351572900f7d87e14fe203ea3a49c1c7ddae0d6 Author: Jiaxin Shan <[email protected]> Date: Wed Sep 18 02:51:59 2024 -0700 [Misc] Add argument to disable FastAPI docs (#8554) commit 95965d31b6ac2c9557816a6ffabe4a3117a5ccb2 Author: Daniele <[email protected]> Date: Wed Sep 18 04:49:53 2024 +0200 [CI/Build] fix Dockerfile.cpu on podman (#8540) commit 8110e44529f431d54b02060528601c0d3e3f7d02 Author: Tyler Michael Smith <[email protected]> Date: Tue Sep 17 19:44:27 2024 -0400 [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (#8012) commit 09deb4721f830602d0417604c7e18b7e384f9594 Author: Alexey Kondratiev(AMD) <[email protected]> Date: Tue Sep 17 19:40:29 2024 -0400 [CI/Build] Excluding kernels/test_gguf.py from ROCm (#8520) commit fa0c114fad4e2b807503e78d5110558cfee92ba4 Author: youkaichao <[email protected]> Date: Tue Sep 17 16:24:06 2024 -0700 [doc] improve installation doc (#8550) Co-authored-by: Andy Dai <[email protected]> commit 98f9713399bd602ff954a83e6e6abcb4cf8b8864 Author: Joe Runde <[email protected]> Date: Tue Sep 17 17:17:08 2024 -0600 [Bugfix] Fix TP > 1 for new granite (#8544) Signed-off-by: Joe Runde <[email protected]> commit 56c3de018c35580fd088655c2f9951cd4da5335d Author: Nick Hill <[email protected]> Date: Tue Sep 17 20:24:29 2024 +0100 [Misc] Don't dump contents of kvcache tensors on errors (#8527) commit a54ed8024953dc6b59906072a7a89cd4791ec4f0 Author: Patrick von Platen <[email protected]> Date: Tue Sep 17 19:50:37 2024 +0200 [Model] Add mistral function calling format to all models loaded with "mistral" format (#8515) Co-authored-by: Cyrus Leung <[email protected]> commit 9855b99502c7537db5ef018129e603650800ac46 Author: chenqianfzh <[email protected]> Date: Tue Sep 17 08:09:12 2024 -0700 [Feature][kernel] tensor parallelism with bitsandbytes quantization (#8434) commit 1009e93c5d634c724eeff3d4e453369337f502d4 Author: sroy745 <[email protected]> Date: Tue Sep 17 07:35:01 2024 -0700 [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631) commit 1b6de8352b878348974b3f117cbb68ed18daa609 Author: Isotr0py <[email protected]> Date: Tue Sep 17 15:34:27 2024 +0800 [Benchmark] Support sample from HF datasets and image input for benchmark_serving (#8495) commit cbdb25225914a04d94e8830f4e739faca8ff3b9d Author: Rui Qiao <[email protected]> Date: Tue Sep 17 00:06:26 2024 -0700 [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change…

vllm-project#8378) Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Alvant <[email protected]>

vllm-project#8378) Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Amit Garg <[email protected]>

vllm-project#8378) Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>

vllm-project#8378) Co-authored-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath marked this pull request as draft September 11, 2024 18:50

This was referenced Sep 11, 2024

[Core] Chunked Prefill support for Multi Step Scheduling #7814

Closed

[WIP] Multi Step Chunked Prefill - Prefill Steps #8001

Closed

comaniac mentioned this pull request Sep 13, 2024

[Tracking issue] [Help wanted]: Multi-step scheduling follow-ups #7528

Closed

17 tasks

varun-sundar-rabindranath force-pushed the varun/p4-multi-step-chunked-prefill branch from 500a2d3 to f240b10 Compare September 16, 2024 05:01

varun-sundar-rabindranath marked this pull request as ready for review September 16, 2024 05:06

comaniac self-assigned this Sep 16, 2024

varun-sundar-rabindranath changed the title ~~Multi-Step + Chunked Prefill with Prefill Stepping~~ Multi-Step + Chunked Prefill with Single Step Prefills Sep 16, 2024

comaniac reviewed Sep 16, 2024

View reviewed changes

SolitaryThinker reviewed Sep 16, 2024

View reviewed changes

vllm/worker/model_runner.py Outdated Show resolved Hide resolved

SolitaryThinker reviewed Sep 16, 2024

View reviewed changes

vllm/worker/multi_step_model_runner.py Show resolved Hide resolved

varun-sundar-rabindranath commented Sep 17, 2024

View reviewed changes

vllm/engine/llm_engine.py Outdated Show resolved Hide resolved

varun-sundar-rabindranath force-pushed the varun/p4-multi-step-chunked-prefill branch from 0ffa5fd to dd2932e Compare September 18, 2024 03:07

varun-sundar-rabindranath commented Sep 18, 2024

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

comaniac reviewed Sep 25, 2024

View reviewed changes

Varun Sundar Rabindranath added 2 commits September 27, 2024 15:01

Multi-Step Chunked-Prefill Support

e37b9db

review comments

89e790c

varun-sundar-rabindranath force-pushed the varun/p4-multi-step-chunked-prefill branch from f4a2886 to 89e790c Compare September 27, 2024 15:45

alexm-redhat reviewed Sep 27, 2024

View reviewed changes

Varun Sundar Rabindranath added 3 commits September 27, 2024 16:37

Update selected_token_indices directly

d874ce6

make can_append_slots assert stronger

b4650b6

format

7e8f66b

alexm-redhat approved these changes Sep 27, 2024

View reviewed changes

comaniac approved these changes Sep 27, 2024

View reviewed changes

SolitaryThinker approved these changes Sep 27, 2024

View reviewed changes

comaniac merged commit c2ec430 into vllm-project:main Sep 27, 2024
76 checks passed

njhill mentioned this pull request Sep 28, 2024

[Core] Combined support for multi-step scheduling, chunked prefill & prefix caching #8804

Merged

varun-sundar-rabindranath mentioned this pull request Sep 30, 2024

[Bugfix] Revert incorrect updates to num_computed_tokens #8950

Closed

MengqingCao pushed a commit to MengqingCao/vllm that referenced this pull request Sep 30, 2024

[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (

9c39d47

vllm-project#8378) Co-authored-by: Varun Sundar Rabindranath <[email protected]>

siddharth9820 pushed a commit to axonn-ai/vllm that referenced this pull request Sep 30, 2024

[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (

3ad6a87

vllm-project#8378) Co-authored-by: Varun Sundar Rabindranath <[email protected]>

liuyanyi pushed a commit to liuyanyi/vllm that referenced this pull request Oct 6, 2024

[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (

991f76b

vllm-project#8378) Co-authored-by: Varun Sundar Rabindranath <[email protected]>

tjtanaa mentioned this pull request Oct 6, 2024

[Bug]: AMD MultiStep Feature Issue. Missing argument: 'turn_prefills_into_decodes' in advance_step() #9111

Closed

1 task

allenwang28 mentioned this pull request Oct 9, 2024

[Bugfix] Sets is_first_step_output for TPUModelRunner #9202

Merged

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (

599ab99

vllm-project#8378) Co-authored-by: Varun Sundar Rabindranath <[email protected]>

[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path #8378

[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path #8378

Conversation

varun-sundar-rabindranath commented Sep 11, 2024 • edited Loading

Idea:

Sample logging output:

Implementation Details:

Scheduling Logic:

Benchmarks:

Benchmark Serving

Next Steps (future PRs) :

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

github-actions bot commented Sep 11, 2024

comaniac left a comment

Choose a reason for hiding this comment

sam-h-bean commented Sep 17, 2024

varun-sundar-rabindranath commented Sep 17, 2024

sam-h-bean commented Sep 18, 2024

varun-sundar-rabindranath commented Sep 18, 2024 • edited Loading

sam-h-bean commented Sep 18, 2024

sam-h-bean commented Sep 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sam-h-bean commented Sep 18, 2024 • edited Loading

varun-sundar-rabindranath commented Sep 18, 2024

sam-h-bean commented Sep 18, 2024

varun-sundar-rabindranath commented Sep 18, 2024

sam-h-bean commented Sep 18, 2024

sam-h-bean commented Sep 19, 2024

varun-sundar-rabindranath commented Sep 19, 2024

sam-h-bean commented Sep 19, 2024

comaniac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexm-redhat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexm-redhat left a comment

Choose a reason for hiding this comment

comaniac left a comment

Choose a reason for hiding this comment

SolitaryThinker left a comment

Choose a reason for hiding this comment

LiuXiaoxuanPKU commented Sep 29, 2024

varun-sundar-rabindranath commented Sep 29, 2024 • edited Loading

comaniac commented Sep 29, 2024

varun-sundar-rabindranath commented Sep 30, 2024 • edited Loading

varun-sundar-rabindranath commented Sep 11, 2024 •

edited

Loading

varun-sundar-rabindranath commented Sep 18, 2024 •

edited

Loading

sam-h-bean commented Sep 18, 2024 •

edited

Loading

varun-sundar-rabindranath commented Sep 29, 2024 •

edited

Loading

varun-sundar-rabindranath commented Sep 30, 2024 •

edited

Loading