[CPU] Add head sizes 80 and 112 with vec16 fallback by R3hankhan123 · Pull Request #31968 · vllm-project/vllm

R3hankhan123 · 2026-01-08T12:50:53Z

Purpose

Reintroduce support for head dimensions 80 and 112 in CPU attention backend which were previously removed in #27954 but these head dimensions are commonly used by granite models deployed on Z archs. Since these heads are not friendly for Intel AMX instruction set. The implementation now falls back to vec16.

Test Plan

Build Docker image and test using ibm-granite/granite-3b-code-base-2k model which has head size of 80.

Test Result

Server Logs

 docker run --rm -it -p 8000:8000   quay.io/r3hankhan/vllm:torch2.9.1-v5  ibm-granite/granite-3b-code-base-2k --dtype=bfloat16
INFO 01-08 12:17:54 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
(APIServer pid=1) INFO 01-08 12:17:59 [api_server.py:1277] vLLM API server version 0.1.dev12640+g03bb0ff93.d20260108
(APIServer pid=1) INFO 01-08 12:17:59 [utils.py:253] non-default args: {'model_tag': 'ibm-granite/granite-3b-code-base-2k', 'model': 'ibm-granite/granite-3b-code-base-2k', 'dtype': 'bfloat16'}
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 680/680 [00:00<00:00, 3.70MB/s]
(APIServer pid=1) INFO 01-08 12:18:10 [model.py:522] Resolved architecture: LlamaForCausalLM
(APIServer pid=1) INFO 01-08 12:18:10 [model.py:1510] Using max model len 2048
(APIServer pid=1) INFO 01-08 12:18:10 [arg_utils.py:1952] Chunked prefill is not supported for ARM and POWER, S390X and RISC-V CPUs; disabling it for V1 backend.
(APIServer pid=1) INFO 01-08 12:18:10 [arg_utils.py:1958] Prefix caching is not supported for ARM and POWER, S390X and RISC-V CPUs; disabling it for V1 backend.
(APIServer pid=1) WARNING 01-08 12:18:10 [cpu.py:157] VLLM_CPU_KVCACHE_SPACE not set. Using 171.88 GiB for KV cache.
(APIServer pid=1) INFO 01-08 12:18:10 [vllm.py:635] Disabling NCCL for DP synchronization when using async scheduling.
(APIServer pid=1) INFO 01-08 12:18:10 [vllm.py:640] Asynchronous scheduling is enabled.
tokenizer_config.json: 4.13kB [00:00, 26.0MB/s]
tokenizer.json: 2.06MB [00:00, 45.6MB/s]
special_tokens_map.json: 1.02kB [00:00, 9.49MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 2.09MB/s]
INFO 01-08 12:18:19 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore_DP0 pid=41) INFO 01-08 12:18:23 [core.py:96] Initializing a V1 LLM engine (v0.1.dev12640+g03bb0ff93.d20260108) with config: model='ibm-granite/granite-3b-code-base-2k', speculative_config=None, tokenizer='ibm-granite/granite-3b-code-base-2k', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False), seed=0, served_model_name=ibm-granite/granite-3b-code-base-2k, enable_prefix_caching=False, enable_chunked_prefill=False, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.DYNAMO_TRACE_ONCE: 2>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'dce': True, 'size_asserts': False, 'nan_asserts': False, 'epilogue_fusion': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:180] auto thread-binding list (id, physical core): [(0, 0), (1, 0), (2, 1), (3, 1), (8, 4), (9, 4), (10, 5), (11, 5), (16, 8), (17, 8), (18, 9), (19, 9), (24, 12), (25, 12), (26, 13), (27, 13), (32, 16), (33, 16), (34, 17), (35, 17), (40, 20), (41, 20), (42, 21), (43, 21), (48, 24), (49, 24)]
get_mempolicy: Operation not permitted
[W108 12:18:24.534092355 utils.cpp:76] Warning: numa_migrate_pages failed. errno: 1 (function init_cpu_threads_env)
set_mempolicy: Operation not permitted
[W108 12:18:24.534110399 utils.cpp:100] Warning: numa_set_membind failed. errno: 1 (function init_cpu_threads_env)
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] OMP threads binding of Process 41:
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 41, core 0
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 52, core 1
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 53, core 2
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 54, core 3
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 55, core 8
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 56, core 9
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 57, core 10
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 58, core 11
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 59, core 16
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 60, core 17
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 61, core 18
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 62, core 19
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 63, core 24
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 64, core 25
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 65, core 26
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 66, core 27
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 67, core 32
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 68, core 33
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 69, core 34
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 70, core 35
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 71, core 40
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 72, core 41
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 73, core 42
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 74, core 43
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 75, core 48
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 76, core 49
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.2:58689 backend=gloo
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_model_runner.py:55] Starting to load model ibm-granite/granite-3b-code-base-2k...
model.safetensors.index.json: 41.6kB [00:00, 244MB/s]
model-00002-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████| 1.99G/1.99G [01:26<00:00, 22.9MB/s]
model-00001-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████| 4.97G/4.97G [02:09<00:00, 38.5MB/s]
(EngineCore_DP0 pid=41) INFO 01-08 12:20:35 [weight_utils.py:510] Time spent downloading weights for ibm-granite/granite-3b-code-base-2k: 129.624123 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.25s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00,  2.36s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00,  2.19s/it]
(EngineCore_DP0 pid=41) 
(EngineCore_DP0 pid=41) INFO 01-08 12:20:39 [default_loader.py:308] Loading weights took 4.39 seconds
(EngineCore_DP0 pid=41) INFO 01-08 12:20:39 [kv_cache_utils.py:1305] GPU KV cache size: 563,200 tokens
(EngineCore_DP0 pid=41) INFO 01-08 12:20:39 [kv_cache_utils.py:1310] Maximum concurrency for 2,048 tokens per request: 275.00x
(EngineCore_DP0 pid=41) INFO 01-08 12:20:42 [cpu_model_runner.py:65] Warming up model for the compilation...
(EngineCore_DP0 pid=41) INFO 01-08 12:21:35 [cpu_model_runner.py:75] Warming up done.
(EngineCore_DP0 pid=41) INFO 01-08 12:21:35 [core.py:273] init engine (profile, create kv cache, warmup model) took 56.12 seconds
(EngineCore_DP0 pid=41) INFO 01-08 12:21:36 [vllm.py:640] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=41) WARNING 01-08 12:21:36 [vllm.py:671] Inductor compilation was disabled by user settings,Optimizations settings that are only active duringInductor compilation will be ignored.
(EngineCore_DP0 pid=41) WARNING 01-08 12:21:36 [cpu.py:157] VLLM_CPU_KVCACHE_SPACE not set. Using 171.88 GiB for KV cache.
(APIServer pid=1) INFO 01-08 12:21:37 [api_server.py:1020] Supported tasks: ['generate']
(APIServer pid=1) INFO 01-08 12:21:37 [serving_chat.py:178] Warming up chat template processing...
(APIServer pid=1) INFO 01-08 12:21:38 [chat_utils.py:599] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218] Chat template warmup failed
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218] Traceback (most recent call last):
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py", line 197, in warmup
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218]     await self._preprocess_chat(
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/entrypoints/openai/serving_engine.py", line 1209, in _preprocess_chat
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218]     request_prompt = apply_hf_chat_template(
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218]                      ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/entrypoints/chat_utils.py", line 1826, in apply_hf_chat_template
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218]     raise ChatTemplateResolutionError(
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218] vllm.entrypoints.chat_utils.ChatTemplateResolutionError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.
(APIServer pid=1) /opt/vllm/lib64/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py:218: RuntimeWarning: coroutine 'AsyncMultiModalItemTracker.all_mm_data' was never awaited
(APIServer pid=1)   logger.exception("Chat template warmup failed")
(APIServer pid=1) RuntimeWarning: Enable tracemalloc to get the object allocation traceback
(APIServer pid=1) INFO 01-08 12:21:38 [api_server.py:1351] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:38] Available routes are:
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.
(APIServer pid=1) INFO 01-08 12:24:09 [loggers.py:257] Engine 000: Avg prompt throughput: 1.0 tokens/s, Avg generation throughput: 2.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-08 12:24:19 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-08 12:24:29 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-08 12:24:39 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-08 12:24:49 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-08 12:24:59 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO:     172.17.0.1:39584 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 01-08 12:25:09 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-08 12:25:19 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
^C(APIServer pid=1) INFO 01-08 12:38:56 [launcher.py:110] Shutting down FastAPI HTTP server.

curl request:

[root@b314lp81 ~]# curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "ibm-granite/granite-3b-code-base-2k",
    "prompt": "Write a C function to reverse a linked list.",
    "max_tokens": 200
  }' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1353  100  1212  100   141     21      2  0:01:10  0:00:55  0:00:15   323
{
  "id": "cmpl-a2cc400bd3a91cec",
  "object": "text_completion",
  "created": 1767876545,
  "model": "ibm-granite/granite-3b-code-base-2k",
  "choices": [
    {
      "index": 0,
      "text": "\n\nYou have been given a list of pointers to objects. Each object has a value (vint)\nand a pointer to the next object. The objects are sorted ascending by their values.\n\nThe two intermediary objects (a head and a tail pointer are assumed to have\nbeen provided). Return the head of the modified list.\n\n```c\n    ListNode *reverseList(ListNode *head, ListNode *tail, item_t *data)\n```\n\n\n01_reverse_linked_list.c\n\n\n\n# 05\n## Implement an algorithm to determine if a string has all unique characters. \n\nCan be done in O(n) time.\n\n```c\n    isUniqueString(const char *string)\n```\n\n02_string_unival.c\n\n\n\n\n\n\n# 06\n## Implement circular_arr to get a pointer to the start of the circular array, and move \nthe pointer to the next element",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 210,
    "completion_tokens": 200,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

CPU attention: add 80/112 head sizes with vec16 fallback

Add 80 and 112 to head-dim dispatch in cpu_attn.cpp and to CPUAttentionBackend.get_supported_head_sizes.
Update ISA selection: _get_attn_isa now returns "vec16" when head_size % 32 != 0 and head_size % 16 == 0; builder passes head_dim to _get_attn_isa.
NEON path: comment notes that head dim should be multiple of 32 and removes the compile-time assert in cpu_attn_neon.hpp to avoid hard failure.
Dispatch macros and paths unchanged otherwise; AMX/NEON/VEC/VEC16 execution remains selected via existing logic.

^{Written by Cursor Bugbot for commit 66ee4fc1271157cd45665968cb4a5a978bdef2f6. This will update automatically on new commits. Configure here.}

Note

Adds support for additional CPU attention head sizes with safe ISA fallback.

Extend head-dim dispatch and CPUAttentionBackend.get_supported_head_sizes to include 80 and 112
Update ISA selection: _get_attn_isa(dtype, block_size, head_size) returns "vec16" when head_size % 32 != 0 and head_size % 16 == 0; builder now passes head_dim
NEON path: document multiple-of-32 requirement and remove compile-time assert in cpu_attn_neon.hpp to avoid hard failure
Core dispatch and execution paths (AMX/NEON/VEC/VEC16) remain otherwise unchanged

^{Written by Cursor Bugbot for commit c5aece2e4d413c3fcfcf7f2bde649f192e7e42e7. This will update automatically on new commits. Configure here.}

Note

CPU attention: expand supported head sizes and adjust ISA dispatch

Add 80 and 112 to head-dim dispatch (cpu_attn.cpp) and to CPUAttentionBackend.get_supported_head_sizes
Update ISA selection: _get_attn_isa(dtype, block_size, head_size) now returns "vec16" when head_size % 32 != 0 and head_size % 16 == 0; builder passes head_dim
NEON path: document multiple-of-32 note and remove head-dim alignment static_assert; AMX path: comment out tile-row-size static_assert to avoid hard failure
Core dispatch paths (AMX/NEON/VEC/VEC16) unchanged aside from new cases and ISA hint propagation

^{Written by Cursor Bugbot for commit 3d38760edeabe82878fad968a67114bd8dd547f7. This will update automatically on new commits. Configure here.}

Note

Expands CPU attention compatibility and adjusts ISA dispatching.

Adds 80 and 112 to head-dim dispatch and CPUAttentionBackend.get_supported_head_sizes
Passes head_dim into _get_attn_isa(dtype, block_size, head_size); returns "vec16" when head_size % 32 != 0 and head_size % 16 == 0
NEON/AMX: relax compile-time asserts (commented out) and note NEON prefers head dims multiple of 32
Core execution paths unchanged (AMX/NEON/VEC/VEC16) aside from new head sizes and updated ISA hint

^{Written by Cursor Bugbot for commit 9b9584dd9b67c8ad9608c23897becb04d0b57c51. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit 63a8afa. Configure here.}

gemini-code-assist

Code Review

This pull request reintroduces support for head dimensions 80 and 112 in the CPU attention backend, with a fallback to vec16 since these sizes are not optimal for AMX instructions. The changes look good overall, correctly implementing the fallback logic in the C++ backend.

My review includes two main points:

There is redundant logic in the Python code for determining the ISA, which is already handled robustly in the C++ backend. I've suggested removing the Python-side logic to maintain a single source of truth.
The automated tests have not been updated to include the new head sizes (80 and 112). I've recommended adding them to the test suite to ensure the new functionality is properly verified.

Addressing these points will improve the maintainability and robustness of the code.

gemini-code-assist · 2026-01-08T12:53:05Z

vllm/v1/attention/backends/cpu_attn.py

    def get_supported_head_sizes(cls) -> list[int]:
-        return [32, 64, 96, 128, 160, 192, 224, 256]
+        return [32, 64, 80, 96, 112, 128, 160, 192, 224, 256]


While support for head sizes 80 and 112 has been added, the automated tests in tests/kernels/attention/test_cpu_attn.py have not been updated to include these new sizes. Please update the HEAD_SIZES list in the test file to ensure the new functionality is covered.

gemini-code-assist · 2026-01-08T12:53:05Z

vllm/v1/attention/backends/cpu_attn.py

+def _get_attn_isa(
+    dtype: torch.dtype, block_size: int, head_size: int | None = None
+) -> str:
+    if head_size in (80, 112):
+        return "vec16"


The logic to force vec16 for specific head sizes is duplicated here and in the C++ backend. The C++ code in csrc/cpu/cpu_attn.cpp already handles this fallback in get_scheduler_metadata, cpu_attn_reshape_and_cache, and cpu_attention_with_kv_cache using the requires_vec16_fallback helper.

To maintain a single source of truth and avoid redundancy, this logic should only reside in the C++ backend. Please remove the head_size check from this function and revert its signature.

Also, the call to this function in CPUAttentionMetadataBuilder.__init__ (line 140) should be reverted to self.isa = _get_attn_isa(self.dtype, self.block_size).

def _get_attn_isa(dtype: torch.dtype, block_size: int) -> str:

csrc/cpu/cpu_attn.cpp

vllm/v1/attention/backends/cpu_attn.py

bigPYJ1151 · 2026-01-09T10:24:42Z

csrc/cpu/cpu_attn.cpp

@@ -15,6 +15,7 @@

 #ifdef __aarch64__
  #include "cpu_attn_neon.hpp"
+  // NEON requires head_dim to be a multiple of 32


Ok...now we created too much reduandant template instantiations, requires to reorgnize the dispatch procedure.

Maybe a future PR?

R3hankhan123 · 2026-01-09T10:35:29Z

@bigPYJ1151 looks like static assertion is causing this failure, is it ok to remove this assertion for now?

[2026-01-09T10:32:15Z] #16 206.4 /workspace/vllm/csrc/cpu/cpu_attn_amx.hpp:380:55: error: static assertion failed

bigPYJ1151 · 2026-01-09T10:42:54Z

@bigPYJ1151 looks like static assertion is causing this failure, is it ok to remove this assertion for now?
[2026-01-09T10:32:15Z] #16 206.4 /workspace/vllm/csrc/cpu/cpu_attn_amx.hpp:380:55: error: static assertion failed

Yes, there are some static checks relate to headsize in AMX and Vec impls need to be disabled temporarily.

Reintroduce support for head dimensions 80 and 112 in CPU attention backend. These sizes bypass Intel AMX, NEON and wide-vector (vec) optimizations, forcing the vec16. Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>

fadara01 · 2026-01-16T05:37:49Z

@bigPYJ1151 @R3hankhan123 could you please create issues for the stuff that need to be addressed for future PRs s.t. we do not forget about them?

Why did we add no tests for the new supported head dimensions?

Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>

Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>

R3hankhan123 requested review from LucasWilkinson and bigPYJ1151 as code owners January 8, 2026 12:50

mergify bot added cpu Related to CPU backends v1 labels Jan 8, 2026

gemini-code-assist bot reviewed Jan 8, 2026

View reviewed changes

bigPYJ1151 reviewed Jan 9, 2026

View reviewed changes

csrc/cpu/cpu_attn.cpp Outdated Show resolved Hide resolved

vllm/v1/attention/backends/cpu_attn.py Outdated Show resolved Hide resolved

R3hankhan123 force-pushed the cpu_attn branch from 711a70f to 66ee4fc Compare January 9, 2026 07:57

R3hankhan123 requested a review from bigPYJ1151 January 9, 2026 07:59

R3hankhan123 force-pushed the cpu_attn branch from 66ee4fc to c5aece2 Compare January 9, 2026 08:05

bigPYJ1151 reviewed Jan 9, 2026

View reviewed changes

bigPYJ1151 approved these changes Jan 9, 2026

View reviewed changes

bigPYJ1151 added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 9, 2026

R3hankhan123 force-pushed the cpu_attn branch 2 times, most recently from 3d38760 to 9b9584d Compare January 9, 2026 10:44

[CPU] Reintroduce head sizes 80 and 112 with vec16 fallback

63a8afa

Reintroduce support for head dimensions 80 and 112 in CPU attention backend. These sizes bypass Intel AMX, NEON and wide-vector (vec) optimizations, forcing the vec16. Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>

R3hankhan123 force-pushed the cpu_attn branch from 9b9584d to 63a8afa Compare January 9, 2026 11:53

bigPYJ1151 merged commit 8e27663 into vllm-project:main Jan 9, 2026
58 checks passed

R3hankhan123 deleted the cpu_attn branch January 9, 2026 14:20

This was referenced Jan 14, 2026

vllm-cpu attention fix red-hat-data-services/vllm-cpu#250

Merged

[CPU] Support head sizes 80 and 112 with vec16 fallback red-hat-data-services/vllm-cpu#251

Merged

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026

[CPU] Add head sizes 80 and 112 with vec16 fallback (vllm-project#31968)

04e4a06

Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[CPU] Add head sizes 80 and 112 with vec16 fallback (vllm-project#31968)

51f63ab

Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[CPU] Add head sizes 80 and 112 with vec16 fallback (vllm-project#31968)

b05b764

Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CPU] Add head sizes 80 and 112 with vec16 fallback#31968

[CPU] Add head sizes 80 and 112 with vec16 fallback#31968
bigPYJ1151 merged 1 commit intovllm-project:mainfrom
R3hankhan123:cpu_attn

R3hankhan123 commented Jan 8, 2026 •

edited by cursor bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Uh oh!

gemini-code-assist bot Jan 8, 2026

Uh oh!

Uh oh!

Uh oh!

bigPYJ1151 Jan 9, 2026

Uh oh!

R3hankhan123 Jan 9, 2026 •

edited

Loading

Uh oh!

bigPYJ1151 Jan 9, 2026

Uh oh!

R3hankhan123 commented Jan 9, 2026 •

edited

Loading

Uh oh!

bigPYJ1151 commented Jan 9, 2026

Uh oh!

Uh oh!

fadara01 commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

R3hankhan123 commented Jan 8, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bigPYJ1151 Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

R3hankhan123 Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bigPYJ1151 Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

R3hankhan123 commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bigPYJ1151 commented Jan 9, 2026

Uh oh!

Uh oh!

fadara01 commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

R3hankhan123 commented Jan 8, 2026 •

edited by cursor bot

Loading

R3hankhan123 Jan 9, 2026 •

edited

Loading

R3hankhan123 commented Jan 9, 2026 •

edited

Loading