Skip to content

[CPU] Add head sizes 80 and 112 with vec16 fallback#31968

Merged
bigPYJ1151 merged 1 commit intovllm-project:mainfrom
R3hankhan123:cpu_attn
Jan 9, 2026
Merged

[CPU] Add head sizes 80 and 112 with vec16 fallback#31968
bigPYJ1151 merged 1 commit intovllm-project:mainfrom
R3hankhan123:cpu_attn

Conversation

@R3hankhan123
Copy link
Contributor

@R3hankhan123 R3hankhan123 commented Jan 8, 2026

Purpose

Reintroduce support for head dimensions 80 and 112 in CPU attention backend which were previously removed in #27954 but these head dimensions are commonly used by granite models deployed on Z archs. Since these heads are not friendly for Intel AMX instruction set. The implementation now falls back to vec16.

Test Plan

Build Docker image and test using ibm-granite/granite-3b-code-base-2k model which has head size of 80.

Test Result

Server Logs

 docker run --rm -it -p 8000:8000   quay.io/r3hankhan/vllm:torch2.9.1-v5  ibm-granite/granite-3b-code-base-2k --dtype=bfloat16
INFO 01-08 12:17:54 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
(APIServer pid=1) INFO 01-08 12:17:59 [api_server.py:1277] vLLM API server version 0.1.dev12640+g03bb0ff93.d20260108
(APIServer pid=1) INFO 01-08 12:17:59 [utils.py:253] non-default args: {'model_tag': 'ibm-granite/granite-3b-code-base-2k', 'model': 'ibm-granite/granite-3b-code-base-2k', 'dtype': 'bfloat16'}
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 680/680 [00:00<00:00, 3.70MB/s]
(APIServer pid=1) INFO 01-08 12:18:10 [model.py:522] Resolved architecture: LlamaForCausalLM
(APIServer pid=1) INFO 01-08 12:18:10 [model.py:1510] Using max model len 2048
(APIServer pid=1) INFO 01-08 12:18:10 [arg_utils.py:1952] Chunked prefill is not supported for ARM and POWER, S390X and RISC-V CPUs; disabling it for V1 backend.
(APIServer pid=1) INFO 01-08 12:18:10 [arg_utils.py:1958] Prefix caching is not supported for ARM and POWER, S390X and RISC-V CPUs; disabling it for V1 backend.
(APIServer pid=1) WARNING 01-08 12:18:10 [cpu.py:157] VLLM_CPU_KVCACHE_SPACE not set. Using 171.88 GiB for KV cache.
(APIServer pid=1) INFO 01-08 12:18:10 [vllm.py:635] Disabling NCCL for DP synchronization when using async scheduling.
(APIServer pid=1) INFO 01-08 12:18:10 [vllm.py:640] Asynchronous scheduling is enabled.
tokenizer_config.json: 4.13kB [00:00, 26.0MB/s]
tokenizer.json: 2.06MB [00:00, 45.6MB/s]
special_tokens_map.json: 1.02kB [00:00, 9.49MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 2.09MB/s]
INFO 01-08 12:18:19 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore_DP0 pid=41) INFO 01-08 12:18:23 [core.py:96] Initializing a V1 LLM engine (v0.1.dev12640+g03bb0ff93.d20260108) with config: model='ibm-granite/granite-3b-code-base-2k', speculative_config=None, tokenizer='ibm-granite/granite-3b-code-base-2k', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False), seed=0, served_model_name=ibm-granite/granite-3b-code-base-2k, enable_prefix_caching=False, enable_chunked_prefill=False, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.DYNAMO_TRACE_ONCE: 2>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'dce': True, 'size_asserts': False, 'nan_asserts': False, 'epilogue_fusion': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:180] auto thread-binding list (id, physical core): [(0, 0), (1, 0), (2, 1), (3, 1), (8, 4), (9, 4), (10, 5), (11, 5), (16, 8), (17, 8), (18, 9), (19, 9), (24, 12), (25, 12), (26, 13), (27, 13), (32, 16), (33, 16), (34, 17), (35, 17), (40, 20), (41, 20), (42, 21), (43, 21), (48, 24), (49, 24)]
get_mempolicy: Operation not permitted
[W108 12:18:24.534092355 utils.cpp:76] Warning: numa_migrate_pages failed. errno: 1 (function init_cpu_threads_env)
set_mempolicy: Operation not permitted
[W108 12:18:24.534110399 utils.cpp:100] Warning: numa_set_membind failed. errno: 1 (function init_cpu_threads_env)
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] OMP threads binding of Process 41:
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 41, core 0
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 52, core 1
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 53, core 2
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 54, core 3
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 55, core 8
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 56, core 9
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 57, core 10
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 58, core 11
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 59, core 16
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 60, core 17
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 61, core 18
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 62, core 19
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 63, core 24
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 64, core 25
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 65, core 26
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 66, core 27
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 67, core 32
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 68, core 33
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 69, core 34
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 70, core 35
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 71, core 40
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 72, core 41
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 73, core 42
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 74, core 43
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 75, core 48
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 	OMP tid: 76, core 49
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_worker.py:86] 
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.2:58689 backend=gloo
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=41) INFO 01-08 12:18:24 [cpu_model_runner.py:55] Starting to load model ibm-granite/granite-3b-code-base-2k...
model.safetensors.index.json: 41.6kB [00:00, 244MB/s]
model-00002-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████| 1.99G/1.99G [01:26<00:00, 22.9MB/s]
model-00001-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████| 4.97G/4.97G [02:09<00:00, 38.5MB/s]
(EngineCore_DP0 pid=41) INFO 01-08 12:20:35 [weight_utils.py:510] Time spent downloading weights for ibm-granite/granite-3b-code-base-2k: 129.624123 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:01<00:01,  1.25s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00,  2.36s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00,  2.19s/it]
(EngineCore_DP0 pid=41) 
(EngineCore_DP0 pid=41) INFO 01-08 12:20:39 [default_loader.py:308] Loading weights took 4.39 seconds
(EngineCore_DP0 pid=41) INFO 01-08 12:20:39 [kv_cache_utils.py:1305] GPU KV cache size: 563,200 tokens
(EngineCore_DP0 pid=41) INFO 01-08 12:20:39 [kv_cache_utils.py:1310] Maximum concurrency for 2,048 tokens per request: 275.00x
(EngineCore_DP0 pid=41) INFO 01-08 12:20:42 [cpu_model_runner.py:65] Warming up model for the compilation...
(EngineCore_DP0 pid=41) INFO 01-08 12:21:35 [cpu_model_runner.py:75] Warming up done.
(EngineCore_DP0 pid=41) INFO 01-08 12:21:35 [core.py:273] init engine (profile, create kv cache, warmup model) took 56.12 seconds
(EngineCore_DP0 pid=41) INFO 01-08 12:21:36 [vllm.py:640] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=41) WARNING 01-08 12:21:36 [vllm.py:671] Inductor compilation was disabled by user settings,Optimizations settings that are only active duringInductor compilation will be ignored.
(EngineCore_DP0 pid=41) WARNING 01-08 12:21:36 [cpu.py:157] VLLM_CPU_KVCACHE_SPACE not set. Using 171.88 GiB for KV cache.
(APIServer pid=1) INFO 01-08 12:21:37 [api_server.py:1020] Supported tasks: ['generate']
(APIServer pid=1) INFO 01-08 12:21:37 [serving_chat.py:178] Warming up chat template processing...
(APIServer pid=1) INFO 01-08 12:21:38 [chat_utils.py:599] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218] Chat template warmup failed
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218] Traceback (most recent call last):
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py", line 197, in warmup
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218]     await self._preprocess_chat(
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/entrypoints/openai/serving_engine.py", line 1209, in _preprocess_chat
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218]     request_prompt = apply_hf_chat_template(
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218]                      ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/entrypoints/chat_utils.py", line 1826, in apply_hf_chat_template
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218]     raise ChatTemplateResolutionError(
(APIServer pid=1) ERROR 01-08 12:21:38 [serving_chat.py:218] vllm.entrypoints.chat_utils.ChatTemplateResolutionError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.
(APIServer pid=1) /opt/vllm/lib64/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py:218: RuntimeWarning: coroutine 'AsyncMultiModalItemTracker.all_mm_data' was never awaited
(APIServer pid=1)   logger.exception("Chat template warmup failed")
(APIServer pid=1) RuntimeWarning: Enable tracemalloc to get the object allocation traceback
(APIServer pid=1) INFO 01-08 12:21:38 [api_server.py:1351] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:38] Available routes are:
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=1) INFO 01-08 12:21:38 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.
(APIServer pid=1) INFO 01-08 12:24:09 [loggers.py:257] Engine 000: Avg prompt throughput: 1.0 tokens/s, Avg generation throughput: 2.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-08 12:24:19 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-08 12:24:29 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-08 12:24:39 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-08 12:24:49 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-08 12:24:59 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO:     172.17.0.1:39584 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 01-08 12:25:09 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 01-08 12:25:19 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
^C(APIServer pid=1) INFO 01-08 12:38:56 [launcher.py:110] Shutting down FastAPI HTTP server.

curl request:

[root@b314lp81 ~]# curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "ibm-granite/granite-3b-code-base-2k",
    "prompt": "Write a C function to reverse a linked list.",
    "max_tokens": 200
  }' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1353  100  1212  100   141     21      2  0:01:10  0:00:55  0:00:15   323
{
  "id": "cmpl-a2cc400bd3a91cec",
  "object": "text_completion",
  "created": 1767876545,
  "model": "ibm-granite/granite-3b-code-base-2k",
  "choices": [
    {
      "index": 0,
      "text": "\n\nYou have been given a list of pointers to objects. Each object has a value (vint)\nand a pointer to the next object. The objects are sorted ascending by their values.\n\nThe two intermediary objects (a head and a tail pointer are assumed to have\nbeen provided). Return the head of the modified list.\n\n```c\n    ListNode *reverseList(ListNode *head, ListNode *tail, item_t *data)\n```\n\n\n01_reverse_linked_list.c\n\n\n\n# 05\n## Implement an algorithm to determine if a string has all unique characters. \n\nCan be done in O(n) time.\n\n```c\n    isUniqueString(const char *string)\n```\n\n02_string_unival.c\n\n\n\n\n\n\n# 06\n## Implement circular_arr to get a pointer to the start of the circular array, and move \nthe pointer to the next element",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 210,
    "completion_tokens": 200,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

CPU attention: add 80/112 head sizes with vec16 fallback

  • Add 80 and 112 to head-dim dispatch in cpu_attn.cpp and to CPUAttentionBackend.get_supported_head_sizes.
  • Update ISA selection: _get_attn_isa now returns "vec16" when head_size % 32 != 0 and head_size % 16 == 0; builder passes head_dim to _get_attn_isa.
  • NEON path: comment notes that head dim should be multiple of 32 and removes the compile-time assert in cpu_attn_neon.hpp to avoid hard failure.
  • Dispatch macros and paths unchanged otherwise; AMX/NEON/VEC/VEC16 execution remains selected via existing logic.

Written by Cursor Bugbot for commit 66ee4fc1271157cd45665968cb4a5a978bdef2f6. This will update automatically on new commits. Configure here.


Note

Adds support for additional CPU attention head sizes with safe ISA fallback.

  • Extend head-dim dispatch and CPUAttentionBackend.get_supported_head_sizes to include 80 and 112
  • Update ISA selection: _get_attn_isa(dtype, block_size, head_size) returns "vec16" when head_size % 32 != 0 and head_size % 16 == 0; builder now passes head_dim
  • NEON path: document multiple-of-32 requirement and remove compile-time assert in cpu_attn_neon.hpp to avoid hard failure
  • Core dispatch and execution paths (AMX/NEON/VEC/VEC16) remain otherwise unchanged

Written by Cursor Bugbot for commit c5aece2e4d413c3fcfcf7f2bde649f192e7e42e7. This will update automatically on new commits. Configure here.


Note

CPU attention: expand supported head sizes and adjust ISA dispatch

  • Add 80 and 112 to head-dim dispatch (cpu_attn.cpp) and to CPUAttentionBackend.get_supported_head_sizes
  • Update ISA selection: _get_attn_isa(dtype, block_size, head_size) now returns "vec16" when head_size % 32 != 0 and head_size % 16 == 0; builder passes head_dim
  • NEON path: document multiple-of-32 note and remove head-dim alignment static_assert; AMX path: comment out tile-row-size static_assert to avoid hard failure
  • Core dispatch paths (AMX/NEON/VEC/VEC16) unchanged aside from new cases and ISA hint propagation

Written by Cursor Bugbot for commit 3d38760edeabe82878fad968a67114bd8dd547f7. This will update automatically on new commits. Configure here.


Note

Expands CPU attention compatibility and adjusts ISA dispatching.

  • Adds 80 and 112 to head-dim dispatch and CPUAttentionBackend.get_supported_head_sizes
  • Passes head_dim into _get_attn_isa(dtype, block_size, head_size); returns "vec16" when head_size % 32 != 0 and head_size % 16 == 0
  • NEON/AMX: relax compile-time asserts (commented out) and note NEON prefers head dims multiple of 32
  • Core execution paths unchanged (AMX/NEON/VEC/VEC16) aside from new head sizes and updated ISA hint

Written by Cursor Bugbot for commit 9b9584dd9b67c8ad9608c23897becb04d0b57c51. This will update automatically on new commits. Configure here.


Note

Cursor Bugbot is generating a summary for commit 63a8afa. Configure here.

@mergify mergify bot added cpu Related to CPU backends v1 labels Jan 8, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reintroduces support for head dimensions 80 and 112 in the CPU attention backend, with a fallback to vec16 since these sizes are not optimal for AMX instructions. The changes look good overall, correctly implementing the fallback logic in the C++ backend.

My review includes two main points:

  1. There is redundant logic in the Python code for determining the ISA, which is already handled robustly in the C++ backend. I've suggested removing the Python-side logic to maintain a single source of truth.
  2. The automated tests have not been updated to include the new head sizes (80 and 112). I've recommended adding them to the test suite to ensure the new functionality is properly verified.

Addressing these points will improve the maintainability and robustness of the code.

Comment on lines 44 to +45
def get_supported_head_sizes(cls) -> list[int]:
return [32, 64, 96, 128, 160, 192, 224, 256]
return [32, 64, 80, 96, 112, 128, 160, 192, 224, 256]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While support for head sizes 80 and 112 has been added, the automated tests in tests/kernels/attention/test_cpu_attn.py have not been updated to include these new sizes. Please update the HEAD_SIZES list in the test file to ensure the new functionality is covered.

Comment on lines +487 to +491
def _get_attn_isa(
dtype: torch.dtype, block_size: int, head_size: int | None = None
) -> str:
if head_size in (80, 112):
return "vec16"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic to force vec16 for specific head sizes is duplicated here and in the C++ backend. The C++ code in csrc/cpu/cpu_attn.cpp already handles this fallback in get_scheduler_metadata, cpu_attn_reshape_and_cache, and cpu_attention_with_kv_cache using the requires_vec16_fallback helper.

To maintain a single source of truth and avoid redundancy, this logic should only reside in the C++ backend. Please remove the head_size check from this function and revert its signature.

Also, the call to this function in CPUAttentionMetadataBuilder.__init__ (line 140) should be reverted to self.isa = _get_attn_isa(self.dtype, self.block_size).

def _get_attn_isa(dtype: torch.dtype, block_size: int) -> str:

@@ -15,6 +15,7 @@

#ifdef __aarch64__
#include "cpu_attn_neon.hpp"
// NEON requires head_dim to be a multiple of 32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok...now we created too much reduandant template instantiations, requires to reorgnize the dispatch procedure.

Copy link
Contributor Author

@R3hankhan123 R3hankhan123 Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a future PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@bigPYJ1151 bigPYJ1151 added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 9, 2026
@R3hankhan123
Copy link
Contributor Author

R3hankhan123 commented Jan 9, 2026

@bigPYJ1151 looks like static assertion is causing this failure, is it ok to remove this assertion for now?

[2026-01-09T10:32:15Z] #16 206.4 /workspace/vllm/csrc/cpu/cpu_attn_amx.hpp:380:55: error: static assertion failed

@bigPYJ1151
Copy link
Member

@bigPYJ1151 looks like static assertion is causing this failure, is it ok to remove this assertion for now?

[2026-01-09T10:32:15Z] #16 206.4 /workspace/vllm/csrc/cpu/cpu_attn_amx.hpp:380:55: error: static assertion failed

Yes, there are some static checks relate to headsize in AMX and Vec impls need to be disabled temporarily.

@R3hankhan123 R3hankhan123 force-pushed the cpu_attn branch 2 times, most recently from 3d38760 to 9b9584d Compare January 9, 2026 10:44
Reintroduce support for head dimensions 80 and 112 in CPU attention backend.
These sizes bypass Intel AMX, NEON  and wide-vector (vec) optimizations,
forcing the vec16.

Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
@bigPYJ1151 bigPYJ1151 merged commit 8e27663 into vllm-project:main Jan 9, 2026
58 checks passed
@R3hankhan123 R3hankhan123 deleted the cpu_attn branch January 9, 2026 14:20
@fadara01
Copy link
Contributor

@bigPYJ1151 @R3hankhan123 could you please create issues for the stuff that need to be addressed for future PRs s.t. we do not forget about them?

Why did we add no tests for the new supported head dimensions?

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpu Related to CPU backends ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants