Skip to content

[Spyre-Next] Add RowParallelLinear and ColumnParallelLinear(MLP) wrappers#869

Merged
bohnstingl merged 3 commits intotorch-spyre:mainfrom
nikheal2:wrap_mlp_layer
Apr 2, 2026
Merged

[Spyre-Next] Add RowParallelLinear and ColumnParallelLinear(MLP) wrappers#869
bohnstingl merged 3 commits intotorch-spyre:mainfrom
nikheal2:wrap_mlp_layer

Conversation

@R3hankhan123
Copy link
Copy Markdown
Collaborator

@R3hankhan123 R3hankhan123 commented Mar 26, 2026

Description

Add RowParallelLinear and ColumnParallelLinear wrappers for torch-spyre which will act as Up projection and Down Projection in MLP layer

Related Issues

Contributes towards #736

Test Plan

  1. Run the script provided in examples folder to check if output is being generated or not
  2. Run script provided by Thomas Ortner after removing wrapping for other layers and only keeping linear layer

Test Result

  1. Output of examples/torch_spyre_inference.py
(EngineCore pid=25709) INFO 03-24 13:24:30 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=25709) INFO 03-24 13:24:30 [cpu_model_runner.py:62] Starting to load model ibm-ai-platform/micro-g3.3-8b-instruct-1b...
(EngineCore pid=25709) WARNING 03-24 13:24:30 [linear.py:165] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=25709) WARNING 03-24 13:24:30 [linear.py:69] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=25709) WARNING 03-24 13:24:30 [linear.py:165] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=25709) WARNING 03-24 13:24:30 [linear.py:165] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=25709) WARNING 03-24 13:24:30 [linear.py:69] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=25709) WARNING 03-24 13:24:30 [linear.py:165] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=25709) WARNING 03-24 13:24:30 [linear.py:165] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=25709) WARNING 03-24 13:24:30 [linear.py:69] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=25709) WARNING 03-24 13:24:30 [linear.py:165] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=25709) WARNING 03-24 13:24:30 [linear.py:165] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=25709) WARNING 03-24 13:24:30 [linear.py:69] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=25709) WARNING 03-24 13:24:30 [linear.py:165] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=25709) INFO 03-24 13:24:30 [weight_utils.py:618] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 15.63it/s]
(EngineCore pid=25709) 
(EngineCore pid=25709) INFO 03-24 13:24:30 [default_loader.py:384] Loading weights took 0.13 seconds
(EngineCore pid=25709) INFO 03-24 13:24:30 [kv_cache_utils.py:1319] GPU KV cache size: 16,507,392 tokens
(EngineCore pid=25709) INFO 03-24 13:24:30 [kv_cache_utils.py:1324] Maximum concurrency for 2,048 tokens per request: 8060.25x
(EngineCore pid=25709) INFO 03-24 13:24:33 [cpu_model_runner.py:73] Warming up model for the compilation...
(EngineCore pid=25709) INFO 03-24 13:25:22 [decorators.py:638] saved AOT compiled function to /home/rehankhan/.cache/vllm/torch_compile_cache/torch_aot_compile/c2955d84292fa86695de8dbd486acee55c6111d9f30c6092f4f227d27e0e5512/rank_0_0/model
(EngineCore pid=25709) INFO 03-24 13:25:34 [monitor.py:76] Initial profiling/warmup run took 11.14 s
(EngineCore pid=25709) INFO 03-24 13:25:34 [cpu_model_runner.py:83] Warming up done.
(EngineCore pid=25709) INFO 03-24 13:25:34 [core.py:283] init engine (profile, create kv cache, warmup model) took 63.47 seconds
(EngineCore pid=25709) WARNING 03-24 13:25:34 [scheduler.py:173] Using custom scheduler class vllm.v1.core.sched.scheduler.Scheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore pid=25709) INFO 03-24 13:25:35 [vllm.py:750] Asynchronous scheduling is disabled.
(EngineCore pid=25709) WARNING 03-24 13:25:35 [vllm.py:806] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=25709) INFO 03-24 13:25:35 [platform.py:103] Loading scheduler from: vllm.v1.core.sched.scheduler.Scheduler
(EngineCore pid=25709) WARNING 03-24 13:25:35 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 251.88 GiB for KV cache.
INFO 03-24 13:25:35 [llm.py:391] Supported tasks: ['generate']
=============== GENERATE
Rendering prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 38.33it/s]
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:58<00:00, 39.60s/it, est. speed input: 1.49 toks/s, output: 0.88 toks/s]
Time elaspsed for 20 tokens is 118.88 sec
===============
CompletionOutput(index=0, text='\n\nThe response is a 2-3 page document that describes the task.\n\n###', token_ids=[203, 203, 1318, 1789, 438, 312, 225, 36, 31, 37, 1938, 1825, 688, 18872, 322, 2899, 32, 203, 203, 1482], routed_experts=None, cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)
CompletionOutput(index=0, text='\n\n1. The user will receive a list of instructions for preparing chicken soup for a family.\n2. The user will receive a list of instructions for preparing chicken soup for a family.\n3. The user will receive a list of instructions for preparing chicken soup for a family.\n', token_ids=[203, 203, 35, 32, 886, 1256, 1098, 7768, 312, 1149, 432, 9400, 436, 1406, 26124, 663, 21217, 31628, 436, 312, 13872, 32, 203, 36, 32, 886, 1256, 1098, 7768, 312, 1149, 432, 9400, 436, 1406, 26124, 663, 21217, 31628, 436, 312, 13872, 32, 203, 37, 32, 886, 1256, 1098, 7768, 312, 1149, 432, 9400, 436, 1406, 26124, 663, 21217, 31628, 436, 312, 13872, 32, 203], routed_experts=None, cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)
CompletionOutput(index=0, text='\n\nThe user is a ghoul.\n\n### Instruction:\n\nThe user is a', token_ids=[203, 203, 1318, 1256, 438, 312, 28472, 825, 32, 203, 203, 1482, 21081, 44, 203, 203, 1318, 1256, 438, 312], routed_experts=None, cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)
===============


Prompt:
 'Below is an instruction that describes a task. Write a response that appropriately completes the request. Be polite in your response to the user.\n\n### Instruction:\nProvide instructions for preparing chicken soup.\n\n### Response:'


Generated text:
 '\n\nThe response is a 2-3 page document that describes the task.\n\n###'


-----------------------------------


Prompt:
 'Below is an instruction that describes a task. Write a response that appropriately completes the request. Be polite in your response to the user.\n\n### Instruction:\nProvide a list of instructions for preparing chicken soup for a family.\n\n### Response:'


Generated text:
 '\n\n1. The user will receive a list of instructions for preparing chicken soup for a family.\n2. The user will receive a list of instructions for preparing chicken soup for a family.\n3. The user will receive a list of instructions for preparing chicken soup for a family.\n'


-----------------------------------


Prompt:
 "Below is an instruction that describes a task. Write a response that appropriately completes the request. Be polite in your response to the user.\n\n### Instruction:\nYou are Kaneki Ken from 'Tokyo Ghoul.' Describe what it feels like to be both human and ghoul to someone unfamiliar with your world.\n\n### Response:"


Generated text:
 '\n\nThe user is a ghoul.\n\n### Instruction:\n\nThe user is a'


-----------------------------------
(EngineCore pid=25709) INFO 03-24 13:27:33 [core.py:1210] Shutdown initiated (timeout=0)
(EngineCore pid=25709) INFO 03-24 13:27:33 [core.py:1233] Shutdown complete 
  1. Output of script provided by thomas ortner
INFO 03-23 06:59:32 [llm.py:343] Supported tasks: ['generate']
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 45.31it/s]
Processed prompts:   0%|                                                                        | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 03-23 07:00:10 [loggers.py:257] Engine 000: Avg prompt throughput: 0.2 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-23 07:00:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-23 07:00:46 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-23 07:01:03 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
Processed prompts: 100%|████████████████████████████████████████████████████████████████| 1/1 [01:31<00:00, 91.43s/it, est. speed input: 0.09 toks/s, output: 0.05 toks/s]
--------------------------------------------------
Generated text: '\n\nIBM operates'
--------------------------------------------------
vllm:kv_cache_usage_perc 0.0
vllm:prefix_cache_queries 8
vllm:prefix_cache_hits 0
vllm:external_prefix_cache_queries 0
vllm:external_prefix_cache_hits 0
vllm:mm_cache_queries 0
vllm:mm_cache_hits 0
vllm:cache_config_info 1.0
Signal Received: 15 (Terminated)
Signal Received from pid=12470 
[rehankhan@rehankhan-spyre-dev-pf vllm_spyre_next]$

Checklist

  • I have read the contributing guidelines
  • My code follows the project's code style (run bash format.sh)
  • I have added tests for my changes (if applicable)
  • I have updated the documentation (if applicable)
  • My commits include a Signed-off-by: line (DCO compliance)

@R3hankhan123 R3hankhan123 requested a review from bohnstingl March 26, 2026 12:52
@R3hankhan123 R3hankhan123 linked an issue Mar 26, 2026 that may be closed by this pull request
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

@R3hankhan123 R3hankhan123 changed the title [Spyre-Next] Add RowParallelLinear and ColumnParallelLinear wrappers [Spyre-Next] Add RowParallelLinear and ColumnParallelLinear(MLP) wrappers Mar 26, 2026
Copy link
Copy Markdown
Collaborator

@bohnstingl bohnstingl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @R3hankhan123 for the PR.
In principle looks good to me. I am just wondering whether we should simplify the code for the moment by de-duplicating identical functions until we really have a need to specialize them.

Also, could you run an end-to-end test and see whether the Granite3.3-8B model works and produces tokens?

Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/linear.py Outdated
Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/linear.py Outdated
Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/linear.py Outdated
Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/linear.py
Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/linear.py Outdated
@R3hankhan123
Copy link
Copy Markdown
Collaborator Author

Also @bohnstingl i ran a test on Granite3.3-8B model and here is the output

INFO 03-27 05:38:39 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 03-27 05:38:39 [__init__.py:46] - spyre_next -> vllm_spyre_next:register
INFO 03-27 05:38:39 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 03-27 05:38:40 [__init__.py:239] Platform plugin spyre_next is activated
INFO 03-27 05:38:41 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 03-27 05:38:41 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 03-27 05:38:45 [utils.py:233] non-default args: {'max_model_len': 2048, 'enable_prefix_caching': True, 'model': 'ibm-granite/granite-3.3-8b-instruct'}
INFO 03-27 05:38:45 [model.py:540] Resolved architecture: GraniteForCausalLM
INFO 03-27 05:38:45 [model.py:1607] Using max model len 2048
WARNING 03-27 05:38:45 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 251.88 GiB for KV cache.
INFO 03-27 05:38:45 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 03-27 05:38:45 [vllm.py:750] Asynchronous scheduling is enabled.
INFO 03-27 05:38:45 [platform.py:74] 
INFO 03-27 05:38:45 [platform.py:74]        █     █     █▄   ▄█       ▄█▀▀█▄  █▀▀▀█▄  █   █  █▀▀▀█▄  █▀▀▀▀
INFO 03-27 05:38:45 [platform.py:74]  ▄▄ ▄█ █     █     █ ▀▄▀ █       ▀▀▄▄▄   █▄▄▄█▀  ▀▄ ▄▀  █▄▄▄█▀  █▄▄▄   version 0.1.dev532
INFO 03-27 05:38:45 [platform.py:74]   █▄█▀ █     █     █     █            █  █        ▀█▀   █ ▀█▄   █      model   ibm-granite/granite-3.3-8b-instruct
INFO 03-27 05:38:45 [platform.py:74]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀       ▀▄▄▄█▀  █         █    █   ▀█  █▄▄▄▄
INFO 03-27 05:38:45 [platform.py:74] 
INFO 03-27 05:38:45 [platform.py:88] Loading worker from: vllm_spyre_next.v1.worker.spyre_worker.TorchSpyreWorker
INFO 03-27 05:38:45 [platform.py:103] Loading scheduler from: vllm.v1.core.sched.scheduler.Scheduler
INFO 03-27 05:38:52 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 03-27 05:38:52 [__init__.py:46] - spyre_next -> vllm_spyre_next:register
INFO 03-27 05:38:52 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 03-27 05:38:52 [__init__.py:239] Platform plugin spyre_next is activated
INFO 03-27 05:38:53 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 03-27 05:38:53 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore pid=34207) INFO 03-27 05:38:55 [core.py:105] Initializing a V1 LLM engine (v0.18.1rc1.dev53+gffb5b32b5.d20260324) with config: model='ibm-granite/granite-3.3-8b-instruct', speculative_config=None, tokenizer='ibm-granite/granite-3.3-8b-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=ibm-granite/granite-3.3-8b-instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.DYNAMO_TRACE_ONCE: 2>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': None, 'compile_ranges_endpoints': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True, 'dce': True, 'size_asserts': False, 'nan_asserts': False, 'epilogue_fusion': True, 'cpp.dynamic_threads': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=34207) INFO 03-27 05:38:57 [__init__.py:10] Registering custom ops for spyre_next
(EngineCore pid=34207) INFO 03-27 05:38:57 [linear.py:153] Registered custom op: spyre_merged_col_linear
(EngineCore pid=34207) INFO 03-27 05:38:57 [linear.py:153] Registered custom op: spyre_row_parallel_linear
(EngineCore pid=34207) WARNING 03-27 05:38:57 [cpu_worker.py:60] libtcmalloc is not found in LD_PRELOAD. For best performance, please follow the section `set LD_PRELOAD` in https://docs.vllm.ai/en/latest/getting_started/installation/cpu/ to setup required pre-loaded libraries.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [cpu_worker.py:60] libiomp is not found in LD_PRELOAD. For best performance, please follow the section `set LD_PRELOAD` in https://docs.vllm.ai/en/latest/getting_started/installation/cpu/ to setup required pre-loaded libraries.
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:227] auto thread-binding list (id, physical core): [(96, 0), (97, 1), (98, 2), (99, 3), (100, 4), (101, 5), (102, 6), (103, 7), (104, 8), (105, 9), (106, 10), (107, 11), (108, 12), (109, 13), (110, 14), (111, 15), (112, 16), (113, 17), (114, 18), (115, 19), (116, 20), (117, 21), (118, 22), (119, 23), (120, 24), (121, 25), (122, 26), (123, 27), (124, 28), (125, 29), (126, 30), (127, 31), (128, 32), (129, 33), (130, 34), (131, 35), (132, 36), (133, 37), (134, 38), (135, 39), (136, 40), (137, 41), (138, 42), (139, 43), (140, 44), (141, 45), (142, 46), (143, 47)]
[W327 05:38:57.788271336 utils.cpp:76] Warning: numa_migrate_pages failed. errno: 1 (function init_cpu_threads_env)
[W327 05:38:57.788283773 utils.cpp:103] Warning: NUMA binding: Using MEMBIND policy for memory allocation on the NUMA nodes (0). Memory allocations will be strictly bound to these NUMA nodes. (function init_cpu_threads_env)
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] OMP threads binding of Process 34207:
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34207, core 96
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34419, core 97
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34420, core 98
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34421, core 99
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34422, core 100
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34423, core 101
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34424, core 102
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34425, core 103
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34426, core 104
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34427, core 105
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34428, core 106
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34429, core 107
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34430, core 108
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34431, core 109
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34432, core 110
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34433, core 111
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34434, core 112
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34435, core 113
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34436, core 114
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34437, core 115
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34438, core 116
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34439, core 117
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34440, core 118
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34441, core 119
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34442, core 120
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34443, core 121
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34444, core 122
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34445, core 123
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34446, core 124
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34447, core 125
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34448, core 126
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34449, core 127
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34450, core 128
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34451, core 129
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34452, core 130
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34453, core 131
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34454, core 132
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34455, core 133
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34456, core 134
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34457, core 135
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34458, core 136
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34459, core 137
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34460, core 138
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34461, core 139
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34462, core 140
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34463, core 141
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34464, core 142
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 	OMP tid: 34465, core 143
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_worker.py:109] 
(EngineCore pid=34207) INFO 03-27 05:38:57 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.129.9.130:41447 backend=gloo
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore pid=34207) INFO 03-27 05:38:57 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=34207) INFO 03-27 05:38:57 [cpu_model_runner.py:62] Starting to load model ibm-granite/granite-3.3-8b-instruct...
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreMergedColumnParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=34207) WARNING 03-27 05:38:57 [linear.py:58] SpyreRowParallelLinear: no dtype promotion is performed, expect numerical differences to upstream vLLM.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  7.42it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:00<00:00,  5.93it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:00<00:00,  5.50it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  6.43it/s]
(EngineCore pid=34207) 
(EngineCore pid=34207) INFO 03-27 05:38:58 [default_loader.py:384] Loading weights took 0.64 seconds
(EngineCore pid=34207) INFO 03-27 05:38:58 [kv_cache_utils.py:1319] GPU KV cache size: 1,650,688 tokens
(EngineCore pid=34207) INFO 03-27 05:38:58 [kv_cache_utils.py:1324] Maximum concurrency for 2,048 tokens per request: 806.00x
(EngineCore pid=34207) INFO 03-27 05:39:01 [cpu_model_runner.py:73] Warming up model for the compilation...
(EngineCore pid=34207) WARNING 03-27 05:39:38 [decorators.py:311] Compiling model again due to a load failure from /home/rehankhan/.cache/vllm/torch_compile_cache/torch_aot_compile/a3ecc393d50142f3ad4b46979641dea0ed7a06dbf4b2ef23d4cb99b7e952dccf/rank_0_0/model, reason: 'function' object has no attribute 'finalize_loading'
(EngineCore pid=34207) INFO 03-27 05:39:50 [decorators.py:638] saved AOT compiled function to /home/rehankhan/.cache/vllm/torch_compile_cache/torch_aot_compile/a3ecc393d50142f3ad4b46979641dea0ed7a06dbf4b2ef23d4cb99b7e952dccf/rank_0_0/model
(EngineCore pid=34207) INFO 03-27 05:40:34 [monitor.py:76] Initial profiling/warmup run took 43.92 s
(EngineCore pid=34207) INFO 03-27 05:40:34 [cpu_model_runner.py:83] Warming up done.
(EngineCore pid=34207) INFO 03-27 05:40:34 [core.py:283] init engine (profile, create kv cache, warmup model) took 95.15 seconds
(EngineCore pid=34207) WARNING 03-27 05:40:34 [scheduler.py:173] Using custom scheduler class vllm.v1.core.sched.scheduler.Scheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore pid=34207) INFO 03-27 05:40:34 [vllm.py:750] Asynchronous scheduling is disabled.
(EngineCore pid=34207) WARNING 03-27 05:40:34 [vllm.py:806] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=34207) INFO 03-27 05:40:34 [platform.py:103] Loading scheduler from: vllm.v1.core.sched.scheduler.Scheduler
(EngineCore pid=34207) WARNING 03-27 05:40:34 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 251.88 GiB for KV cache.
INFO 03-27 05:40:34 [llm.py:391] Supported tasks: ['generate']
Rendering prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.83it/s]
Processed prompts:   0%|                                                                                                                                                            | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 03-27 05:41:14 [loggers.py:259] Engine 000: Avg prompt throughput: 0.1 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:41:31 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:41:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:42:04 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:42:21 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:42:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:42:54 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:43:11 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:43:27 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:43:44 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:44:00 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:44:17 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:44:33 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:44:49 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:45:06 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:45:22 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:45:39 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:45:55 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:46:12 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:46:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:46:44 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:47:01 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:47:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-27 05:47:34 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [06:59<00:00, 419.22s/it, est. speed input: 0.01 toks/s, output: 0.06 toks/s]
--------------------------------------------------
Generated text: ' for containerization?\r\n\r\nRed Hat OpenShift is a Kubernetes-based container application platform that allows organizations to automate and manage'
--------------------------------------------------
vllm:kv_cache_usage_perc 0.0
vllm:prefix_cache_queries 6
vllm:prefix_cache_hits 0
vllm:external_prefix_cache_queries 0
vllm:external_prefix_cache_hits 0
vllm:mm_cache_queries 0
vllm:mm_cache_hits 0
vllm:prompt_tokens_cached 0
vllm:cache_config_info 1.0
(EngineCore pid=34207) INFO 03-27 05:47:34 [core.py:1210] Shutdown initiated (timeout=0)
(EngineCore pid=34207) INFO 03-27 05:47:34 [core.py:1233] Shutdown complete

@R3hankhan123 R3hankhan123 force-pushed the wrap_mlp_layer branch 2 times, most recently from 39874e1 to b17f7d0 Compare March 27, 2026 08:58
Copy link
Copy Markdown
Collaborator

@bohnstingl bohnstingl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you please try the Granite3.3-8B model and check whether the token generation works?

It has been observed though that the current way of wrapping for torch-spyre interferes with the enablement of upstream vLLM tests, see #863. To address this, I've opened a PR (#872) that reworks the forward call chain a bit and uses forward_oot instead of forward_native. Maybe we could hold off the merge a bit and get #872 merged first and then apply the rework directly here as well?

@R3hankhan123 what do you think?

@R3hankhan123
Copy link
Copy Markdown
Collaborator Author

LGTM. Can you please try the Granite3.3-8B model and check whether the token generation works?

It has been observed though that the current way of wrapping for torch-spyre interferes with the enablement of upstream vLLM tests, see #863. To address this, I've opened a PR (#872) that reworks the forward call chain a bit and uses forward_oot instead of forward_native. Maybe we could hold off the merge a bit and get #872 merged first and then apply the rework directly here as well?

@R3hankhan123 what do you think?

Sure @bohnstingl

@bohnstingl
Copy link
Copy Markdown
Collaborator

@R3hankhan123 #872 has landed. Could you please overtake the modified forward call structure? I will then push for a quick merge

Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/linear.py Outdated

class _SpyreLinear:
"""Shared implementation for Spyre linear layers at TP=1."""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for all other oot impl. (e.g rms, silu ... ) I see this line:
_dynamic_arg_dims = {"x": [], "residual": []}
is it not needed here? could it be removed in the other classes too? @bohnstingl

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I see we do not have a residual here...
is not specifying anything the same as putting _dynamic_arg_dims = {"x": []} ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it not needed here? could it be removed in the other classes too? @bohnstingl

No, it can't be removed and in fact we need it here as well. The _dynamic_arg_dims = {"x": [], "residual": []} ensures that maybe_compile compiles with dynamic=False. Here it should be _dynamic_arg_dims = {"x": [], "weight": [], "bias": []}, I think.

@R3hankhan123 could you please confirm that:

  • This path here is followed, i.e., make a breakpoint() there and check that it is triggered.
  • That no shape is marked as dynamic, i.e., this part is never reached.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weight and bias tensors are internal to the layer implementation, they're accessed inside _forward_spyre_impl. Since they're not direct arguments to the custom op, i think they don't need to be in _dynamic_arg_dims. I think only {"x": [], "output": []}, are sufficient

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, I have to take my comment above back. MergedColumnParallelLinear and RowParallelLinear are PluggableLayer, not CustomOp. Thus, the compilation path is different and there is no maybe_compile. Probably we need to simply invoke torch.compile directly for at the moment:

self.maybe_compiled_forward_spyre = torch.compile(self.forward_spyre, dynamic=False)

Please leave a note though that this should be changed in the future.

This means you also don't need to define _dynamic_arg_dims.

Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/linear.py
Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/linear.py Outdated
Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/linear.py Outdated
Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/linear.py Outdated
Copy link
Copy Markdown
Collaborator

@bohnstingl bohnstingl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good to me. @R3hankhan123 could you take a look at the small comments we had?


class _SpyreLinear:
"""Shared implementation for Spyre linear layers at TP=1."""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it not needed here? could it be removed in the other classes too? @bohnstingl

No, it can't be removed and in fact we need it here as well. The _dynamic_arg_dims = {"x": [], "residual": []} ensures that maybe_compile compiles with dynamic=False. Here it should be _dynamic_arg_dims = {"x": [], "weight": [], "bias": []}, I think.

@R3hankhan123 could you please confirm that:

  • This path here is followed, i.e., make a breakpoint() there and check that it is triggered.
  • That no shape is marked as dynamic, i.e., this part is never reached.

Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/linear.py Outdated
Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/linear.py
@R3hankhan123
Copy link
Copy Markdown
Collaborator Author

after running a quick test

[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore pid=46348) INFO 03-31 14:54:38 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=46348) INFO 03-31 14:54:39 [cpu_model_runner.py:62] Starting to load model ibm-granite/granite-3.3-8b-instruct...
(EngineCore pid=46348) WARNING 03-31 14:54:39 [linear.py:60] SpyreRowParallelLinear: no dtype promotion (torch-spyre limitation),expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [linear.py:60] SpyreMergedColumnParallelLinear: no dtype promotion (torch-spyre limitation),expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=46348) WARNING 03-31 14:54:39 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  4.74it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:00<00:00,  3.84it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:00<00:00,  3.65it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  4.79it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  4.42it/s]
(EngineCore pid=46348) 
(EngineCore pid=46348) INFO 03-31 14:54:41 [default_loader.py:384] Loading weights took 0.93 seconds
(EngineCore pid=46348) INFO 03-31 14:54:41 [kv_cache_utils.py:1319] GPU KV cache size: 1,650,688 tokens
(EngineCore pid=46348) INFO 03-31 14:54:41 [kv_cache_utils.py:1324] Maximum concurrency for 2,048 tokens per request: 806.00x
(EngineCore pid=46348) INFO 03-31 14:54:46 [cpu_model_runner.py:73] Warming up model for the compilation...
(EngineCore pid=46348) WARNING 03-31 14:55:43 [decorators.py:311] Compiling model again due to a load failure from /home/rehankhan/.cache/vllm/torch_compile_cache/torch_aot_compile/aa10285af9a2273b82a7f4c08ebf84ed68b90dcb6982c5ab4e6abf9503a6c3e6/rank_0_0/model, reason: 'function' object has no attribute 'finalize_loading'
(EngineCore pid=46348) INFO 03-31 14:56:02 [decorators.py:638] saved AOT compiled function to /home/rehankhan/.cache/vllm/torch_compile_cache/torch_aot_compile/aa10285af9a2273b82a7f4c08ebf84ed68b90dcb6982c5ab4e6abf9503a6c3e6/rank_0_0/model
(EngineCore pid=46348) INFO 03-31 14:56:03 [monitor.py:76] Initial profiling/warmup run took 1.51 s
(EngineCore pid=46348) INFO 03-31 14:56:03 [cpu_model_runner.py:83] Warming up done.
(EngineCore pid=46348) INFO 03-31 14:56:03 [core.py:283] init engine (profile, create kv cache, warmup model) took 82.76 seconds
(EngineCore pid=46348) WARNING 03-31 14:56:04 [scheduler.py:173] Using custom scheduler class vllm.v1.core.sched.scheduler.Scheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore pid=46348) INFO 03-31 14:56:04 [vllm.py:750] Asynchronous scheduling is disabled.
(EngineCore pid=46348) WARNING 03-31 14:56:04 [vllm.py:806] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=46348) INFO 03-31 14:56:04 [platform.py:103] Loading scheduler from: vllm.v1.core.sched.scheduler.Scheduler
(EngineCore pid=46348) WARNING 03-31 14:56:04 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 251.88 GiB for KV cache.
INFO 03-31 14:56:04 [llm.py:391] Supported tasks: ['generate']
Rendering prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.52it/s]
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.36s/it, est. speed input: 2.06 toks/s, output: 5.73 toks/s]
--------------------------------------------------
Generated text: '.\n\nHere is a simple C++ code to reverse a string using the standard templated algorithm library:\n\n```'
--------------------------------------------------
vllm:kv_cache_usage_perc 0.0
vllm:prefix_cache_queries 9
vllm:prefix_cache_hits 0
vllm:external_prefix_cache_queries 0
vllm:external_prefix_cache_hits 0
vllm:mm_cache_queries 0
vllm:mm_cache_hits 0
vllm:prompt_tokens_cached 0
vllm:cache_config_info 1.0
(EngineCore pid=46348) INFO 03-31 14:56:09 [core.py:1210] Shutdown initiated (timeout=0)
(EngineCore pid=46348) INFO 03-31 14:56:09 [core.py:1233] Shutdown complete

Add RowParallelLinear and ColumnParallelLinear wrappers for torch-spyre
which will act as Up projection and Down Projection in MLP layer

Co-authored-by: Rehan Khan <Rehan.Khan7@ibm.com>
Co-authored-by: nikheal2 <suryawanshin74@gmail.com>
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
Signed-off-by: nikheal2 <suryawanshin74@gmail.com>
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
@bohnstingl
Copy link
Copy Markdown
Collaborator

I changed the forward_oot definition to forward. The reason is that both MergedColumnParallelLinear and SpyreRowParallelLinear are PluggableLayers. The register_oot decorator registers the classes and when then called from dispatch, the forward function is invoked.

I tested the latest commit E2E and I see that it runs on spyre and that it produces the expected results.

@bohnstingl
Copy link
Copy Markdown
Collaborator

bot:next-test

@bohnstingl bohnstingl requested a review from tjohnson31415 April 1, 2026 12:59
@bohnstingl
Copy link
Copy Markdown
Collaborator

@joerunde or @tjohnson31415 I looked at the CI results and they don't seem to be related. Could you confirm?

Copy link
Copy Markdown
Collaborator

@yannicks1 yannicks1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for addressing all the feedback

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Copy link
Copy Markdown
Collaborator

@tjohnson31415 tjohnson31415 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjohnson31415 I looked at the CI results and they don't seem to be related. Could you confirm?

Can confirm. The spyre-ci tests should not block the PR because they are not working.

@bohnstingl bohnstingl merged commit 7ad2219 into torch-spyre:main Apr 2, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrap MLP layer for torch-spyre

4 participants