[Spyre-Next] [Feature] Wrap RoPE layer on Spyre by dilipgb · Pull Request #881 · torch-spyre/sendnn-inference

dilipgb · 2026-03-31T18:53:50Z

Description

Adds SpyreRotaryEmbeding, a Spyre-optimized out-of-tree (OOT) replacement for vLLM's RotaryEmbedding, following the custom op pattern from #842 (same as SpyreRMSNorm and SpyreSiluAndMul).

Related Issues

Fixes #820

Test Plan

Ran Couple of tests to confirm its working on spyre. Also looking into upstream test for rotary embeddings which is in progress.

examples/torch_spyre_inference.py

python examples/torch_spyre_inference.py 
INFO 03-31 17:58:31 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 03-31 17:58:31 [__init__.py:46] - spyre_next -> vllm_spyre_next:register
INFO 03-31 17:58:31 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 03-31 17:58:31 [__init__.py:239] Platform plugin spyre_next is activated
INFO 03-31 17:58:32 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 03-31 17:58:32 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 03-31 17:58:39 [utils.py:233] non-default args: {'tokenizer': 'ibm-ai-platform/micro-g3.3-8b-instruct-1b', 'max_model_len': 2048, 'max_num_batched_tokens': 1024, 'max_num_seqs': 2, 'disable_log_stats': True, 'model': 'ibm-ai-platform/micro-g3.3-8b-instruct-1b'}
INFO 03-31 17:58:39 [model.py:540] Resolved architecture: GraniteForCausalLM
INFO 03-31 17:58:39 [model.py:1607] Using max model len 2048
WARNING 03-31 17:58:39 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 251.88 GiB for KV cache.
INFO 03-31 17:58:39 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=1024.
INFO 03-31 17:58:39 [vllm.py:750] Asynchronous scheduling is enabled.
INFO 03-31 17:58:39 [platform.py:74] 
INFO 03-31 17:58:39 [platform.py:74]        █     █     █▄   ▄█       ▄█▀▀█▄  █▀▀▀█▄  █   █  █▀▀▀█▄  █▀▀▀▀
INFO 03-31 17:58:39 [platform.py:74]  ▄▄ ▄█ █     █     █ ▀▄▀ █       ▀▀▄▄▄   █▄▄▄█▀  ▀▄ ▄▀  █▄▄▄█▀  █▄▄▄   version 0.1.dev536
INFO 03-31 17:58:39 [platform.py:74]   █▄█▀ █     █     █     █            █  █        ▀█▀   █ ▀█▄   █      model   ibm-ai-platform/micro-g3.3-8b-instruct-1b
INFO 03-31 17:58:39 [platform.py:74]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀       ▀▄▄▄█▀  █         █    █   ▀█  █▄▄▄▄
INFO 03-31 17:58:39 [platform.py:74] 
INFO 03-31 17:58:39 [platform.py:88] Loading worker from: vllm_spyre_next.v1.worker.spyre_worker.TorchSpyreWorker
INFO 03-31 17:58:39 [platform.py:103] Loading scheduler from: vllm.v1.core.sched.scheduler.Scheduler
INFO 03-31 17:58:50 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 03-31 17:58:50 [__init__.py:46] - spyre_next -> vllm_spyre_next:register
INFO 03-31 17:58:50 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 03-31 17:58:50 [__init__.py:239] Platform plugin spyre_next is activated
INFO 03-31 17:58:50 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 03-31 17:58:50 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore pid=53257) INFO 03-31 17:58:53 [core.py:105] Initializing a V1 LLM engine (v0.18.1rc1.dev53+gffb5b32b5.d20260324) with config: model='ibm-ai-platform/micro-g3.3-8b-instruct-1b', speculative_config=None, tokenizer='ibm-ai-platform/micro-g3.3-8b-instruct-1b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=ibm-ai-platform/micro-g3.3-8b-instruct-1b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.DYNAMO_TRACE_ONCE: 2>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': None, 'compile_ranges_endpoints': [1024], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True, 'dce': True, 'size_asserts': False, 'nan_asserts': False, 'epilogue_fusion': True, 'cpp.dynamic_threads': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=53257) INFO 03-31 17:58:55 [__init__.py:13] Registering custom ops for spyre_next
(EngineCore pid=53257) INFO 03-31 17:58:55 [rms_norm.py:236] Registered custom op: SpyreRMSNorm
(EngineCore pid=53257) INFO 03-31 17:58:55 [silu_and_mul.py:169] Registered custom op: SpyreSiluAndMul
(EngineCore pid=53257) INFO 03-31 17:58:55 [linear.py:157] Registered custom op: spyre_merged_col_linear
(EngineCore pid=53257) INFO 03-31 17:58:55 [linear.py:157] Registered custom op: spyre_row_parallel_linear
(EngineCore pid=53257) INFO 03-31 17:58:55 [rotary_embedding.py:305] Registered custom op: SpyreRotaryEmbedding
(EngineCore pid=53257) WARNING 03-31 17:58:55 [cpu_worker.py:60] libtcmalloc is not found in LD_PRELOAD. For best performance, please follow the section `set LD_PRELOAD` in https://docs.vllm.ai/en/latest/getting_started/installation/cpu/ to setup required pre-loaded libraries.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [cpu_worker.py:60] libiomp is not found in LD_PRELOAD. For best performance, please follow the section `set LD_PRELOAD` in https://docs.vllm.ai/en/latest/getting_started/installation/cpu/ to setup required pre-loaded libraries.
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:227] auto thread-binding list (id, physical core): [(96, 0), (97, 1), (98, 2), (99, 3), (100, 4), (101, 5), (102, 6), (103, 7), (104, 8), (105, 9), (106, 10), (107, 11), (108, 12), (109, 13), (110, 14), (111, 15), (112, 16), (113, 17), (114, 18), (115, 19), (116, 20), (117, 21), (118, 22), (119, 23), (120, 24), (121, 25), (122, 26), (123, 27), (124, 28), (125, 29), (126, 30), (127, 31), (128, 32), (129, 33), (130, 34), (131, 35), (132, 36), (133, 37), (134, 38), (135, 39), (136, 40), (137, 41), (138, 42), (139, 43), (140, 44), (141, 45), (142, 46), (143, 47)]
[W331 17:58:55.613566406 utils.cpp:76] Warning: numa_migrate_pages failed. errno: 1 (function init_cpu_threads_env)
[W331 17:58:55.613579509 utils.cpp:103] Warning: NUMA binding: Using MEMBIND policy for memory allocation on the NUMA nodes (0). Memory allocations will be strictly bound to these NUMA nodes. (function init_cpu_threads_env)
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] OMP threads binding of Process 53257:
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53257, core 96
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53469, core 97
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53470, core 98
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53471, core 99
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53472, core 100
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53473, core 101
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53474, core 102
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53475, core 103
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53476, core 104
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53477, core 105
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53478, core 106
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53479, core 107
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53480, core 108
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53481, core 109
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53482, core 110
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53483, core 111
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53484, core 112
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53485, core 113
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53486, core 114
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53487, core 115
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53488, core 116
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53489, core 117
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53490, core 118
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53491, core 119
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53492, core 120
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53493, core 121
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53494, core 122
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53495, core 123
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53496, core 124
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53497, core 125
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53498, core 126
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53499, core 127
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53500, core 128
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53501, core 129
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53502, core 130
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53503, core 131
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53504, core 132
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53505, core 133
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53506, core 134
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53507, core 135
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53508, core 136
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53509, core 137
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53510, core 138
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53511, core 139
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53512, core 140
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53513, core 141
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53514, core 142
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53515, core 143
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 
(EngineCore pid=53257) INFO 03-31 17:58:55 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.129.9.130:56013 backend=gloo
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore pid=53257) INFO 03-31 17:58:55 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_model_runner.py:62] Starting to load model ibm-ai-platform/micro-g3.3-8b-instruct-1b...
(EngineCore pid=53257) WARNING 03-31 17:58:55 [linear.py:60] SpyreRowParallelLinear: no dtype promotion (torch-spyre limitation),expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rotary_embedding.py:89] SpyreRotaryEmbedding: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [linear.py:60] SpyreMergedColumnParallelLinear: no dtype promotion (torch-spyre limitation),expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) INFO 03-31 17:58:55 [weight_utils.py:618] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 14.90it/s]
(EngineCore pid=53257) 
(EngineCore pid=53257) INFO 03-31 17:58:55 [default_loader.py:384] Loading weights took 0.13 seconds
(EngineCore pid=53257) INFO 03-31 17:58:56 [kv_cache_utils.py:1319] GPU KV cache size: 16,507,392 tokens
(EngineCore pid=53257) INFO 03-31 17:58:56 [kv_cache_utils.py:1324] Maximum concurrency for 2,048 tokens per request: 8060.25x
(EngineCore pid=53257) INFO 03-31 17:58:59 [cpu_model_runner.py:73] Warming up model for the compilation...
(EngineCore pid=53257) WARNING 03-31 17:59:41 [decorators.py:311] Compiling model again due to a load failure from /home/rehankhan/.cache/vllm/torch_compile_cache/torch_aot_compile/582fa9fdd760fed36d5e8fe8543c1366cc78037643cc9a0fd374f222ca452ed8/rank_0_0/model, reason: Source code has changed since the last compilation. Recompiling the model.
(EngineCore pid=53257) INFO 03-31 17:59:50 [decorators.py:638] saved AOT compiled function to /home/rehankhan/.cache/vllm/torch_compile_cache/torch_aot_compile/582fa9fdd760fed36d5e8fe8543c1366cc78037643cc9a0fd374f222ca452ed8/rank_0_0/model
(EngineCore pid=53257) INFO 03-31 17:59:50 [monitor.py:76] Initial profiling/warmup run took 0.03 s
(EngineCore pid=53257) INFO 03-31 17:59:50 [cpu_model_runner.py:83] Warming up done.
(EngineCore pid=53257) INFO 03-31 17:59:50 [core.py:283] init engine (profile, create kv cache, warmup model) took 54.55 seconds
(EngineCore pid=53257) WARNING 03-31 17:59:51 [scheduler.py:173] Using custom scheduler class vllm.v1.core.sched.scheduler.Scheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore pid=53257) INFO 03-31 17:59:51 [vllm.py:750] Asynchronous scheduling is disabled.
(EngineCore pid=53257) WARNING 03-31 17:59:51 [vllm.py:806] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=53257) INFO 03-31 17:59:51 [platform.py:103] Loading scheduler from: vllm.v1.core.sched.scheduler.Scheduler
(EngineCore pid=53257) WARNING 03-31 17:59:51 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 251.88 GiB for KV cache.
INFO 03-31 17:59:51 [llm.py:391] Supported tasks: ['generate']
=============== GENERATE
Rendering prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 30.27it/s]
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.60it/s, est. speed input: 153.71 toks/s, output: 91.18 toks/s]
Time elaspsed for 20 tokens is 1.25 sec
===============
CompletionOutput(index=0, text='\n\nThe response is a 2-3 page document that describes the task.\n\n###', token_ids=[203, 203, 1318, 1789, 438, 312, 225, 36, 31, 37, 1938, 1825, 688, 18872, 322, 2899, 32, 203, 203, 1482], routed_experts=None, cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)
CompletionOutput(index=0, text='\n\n1. The user will receive a list of instructions for preparing chicken soup for a family.\n2. The user will receive a list of instructions for preparing chicken soup for a family.\n3. The user will receive a list of instructions for preparing chicken soup for a family.\n', token_ids=[203, 203, 35, 32, 886, 1256, 1098, 7768, 312, 1149, 432, 9400, 436, 1406, 26124, 663, 21217, 31628, 436, 312, 13872, 32, 203, 36, 32, 886, 1256, 1098, 7768, 312, 1149, 432, 9400, 436, 1406, 26124, 663, 21217, 31628, 436, 312, 13872, 32, 203, 37, 32, 886, 1256, 1098, 7768, 312, 1149, 432, 9400, 436, 1406, 26124, 663, 21217, 31628, 436, 312, 13872, 32, 203], routed_experts=None, cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)
CompletionOutput(index=0, text='\n\nKaneki Ken is a human.\n\n### Instruction:\n\nDescribe what it', token_ids=[203, 203, 61, 2600, 7319, 48487, 438, 312, 13462, 32, 203, 203, 1482, 21081, 44, 203, 203, 8591, 2769, 561], routed_experts=None, cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)
===============

Prompt:
 'Below is an instruction that describes a task. Write a response that appropriately completes the request. Be polite in your response to the user.\n\n### Instruction:\nProvide instructions for preparing chicken soup.\n\n### Response:'

Generated text:
 '\n\nThe response is a 2-3 page document that describes the task.\n\n###'

-----------------------------------

Prompt:
 'Below is an instruction that describes a task. Write a response that appropriately completes the request. Be polite in your response to the user.\n\n### Instruction:\nProvide a list of instructions for preparing chicken soup for a family.\n\n### Response:'

Generated text:
 '\n\n1. The user will receive a list of instructions for preparing chicken soup for a family.\n2. The user will receive a list of instructions for preparing chicken soup for a family.\n3. The user will receive a list of instructions for preparing chicken soup for a family.\n'

-----------------------------------

Prompt:
 "Below is an instruction that describes a task. Write a response that appropriately completes the request. Be polite in your response to the user.\n\n### Instruction:\nYou are Kaneki Ken from 'Tokyo Ghoul.' Describe what it feels like to be both human and ghoul to someone unfamiliar with your world.\n\n### Response:"

Generated text:
 '\n\nKaneki Ken is a human.\n\n### Instruction:\n\nDescribe what it'

-----------------------------------
(EngineCore pid=53257) INFO 03-31 17:59:52 [core.py:1210] Shutdown initiated (timeout=0)
(EngineCore pid=53257) INFO 03-31 17:59:52 [core.py:1233] Shutdown complete

vllm_spyre_next/examples/Offline_demo.py

(rehankhan) [rehankhan@rehankhan-spyre-dev-pf vllm_spyre_next]$ python examples/test.py 
INFO 03-31 18:05:01 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 03-31 18:05:01 [__init__.py:46] - spyre_next -> vllm_spyre_next:register
INFO 03-31 18:05:01 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 03-31 18:05:01 [__init__.py:239] Platform plugin spyre_next is activated
INFO 03-31 18:05:03 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 03-31 18:05:03 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 03-31 18:05:10 [utils.py:233] non-default args: {'enable_prefix_caching': True, 'attention_config': AttentionConfig(backend=<AttentionBackendEnum.CUSTOM: None>, flash_attn_version=None, use_prefill_decode_attention=False, flash_attn_max_num_splits_for_cuda_graph=32, use_cudnn_prefill=False, use_trtllm_ragged_deepseek_prefill=False, use_trtllm_attention=None, disable_flashinfer_prefill=True, disable_flashinfer_q_quantization=False, use_prefill_query_quantization=False), 'model': 'ibm-granite/granite-3.3-8b-instruct'}
INFO 03-31 18:05:10 [model.py:540] Resolved architecture: GraniteForCausalLM
INFO 03-31 18:05:10 [model.py:1607] Using max model len 131072
WARNING 03-31 18:05:10 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 251.88 GiB for KV cache.
INFO 03-31 18:05:11 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 03-31 18:05:11 [vllm.py:750] Asynchronous scheduling is enabled.
INFO 03-31 18:05:11 [platform.py:74] 
INFO 03-31 18:05:11 [platform.py:74]        █     █     █▄   ▄█       ▄█▀▀█▄  █▀▀▀█▄  █   █  █▀▀▀█▄  █▀▀▀▀
INFO 03-31 18:05:11 [platform.py:74]  ▄▄ ▄█ █     █     █ ▀▄▀ █       ▀▀▄▄▄   █▄▄▄█▀  ▀▄ ▄▀  █▄▄▄█▀  █▄▄▄   version 0.1.dev536
INFO 03-31 18:05:11 [platform.py:74]   █▄█▀ █     █     █     █            █  █        ▀█▀   █ ▀█▄   █      model   ibm-granite/granite-3.3-8b-instruct
INFO 03-31 18:05:11 [platform.py:74]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀       ▀▄▄▄█▀  █         █    █   ▀█  █▄▄▄▄
INFO 03-31 18:05:11 [platform.py:74] 
INFO 03-31 18:05:11 [platform.py:88] Loading worker from: vllm_spyre_next.v1.worker.spyre_worker.TorchSpyreWorker
INFO 03-31 18:05:11 [platform.py:103] Loading scheduler from: vllm.v1.core.sched.scheduler.Scheduler
INFO 03-31 18:05:18 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 03-31 18:05:18 [__init__.py:46] - spyre_next -> vllm_spyre_next:register
INFO 03-31 18:05:18 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 03-31 18:05:18 [__init__.py:239] Platform plugin spyre_next is activated
INFO 03-31 18:05:18 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 03-31 18:05:18 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore pid=53786) INFO 03-31 18:05:21 [core.py:105] Initializing a V1 LLM engine (v0.18.1rc1.dev53+gffb5b32b5.d20260324) with config: model='ibm-granite/granite-3.3-8b-instruct', speculative_config=None, tokenizer='ibm-granite/granite-3.3-8b-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=ibm-granite/granite-3.3-8b-instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.DYNAMO_TRACE_ONCE: 2>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': None, 'compile_ranges_endpoints': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True, 'dce': True, 'size_asserts': False, 'nan_asserts': False, 'epilogue_fusion': True, 'cpp.dynamic_threads': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=53786) INFO 03-31 18:05:23 [__init__.py:13] Registering custom ops for spyre_next
(EngineCore pid=53786) INFO 03-31 18:05:23 [rms_norm.py:236] Registered custom op: SpyreRMSNorm
(EngineCore pid=53786) INFO 03-31 18:05:23 [silu_and_mul.py:169] Registered custom op: SpyreSiluAndMul
(EngineCore pid=53786) INFO 03-31 18:05:23 [linear.py:157] Registered custom op: spyre_merged_col_linear
(EngineCore pid=53786) INFO 03-31 18:05:23 [linear.py:157] Registered custom op: spyre_row_parallel_linear
(EngineCore pid=53786) INFO 03-31 18:05:23 [rotary_embedding.py:305] Registered custom op: SpyreRotaryEmbedding
(EngineCore pid=53786) WARNING 03-31 18:05:23 [cpu_worker.py:60] libtcmalloc is not found in LD_PRELOAD. For best performance, please follow the section `set LD_PRELOAD` in https://docs.vllm.ai/en/latest/getting_started/installation/cpu/ to setup required pre-loaded libraries.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [cpu_worker.py:60] libiomp is not found in LD_PRELOAD. For best performance, please follow the section `set LD_PRELOAD` in https://docs.vllm.ai/en/latest/getting_started/installation/cpu/ to setup required pre-loaded libraries.
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:227] auto thread-binding list (id, physical core): [(96, 0), (97, 1), (98, 2), (99, 3), (100, 4), (101, 5), (102, 6), (103, 7), (104, 8), (105, 9), (106, 10), (107, 11), (108, 12), (109, 13), (110, 14), (111, 15), (112, 16), (113, 17), (114, 18), (115, 19), (116, 20), (117, 21), (118, 22), (119, 23), (120, 24), (121, 25), (122, 26), (123, 27), (124, 28), (125, 29), (126, 30), (127, 31), (128, 32), (129, 33), (130, 34), (131, 35), (132, 36), (133, 37), (134, 38), (135, 39), (136, 40), (137, 41), (138, 42), (139, 43), (140, 44), (141, 45), (142, 46), (143, 47)]
[W331 18:05:23.797197746 utils.cpp:76] Warning: numa_migrate_pages failed. errno: 1 (function init_cpu_threads_env)
[W331 18:05:23.797213401 utils.cpp:103] Warning: NUMA binding: Using MEMBIND policy for memory allocation on the NUMA nodes (0). Memory allocations will be strictly bound to these NUMA nodes. (function init_cpu_threads_env)
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] OMP threads binding of Process 53786:
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 53786, core 96
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 53998, core 97
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 53999, core 98
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54000, core 99
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54001, core 100
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54002, core 101
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54003, core 102
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54004, core 103
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54005, core 104
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54006, core 105
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54007, core 106
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54008, core 107
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54009, core 108
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54010, core 109
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54011, core 110
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54012, core 111
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54013, core 112
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54014, core 113
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54015, core 114
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54016, core 115
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54017, core 116
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54018, core 117
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54019, core 118
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54020, core 119
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54021, core 120
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54022, core 121
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54023, core 122
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54024, core 123
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54025, core 124
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54026, core 125
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54027, core 126
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54028, core 127
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54029, core 128
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54030, core 129
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54031, core 130
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54032, core 131
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54033, core 132
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54034, core 133
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54035, core 134
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54036, core 135
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54037, core 136
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54038, core 137
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54039, core 138
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54040, core 139
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54041, core 140
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54042, core 141
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54043, core 142
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54044, core 143
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 
(EngineCore pid=53786) INFO 03-31 18:05:23 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.129.9.130:42787 backend=gloo
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore pid=53786) INFO 03-31 18:05:23 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_model_runner.py:62] Starting to load model ibm-granite/granite-3.3-8b-instruct...
(EngineCore pid=53786) WARNING 03-31 18:05:23 [linear.py:60] SpyreRowParallelLinear: no dtype promotion (torch-spyre limitation),expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rotary_embedding.py:89] SpyreRotaryEmbedding: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu.py:112] Cannot use AttentionBackendEnum.CUSTOM backend on CPU.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [linear.py:60] SpyreMergedColumnParallelLinear: no dtype promotion (torch-spyre limitation),expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  7.10it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:00<00:00,  5.45it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:00<00:00,  5.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  6.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  5.95it/s]
(EngineCore pid=53786) 
(EngineCore pid=53786) INFO 03-31 18:05:26 [default_loader.py:384] Loading weights took 0.69 seconds
(EngineCore pid=53786) INFO 03-31 18:05:26 [kv_cache_utils.py:1319] GPU KV cache size: 1,650,688 tokens
(EngineCore pid=53786) INFO 03-31 18:05:26 [kv_cache_utils.py:1324] Maximum concurrency for 131,072 tokens per request: 12.59x
(EngineCore pid=53786) INFO 03-31 18:05:29 [cpu_model_runner.py:73] Warming up model for the compilation...
(EngineCore pid=53786) INFO 03-31 18:06:35 [decorators.py:638] saved AOT compiled function to /home/rehankhan/.cache/vllm/torch_compile_cache/torch_aot_compile/90015b427fdcee783eafbdf0a8b1043f9709aae600b70b2fe774c2104edbe0a1/rank_0_0/model
(EngineCore pid=53786) INFO 03-31 18:06:36 [monitor.py:76] Initial profiling/warmup run took 1.40 s
(EngineCore pid=53786) INFO 03-31 18:06:36 [cpu_model_runner.py:83] Warming up done.
(EngineCore pid=53786) INFO 03-31 18:06:36 [core.py:283] init engine (profile, create kv cache, warmup model) took 70.62 seconds
(EngineCore pid=53786) WARNING 03-31 18:06:37 [scheduler.py:173] Using custom scheduler class vllm.v1.core.sched.scheduler.Scheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore pid=53786) INFO 03-31 18:06:38 [vllm.py:750] Asynchronous scheduling is disabled.
(EngineCore pid=53786) WARNING 03-31 18:06:38 [vllm.py:806] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=53786) INFO 03-31 18:06:38 [platform.py:103] Loading scheduler from: vllm.v1.core.sched.scheduler.Scheduler
(EngineCore pid=53786) WARNING 03-31 18:06:38 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 251.88 GiB for KV cache.
INFO 03-31 18:06:38 [llm.py:391] Supported tasks: ['generate']
Rendering prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.33it/s]
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.55it/s, est. speed input: 12.43 toks/s, output: 7.77 toks/s]
--------------------------------------------------
Generated text: '\n\nIBM operates'
--------------------------------------------------
vllm:kv_cache_usage_perc 0.0
vllm:prefix_cache_queries 8
vllm:prefix_cache_hits 0
vllm:external_prefix_cache_queries 0
vllm:external_prefix_cache_hits 0
vllm:mm_cache_queries 0
vllm:mm_cache_hits 0
vllm:prompt_tokens_cached 0
vllm:cache_config_info 1.0
(EngineCore pid=53786) INFO 03-31 18:06:39 [core.py:1210] Shutdown initiated (timeout=0)
(EngineCore pid=53786) INFO 03-31 18:06:39 [core.py:1233] Shutdown complete

Checklist

I have read the contributing guidelines
[] My code follows the project's code style (run bash format.sh)
[] I have added tests for my changes (if applicable)
[] I have updated the documentation (if applicable)
My commits include a Signed-off-by: line (DCO compliance)

github-actions · 2026-03-31T18:53:59Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

bohnstingl

Thank you @dilipgb. I made a first pass through the PR and left some comments.

In general, I think you need to merge in the latest main branch, which would remove the changes from pyproject.toml and uv.lock. At least the changes there are not related and should be removed.

Also, could you please adopt the new call chain from #872?

bohnstingl · 2026-03-31T21:28:18Z

+    - No dtype promotion (torch-spyre limitation)
+    - rope_scaling not yet implemented
+    - Expect numerical differences from upstream vLLM


Ad 1) Where are is the dtype promotion happening in upstream?
I see that there is enable_fp32_compute, but I belief for Granite it is False?

Ad 2) Can we have an assert / raise Exception to ensure that this code path is reached, i.e., that scaling_type == "default" and "mrope_section" not in rope_parameters and "use_fope" not in rope_parameters or not rope_parameters["use_fope"].

Ad 3) s the dtype promotion the source for the numerical differences, or is there anything apart from that?

dtype promotion in upstream is optional and will be enabled only when enable_fp32_compute. We can have a another condition to check for this? But granite dtype is fp16 and we will be using the same without upcasting, that is the though process here.

addressed.

Yes since trig functions and other intermediate operations in upstream vllm is upcasted for better precision.

bohnstingl · 2026-03-31T21:43:36Z

+        # Use float16 directly - no dynamic dimensions (Spyre constraint)
+        compute_dtype = torch.float16
+
+        # Compute inverse frequencies: base^(-2i/rotary_dim)
+        # Using negative exponent for numerical stability
+        exponents = -torch.arange(0, self.rotary_dim, 2, dtype=compute_dtype) / self.rotary_dim
+        inv_freq = torch.pow(self.base, exponents)
+
+        # Create position indices [0, 1, 2, ..., max_position_embeddings-1]
+        t = torch.arange(self.max_position_embeddings, dtype=compute_dtype)
+
+        # Compute frequencies for each position: pos * inv_freq
+        # Shape: [max_position_embeddings, rotary_dim // 2]
+        freqs = torch.outer(t, inv_freq)
+
+        # Duplicate frequencies for interleaved pattern
+        # Shape: [max_position_embeddings, rotary_dim]
+        emb = torch.cat([freqs, freqs], dim=-1)


Can we make a comment that these ops are currently happening on CPU?

torch-spyre has had some more ops added lately and torch.cat should now work on spyre. So we might want to try and convert some of these operations to be happening on spyre.

I tried torch.arrange and torch.outer which are not yet implemented on spyre. torch.cat though implemented on spyre emb calculation will again fall back on CPU. We need to move data too and fro from CPU to card and card to CPU multiple times for only supporting torch.cat on cpu.

outer, may indeed not be supported. However, torch.cat and torch.arange should be.

Although torch.cat and torch.arange might have CPU fallbacks, I think we should still try to use them with the spyre device, because once those operations are supported through torch-spyre, they will just work in vllm-spyre.

bohnstingl · 2026-03-31T22:01:03Z

+        query_rot = query[..., :rotary_dim]
+        query_pass = query[..., rotary_dim:]
+        key_rot = key[..., :rotary_dim]
+        key_pass = key[..., rotary_dim:]


Same here, tensor slicing shouldn't currently work with the tensors on spyre? Can you confirm that the tensors are indeed on spyre?

bohnstingl · 2026-03-31T22:02:02Z

+        # Retrieve cos/sin for the given positions
+        # positions shape: [batch_size, seq_len] or [total_tokens]
+        cos = cos_cache[positions]  # [..., rotary_dim]
+        sin = sin_cache[positions]  # [..., rotary_dim]


I am surprised this is actually working when cos_cache and sin_cache are on spyre?

@dilipgb I am still puzzled by how this can work for you. I tested it locally and the tensor slicing fails, as this is not yet supported in torch-spyre. At least not in eager mode. Can you confirm that?

Therefore, I think we should restructure this function a bit overall and do the slicing in _forward_spyre_impl on CPU, pass-in the two halves, apply the individual RoPE on them, return them and then combine them together again in _forward_spyre_impl on CPU.

Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>

This PR bumps the lower bound of foundation-model-stack dependency from 1.7.0 to 1.8.0 which includes Llama bug fixes for torch 2.10.   - [ ] I have read the [contributing guidelines](https://docs.vllm.ai/projects/spyre/en/latest/contributing) - [ ] My code follows the project's code style (run `bash format.sh`) - [ ] I have added tests for my changes (if applicable) - [ ] I have updated the documentation (if applicable) - [ ] My commits include a `Signed-off-by:` line (DCO compliance) --------- Signed-off-by: Daniel Schenker <daniel.schenker@ibm.com> Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>

Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>

bohnstingl

@dilipgb Can you please have a look at my comments and also merge-in the latest main?

bohnstingl · 2026-04-05T12:40:10Z

+        # Transfer cos/sin cache to Spyre device if not already there
+        # if self.cos_cache.device != self._target_device:
+        #     self.cos_cache = convert(self.cos_cache, self._target_device, self._target_dtype)
+        #     self.sin_cache = convert(self.sin_cache, self._target_device, self._target_dtype)


Can we remove that?

bohnstingl · 2026-04-05T12:45:00Z

+            Rotated tensor [..., rotary_dim]
+        """
+        x1, x2 = x.chunk(2, dim=-1)
+        return torch.cat([-x2, x1], dim=-1)


I think we can rework this to be supported on spyre. In particular,

Let's replace tensor.chunk with torch.split, which is supported. E.g., x1, x2 = torch.split(x, [d, d], dim=1), where d is passed in from the caller.

torch.cat should be supported in torch-spyre, see https://github.com/torch-spyre/torch-spyre/blob/eaf2f76026880b071cd8dffcf473685e8223c8aa/tests/inductor/test_inductor_ops.py#L1799-L1803

bohnstingl · 2026-04-05T12:58:31Z

+        # Use float16 directly - no dynamic dimensions (Spyre constraint)
+        compute_dtype = torch.float16
+
+        # Compute inverse frequencies: base^(-2i/rotary_dim)
+        # Using negative exponent for numerical stability
+        exponents = -torch.arange(0, self.rotary_dim, 2, dtype=compute_dtype) / self.rotary_dim
+        inv_freq = torch.pow(self.base, exponents)
+
+        # Create position indices [0, 1, 2, ..., max_position_embeddings-1]
+        t = torch.arange(self.max_position_embeddings, dtype=compute_dtype)
+
+        # Compute frequencies for each position: pos * inv_freq
+        # Shape: [max_position_embeddings, rotary_dim // 2]
+        freqs = torch.outer(t, inv_freq)
+
+        # Duplicate frequencies for interleaved pattern
+        # Shape: [max_position_embeddings, rotary_dim]
+        emb = torch.cat([freqs, freqs], dim=-1)


outer, may indeed not be supported. However, torch.cat and torch.arange should be.

Although torch.cat and torch.arange might have CPU fallbacks, I think we should still try to use them with the spyre device, because once those operations are supported through torch-spyre, they will just work in vllm-spyre.

bohnstingl · 2026-04-05T13:02:46Z

+        # Retrieve cos/sin for the given positions
+        # positions shape: [batch_size, seq_len] or [total_tokens]
+        cos = cos_cache[positions]  # [..., rotary_dim]
+        sin = sin_cache[positions]  # [..., rotary_dim]


@dilipgb I am still puzzled by how this can work for you. I tested it locally and the tensor slicing fails, as this is not yet supported in torch-spyre. At least not in eager mode. Can you confirm that?

Therefore, I think we should restructure this function a bit overall and do the slicing in _forward_spyre_impl on CPU, pass-in the two halves, apply the individual RoPE on them, return them and then combine them together again in _forward_spyre_impl on CPU.

Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>

Signed-off-by: Dilip Gowda Bhagavan <110233170+dilipgb@users.noreply.github.com>

bohnstingl

Some minor details that I don't forget.
As discussed offline, it would be good to run as many operations on spyre. For example, torch.cat should be supported now.
Indexing operations, such as slicing still need to stay on cpu for the moment.

bohnstingl · 2026-04-10T05:08:27Z

+        Tq, q_hidden = query.shape
+        Tk, k_hidden = key.shape
+
+        assert Tq == Tk, f"Query/Key sequence mismatch: {Tq} != {Tk}"


Why do we have this constraint? Is this also part of the upstream implementation? In upstream I see https://github.com/vllm-project/vllm/blob/a5b17fba8ff3fcc076d73ba749a0819e0ec25f06/vllm/model_executor/layers/rotary_embedding/base.py#L140-L180

bohnstingl · 2026-04-10T05:10:56Z

+        # Compile the forward kernel
+        self.maybe_compiled_forward_spyre = self.maybe_compile(self.forward_spyre)
+        self._layer_name = register_layer(self, "spyre_rotary_embedding")
+


We recently introduced an additional logging. Thus, please include something like

logger.debug_once( "SpyreRotaryEmbedding: Dispatch: enabled=%s, Forward method=%s, Compiled=%s", self.enabled(), self._forward_method.__name__, self.maybe_compiled_forward_spyre is not self.forward_spyre, )

bohnstingl · 2026-04-10T05:12:12Z

+        assert cos_q.shape == query_rot.shape, f"{cos_q.shape} != {query.shape}"
+        assert sin_q.shape == query_rot.shape


Can we have more descriptive error messages here?

dilipgb force-pushed the main branch from b202077 to 9385342 Compare March 31, 2026 19:14

bohnstingl self-requested a review March 31, 2026 21:15

bohnstingl assigned dilipgb Mar 31, 2026

bohnstingl requested changes Mar 31, 2026

View reviewed changes

dilipgb and others added 6 commits April 3, 2026 12:14

Wrap RoPE layer

11b7807

Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>

nitpick: update comments

12ed4c9

Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>

nitpick: Fixing ruff formatting issues

494ee99

Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>

nitpick: Fixing ruff check issues

fe46602

Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>

address review comments

8cbb942

Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>

dilipgb force-pushed the main branch from 201d2d6 to 48bb55d Compare April 3, 2026 06:47

bohnstingl requested changes Apr 5, 2026

View reviewed changes

addressed review comments

f720b63

Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>

dilipgb force-pushed the main branch from 48bb55d to f720b63 Compare April 9, 2026 13:02

Merge branch 'main' into main

b29a6f7

Signed-off-by: Dilip Gowda Bhagavan <110233170+dilipgb@users.noreply.github.com>

bohnstingl requested changes Apr 10, 2026

View reviewed changes

Merge branch 'torch-spyre:main' into main

8e585df

dilipgb mentioned this pull request Apr 17, 2026

torch.cos has high variance required for vllm RoPE implementation torch-spyre/torch-spyre#1668

Open

yannicks1 mentioned this pull request Apr 17, 2026

[Spyre-Next] [Feature] Implement spyre-specific model runner #878

Open

5 tasks

		assert cos_q.shape == query_rot.shape, f"{cos_q.shape} != {query.shape}"
		assert sin_q.shape == query_rot.shape

Conversation

dilipgb commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Test Plan

Checklist

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

bohnstingl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bohnstingl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bohnstingl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dilipgb commented Mar 31, 2026 •

edited

Loading