Skip to content

[Spyre-Next] [Feature] Wrap RoPE layer on Spyre#881

Draft
dilipgb wants to merge 9 commits intotorch-spyre:mainfrom
dilipgb:main
Draft

[Spyre-Next] [Feature] Wrap RoPE layer on Spyre#881
dilipgb wants to merge 9 commits intotorch-spyre:mainfrom
dilipgb:main

Conversation

@dilipgb
Copy link
Copy Markdown
Collaborator

@dilipgb dilipgb commented Mar 31, 2026

Description

Adds SpyreRotaryEmbeding, a Spyre-optimized out-of-tree (OOT) replacement for vLLM's RotaryEmbedding, following the custom op pattern from #842 (same as SpyreRMSNorm and SpyreSiluAndMul).

Related Issues

Fixes #820

Test Plan

Ran Couple of tests to confirm its working on spyre. Also looking into upstream test for rotary embeddings which is in progress.

examples/torch_spyre_inference.py

python examples/torch_spyre_inference.py 
INFO 03-31 17:58:31 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 03-31 17:58:31 [__init__.py:46] - spyre_next -> vllm_spyre_next:register
INFO 03-31 17:58:31 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 03-31 17:58:31 [__init__.py:239] Platform plugin spyre_next is activated
INFO 03-31 17:58:32 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 03-31 17:58:32 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 03-31 17:58:39 [utils.py:233] non-default args: {'tokenizer': 'ibm-ai-platform/micro-g3.3-8b-instruct-1b', 'max_model_len': 2048, 'max_num_batched_tokens': 1024, 'max_num_seqs': 2, 'disable_log_stats': True, 'model': 'ibm-ai-platform/micro-g3.3-8b-instruct-1b'}
INFO 03-31 17:58:39 [model.py:540] Resolved architecture: GraniteForCausalLM
INFO 03-31 17:58:39 [model.py:1607] Using max model len 2048
WARNING 03-31 17:58:39 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 251.88 GiB for KV cache.
INFO 03-31 17:58:39 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=1024.
INFO 03-31 17:58:39 [vllm.py:750] Asynchronous scheduling is enabled.
INFO 03-31 17:58:39 [platform.py:74] 
INFO 03-31 17:58:39 [platform.py:74]        █     █     █▄   ▄█       ▄█▀▀█▄  █▀▀▀█▄  █   █  █▀▀▀█▄  █▀▀▀▀
INFO 03-31 17:58:39 [platform.py:74]  ▄▄ ▄█ █     █     █ ▀▄▀ █       ▀▀▄▄▄   █▄▄▄█▀  ▀▄ ▄▀  █▄▄▄█▀  █▄▄▄   version 0.1.dev536
INFO 03-31 17:58:39 [platform.py:74]   █▄█▀ █     █     █     █            █  █        ▀█▀   █ ▀█▄   █      model   ibm-ai-platform/micro-g3.3-8b-instruct-1b
INFO 03-31 17:58:39 [platform.py:74]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀       ▀▄▄▄█▀  █         █    █   ▀█  █▄▄▄▄
INFO 03-31 17:58:39 [platform.py:74] 
INFO 03-31 17:58:39 [platform.py:88] Loading worker from: vllm_spyre_next.v1.worker.spyre_worker.TorchSpyreWorker
INFO 03-31 17:58:39 [platform.py:103] Loading scheduler from: vllm.v1.core.sched.scheduler.Scheduler
INFO 03-31 17:58:50 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 03-31 17:58:50 [__init__.py:46] - spyre_next -> vllm_spyre_next:register
INFO 03-31 17:58:50 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 03-31 17:58:50 [__init__.py:239] Platform plugin spyre_next is activated
INFO 03-31 17:58:50 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 03-31 17:58:50 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore pid=53257) INFO 03-31 17:58:53 [core.py:105] Initializing a V1 LLM engine (v0.18.1rc1.dev53+gffb5b32b5.d20260324) with config: model='ibm-ai-platform/micro-g3.3-8b-instruct-1b', speculative_config=None, tokenizer='ibm-ai-platform/micro-g3.3-8b-instruct-1b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=ibm-ai-platform/micro-g3.3-8b-instruct-1b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.DYNAMO_TRACE_ONCE: 2>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': None, 'compile_ranges_endpoints': [1024], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True, 'dce': True, 'size_asserts': False, 'nan_asserts': False, 'epilogue_fusion': True, 'cpp.dynamic_threads': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=53257) INFO 03-31 17:58:55 [__init__.py:13] Registering custom ops for spyre_next
(EngineCore pid=53257) INFO 03-31 17:58:55 [rms_norm.py:236] Registered custom op: SpyreRMSNorm
(EngineCore pid=53257) INFO 03-31 17:58:55 [silu_and_mul.py:169] Registered custom op: SpyreSiluAndMul
(EngineCore pid=53257) INFO 03-31 17:58:55 [linear.py:157] Registered custom op: spyre_merged_col_linear
(EngineCore pid=53257) INFO 03-31 17:58:55 [linear.py:157] Registered custom op: spyre_row_parallel_linear
(EngineCore pid=53257) INFO 03-31 17:58:55 [rotary_embedding.py:305] Registered custom op: SpyreRotaryEmbedding
(EngineCore pid=53257) WARNING 03-31 17:58:55 [cpu_worker.py:60] libtcmalloc is not found in LD_PRELOAD. For best performance, please follow the section `set LD_PRELOAD` in https://docs.vllm.ai/en/latest/getting_started/installation/cpu/ to setup required pre-loaded libraries.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [cpu_worker.py:60] libiomp is not found in LD_PRELOAD. For best performance, please follow the section `set LD_PRELOAD` in https://docs.vllm.ai/en/latest/getting_started/installation/cpu/ to setup required pre-loaded libraries.
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:227] auto thread-binding list (id, physical core): [(96, 0), (97, 1), (98, 2), (99, 3), (100, 4), (101, 5), (102, 6), (103, 7), (104, 8), (105, 9), (106, 10), (107, 11), (108, 12), (109, 13), (110, 14), (111, 15), (112, 16), (113, 17), (114, 18), (115, 19), (116, 20), (117, 21), (118, 22), (119, 23), (120, 24), (121, 25), (122, 26), (123, 27), (124, 28), (125, 29), (126, 30), (127, 31), (128, 32), (129, 33), (130, 34), (131, 35), (132, 36), (133, 37), (134, 38), (135, 39), (136, 40), (137, 41), (138, 42), (139, 43), (140, 44), (141, 45), (142, 46), (143, 47)]
[W331 17:58:55.613566406 utils.cpp:76] Warning: numa_migrate_pages failed. errno: 1 (function init_cpu_threads_env)
[W331 17:58:55.613579509 utils.cpp:103] Warning: NUMA binding: Using MEMBIND policy for memory allocation on the NUMA nodes (0). Memory allocations will be strictly bound to these NUMA nodes. (function init_cpu_threads_env)
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] OMP threads binding of Process 53257:
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53257, core 96
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53469, core 97
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53470, core 98
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53471, core 99
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53472, core 100
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53473, core 101
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53474, core 102
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53475, core 103
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53476, core 104
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53477, core 105
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53478, core 106
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53479, core 107
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53480, core 108
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53481, core 109
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53482, core 110
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53483, core 111
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53484, core 112
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53485, core 113
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53486, core 114
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53487, core 115
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53488, core 116
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53489, core 117
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53490, core 118
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53491, core 119
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53492, core 120
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53493, core 121
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53494, core 122
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53495, core 123
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53496, core 124
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53497, core 125
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53498, core 126
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53499, core 127
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53500, core 128
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53501, core 129
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53502, core 130
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53503, core 131
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53504, core 132
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53505, core 133
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53506, core 134
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53507, core 135
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53508, core 136
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53509, core 137
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53510, core 138
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53511, core 139
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53512, core 140
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53513, core 141
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53514, core 142
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 	OMP tid: 53515, core 143
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_worker.py:109] 
(EngineCore pid=53257) INFO 03-31 17:58:55 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.129.9.130:56013 backend=gloo
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore pid=53257) INFO 03-31 17:58:55 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=53257) INFO 03-31 17:58:55 [cpu_model_runner.py:62] Starting to load model ibm-ai-platform/micro-g3.3-8b-instruct-1b...
(EngineCore pid=53257) WARNING 03-31 17:58:55 [linear.py:60] SpyreRowParallelLinear: no dtype promotion (torch-spyre limitation),expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rotary_embedding.py:89] SpyreRotaryEmbedding: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [linear.py:60] SpyreMergedColumnParallelLinear: no dtype promotion (torch-spyre limitation),expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) WARNING 03-31 17:58:55 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53257) INFO 03-31 17:58:55 [weight_utils.py:618] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 14.90it/s]
(EngineCore pid=53257) 
(EngineCore pid=53257) INFO 03-31 17:58:55 [default_loader.py:384] Loading weights took 0.13 seconds
(EngineCore pid=53257) INFO 03-31 17:58:56 [kv_cache_utils.py:1319] GPU KV cache size: 16,507,392 tokens
(EngineCore pid=53257) INFO 03-31 17:58:56 [kv_cache_utils.py:1324] Maximum concurrency for 2,048 tokens per request: 8060.25x
(EngineCore pid=53257) INFO 03-31 17:58:59 [cpu_model_runner.py:73] Warming up model for the compilation...
(EngineCore pid=53257) WARNING 03-31 17:59:41 [decorators.py:311] Compiling model again due to a load failure from /home/rehankhan/.cache/vllm/torch_compile_cache/torch_aot_compile/582fa9fdd760fed36d5e8fe8543c1366cc78037643cc9a0fd374f222ca452ed8/rank_0_0/model, reason: Source code has changed since the last compilation. Recompiling the model.
(EngineCore pid=53257) INFO 03-31 17:59:50 [decorators.py:638] saved AOT compiled function to /home/rehankhan/.cache/vllm/torch_compile_cache/torch_aot_compile/582fa9fdd760fed36d5e8fe8543c1366cc78037643cc9a0fd374f222ca452ed8/rank_0_0/model
(EngineCore pid=53257) INFO 03-31 17:59:50 [monitor.py:76] Initial profiling/warmup run took 0.03 s
(EngineCore pid=53257) INFO 03-31 17:59:50 [cpu_model_runner.py:83] Warming up done.
(EngineCore pid=53257) INFO 03-31 17:59:50 [core.py:283] init engine (profile, create kv cache, warmup model) took 54.55 seconds
(EngineCore pid=53257) WARNING 03-31 17:59:51 [scheduler.py:173] Using custom scheduler class vllm.v1.core.sched.scheduler.Scheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore pid=53257) INFO 03-31 17:59:51 [vllm.py:750] Asynchronous scheduling is disabled.
(EngineCore pid=53257) WARNING 03-31 17:59:51 [vllm.py:806] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=53257) INFO 03-31 17:59:51 [platform.py:103] Loading scheduler from: vllm.v1.core.sched.scheduler.Scheduler
(EngineCore pid=53257) WARNING 03-31 17:59:51 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 251.88 GiB for KV cache.
INFO 03-31 17:59:51 [llm.py:391] Supported tasks: ['generate']
=============== GENERATE
Rendering prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 30.27it/s]
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.60it/s, est. speed input: 153.71 toks/s, output: 91.18 toks/s]
Time elaspsed for 20 tokens is 1.25 sec
===============
CompletionOutput(index=0, text='\n\nThe response is a 2-3 page document that describes the task.\n\n###', token_ids=[203, 203, 1318, 1789, 438, 312, 225, 36, 31, 37, 1938, 1825, 688, 18872, 322, 2899, 32, 203, 203, 1482], routed_experts=None, cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)
CompletionOutput(index=0, text='\n\n1. The user will receive a list of instructions for preparing chicken soup for a family.\n2. The user will receive a list of instructions for preparing chicken soup for a family.\n3. The user will receive a list of instructions for preparing chicken soup for a family.\n', token_ids=[203, 203, 35, 32, 886, 1256, 1098, 7768, 312, 1149, 432, 9400, 436, 1406, 26124, 663, 21217, 31628, 436, 312, 13872, 32, 203, 36, 32, 886, 1256, 1098, 7768, 312, 1149, 432, 9400, 436, 1406, 26124, 663, 21217, 31628, 436, 312, 13872, 32, 203, 37, 32, 886, 1256, 1098, 7768, 312, 1149, 432, 9400, 436, 1406, 26124, 663, 21217, 31628, 436, 312, 13872, 32, 203], routed_experts=None, cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)
CompletionOutput(index=0, text='\n\nKaneki Ken is a human.\n\n### Instruction:\n\nDescribe what it', token_ids=[203, 203, 61, 2600, 7319, 48487, 438, 312, 13462, 32, 203, 203, 1482, 21081, 44, 203, 203, 8591, 2769, 561], routed_experts=None, cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)
===============

Prompt:
 'Below is an instruction that describes a task. Write a response that appropriately completes the request. Be polite in your response to the user.\n\n### Instruction:\nProvide instructions for preparing chicken soup.\n\n### Response:'

Generated text:
 '\n\nThe response is a 2-3 page document that describes the task.\n\n###'

-----------------------------------

Prompt:
 'Below is an instruction that describes a task. Write a response that appropriately completes the request. Be polite in your response to the user.\n\n### Instruction:\nProvide a list of instructions for preparing chicken soup for a family.\n\n### Response:'

Generated text:
 '\n\n1. The user will receive a list of instructions for preparing chicken soup for a family.\n2. The user will receive a list of instructions for preparing chicken soup for a family.\n3. The user will receive a list of instructions for preparing chicken soup for a family.\n'

-----------------------------------

Prompt:
 "Below is an instruction that describes a task. Write a response that appropriately completes the request. Be polite in your response to the user.\n\n### Instruction:\nYou are Kaneki Ken from 'Tokyo Ghoul.' Describe what it feels like to be both human and ghoul to someone unfamiliar with your world.\n\n### Response:"

Generated text:
 '\n\nKaneki Ken is a human.\n\n### Instruction:\n\nDescribe what it'

-----------------------------------
(EngineCore pid=53257) INFO 03-31 17:59:52 [core.py:1210] Shutdown initiated (timeout=0)
(EngineCore pid=53257) INFO 03-31 17:59:52 [core.py:1233] Shutdown complete

vllm_spyre_next/examples/Offline_demo.py

(rehankhan) [rehankhan@rehankhan-spyre-dev-pf vllm_spyre_next]$ python examples/test.py 
INFO 03-31 18:05:01 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 03-31 18:05:01 [__init__.py:46] - spyre_next -> vllm_spyre_next:register
INFO 03-31 18:05:01 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 03-31 18:05:01 [__init__.py:239] Platform plugin spyre_next is activated
INFO 03-31 18:05:03 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 03-31 18:05:03 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 03-31 18:05:10 [utils.py:233] non-default args: {'enable_prefix_caching': True, 'attention_config': AttentionConfig(backend=<AttentionBackendEnum.CUSTOM: None>, flash_attn_version=None, use_prefill_decode_attention=False, flash_attn_max_num_splits_for_cuda_graph=32, use_cudnn_prefill=False, use_trtllm_ragged_deepseek_prefill=False, use_trtllm_attention=None, disable_flashinfer_prefill=True, disable_flashinfer_q_quantization=False, use_prefill_query_quantization=False), 'model': 'ibm-granite/granite-3.3-8b-instruct'}
INFO 03-31 18:05:10 [model.py:540] Resolved architecture: GraniteForCausalLM
INFO 03-31 18:05:10 [model.py:1607] Using max model len 131072
WARNING 03-31 18:05:10 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 251.88 GiB for KV cache.
INFO 03-31 18:05:11 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 03-31 18:05:11 [vllm.py:750] Asynchronous scheduling is enabled.
INFO 03-31 18:05:11 [platform.py:74] 
INFO 03-31 18:05:11 [platform.py:74]        █     █     █▄   ▄█       ▄█▀▀█▄  █▀▀▀█▄  █   █  █▀▀▀█▄  █▀▀▀▀
INFO 03-31 18:05:11 [platform.py:74]  ▄▄ ▄█ █     █     █ ▀▄▀ █       ▀▀▄▄▄   █▄▄▄█▀  ▀▄ ▄▀  █▄▄▄█▀  █▄▄▄   version 0.1.dev536
INFO 03-31 18:05:11 [platform.py:74]   █▄█▀ █     █     █     █            █  █        ▀█▀   █ ▀█▄   █      model   ibm-granite/granite-3.3-8b-instruct
INFO 03-31 18:05:11 [platform.py:74]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀       ▀▄▄▄█▀  █         █    █   ▀█  █▄▄▄▄
INFO 03-31 18:05:11 [platform.py:74] 
INFO 03-31 18:05:11 [platform.py:88] Loading worker from: vllm_spyre_next.v1.worker.spyre_worker.TorchSpyreWorker
INFO 03-31 18:05:11 [platform.py:103] Loading scheduler from: vllm.v1.core.sched.scheduler.Scheduler
INFO 03-31 18:05:18 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 03-31 18:05:18 [__init__.py:46] - spyre_next -> vllm_spyre_next:register
INFO 03-31 18:05:18 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 03-31 18:05:18 [__init__.py:239] Platform plugin spyre_next is activated
INFO 03-31 18:05:18 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 03-31 18:05:18 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore pid=53786) INFO 03-31 18:05:21 [core.py:105] Initializing a V1 LLM engine (v0.18.1rc1.dev53+gffb5b32b5.d20260324) with config: model='ibm-granite/granite-3.3-8b-instruct', speculative_config=None, tokenizer='ibm-granite/granite-3.3-8b-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=ibm-granite/granite-3.3-8b-instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.DYNAMO_TRACE_ONCE: 2>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': None, 'compile_ranges_endpoints': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True, 'dce': True, 'size_asserts': False, 'nan_asserts': False, 'epilogue_fusion': True, 'cpp.dynamic_threads': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=53786) INFO 03-31 18:05:23 [__init__.py:13] Registering custom ops for spyre_next
(EngineCore pid=53786) INFO 03-31 18:05:23 [rms_norm.py:236] Registered custom op: SpyreRMSNorm
(EngineCore pid=53786) INFO 03-31 18:05:23 [silu_and_mul.py:169] Registered custom op: SpyreSiluAndMul
(EngineCore pid=53786) INFO 03-31 18:05:23 [linear.py:157] Registered custom op: spyre_merged_col_linear
(EngineCore pid=53786) INFO 03-31 18:05:23 [linear.py:157] Registered custom op: spyre_row_parallel_linear
(EngineCore pid=53786) INFO 03-31 18:05:23 [rotary_embedding.py:305] Registered custom op: SpyreRotaryEmbedding
(EngineCore pid=53786) WARNING 03-31 18:05:23 [cpu_worker.py:60] libtcmalloc is not found in LD_PRELOAD. For best performance, please follow the section `set LD_PRELOAD` in https://docs.vllm.ai/en/latest/getting_started/installation/cpu/ to setup required pre-loaded libraries.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [cpu_worker.py:60] libiomp is not found in LD_PRELOAD. For best performance, please follow the section `set LD_PRELOAD` in https://docs.vllm.ai/en/latest/getting_started/installation/cpu/ to setup required pre-loaded libraries.
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:227] auto thread-binding list (id, physical core): [(96, 0), (97, 1), (98, 2), (99, 3), (100, 4), (101, 5), (102, 6), (103, 7), (104, 8), (105, 9), (106, 10), (107, 11), (108, 12), (109, 13), (110, 14), (111, 15), (112, 16), (113, 17), (114, 18), (115, 19), (116, 20), (117, 21), (118, 22), (119, 23), (120, 24), (121, 25), (122, 26), (123, 27), (124, 28), (125, 29), (126, 30), (127, 31), (128, 32), (129, 33), (130, 34), (131, 35), (132, 36), (133, 37), (134, 38), (135, 39), (136, 40), (137, 41), (138, 42), (139, 43), (140, 44), (141, 45), (142, 46), (143, 47)]
[W331 18:05:23.797197746 utils.cpp:76] Warning: numa_migrate_pages failed. errno: 1 (function init_cpu_threads_env)
[W331 18:05:23.797213401 utils.cpp:103] Warning: NUMA binding: Using MEMBIND policy for memory allocation on the NUMA nodes (0). Memory allocations will be strictly bound to these NUMA nodes. (function init_cpu_threads_env)
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] OMP threads binding of Process 53786:
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 53786, core 96
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 53998, core 97
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 53999, core 98
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54000, core 99
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54001, core 100
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54002, core 101
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54003, core 102
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54004, core 103
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54005, core 104
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54006, core 105
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54007, core 106
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54008, core 107
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54009, core 108
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54010, core 109
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54011, core 110
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54012, core 111
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54013, core 112
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54014, core 113
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54015, core 114
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54016, core 115
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54017, core 116
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54018, core 117
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54019, core 118
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54020, core 119
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54021, core 120
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54022, core 121
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54023, core 122
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54024, core 123
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54025, core 124
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54026, core 125
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54027, core 126
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54028, core 127
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54029, core 128
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54030, core 129
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54031, core 130
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54032, core 131
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54033, core 132
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54034, core 133
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54035, core 134
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54036, core 135
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54037, core 136
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54038, core 137
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54039, core 138
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54040, core 139
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54041, core 140
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54042, core 141
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54043, core 142
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 	OMP tid: 54044, core 143
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_worker.py:109] 
(EngineCore pid=53786) INFO 03-31 18:05:23 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.129.9.130:42787 backend=gloo
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore pid=53786) INFO 03-31 18:05:23 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu_model_runner.py:62] Starting to load model ibm-granite/granite-3.3-8b-instruct...
(EngineCore pid=53786) WARNING 03-31 18:05:23 [linear.py:60] SpyreRowParallelLinear: no dtype promotion (torch-spyre limitation),expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rotary_embedding.py:89] SpyreRotaryEmbedding: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) INFO 03-31 18:05:23 [cpu.py:112] Cannot use AttentionBackendEnum.CUSTOM backend on CPU.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [linear.py:60] SpyreMergedColumnParallelLinear: no dtype promotion (torch-spyre limitation),expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:23 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
(EngineCore pid=53786) WARNING 03-31 18:05:24 [rms_norm.py:75] SpyreRMSNorm: no dtype promotion is performed, expect numerical differences to upstream vLLM.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  7.10it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:00<00:00,  5.45it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:00<00:00,  5.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  6.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00,  5.95it/s]
(EngineCore pid=53786) 
(EngineCore pid=53786) INFO 03-31 18:05:26 [default_loader.py:384] Loading weights took 0.69 seconds
(EngineCore pid=53786) INFO 03-31 18:05:26 [kv_cache_utils.py:1319] GPU KV cache size: 1,650,688 tokens
(EngineCore pid=53786) INFO 03-31 18:05:26 [kv_cache_utils.py:1324] Maximum concurrency for 131,072 tokens per request: 12.59x
(EngineCore pid=53786) INFO 03-31 18:05:29 [cpu_model_runner.py:73] Warming up model for the compilation...
(EngineCore pid=53786) INFO 03-31 18:06:35 [decorators.py:638] saved AOT compiled function to /home/rehankhan/.cache/vllm/torch_compile_cache/torch_aot_compile/90015b427fdcee783eafbdf0a8b1043f9709aae600b70b2fe774c2104edbe0a1/rank_0_0/model
(EngineCore pid=53786) INFO 03-31 18:06:36 [monitor.py:76] Initial profiling/warmup run took 1.40 s
(EngineCore pid=53786) INFO 03-31 18:06:36 [cpu_model_runner.py:83] Warming up done.
(EngineCore pid=53786) INFO 03-31 18:06:36 [core.py:283] init engine (profile, create kv cache, warmup model) took 70.62 seconds
(EngineCore pid=53786) WARNING 03-31 18:06:37 [scheduler.py:173] Using custom scheduler class vllm.v1.core.sched.scheduler.Scheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore pid=53786) INFO 03-31 18:06:38 [vllm.py:750] Asynchronous scheduling is disabled.
(EngineCore pid=53786) WARNING 03-31 18:06:38 [vllm.py:806] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=53786) INFO 03-31 18:06:38 [platform.py:103] Loading scheduler from: vllm.v1.core.sched.scheduler.Scheduler
(EngineCore pid=53786) WARNING 03-31 18:06:38 [cpu.py:136] VLLM_CPU_KVCACHE_SPACE not set. Using 251.88 GiB for KV cache.
INFO 03-31 18:06:38 [llm.py:391] Supported tasks: ['generate']
Rendering prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.33it/s]
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.55it/s, est. speed input: 12.43 toks/s, output: 7.77 toks/s]
--------------------------------------------------
Generated text: '\n\nIBM operates'
--------------------------------------------------
vllm:kv_cache_usage_perc 0.0
vllm:prefix_cache_queries 8
vllm:prefix_cache_hits 0
vllm:external_prefix_cache_queries 0
vllm:external_prefix_cache_hits 0
vllm:mm_cache_queries 0
vllm:mm_cache_hits 0
vllm:prompt_tokens_cached 0
vllm:cache_config_info 1.0
(EngineCore pid=53786) INFO 03-31 18:06:39 [core.py:1210] Shutdown initiated (timeout=0)
(EngineCore pid=53786) INFO 03-31 18:06:39 [core.py:1233] Shutdown complete

Checklist

  • I have read the contributing guidelines
  • [] My code follows the project's code style (run bash format.sh)
  • [] I have added tests for my changes (if applicable)
  • [] I have updated the documentation (if applicable)
  • My commits include a Signed-off-by: line (DCO compliance)

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

Copy link
Copy Markdown
Collaborator

@bohnstingl bohnstingl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dilipgb. I made a first pass through the PR and left some comments.

In general, I think you need to merge in the latest main branch, which would remove the changes from pyproject.toml and uv.lock. At least the changes there are not related and should be removed.

Also, could you please adopt the new call chain from #872?

Comment on lines +21 to +23
- No dtype promotion (torch-spyre limitation)
- rope_scaling not yet implemented
- Expect numerical differences from upstream vLLM
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ad 1) Where are is the dtype promotion happening in upstream?
I see that there is enable_fp32_compute, but I belief for Granite it is False?

Ad 2) Can we have an assert / raise Exception to ensure that this code path is reached, i.e., that scaling_type == "default" and "mrope_section" not in rope_parameters and "use_fope" not in rope_parameters or not rope_parameters["use_fope"].

Ad 3) s the dtype promotion the source for the numerical differences, or is there anything apart from that?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. dtype promotion in upstream is optional and will be enabled only when enable_fp32_compute. We can have a another condition to check for this? But granite dtype is fp16 and we will be using the same without upcasting, that is the though process here.
  2. addressed.
  3. Yes since trig functions and other intermediate operations in upstream vllm is upcasted for better precision.

Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/rotary_embedding.py Outdated
Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/rotary_embedding.py Outdated
Comment on lines +84 to +101
# Use float16 directly - no dynamic dimensions (Spyre constraint)
compute_dtype = torch.float16

# Compute inverse frequencies: base^(-2i/rotary_dim)
# Using negative exponent for numerical stability
exponents = -torch.arange(0, self.rotary_dim, 2, dtype=compute_dtype) / self.rotary_dim
inv_freq = torch.pow(self.base, exponents)

# Create position indices [0, 1, 2, ..., max_position_embeddings-1]
t = torch.arange(self.max_position_embeddings, dtype=compute_dtype)

# Compute frequencies for each position: pos * inv_freq
# Shape: [max_position_embeddings, rotary_dim // 2]
freqs = torch.outer(t, inv_freq)

# Duplicate frequencies for interleaved pattern
# Shape: [max_position_embeddings, rotary_dim]
emb = torch.cat([freqs, freqs], dim=-1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make a comment that these ops are currently happening on CPU?

torch-spyre has had some more ops added lately and torch.cat should now work on spyre. So we might want to try and convert some of these operations to be happening on spyre.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried torch.arrange and torch.outer which are not yet implemented on spyre. torch.cat though implemented on spyre emb calculation will again fall back on CPU. We need to move data too and fro from CPU to card and card to CPU multiple times for only supporting torch.cat on cpu.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outer, may indeed not be supported. However, torch.cat and torch.arange should be.

Although torch.cat and torch.arange might have CPU fallbacks, I think we should still try to use them with the spyre device, because once those operations are supported through torch-spyre, they will just work in vllm-spyre.

Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/rotary_embedding.py Outdated
Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/rotary_embedding.py Outdated
Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/rotary_embedding.py Outdated
Comment on lines +194 to +197
query_rot = query[..., :rotary_dim]
query_pass = query[..., rotary_dim:]
key_rot = key[..., :rotary_dim]
key_pass = key[..., rotary_dim:]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, tensor slicing shouldn't currently work with the tensors on spyre? Can you confirm that the tensors are indeed on spyre?

# Retrieve cos/sin for the given positions
# positions shape: [batch_size, seq_len] or [total_tokens]
cos = cos_cache[positions] # [..., rotary_dim]
sin = sin_cache[positions] # [..., rotary_dim]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised this is actually working when cos_cache and sin_cache are on spyre?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dilipgb I am still puzzled by how this can work for you. I tested it locally and the tensor slicing fails, as this is not yet supported in torch-spyre. At least not in eager mode. Can you confirm that?

Therefore, I think we should restructure this function a bit overall and do the slicing in _forward_spyre_impl on CPU, pass-in the two halves, apply the individual RoPE on them, return them and then combine them together again in _forward_spyre_impl on CPU.

dilipgb and others added 6 commits April 3, 2026 12:14
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>
<!-- markdownlint-disable -->

This PR bumps the lower bound of foundation-model-stack dependency from
1.7.0 to 1.8.0 which includes Llama bug fixes for torch 2.10.

<!-- Link related issues, e.g., `Fixes #` or `Relates to torch-spyre#456` -->

<!-- Describe how you tested your changes. Include commands or steps to
reproduce. -->

- [ ] I have read the [contributing
guidelines](https://docs.vllm.ai/projects/spyre/en/latest/contributing)
- [ ] My code follows the project's code style (run `bash format.sh`)
- [ ] I have added tests for my changes (if applicable)
- [ ] I have updated the documentation (if applicable)
- [ ] My commits include a `Signed-off-by:` line (DCO compliance)

---------

Signed-off-by: Daniel Schenker <daniel.schenker@ibm.com>
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>
Copy link
Copy Markdown
Collaborator

@bohnstingl bohnstingl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dilipgb Can you please have a look at my comments and also merge-in the latest main?

Comment on lines +252 to +255
# Transfer cos/sin cache to Spyre device if not already there
# if self.cos_cache.device != self._target_device:
# self.cos_cache = convert(self.cos_cache, self._target_device, self._target_dtype)
# self.sin_cache = convert(self.sin_cache, self._target_device, self._target_dtype)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove that?

Rotated tensor [..., rotary_dim]
"""
x1, x2 = x.chunk(2, dim=-1)
return torch.cat([-x2, x1], dim=-1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can rework this to be supported on spyre. In particular,

Comment thread vllm_spyre_next/vllm_spyre_next/custom_ops/rotary_embedding.py
Comment on lines +84 to +101
# Use float16 directly - no dynamic dimensions (Spyre constraint)
compute_dtype = torch.float16

# Compute inverse frequencies: base^(-2i/rotary_dim)
# Using negative exponent for numerical stability
exponents = -torch.arange(0, self.rotary_dim, 2, dtype=compute_dtype) / self.rotary_dim
inv_freq = torch.pow(self.base, exponents)

# Create position indices [0, 1, 2, ..., max_position_embeddings-1]
t = torch.arange(self.max_position_embeddings, dtype=compute_dtype)

# Compute frequencies for each position: pos * inv_freq
# Shape: [max_position_embeddings, rotary_dim // 2]
freqs = torch.outer(t, inv_freq)

# Duplicate frequencies for interleaved pattern
# Shape: [max_position_embeddings, rotary_dim]
emb = torch.cat([freqs, freqs], dim=-1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outer, may indeed not be supported. However, torch.cat and torch.arange should be.

Although torch.cat and torch.arange might have CPU fallbacks, I think we should still try to use them with the spyre device, because once those operations are supported through torch-spyre, they will just work in vllm-spyre.

# Retrieve cos/sin for the given positions
# positions shape: [batch_size, seq_len] or [total_tokens]
cos = cos_cache[positions] # [..., rotary_dim]
sin = sin_cache[positions] # [..., rotary_dim]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dilipgb I am still puzzled by how this can work for you. I tested it locally and the tensor slicing fails, as this is not yet supported in torch-spyre. At least not in eager mode. Can you confirm that?

Therefore, I think we should restructure this function a bit overall and do the slicing in _forward_spyre_impl on CPU, pass-in the two halves, apply the individual RoPE on them, return them and then combine them together again in _forward_spyre_impl on CPU.

Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>
Signed-off-by: Dilip Gowda Bhagavan <110233170+dilipgb@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@bohnstingl bohnstingl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor details that I don't forget.
As discussed offline, it would be good to run as many operations on spyre. For example, torch.cat should be supported now.
Indexing operations, such as slicing still need to stay on cpu for the moment.

Tq, q_hidden = query.shape
Tk, k_hidden = key.shape

assert Tq == Tk, f"Query/Key sequence mismatch: {Tq} != {Tk}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Compile the forward kernel
self.maybe_compiled_forward_spyre = self.maybe_compile(self.forward_spyre)
self._layer_name = register_layer(self, "spyre_rotary_embedding")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently introduced an additional logging. Thus, please include something like

logger.debug_once(
    "SpyreRotaryEmbedding: Dispatch: enabled=%s, Forward method=%s, Compiled=%s",
    self.enabled(),
    self._forward_method.__name__,
    self.maybe_compiled_forward_spyre is not self.forward_spyre,
)

Comment on lines +262 to +263
assert cos_q.shape == query_rot.shape, f"{cos_q.shape} != {query.shape}"
assert sin_q.shape == query_rot.shape
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have more descriptive error messages here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Wrap the RoPE op

3 participants