You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using the latest vllm version, i need to apply rope scaling to llama3.1-8b and gemma2-9b to extend the the max context length from 8k up to 128k.
I using this command:
I don't know how to use/tune these args --rope-scaling and --rope-theta to be assure that i can serve 128k tokens per requests (or context length is extended to 128k tokens).
what do they mean?
Hardware A10 GPU with 24G RAM
btw I got this error:
$ python -m vllm.entrypoints.openai.api_server --model NousResearch/Meta-Llama-3.1-8B-Instruct --max-model-len 120000 --gpu-memory-utilization 0.95 --rope-scaling '{"factor": 8.0, "type": "dynamic"}'
INFO 11-21 13:28:24 api_server.py:585] vLLM API server version 0.6.4.post1
INFO 11-21 13:28:24 api_server.py:586] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='NousResearch/Meta-Llama-3.1-8B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=120000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling={'factor': 8.0, 'type': 'dynamic'}, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 11-21 13:28:24 api_server.py:175] Multiprocessing frontend to use ipc:/tmp/2a3bc2d6-292b-410b-bb5d-f83383760498 for IPC Path.
INFO 11-21 13:28:24 api_server.py:194] Started engine process with PID 29655
INFO 11-21 13:28:24 config.py:112] Replacing legacy 'type' key with 'rope_type'
INFO 11-21 13:28:29 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
WARNING 11-21 13:28:29 arg_utils.py:1013] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
WARNING 11-21 13:28:29 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-21 13:28:29 config.py:1136] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 11-21 13:28:29 config.py:112] Replacing legacy 'type' key with 'rope_type'
INFO 11-21 13:28:37 config.py:350] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
WARNING 11-21 13:28:37 arg_utils.py:1013] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
WARNING 11-21 13:28:37 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-21 13:28:37 config.py:1136] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 11-21 13:28:37 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='NousResearch/Meta-Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='NousResearch/Meta-Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=120000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=NousResearch/Meta-Llama-3.1-8B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=True multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
INFO 11-21 13:28:40 selector.py:135] Using Flash Attention backend.
INFO 11-21 13:28:43 model_runner.py:1072] Starting to load model NousResearch/Meta-Llama-3.1-8B-Instruct...
INFO 11-21 13:28:44 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.43it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.16it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.30it/s]
INFO 11-21 13:28:48 model_runner.py:1077] Loading model weights took 15.2075 GB
INFO 11-21 13:28:49 worker.py:232] Memory profiling results: total_gpu_memory=21.99GiB initial_memory_usage=15.53GiB peak_torch_memory=16.38GiB memory_usage_post_profile=15.55GiB non_torch_memory=0.33GiB kv_cache_size=4.17GiB gpu_memory_utilization=0.95
INFO 11-21 13:28:49 gpu_executor.py:113] # GPU blocks: 2136, # CPU blocks: 2048
INFO 11-21 13:28:49 gpu_executor.py:117] Maximum concurrency for 120000 tokens per request: 0.28x
ERROR 11-21 13:28:49 engine.py:366] The model's max seq len (120000) is larger than the maximum number of tokens that can be stored in KV cache (34176). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
ERROR 11-21 13:28:49 engine.py:366] Traceback (most recent call last):
ERROR 11-21 13:28:49 engine.py:366] File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
ERROR 11-21 13:28:49 engine.py:366] engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 11-21 13:28:49 engine.py:366] File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
ERROR 11-21 13:28:49 engine.py:366] return cls(ipc_path=ipc_path,
ERROR 11-21 13:28:49 engine.py:366] File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
ERROR 11-21 13:28:49 engine.py:366] self.engine = LLMEngine(*args, **kwargs)
ERROR 11-21 13:28:49 engine.py:366] File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 350, in __init__
ERROR 11-21 13:28:49 engine.py:366] self._initialize_kv_caches()
ERROR 11-21 13:28:49 engine.py:366] File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 500, in _initialize_kv_caches
ERROR 11-21 13:28:49 engine.py:366] self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
ERROR 11-21 13:28:49 engine.py:366] File "/home/vllm-env/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 120, in initialize_cache
ERROR 11-21 13:28:49 engine.py:366] self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
ERROR 11-21 13:28:49 engine.py:366] File "/home/vllm-env/lib/python3.9/site-packages/vllm/worker/worker.py", line 268, in initialize_cache
ERROR 11-21 13:28:49 engine.py:366] raise_if_cache_size_invalid(num_gpu_blocks,
ERROR 11-21 13:28:49 engine.py:366] File "/home/vllm-env/lib/python3.9/site-packages/vllm/worker/worker.py", line 498, in raise_if_cache_size_invalid
ERROR 11-21 13:28:49 engine.py:366] raise ValueError(
ERROR 11-21 13:28:49 engine.py:366] ValueError: The model's max seq len (120000) is larger than the maximum number of tokens that can be stored in KV cache (34176). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Process SpawnProcess-1:
Traceback (most recent call last):
File "/home/vllm-env/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/vllm-env/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine
raise e
File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
return cls(ipc_path=ipc_path,
File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
self.engine = LLMEngine(*args, **kwargs)
File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 350, in __init__
self._initialize_kv_caches()
File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 500, in _initialize_kv_caches
self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
File "/home/vllm-env/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 120, in initialize_cache
self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
File "/home/vllm-env/lib/python3.9/site-packages/vllm/worker/worker.py", line 268, in initialize_cache
raise_if_cache_size_invalid(num_gpu_blocks,
File "/home/vllm-env/lib/python3.9/site-packages/vllm/worker/worker.py", line 498, in raise_if_cache_size_invalid
raise ValueError(
ValueError: The model's max seq len (120000) is larger than the maximum number of tokens that can be stored in KV cache (34176). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
[rank0]:[W1121 13:28:49.681361114 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
File "/home/vllm-env/lib/python3.9/site-packages/zmq/_future.py", line 382, in poll
raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Traceback (most recent call last):
File "/home/vllm-env/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/vllm-env/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/vllm-env/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 643, in <module>
uvloop.run(run_server(args))
File "/home/vllm-env/lib/python3.9/site-packages/uvloop/__init__.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
return future.result()
File "/home/vllm-env/lib/python3.9/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
File "/home/vllm-env/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 609, in run_server
async with build_async_engine_client(args) as engine_client:
File "/home/vllm-env/lib/python3.9/contextlib.py", line 181, in __aenter__
return await self.gen.__anext__()
File "/home/vllm-env/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 113, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/home/vllm-env/lib/python3.9/contextlib.py", line 181, in __aenter__
return await self.gen.__anext__()
File "/home/vllm-env/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 210, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Your current environment
How would you like to use vllm
I am using the latest vllm version, i need to apply rope scaling to llama3.1-8b and gemma2-9b to extend the the max context length from 8k up to 128k.
I using this command:
I don't know how to use/tune these args
--rope-scaling
and--rope-theta
to be assure that i can serve 128k tokens per requests (or context length is extended to 128k tokens).what do they mean?
Hardware A10 GPU with 24G RAM
btw I got this error:
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: