[Usage]: How to use ROPE scaling for llama3.1 and gemma2? #10537

hahmad2008 · 2024-11-21T13:36:19Z

Your current environment

 vllm-0.6.4.post1

How would you like to use vllm

I am using the latest vllm version, i need to apply rope scaling to llama3.1-8b and gemma2-9b to extend the the max context length from 8k up to 128k.
I using this command:

python -m vllm.entrypoints.openai.api_server --model NousResearch/Meta-Llama-3.1-8B-Instruct  --max-model-len 12000 --gpu-memory-utilization 0.95 --rope-scaling '{"factor": 8.0, "type": "dynamic"}'

I don't know how to use/tune these args --rope-scaling and --rope-theta to be assure that i can serve 128k tokens per requests (or context length is extended to 128k tokens).
what do they mean?

Hardware A10 GPU with 24G RAM
btw I got this error:

$ python -m vllm.entrypoints.openai.api_server --model NousResearch/Meta-Llama-3.1-8B-Instruct  --max-model-len 120000 --gpu-memory-utilization 0.95 --rope-scaling '{"factor": 8.0, "type": "dynamic"}' 
INFO 11-21 13:28:24 api_server.py:585] vLLM API server version 0.6.4.post1
INFO 11-21 13:28:24 api_server.py:586] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='NousResearch/Meta-Llama-3.1-8B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', chat_template_text_format='string', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=120000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling={'factor': 8.0, 'type': 'dynamic'}, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 11-21 13:28:24 api_server.py:175] Multiprocessing frontend to use ipc:/tmp/2a3bc2d6-292b-410b-bb5d-f83383760498 for IPC Path.
INFO 11-21 13:28:24 api_server.py:194] Started engine process with PID 29655
INFO 11-21 13:28:24 config.py:112] Replacing legacy 'type' key with 'rope_type'
INFO 11-21 13:28:29 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
WARNING 11-21 13:28:29 arg_utils.py:1013] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
WARNING 11-21 13:28:29 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-21 13:28:29 config.py:1136] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 11-21 13:28:29 config.py:112] Replacing legacy 'type' key with 'rope_type'
INFO 11-21 13:28:37 config.py:350] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
WARNING 11-21 13:28:37 arg_utils.py:1013] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
WARNING 11-21 13:28:37 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-21 13:28:37 config.py:1136] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 11-21 13:28:37 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='NousResearch/Meta-Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='NousResearch/Meta-Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=120000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=NousResearch/Meta-Llama-3.1-8B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=True multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
INFO 11-21 13:28:40 selector.py:135] Using Flash Attention backend.
INFO 11-21 13:28:43 model_runner.py:1072] Starting to load model NousResearch/Meta-Llama-3.1-8B-Instruct...
INFO 11-21 13:28:44 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.43it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.16it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.30it/s]

INFO 11-21 13:28:48 model_runner.py:1077] Loading model weights took 15.2075 GB
INFO 11-21 13:28:49 worker.py:232] Memory profiling results: total_gpu_memory=21.99GiB initial_memory_usage=15.53GiB peak_torch_memory=16.38GiB memory_usage_post_profile=15.55GiB non_torch_memory=0.33GiB kv_cache_size=4.17GiB gpu_memory_utilization=0.95
INFO 11-21 13:28:49 gpu_executor.py:113] # GPU blocks: 2136, # CPU blocks: 2048
INFO 11-21 13:28:49 gpu_executor.py:117] Maximum concurrency for 120000 tokens per request: 0.28x
ERROR 11-21 13:28:49 engine.py:366] The model's max seq len (120000) is larger than the maximum number of tokens that can be stored in KV cache (34176). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
ERROR 11-21 13:28:49 engine.py:366] Traceback (most recent call last):
ERROR 11-21 13:28:49 engine.py:366]   File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
ERROR 11-21 13:28:49 engine.py:366]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 11-21 13:28:49 engine.py:366]   File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
ERROR 11-21 13:28:49 engine.py:366]     return cls(ipc_path=ipc_path,
ERROR 11-21 13:28:49 engine.py:366]   File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
ERROR 11-21 13:28:49 engine.py:366]     self.engine = LLMEngine(*args, **kwargs)
ERROR 11-21 13:28:49 engine.py:366]   File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 350, in __init__
ERROR 11-21 13:28:49 engine.py:366]     self._initialize_kv_caches()
ERROR 11-21 13:28:49 engine.py:366]   File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 500, in _initialize_kv_caches
ERROR 11-21 13:28:49 engine.py:366]     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
ERROR 11-21 13:28:49 engine.py:366]   File "/home/vllm-env/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 120, in initialize_cache
ERROR 11-21 13:28:49 engine.py:366]     self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
ERROR 11-21 13:28:49 engine.py:366]   File "/home/vllm-env/lib/python3.9/site-packages/vllm/worker/worker.py", line 268, in initialize_cache
ERROR 11-21 13:28:49 engine.py:366]     raise_if_cache_size_invalid(num_gpu_blocks,
ERROR 11-21 13:28:49 engine.py:366]   File "/home/vllm-env/lib/python3.9/site-packages/vllm/worker/worker.py", line 498, in raise_if_cache_size_invalid
ERROR 11-21 13:28:49 engine.py:366]     raise ValueError(
ERROR 11-21 13:28:49 engine.py:366] ValueError: The model's max seq len (120000) is larger than the maximum number of tokens that can be stored in KV cache (34176). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/home/vllm-env/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/vllm-env/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 368, in run_mp_engine
    raise e
  File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
    return cls(ipc_path=ipc_path,
  File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
    self.engine = LLMEngine(*args, **kwargs)
  File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 350, in __init__
    self._initialize_kv_caches()
  File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 500, in _initialize_kv_caches
    self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
  File "/home/vllm-env/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 120, in initialize_cache
    self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
  File "/home/vllm-env/lib/python3.9/site-packages/vllm/worker/worker.py", line 268, in initialize_cache
    raise_if_cache_size_invalid(num_gpu_blocks,
  File "/home/vllm-env/lib/python3.9/site-packages/vllm/worker/worker.py", line 498, in raise_if_cache_size_invalid
    raise ValueError(
ValueError: The model's max seq len (120000) is larger than the maximum number of tokens that can be stored in KV cache (34176). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
[rank0]:[W1121 13:28:49.681361114 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
  File "/home/vllm-env/lib/python3.9/site-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
    while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
  File "/home/vllm-env/lib/python3.9/site-packages/zmq/_future.py", line 382, in poll
    raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Traceback (most recent call last):
  File "/home/vllm-env/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/vllm-env/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/vllm-env/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 643, in <module>
    uvloop.run(run_server(args))
  File "/home/vllm-env/lib/python3.9/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
    return future.result()
  File "/home/vllm-env/lib/python3.9/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/home/vllm-env/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 609, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/home/vllm-env/lib/python3.9/contextlib.py", line 181, in __aenter__
    return await self.gen.__anext__()
  File "/home/vllm-env/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 113, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/home/vllm-env/lib/python3.9/contextlib.py", line 181, in __aenter__
    return await self.gen.__anext__()
  File "/home/vllm-env/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 210, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

hahmad2008 added the usage How to use vllm label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: How to use ROPE scaling for llama3.1 and gemma2? #10537

[Usage]: How to use ROPE scaling for llama3.1 and gemma2? #10537

hahmad2008 commented Nov 21, 2024 •

edited

Loading

[Usage]: How to use ROPE scaling for llama3.1 and gemma2? #10537

[Usage]: How to use ROPE scaling for llama3.1 and gemma2? #10537

Comments

hahmad2008 commented Nov 21, 2024 • edited Loading

Your current environment

How would you like to use vllm

Before submitting a new issue...

hahmad2008 commented Nov 21, 2024 •

edited

Loading