You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
model: meta-llama/Llama3-8B-Instruct
quantization: none
tensor_parallel_size: 2
GPUs: 2xA30
vllm: 0.6.4.post1v (also tried with 0.5.4 and 0.5.0)
strongly related to this issue: #6152
Can't run the script for multi GPUs (it works for a single GPU). Teh following error occurs:
(VllmWorkerProcess pid=13824) ERROR 11-20 12:20:01 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
I tried to set env variavle to "spawn" or use the latest vllm version:
os.environ["VLLM_WORKER_MULTIPROC_METHOD"]="spawn"
The problems is that the calling script runs the script again as a child process , but this is not the desired behavior. And as expected, the following error is thrown:
================================================================================
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
================================================================================
The code that loads the llm is
More info:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
INFO 11-20 13:28:39 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='models/meta-llama/Meta-Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='models/meta-llama/Meta-Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=models/meta-llama/Meta-Llama-3.1-8B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Same issue exists on different GPUs (V100). What is the problem here?
I attach the results of collect_env.py script.
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Your current environment
collect_env.txt
How would you like to use vllm
model: meta-llama/Llama3-8B-Instruct
quantization: none
tensor_parallel_size: 2
GPUs: 2xA30
vllm: 0.6.4.post1v (also tried with 0.5.4 and 0.5.0)
strongly related to this issue: #6152
Can't run the script for multi GPUs (it works for a single GPU). Teh following error occurs:
(VllmWorkerProcess pid=13824) ERROR 11-20 12:20:01 multiproc_worker_utils.py:226] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
I tried to set env variavle to "spawn" or use the latest vllm version:
os.environ["VLLM_WORKER_MULTIPROC_METHOD"]="spawn"
The problems is that the calling script runs the script again as a child process , but this is not the desired behavior. And as expected, the following error is thrown:
================================================================================
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
================================================================================
The code that loads the llm is
and for inference
More info:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
INFO 11-20 13:28:39 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='models/meta-llama/Meta-Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='models/meta-llama/Meta-Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=models/meta-llama/Meta-Llama-3.1-8B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Same issue exists on different GPUs (V100). What is the problem here?
I attach the results of collect_env.py script.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: