You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Follow the tutorial / Downlaod mistral-7B-instruct
vllm serve ~/finetuning/models/7B
INFO 08-22 14:29:18 api_server.py:339] vLLM API server version 0.5.4
INFO 08-22 14:29:18 api_server.py:340] args: Namespace(model_tag='/home/barbatus/finetuning/models/7B', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/home/barbatus/finetuning/models/7B', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fbd2bdbd510>)
WARNING 08-22 14:29:18 config.py:1454] Casting torch.bfloat16 to torch.float16.
INFO 08-22 14:29:18 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/home/barbatus/finetuning/models/7B', speculative_config=None, tokenizer='/home/barbatus/finetuning/models/7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/barbatus/finetuning/models/7B, use_v2_block_manager=False, enable_prefix_caching=False)
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
INFO 08-22 14:29:19 model_runner.py:720] Starting to load model /home/barbatus/finetuning/models/7B...
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in __init__
self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
engine = cls(
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
return engine_class(*args, **kwargs)
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in __init__
self.model_executor = executor_class(
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
self.driver_worker.load_model()
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in load_model
self.model_runner.load_model()
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 722, in load_model
self.model = get_model(model_config=self.model_config,
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
return loader.load_model(model_config=model_config,
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
model.load_weights(
File "/home/barbatus/.local/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 513, in load_weights
param = params_dict[name]
KeyError: 'layers.0.attention.wk.weight'
Expected Behavior
To launch the model
Additional Context
I am trying to launch vllm with my trained model, before pluging LoRA layer, I am trying to load the base model but it fails. Can you help?
Thank you
Suggested Solutions
No response
The text was updated successfully, but these errors were encountered:
Python Version
Pip Freeze
Reproduction Steps
Expected Behavior
To launch the model
Additional Context
I am trying to launch vllm with my trained model, before pluging LoRA layer, I am trying to load the base model but it fails. Can you help?
Thank you
Suggested Solutions
No response
The text was updated successfully, but these errors were encountered: