Compatible with Decapoda Research llama hf version by BasicCoder · Pull Request #251 · vllm-project/vllm

BasicCoder · 2023-06-26T05:31:17Z

For the Decapoda Research llama hf version:
Model's config.json:
"architectures": ["LLaMAForCausalLM"]
This may be seen as the Llama alias

For the Decapoda Research llama hf version: Model's config.json: "architectures": ["LLaMAForCausalLM"] This may be seen as the Llama alias

zhuohan123

LGTM! Thank you for your contribution to vLLM!

SUMMARY: * make `magic-wand` version check robust TEST PLAN: runs on remote push. will be manually triggering NIGHTLY and RELEASE relative to this branch. ```bash andy@waldorf:~$ cat test.sh #!/bin/bash set -euo pipefail MAGIC_WAND=$(pip3 show nm-magic-wand-nightly | grep "Version" | cut -d' ' -f2) || echo "nightly not installed" if [ -z "$MAGIC_WAND" ]; then MAGIC_WAND=$(pip3 show nm-magic-wand | grep "Version" | cut -d' ' -f2) fi echo ${MAGIC_WAND} andy@waldorf:~$ ./test.sh WARNING: Package(s) not found: nm-magic-wand-nightly nightly not installed 0.2.2 andy@waldorf:~$ echo $? 0 ``` when "nightly" is installed ... ```bash andy@waldorf:~$ ./test.sh 0.2.2.20240520 ``` --------- Co-authored-by: andy-neuma <andy@neuralmagic.com>

This PR fixes crashes observed on older Synapse builds introduced with HabanaAI#227. Setting PT_COMPILE_ONLY_MODE is not supported in current or older public Synapse builds, but we should not crash because of it, rather we should advise user to use the latest build. Previous behavior: ``` ... INFO 09-06 17:08:37 habana_executor.py:85] # HPU blocks: 10761, # CPU blocks: 910 INFO 09-06 17:08:37 habana_worker.py:201] Initializing cache engine took 47.29 GiB of device memory (54.34 GiB/94.62 GiB used) and -159.6 MiB of host memory (414.9 GiB/1007 GiB used) [rank0]: Traceback (most recent call last): [rank0]: File "/software/users/kzawora/vllm-utils/vllm_hpu_simple_test.py", line 9, in <module> [rank0]: llm = LLM(model="facebook/opt-125m") [rank0]: File "/software/users/kzawora/vllm-fork/vllm/entrypoints/llm.py", line 155, in __init__ [rank0]: self.llm_engine = LLMEngine.from_engine_args( [rank0]: File "/software/users/kzawora/vllm-fork/vllm/engine/llm_engine.py", line 456, in from_engine_args [rank0]: engine = cls( [rank0]: File "/software/users/kzawora/vllm-fork/vllm/engine/llm_engine.py", line 266, in __init__ [rank0]: self._initialize_kv_caches() [rank0]: File "/software/users/kzawora/vllm-fork/vllm/engine/llm_engine.py", line 378, in _initialize_kv_caches [rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) [rank0]: File "/software/users/kzawora/vllm-fork/vllm/executor/habana_executor.py", line 89, in initialize_cache [rank0]: self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks) [rank0]: File "/software/users/kzawora/vllm-fork/vllm/worker/habana_worker.py", line 202, in initialize_cache [rank0]: self._warm_up_model() [rank0]: File "/software/users/kzawora/vllm-fork/vllm/worker/habana_worker.py", line 220, in _warm_up_model [rank0]: self.model_runner.warmup_model(self.hpu_cache[0]) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: File "/software/users/kzawora/vllm-fork/vllm/worker/habana_model_runner.py", line 1412, in warmup_model [rank0]: with compile_only_mode_context(): [rank0]: File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__ [rank0]: return next(self.gen) [rank0]: File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/internal/bridge_config.py", line 20, in env_setting [rank0]: get_func = globals()['get_' + var.lower()] [rank0]: KeyError: 'get_pt_compile_only_mode' inc shutdown inc shutdown inc shutdown inc shutdown ``` Current behavior: ``` ... INFO 09-06 17:06:42 habana_executor.py:85] # HPU blocks: 10761, # CPU blocks: 910 INFO 09-06 17:06:43 habana_worker.py:201] Initializing cache engine took 47.29 GiB of device memory (54.34 GiB/94.62 GiB used) and -143.7 MiB of host memory (415 GiB/1007 GiB used) WARNING 09-06 17:06:43 habana_model_runner.py:1419] Cannot use PT_COMPILE_ONLY_MODE. Warmup time will be negatively impacted. Please update Gaudi Software Suite. INFO 09-06 17:06:43 habana_model_runner.py:1336] [Warmup][Prompt][1/23] batch_size:2 seq_len:1024 free_mem:40.28 GiB ... ```

* rocm support for moe tuning script - add rocm triton search space and pruning - Ray fix: use device id for multi-gpu tuning * current_platform.is_rocm(), not is_navi()

…ct#251) (vllm-project#270) ### What this PR does / why we need it? Backport: vllm-project/vllm-ascend#251 Add dispatch job to leverage jobs to dynamic devices include 2 stage as below: The dispatch job will spend extra about `10s * parallel number + 30s` time to wait other job launch container and release lock. - **Stage 1: Acquire lock** add a dispatch job, this job use lockfile to acquire locks and then get device number dynamically - **Stage 2.1: Launch container with dynamic device** pass the device number via output and start the container job with dynamic device - **Stage 2.2: Release lock** once the job started, release the lock. In the backend, we use multiple path to setup multiple self host runners as load balancer: ``` $ pwd /home/action $ ll | grep actions drwx------ 6 action action 4096 Mar 7 08:55 actions-runner-01 drwx------ 6 action action 4096 Mar 7 08:55 actions-runner-02 drwx------ 6 action action 4096 Mar 7 08:55 actions-runner-03 drwx------ 6 action action 4096 Mar 7 08:56 actions-runner-04 drwx------ 4 action action 4096 Jan 24 22:08 actions-runner-05 drwx------ 4 action action 4096 Jan 24 22:08 actions-runner-06 ``` ``` adduser -G docker action su action pip3 install docker prettytable sudo yum install procmail ``` ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - CI passed - E2E test manully, triggered 3 jobs in parallel: - [1st job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711345757/job/38348309297) dispatch to /dev/davinci2. - [2nd job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711348739/job/38348316250) dispatch to /dev/davinci3 - [3rd job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711351493/job/38348324551) dispatch to /dev/davinci4  ### What this PR does / why we need it?  ### Does this PR introduce _any_ user-facing change?  ### How was this patch tested?  Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

### What this PR does / why we need it? Add dispatch job to leverage jobs to dynamic devices include 2 stage as below: The dispatch job will spend extra about `10s * parallel number + 30s` time to wait other job launch container and release lock. - **Stage 1: Acquire lock** add a dispatch job, this job use lockfile to acquire locks and then get device number dynamically - **Stage 2.1: Launch container with dynamic device** pass the device number via output and start the container job with dynamic device - **Stage 2.2: Release lock** once the job started, release the lock. In the backend, we use multiple path to setup multiple self host runners as load balancer: ``` $ pwd /home/action $ ll | grep actions drwx------ 6 action action 4096 Mar 7 08:55 actions-runner-01 drwx------ 6 action action 4096 Mar 7 08:55 actions-runner-02 drwx------ 6 action action 4096 Mar 7 08:55 actions-runner-03 drwx------ 6 action action 4096 Mar 7 08:56 actions-runner-04 drwx------ 4 action action 4096 Jan 24 22:08 actions-runner-05 drwx------ 4 action action 4096 Jan 24 22:08 actions-runner-06 ``` ``` adduser -G docker action su action pip3 install docker prettytable sudo yum install procmail ``` ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - CI passed - E2E test manully, triggered 3 jobs in parallel: - [1st job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711345757/job/38348309297) dispatch to /dev/davinci2. - [2nd job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711348739/job/38348316250) dispatch to /dev/davinci3 - [3rd job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711351493/job/38348324551) dispatch to /dev/davinci4 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

…llm-project#251) * Check if qwen-vl-utils import succeeded, print nice warning if not. * Dependeny add. Co-authored-by: Salar Hosseini <159165450+skhorasganiTT@users.noreply.github.com>

Compatible with Decapoda Research llama hf version

8a24c29

For the Decapoda Research llama hf version: Model's config.json: "architectures": ["LLaMAForCausalLM"] This may be seen as the Llama alias

zhuohan123 self-requested a review June 26, 2023 15:11

zhuohan123 approved these changes Jun 26, 2023

View reviewed changes

zhuohan123 merged commit 471a7a4 into vllm-project:main Jun 26, 2023

michaelfeil pushed a commit to michaelfeil/vllm that referenced this pull request Jul 1, 2023

Compatible with Decapoda Research llama hf version (vllm-project#251)

14e0136

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Compatible with Decapoda Research llama hf version (vllm-project#251)

b199353

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compatible with Decapoda Research llama hf version#251

Compatible with Decapoda Research llama hf version#251
zhuohan123 merged 1 commit intovllm-project:mainfrom
BasicCoder:patch-1

BasicCoder commented Jun 26, 2023

Uh oh!

zhuohan123 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

BasicCoder commented Jun 26, 2023

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants