[Bugfix] add truncate_prompt_tokens to work offline, directly from LLM class.#4598
[Bugfix] add truncate_prompt_tokens to work offline, directly from LLM class.#4598yecohn wants to merge 3 commits intovllm-project:mainfrom
Conversation
|
@tdoublep can you help review this? |
|
@simon-mo sure, will try to get to that later today |
| def _validate_prompt( | ||
| self, | ||
| prompt: Optional[str] = None, | ||
| prompt_token_ids: Optional[List[int]] = None, | ||
| truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None | ||
| ) -> Tuple[Optional[str], Optional[List[int]]]: |
There was a problem hiding this comment.
This function re-implements the same logic as this function. It's worth thinking about whether we should have it implemented in a single place and re-used accordingly. I think that would be preferable, unless there is some strong reason to have it duplicated.
There was a problem hiding this comment.
I'm also studying #3512 to understand if that needs to be merged before this one.
There was a problem hiding this comment.
yes this is a good point. let me know what do you think is the best and I'll implement it
There was a problem hiding this comment.
For now let's just make sure that the behaviour is equivalent (see the below discussion re: using truncation from the tokenizer). I think it would make more sense to think about refactoring it after #3512 is merged because after that point the completions API will be using the tokenizer from the engine anyway.
| if truncate_prompt_tokens is not None: | ||
| if prompt_token_ids is None: | ||
| prompt_token_ids = self.llm_engine.tokenizer.tokenizer( | ||
| prompt).input_ids | ||
| prompt_token_ids = prompt_token_ids[-truncate_prompt_tokens:] |
There was a problem hiding this comment.
This isn't exactly the same behaviour as the equivalent code in the completion API. In that case, we use the built-in capability of the tokenizer to perform truncation in the case that the user provides a prompt, and only implement the truncation ourselves in the case when the user provides prompt ids. I think it would be better to do the same here to be consistent.
There was a problem hiding this comment.
yes indeed, the thing I remarked though is that using the tokenizer with kwargs does a right truncation and not a left one basically doing prompt_token_ids[:truncated_prompt_tokens]. I ll recheck to make sure I am not mistaken.
here the traceback I got:
`
| Preparing working directory | 2s |
|---|---|
| # Creating "/workspace/build/buildkite/vllm/ci" | |
| $ cd /workspace/build/buildkite/vllm/ci | |
| $ git clone -v -- https://github.com/vllm-project/vllm.git . | |
| Cloning into '.'... | |
| POST git-upload-pack (175 bytes) | |
| POST git-upload-pack (gzip 2402 to 1225 bytes) | |
| remote: Enumerating objects: 16905, done. | |
| remote: Counting objects: 100% (3599/3599), done. | |
| remote: Compressing objects: 100% (711/711), done. | |
| remote: Total 16905 (delta 3181), reused 2987 (delta 2887), pack-reused 13306 | |
| Receiving objects: 100% (16905/16905), 8.71 MiB | 20.47 MiB/s, done. | |
| Resolving deltas: 100% (12703/12703), done. | |
| $ git clean -ffxdq | |
| # Fetch and checkout pull request head from GitHub | |
| $ git fetch -- origin refs/pull/4598/head | |
| remote: Enumerating objects: 15, done. | |
| remote: Counting objects: 100% (15/15), done. | |
| remote: Compressing objects: 100% (10/10), done. | |
| remote: Total 15 (delta 7), reused 13 (delta 5), pack-reused 0 | |
| Unpacking objects: 100% (15/15), 3.66 KiB | 749.00 KiB/s, done. | |
| From https://github.com/vllm-project/vllm | |
| * branch refs/pull/4598/head -> FETCH_HEAD | |
# FETCH_HEAD is now 968d5e9651ca28b90498e282e3095f9654ceebe0 |
|
| $ git checkout -f 968d5e9 | |
| Note: switching to '968d5e9651ca28b90498e282e3095f9654ceebe0'. | |
| You are in 'detached HEAD' state. You can look around, make experimental | |
| changes and commit them, and you can discard any commits you make in this | |
| state without impacting any branches by switching back to a branch. | |
| If you want to create a new branch to retain commits you create, you may | |
| do so (now or later) by using -c with the switch command. Example: | |
| git switch -c | |
| Or undo this operation with: | |
| git switch - | |
| Turn off this advice by setting config variable advice.detachedHead to false | |
| HEAD is now at 968d5e9 fix dependency injection | |
| # Cleaning again to catch any post-checkout changes | |
| $ git clean -ffxdq | |
| # Checking to see if Git data needs to be sent to Buildkite | |
| $ buildkite-agent meta-data exists buildkite:git:commit | |
| Running commands | |
| $ trap 'kill -- $$' INT TERM QUIT; cp -r ~/.ssh /workspace/.ssh && chmod -R 777 /workspace | |
| Some containers had unknown exit statuses. Perhaps they were in ImagePullBackOff. |
`
There was a problem hiding this comment.
regarding the build I got an error buildkite/ci/pr/distributed-tests-multiple-groups — Failed (exit status -10) do you have any idea what could it be ? and how could I debug this ?
There was a problem hiding this comment.
@yecohn Regarding the truncation side issue. You are right, that by default the tokenizers do truncation from the right. That's why we added this argument when calling get_tokenizer to set the truncation side to left.
Re: the build error I don't think it looks related to your changes currently.
|
Hi @tdoublep, @yecohn, and @simon-mo, I’d like to see this feature implemented. In my case, I need the truncation to occur in vllm, and I would also like to experiment with the generated KV caches. Are there any plans to support this feature? Can I help in any way? Thanks, |
|
seems weird that we put it as sampling_params for offline inference, but it doesn't work at there |
|
Another foot shot today! It seems offline prompt truncation with vLLM engine is still not implemented as of vLLM 0.8.5. |
|
The ·truncate_prompt_tokens· feature remains non-functional in vllm 0.11.0. |
Fixes #4507.