Skip to content

[Bugfix] add truncate_prompt_tokens to work offline, directly from LLM class.#4598

Closed
yecohn wants to merge 3 commits intovllm-project:mainfrom
yecohn:truncate_usage_pr
Closed

[Bugfix] add truncate_prompt_tokens to work offline, directly from LLM class.#4598
yecohn wants to merge 3 commits intovllm-project:mainfrom
yecohn:truncate_usage_pr

Conversation

@yecohn
Copy link
Contributor

@yecohn yecohn commented May 4, 2024

Fixes #4507.

  • added function _validate_prompt to make sure prompt is a right format and run some validations.
  • truncate_prompt_tokens should take truncate the prompt to the left as seen in vllm/entrypoints/openai/serving_engine.py line 182.

@simon-mo
Copy link
Collaborator

simon-mo commented May 4, 2024

@tdoublep can you help review this?

@tdoublep
Copy link
Member

tdoublep commented May 6, 2024

@simon-mo sure, will try to get to that later today

Comment on lines +261 to +266
def _validate_prompt(
self,
prompt: Optional[str] = None,
prompt_token_ids: Optional[List[int]] = None,
truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
) -> Tuple[Optional[str], Optional[List[int]]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function re-implements the same logic as this function. It's worth thinking about whether we should have it implemented in a single place and re-used accordingly. I think that would be preferable, unless there is some strong reason to have it duplicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also studying #3512 to understand if that needs to be merged before this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is a good point. let me know what do you think is the best and I'll implement it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now let's just make sure that the behaviour is equivalent (see the below discussion re: using truncation from the tokenizer). I think it would make more sense to think about refactoring it after #3512 is merged because after that point the completions API will be using the tokenizer from the engine anyway.

Comment on lines +273 to +277
if truncate_prompt_tokens is not None:
if prompt_token_ids is None:
prompt_token_ids = self.llm_engine.tokenizer.tokenizer(
prompt).input_ids
prompt_token_ids = prompt_token_ids[-truncate_prompt_tokens:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't exactly the same behaviour as the equivalent code in the completion API. In that case, we use the built-in capability of the tokenizer to perform truncation in the case that the user provides a prompt, and only implement the truncation ourselves in the case when the user provides prompt ids. I think it would be better to do the same here to be consistent.

Copy link
Contributor Author

@yecohn yecohn May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes indeed, the thing I remarked though is that using the tokenizer with kwargs does a right truncation and not a left one basically doing prompt_token_ids[:truncated_prompt_tokens]. I ll recheck to make sure I am not mistaken.

here the traceback I got:

`

Preparing working directory 2s
  # Creating "/workspace/build/buildkite/vllm/ci"
  $ cd /workspace/build/buildkite/vllm/ci
  $ git clone -v -- https://github.com/vllm-project/vllm.git .
  Cloning into '.'...
  POST git-upload-pack (175 bytes)
  POST git-upload-pack (gzip 2402 to 1225 bytes)
  remote: Enumerating objects: 16905, done.
  remote: Counting objects: 100% (3599/3599), done.
  remote: Compressing objects: 100% (711/711), done.
  remote: Total 16905 (delta 3181), reused 2987 (delta 2887), pack-reused 13306
  Receiving objects: 100% (16905/16905), 8.71 MiB | 20.47 MiB/s, done.
  Resolving deltas: 100% (12703/12703), done.
  $ git clean -ffxdq
  # Fetch and checkout pull request head from GitHub
  $ git fetch -- origin refs/pull/4598/head
  remote: Enumerating objects: 15, done.
  remote: Counting objects: 100% (15/15), done.
  remote: Compressing objects: 100% (10/10), done.
  remote: Total 15 (delta 7), reused 13 (delta 5), pack-reused 0
  Unpacking objects: 100% (15/15), 3.66 KiB | 749.00 KiB/s, done.
  From https://github.com/vllm-project/vllm
  * branch refs/pull/4598/head -> FETCH_HEAD
  # FETCH_HEAD is now 968d5e9651ca28b90498e282e3095f9654ceebe0
  $ git checkout -f 968d5e9
  Note: switching to '968d5e9651ca28b90498e282e3095f9654ceebe0'.
   
  You are in 'detached HEAD' state. You can look around, make experimental
  changes and commit them, and you can discard any commits you make in this
  state without impacting any branches by switching back to a branch.
   
  If you want to create a new branch to retain commits you create, you may
  do so (now or later) by using -c with the switch command. Example:
   
  git switch -c
   
  Or undo this operation with:
   
  git switch -
   
  Turn off this advice by setting config variable advice.detachedHead to false
   
  HEAD is now at 968d5e9 fix dependency injection
  # Cleaning again to catch any post-checkout changes
  $ git clean -ffxdq
  # Checking to see if Git data needs to be sent to Buildkite
  $ buildkite-agent meta-data exists buildkite:git:commit
  Running commands
  $ trap 'kill -- $$' INT TERM QUIT; cp -r ~/.ssh /workspace/.ssh && chmod -R 777 /workspace
  Some containers had unknown exit statuses. Perhaps they were in ImagePullBackOff.

`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding the build I got an error buildkite/ci/pr/distributed-tests-multiple-groups — Failed (exit status -10) do you have any idea what could it be ? and how could I debug this ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yecohn Regarding the truncation side issue. You are right, that by default the tokenizers do truncation from the right. That's why we added this argument when calling get_tokenizer to set the truncation side to left.

Re: the build error I don't think it looks related to your changes currently.

@yecohn yecohn closed this Jul 23, 2024
@yecohn yecohn deleted the truncate_usage_pr branch July 23, 2024 16:16
@JackChuang
Copy link

Hi @tdoublep, @yecohn, and @simon-mo,
It looks like this feature has not been merged yet. According to my tests, truncate_prompt_tokens still does not work for offline inference.

I’d like to see this feature implemented. In my case, I need the truncation to occur in vllm, and I would also like to experiment with the generated KV caches.

Are there any plans to support this feature? Can I help in any way?
Or, alternatively, are there any other methods to perform prompt truncation while running offline inference in the vllm?

Thanks,

@allanj
Copy link

allanj commented Dec 30, 2024

seems weird that we put it as sampling_params for offline inference, but it doesn't work at there

@YunfanZhang42
Copy link

Another foot shot today! It seems offline prompt truncation with vLLM engine is still not implemented as of vLLM 0.8.5.

@Galaxy-Husky
Copy link

The ·truncate_prompt_tokens· feature remains non-functional in vllm 0.11.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: truncate_prompt_tokens in SamplingParams only available for openai entrypoints, not for offline vLLM engine

7 participants