Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove threadsafe #2907

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Remove threadsafe #2907

wants to merge 12 commits into from

Conversation

grimoire
Copy link
Collaborator

@grimoire grimoire commented Dec 17, 2024

  • Thread-safe mode has been removed.
  • asyncio.Queue -> asyncio.Event
  • Better host performance

Note that EOS would be output in this PR.

@lvhan028
Copy link
Collaborator

We have users who use pytorch engine in multi-thread env.
Pls provide a guide for them about migrating the non-threadsafe pytorch engine

@lvhan028
Copy link
Collaborator

Add WARNING that threadsafe is removed

@lvhan028
Copy link
Collaborator

"Better host performance", so what's the performance now?

@grimoire
Copy link
Collaborator Author

"Better host performance", so what's the performance now?

llama3-8b, tp=1, 3000 prompt, 256 concurrency

concurrency: 256
elapsed_time: 133.107s

first token latency(s)(min, max, ave): 0.119, 4.574, 0.621
per-token latency(s) percentile(50, 75, 95, 99): [0.028, 0.03, 0.284, 0.47]

number of prompt tokens: 676779
number of completion tokens: 612685
token throughput (completion token): 4602.956 token/s
token throughput (prompt + completion token): 9687.436 token/s
RPS (request per second): 22.538 req/s
RPM (request per minute): 1352.297 req/min

llama3-8b, tp=1, 10000 prompt, 512 concurrency

concurrency: 512
elapsed_time: 386.856s

first token latency(s)(min, max, ave): 0.259, 7.529, 0.823
per-token latency(s) percentile(50, 75, 95, 99): [0, 0.055, 0.894, 1.138]

number of prompt tokens: 2238358
number of completion tokens: 1995438
token throughput (completion token): 5158.094 token/s
token throughput (prompt + completion token): 10944.123 token/s
RPS (request per second): 25.849 req/s
RPM (request per minute): 1550.966 req/min

@lvhan028
Copy link
Collaborator

"Note that EOS would be output in this PR."
@lzhangzz will the tm refactoring you are working on output the EOS and stop_token_id to async_engine?

Copy link
Collaborator

@RunningLeon RunningLeon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lzhangzz
Copy link
Collaborator

"Note that EOS would be output in this PR." @lzhangzz will the tm refactoring you are working on output the EOS and stop_token_id to async_engine?

We need to discuss how EOS/stop_token_ids should be skipped in the async engine.

For some models, EOS is part of their chat template we may exclude the token in the reponse but step should not be rewinded (i.e. the token is kept in kv cache).

However, for models like vicuna, EOS must be excluded from both response and kv cache (rewind step to the token before EOS).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants