Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(vLLM): support async generator #746

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

fengyizhu
Copy link

@fengyizhu fengyizhu commented Sep 9, 2024

I have provided a draft version of the VLLM, with the following key changes:

  1. Protocol changes: Aligned with OpenAI /v1/audio/speech.
  2. Removed the custom inference part, keeping the inference logic consistent between streaming and non-streaming.
  3. Optimized some potential issues of inconsistency caused by streaming inference.

These changes may have a significant impact. Feel free to leave comments to guide me in further improvements.

@github-actions github-actions bot changed the base branch from main to dev September 9, 2024 02:09
Copy link
Member

@fumiama fumiama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Please note that vLLM cannot and will never replace the original infer code.
  • You should open a new PR for the change of examples/api but not merge them together.

ChatTTS/model/gpt.py Outdated Show resolved Hide resolved
ChatTTS/model/gpt.py Outdated Show resolved Hide resolved
ChatTTS/core.py Show resolved Hide resolved
@fumiama fumiama added enhancement New feature or request algorithm Algorithm improvements & issues performance Running speed & quality labels Sep 9, 2024
Copy link
Member

@fumiama fumiama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not change the default behavior unless you must do it. If you have the reason of changing the behavior, please explain.

Note: you shouldn't refer the content written in vLLM code and port those contents into main, because the code in vLLM is contributed by the community and hasn't been fully checked.

ChatTTS/core.py Outdated Show resolved Hide resolved
ChatTTS/core.py Outdated Show resolved Hide resolved
ChatTTS/core.py Outdated Show resolved Hide resolved
ChatTTS/core.py Outdated Show resolved Hide resolved
ChatTTS/core.py Outdated Show resolved Hide resolved
ChatTTS/core.py Outdated Show resolved Hide resolved
ChatTTS/core.py Outdated Show resolved Hide resolved
@IrisSally
Copy link
Contributor

这一个版本太强大了,我用4090测,流式api第一个chunk 65毫秒,而且音色能固定,太快了,太快了,怀疑电脑坏了

@ZaymeShaw
Copy link
Contributor

感谢大佬分享。这个版本能和原版本保持相同音色吗,这个issue中有提到 #640

@fengyizhu
Copy link
Author

fengyizhu commented Sep 9, 2024

感谢大佬分享。这个版本能和原版本保持相同音色吗,这个issue中有提到 #640

It is fully compatible in principle.

@LLongIsland
Copy link

这一个版本太强大了,我用4090测,流式api第一个chunk 65毫秒,而且音色能固定,太快了,太快了,怀疑电脑坏了

4090上使用compile=True耗时3.3s的文本,用main分支vllm加速是1.6s,不能调整音色,用这个pr的版本可以调整,但速度只有2.6s左右

Copy link
Member

@fumiama fumiama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me say it again. Your modification on examples/api is about another function so you SHOULD move those changes to ANOTHER PR. Thanks for your comprehension.

ChatTTS/core.py Outdated Show resolved Hide resolved
ChatTTS/core.py Outdated Show resolved Hide resolved
ChatTTS/core.py Outdated Show resolved Hide resolved
# Filter both rows and columns using slicing
yield new_wavs[:][:, keep_cols]
# Hacker:Check if there are any silent segments; if so, take the last segment. Otherwise, try waiting for another loop.
import librosa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to introduce librosa. Modifying the original code makes sense. ex. replace

keep_cols = np.sum(new_wavs != 0, axis=0) > 0

with

# pseudo code without testing, just a hint
keep_cols = np.sum(abs(new_wavs) > 1e-6, axis=0) > 0

ChatTTS/core.py Outdated Show resolved Hide resolved
),
]

emb = self.embed(input_ids, text_mask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you move a line up, don't change the variable name when it's ok to remain the original one. It will make it difficult to see your changes.

ChatTTS/core.py Outdated Show resolved Hide resolved
Copy link
Member

@fumiama fumiama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that you got some trouble moving your modification of examples/api and tools/audio into another PR 😂. Here's a brief instruction.

  1. Add a new commit in this PR, removing examples/api/main.py and tools/audio/np.py
  2. Add a new branch based on the dev branch of THIS MAIN REPO.
  3. Apply your modification of examples/api/main.py and tools/audio/np.py on this new branch.
  4. Create a new PR based on this branch to dev.

@fumiama fumiama changed the title feat:support async vllm generator feat(vLLM): support async generator Sep 12, 2024
@fumiama fumiama mentioned this pull request Sep 15, 2024
@niuzheng168
Copy link
Contributor

niuzheng168 commented Sep 19, 2024

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm.
Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch

also @fumiama for this question

@fumiama
Copy link
Member

fumiama commented Sep 19, 2024

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm.

Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch

also @fumiama for this question

This part of code is contributed by the community and you can ask @ylzz1997 about that. I'm sorry but I'm not familiar with vLLM 😂.

@ylzz1997
Copy link
Contributor

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch

also @fumiama for this question

Because Vllm don't suport some feature:

  1. custom lm-head
  2. multi-codebook sampler (custom sampler)
  3. sample without tokenizer

In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code.

That's all

@fumiama
Copy link
Member

fumiama commented Sep 20, 2024

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch
also @fumiama for this question

Because Vllm don't suport some feature:

  1. custom lm-head
  2. multi-codebook sampler (custom sampler)
  3. sample without tokenizer

In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code.

That's all

Thanks for your explanation.

@niuzheng168
Copy link
Contributor

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch
also @fumiama for this question

Because Vllm don't suport some feature:

  1. custom lm-head
  2. multi-codebook sampler (custom sampler)
  3. sample without tokenizer

In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code.

That's all

Yes, we cannot use vllm directly as it requires some code changes. I have implemented a solution here, I am new to Torch and Python, so feel free for any comments.
Only for the llama part:
Model: https://github.com/niuzheng168/vllm/blob/dev/vllm/model_executor/models/chattts.py
Sample usage: https://github.com/niuzheng168/vllm/blob/dev/chattts_sample.py

For the issues you list above:

  • We can create multiple lm_heads.
  • I noticed it runs slowly when we run the sample N times, so I made a lite version of the multi-head sampler.
  • This is already supported in vllm by setting detokenize=false.

One of the main challenges is that vllm assumes all the model output is a single token, which is just an int value. However, the TTS system, whether chattts or fishtts, generates multi-head tokens in one decoding step. This means the model output is a token list, breaking the fundamental design. I had to use many if/else statements to ensure the whole pipeline still works.

Overall, compared to moving vllm codes here, implementing a model in vllm will save effort for other features than core model inference, like sampling and scheduling, continual batch processing, etc.
I also reply the road map of vllm thread, to see if push vllm support the model official, I believe more and more multi-modal models is using similar model arch, especially for those gpt-4o like models.

@fumiama
Copy link
Member

fumiama commented Oct 1, 2024

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch
also @fumiama for this question

Because Vllm don't suport some feature:

  1. custom lm-head
  2. multi-codebook sampler (custom sampler)
  3. sample without tokenizer

In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code.
That's all

Yes, we cannot use vllm directly as it requires some code changes. I have implemented a solution here, I am new to Torch and Python, so feel free for any comments. Only for the llama part: Model: https://github.com/niuzheng168/vllm/blob/dev/vllm/model_executor/models/chattts.py Sample usage: https://github.com/niuzheng168/vllm/blob/dev/chattts_sample.py

For the issues you list above:

  • We can create multiple lm_heads.
  • I noticed it runs slowly when we run the sample N times, so I made a lite version of the multi-head sampler.
  • This is already supported in vllm by setting detokenize=false.

One of the main challenges is that vllm assumes all the model output is a single token, which is just an int value. However, the TTS system, whether chattts or fishtts, generates multi-head tokens in one decoding step. This means the model output is a token list, breaking the fundamental design. I had to use many if/else statements to ensure the whole pipeline still works.

Overall, compared to moving vllm codes here, implementing a model in vllm will save effort for other features than core model inference, like sampling and scheduling, continual batch processing, etc. I also reply the road map of vllm thread, to see if push vllm support the model official, I believe more and more multi-modal models is using similar model arch, especially for those gpt-4o like models.

Thanks for your great effort. We welcome you to contribute your code into this repo if you like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
algorithm Algorithm improvements & issues enhancement New feature or request good first issue Good for newcomers performance Running speed & quality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants