-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Support v1/chat/completions #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
For the chat template, can we use the hugging face tokenizer by default? https://huggingface.co/docs/transformers/main/en/chat_templating#how-do-i-use-chat-templates |
Good idea. Let me add this. |
|
Now if And this template is actually incorrect for this model, so you will get the following response in the unit test: The following response is expected with ChatML template: Another issue is |
| @@ -0,0 +1,381 @@ | |||
| # Adapted from | |||
| # https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can consider importing instead of copying later.
…roject#50) * [Transpiler] Add config class and interface for the transpiler * [Misc] Add a static checking for STensor::num_elements * [Transpiler] Implement all kernel operators and their runtime * Format code * Add support for broadcast in ElemwiseBinaryKernel * Refine Python interface for transpiler & Add layout resolve for DTensor * Finish layout resolution for threadblock level ops * Add basic support for threadblock level transpiling * Merge Python JIT frontend (sgl-project#40). Add support for Python JIT frontend * temp frontend * bug fix * bug fix * frontend with output shapes/strides * add jit demo * Add checking for MIRAGE_ROOT --------- Co-authored-by: Shengyu Liu <[email protected]> * Add unit test for the transpiler * Add document for the transpiler * Add threadblock level matmul operator * Add threadblock-level reduction op * clang format * nits * Remove nonexist examples * Merge from main * Add tb scheduling & Add support for forloop accumulator * Add support for tb operator fusion * TB_FORLOOP_ACCUM_OP->TB_FORLOOP_ACCUM_NO_RED_OP * Refine documents * nits * nits * Add support for chunked copy and async copy * fix Python JIT compilation errors * [CUDA Transpiler] Fix Python JIT compilation errors (sgl-project#51) * fix Python JIT compilation errors * Remove __getattr__ from wrapper --------- Co-authored-by: SpiritedAwayCN <[email protected]> * checkpoint * Bugfix in async copy * Optimize matmul * Add support for output chunked copy * Python interface for creating threadblock graph * rename python objects * Optimize ClearAccumulatorKernel * Add testcases for IO * Optimize threadblock input ops * support customized * Small optimization * Change memory alignment to 128B for dtensor and stensors * bug fixes * remove the Py prefix for tensor objects in Python * fix typo * fix typo * Modify lib.h to adapt to PR sgl-project#53 * Bugfix in testcase * Support in-register accumulation for matmul * Rewrite tb scheduling * Add support for advanced memory planning interface & algos * Allocate software pipeline buffers in memory planner too * Code formatting * Fix test script * Add doc for TB scheduling and memory planning * Add doc for register-backed accumulator * Refine tb elementwise binary operator for broadcast support * Add test for tb elementwise binary operator with broadcast * Fix a subtle bug in matmul * Add some comments * Refine TB input and output ops (do not rely on stride) * Refine TB reduction kernel to avoid using stride * nits * Refine matmul operator: do not rely on stride * Slightly reorganize procedure in Transpiler * Change matmul perf args to align with tb matmul perf * Rename a file * Add support for swizzling (XOR and SHIFT) (has a bug) * Add doc for swizzling * Add a test for the SHIFT swizzling * Fix issue (sgl-project#56) * Format boolean variable as `true` and `false` for better readability * nits * nits * Hotfix CuTe's bug Ref: NVIDIA/cutlass#1766 * Bump Cutlass's version and remove the workaround in the last commit * Remove debug info * Update doc * Bugfix * [Transpiler] Add more Python examples for transpiler testing (sgl-project#57) * bug fixes * fix compile issue * minor updates --------- Co-authored-by: interestingLSY <[email protected]> Co-authored-by: Chun'an Shi <[email protected]> Co-authored-by: SpiritedAwayCN <[email protected]>
* remove vllm as a dependency revert registry * Isort format Signed-off-by: Daniel Huang <[email protected]> * Fix isort Signed-off-by: Daniel Huang <[email protected]> * Fix skip vllm import logic Signed-off-by: Daniel Huang <[email protected]> * Fix more vllm imports Signed-off-by: Daniel Huang <[email protected]> * Update to a newer vllm-hpu-extension version that does not rely on vllm Signed-off-by: Daniel Huang <[email protected]> * Add ray dep Signed-off-by: Daniel Huang <[email protected]> * Add einops Signed-off-by: Daniel Huang <[email protected]> --------- Signed-off-by: Daniel Huang <[email protected]> Co-authored-by: Yang Wang <[email protected]>
Close #26