Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
fc6c62a to
5809b33
Compare
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Hi @BBuf, could you please take a look and let me know which benchmark I should run? |
a5c2bae to
056d8d1
Compare
| "csrc/allreduce/custom_all_reduce.hip", | ||
| "csrc/allreduce/deterministic_all_reduce.hip", | ||
| "csrc/allreduce/quick_all_reduce.cu", | ||
| "csrc/common_extension_rocm.cc", |
There was a problem hiding this comment.
JIT kernel has not support rocm yet. Maybe just keep the original HIP code first?
There was a problem hiding this comment.
Hi @DarkSharpness, I've kept the HIP code and added a _IS_ROCM guard to allow non-NVIDIA GPUs to use the original AOT kernel. Please let me know if this meets the requirements.
3aeb599 to
ddb5860
Compare
|
@weimin023 Thanks for contribution! I adapted from this PR in #21766 (which should be a super-set of this PR) and added you as co-author. Feel free to reopen the PR if something is still missing |
Motivation
Add JIT-compiled CUDA kernels for activation function
Modifications
Accuracy Tests
pytest /sgl-workspace/sglang/python/sglang/jit_kernel/tests/test_activation.py
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0
rootdir: /sgl-workspace/sglang/python
configfile: pyproject.toml
plugins: anyio-4.12.1, typeguard-4.4.4
collected 48 items
python/sglang/jit_kernel/tests/test_activation.py ................................................ [100%]
=================================================================================================================== 48 passed in 29.37s ====================================================================================================================
Benchmarking and Profiling
Test the accuracy:
python3 -m sglang.test.few_shot_gsm8k --num-questions 200
Accuracy: 0.820
Invalid: 0.000
Latency: 10.294 s
Output throughput: 2829.049 token/s
Benchmark the speed:
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci