Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
62d6b1c
add HunYuan V3 model
mpjlu Mar 25, 2026
cf771ff
Merge remote-tracking branch 'upstream/main'
Qiaolin-Yu Mar 26, 2026
b52c5d3
perf optimization: tune moe kernels, moe two stream overlap, change m…
Qiaolin-Yu Apr 4, 2026
b6eb605
upd
Qiaolin-Yu Apr 7, 2026
05bc84c
naive ep
Qiaolin-Yu Apr 8, 2026
09cecfe
upd
Qiaolin-Yu Apr 8, 2026
e87be68
Merge pull request #2 from Qiaolin-Yu/mtp
Qiaolin-Yu Apr 8, 2026
373a554
Add Hunyuan tool call parser and reasoning parser
JustinTong0323 Apr 9, 2026
8819588
Merge pull request #3 from Qiaolin-Yu/hunyuan-tool-call-reasoning-parser
JustinTong0323 Apr 10, 2026
a753002
Merge branch 'upstream' into main
JustinTong0323 Apr 10, 2026
994a0d7
fix(hunyuan_v3): align model code with tencent HYV3Config changes
JustinTong0323 Apr 13, 2026
ecf82b4
add hy3 preview tuned moe triton config and usage document
Apr 16, 2026
ae9965b
Revert "Merge branch 'upstream' into main"
JustinTong0323 Apr 17, 2026
42c2e7e
Merge pull request #4 from Qiaolin-Yu/hunyuan-model-fixed
JustinTong0323 Apr 17, 2026
1a0e37c
Reapply "Merge branch 'upstream' into main"
JustinTong0323 Apr 17, 2026
c7eff03
Merge remote-tracking branch 'upstream/main' into revert-merge-main
JustinTong0323 Apr 17, 2026
6b085f1
Merge remote-tracking branch 'upstream/main' into revert-merge-main
JustinTong0323 Apr 17, 2026
fa56402
chore: apply pre-commit formatting
JustinTong0323 Apr 17, 2026
e8aea3d
fix doc
Apr 17, 2026
86557ce
feat(function_call): streaming HYV3 tool parser
JustinTong0323 Apr 20, 2026
83ff3c9
Merge pull request #5 from Qiaolin-Yu/feat/hunyuan-streaming-tool-parser
JustinTong0323 Apr 20, 2026
4238416
add HY3-preview H20-3e(H20 with 140GB) tuned moe config
Apr 20, 2026
7d27ca8
fix HY3-preview usage document and H20-3e(H20 with 140GB) bf16 tuned …
Apr 20, 2026
9713de8
rename HY3-Preview to Hy3-preview
Apr 21, 2026
e40081d
refine doc
Apr 21, 2026
fcfef0a
Hy3-preview doc add reasoning_effort introduction
Apr 21, 2026
dc83e31
fix Hy3_preview model read rope_theta from config bug
Apr 21, 2026
f3da589
Merge pull request #6 from Qiaolin-Yu/fix_hy3_preview_rope_theta
JustinTong0323 Apr 21, 2026
af4c9cb
Hy3-preview usage document from fp8 to bf16
Apr 22, 2026
644e525
update doc
Apr 22, 2026
035c03d
update doc
Apr 22, 2026
5fca30f
fix(reasoning): treat Hunyuan "no_think" reasoning_effort as non-reas…
JustinTong0323 Apr 22, 2026
dad3080
Merge pull request #8 from Qiaolin-Yu/fix/hunyuan-no-think
JustinTong0323 Apr 22, 2026
b108a5d
Merge branch 'main' into support-hy3-preview
JustinTong0323 Apr 23, 2026
b79b46b
style(hunyuan_v3): isort fix (PEP 8 two blank lines before class)
JustinTong0323 Apr 23, 2026
0553ce8
Merge branch 'main' into support-hy3-preview
JustinTong0323 Apr 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions benchmark/kernels/fused_moe_triton/common_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,10 @@ def get_model_config(
E = config.num_experts // ep_size
topk = config.num_experts_per_tok
intermediate_size = config.moe_intermediate_size
elif architecture == "HYV3ForCausalLM":
E = config.num_experts // ep_size
topk = config.num_experts_per_tok
intermediate_size = config.expert_hidden_dim
elif architecture == "NemotronHForCausalLM":
E = config.n_routed_experts // ep_size
topk = config.num_experts_per_tok
Expand Down
191 changes: 191 additions & 0 deletions docs/basic_usage/hy3_preview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
# Hy3-preview Usage

Hy3-preview is a large-scale language model (295B parameters, 21B active parameters) from Tencent Hunyuan team. SGLang supports serving Hy3-preview. This guide describes how to run Hy3-preview with native BF16.

## Installation

### Docker

```bash
docker pull lmsysorg/sglang:hy3-preview
```

### Build from Source

```bash
# Install SGLang
git clone https://github.com/sgl-project/sglang
cd sglang
pip3 install pip --upgrade
pip3 install "transformers>=5.6.0"
pip3 install -e "python"
```

## Launch Hy3-preview with SGLang

To serve the [Hy3-preview](https://huggingface.co/tencent/Hy3-preview) model on 8 GPUs. On 8x96GB H20, SGLang can barely deploy the BF16 model and can only run small batch sizes or short requests. Use larger-memory GPUs such as H20-3e when possible.

```bash
python3 -m sglang.launch_server \
--model tencent/Hy3-preview \
--tp 8 \
--tool-call-parser hunyuan \
--reasoning-parser hunyuan \
--served-model-name hy3-preview
```

### EAGLE Speculative Decoding

**Description**: SGLang supports Hy3-preview models with [EAGLE speculative decoding](https://docs.sglang.io/advanced_features/speculative_decoding.html#eagle-decoding).

**Usage**:
Add `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk`, and `--speculative-num-draft-tokens` to enable this feature. For example:

```bash
python3 -m sglang.launch_server \
--model tencent/Hy3-preview \
--tp 8 \
--tool-call-parser hunyuan \
--reasoning-parser hunyuan \
--speculative-num-steps 1 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 2 \
--speculative-algorithm EAGLE \
--served-model-name hy3-preview
```

## OpenAI Client Example

First, install the OpenAI Python client:

```bash
uv pip install -U openai
```

You can use the OpenAI client as follows to verify thinking-mode responses.

```python
from openai import OpenAI

# If running SGLang locally with its default OpenAI-compatible port:
# http://localhost:30000/v1
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:30000/v1"

client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello."},
]

# Thinking mode is disabled by default (no need to pass chat_template_kwargs).
resp = client.chat.completions.create(
model="hy3-preview",
messages=messages,
temperature=1,
max_tokens=4096,
)
print(resp.choices[0].message.content)

# Thinking mode is enabled only if 'reasoning_effort' and 'interleaved_thinking' are set in 'chat_template_kwargs'.
# 'reasoning_effort' supports: 'high', 'low', 'no_think'.
resp_think = client.chat.completions.create(
model="hy3-preview",
messages=messages,
temperature=1,
max_tokens=4096,
extra_body={
"chat_template_kwargs": {
"reasoning_effort": "high",
"interleaved_thinking": True
},
},
)
output_msg = resp_think.choices[0].message
# thinking content
print(output_msg.reasoning_content)
# response content
print(output_msg.content)
```

### cURL Usage

```bash
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "hy3-preview",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello."}
],
"temperature": 1,
"max_tokens": 4096
}'
```

## Benchmarking Results

For benchmarking, disable prefix caching by adding `--disable-radix-cache` to the server command.

The following example runs the benchmark on 8 H20 GPUs with 96 GB memory each.

```bash
python3 -m sglang.bench_serving \
--backend sglang \
--flush-cache \
--dataset-name random \
--random-range-ratio 1.0 \
--random-input-len 4096 \
--random-output-len 4096 \
--num-prompts 5 \
--max-concurrency 1 \
--output-file hy3_preview_h20.jsonl \
--model tencent/Hy3-preview \
--served-model-name hy3-preview
```

If successful, you will see the following output.

```shell
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 5
Benchmark duration (s): 176.41
Total input tokens: 20480
Total input text tokens: 20480
Total generated tokens: 20480
Total generated tokens (retokenized): 20480
Request throughput (req/s): 0.03
Input token throughput (tok/s): 116.09
Output token throughput (tok/s): 116.09
Peak output token throughput (tok/s): 118.00
Peak concurrent requests: 2
Total token throughput (tok/s): 232.19
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 35279.06
Median E2E Latency (ms): 35275.60
P90 E2E Latency (ms): 35294.13
P99 E2E Latency (ms): 35294.41
---------------Time to First Token----------------
Mean TTFT (ms): 355.93
Median TTFT (ms): 309.28
P99 TTFT (ms): 518.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 8.53
Median TPOT (ms): 8.54
P99 TPOT (ms): 8.54
---------------Inter-Token Latency----------------
Mean ITL (ms): 8.53
Median ITL (ms): 8.54
P95 ITL (ms): 8.62
P99 ITL (ms): 8.74
Max ITL (ms): 31.70
==================================================
```
Loading
Loading