[feat] add Qwen3-ASR model support and related configurations by AgainstEntropy · Pull Request #22071 · sgl-project/sglang

AgainstEntropy · 2026-04-03T19:29:11Z

Motivation

#22025

Modifications

Introduced Qwen3-ASR model with configuration and processor classes.
Updated entry points to handle Qwen3-ASR in the transcription endpoint.
Enhanced multimodal processing to support Qwen3-ASR.
Added tests for Qwen3-ASR transcription functionality.
Updated existing files to include Qwen3ASR in relevant imports and configurations.

Accuracy Tests

official qwen_asr

$ python test_qwen3_asr_official.py 
Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.42it/s]
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
English
Uh huh. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well, but he did very well when he started writing for other people.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Chinese
甚至出现交易几乎停滞的情况。

this PR

$ python test_qwen3_asr.py 
command=sglang serve --model-path Qwen/Qwen3-ASR-1.7B --served-model-name qwen3-asr --trust-remote-code --disable-cuda-graph --device cuda --host 127.0.0.1 --port 21000
CI_OFFLINE: Launching server HF_HUB_OFFLINE=0 model=Qwen/Qwen3-ASR-1.7B
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_section', 'mrope_interleaved', 'interleaved'}
`torch_dtype` is deprecated! Use `dtype` instead!
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
/data/yihao/repos/sglang/python/sglang/srt/entrypoints/http_server.py:172: FastAPIDeprecationWarning: ORJSONResponse is deprecated, FastAPI now serializes data directly to JSON bytes via Pydantic when a return type or response model is set, which is faster and doesn't need a custom response class. Read more in the FastAPI docs: https://fastapi.tiangolo.com/advanced/custom-response/#orjson-or-response-model and https://fastapi.tiangolo.com/tutorial/response-model/
  from sglang.srt.utils.json_response import (
[2026-04-03 19:21:26] server_args=ServerArgs(model_path='Qwen/Qwen3-ASR-1.7B', tokenizer_path='Qwen/Qwen3-ASR-1.7B', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=21000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_keyfile_password=None, enable_ssl_refresh=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.907, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, disable_priority_preemption=False, default_priority_value=None, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, incremental_streaming_output=False, enable_streaming_session=False, random_seed=222246292, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, use_ray=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_mfu_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, extra_metric_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='qwen3-asr', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', attn_cp_size=1, moe_dp_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, experts_shared_outer_loras=None, attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='auto', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, mamba_backend='triton', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_max_trie_depth=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, enforce_disable_flashinfer_allreduce_fusion=False, enable_aiter_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, enable_elastic_expert_backup=False, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype=None, mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, linear_attn_backend='triton', linear_attn_decode_backend=None, linear_attn_prefill_backend=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_hisparse=False, hisparse_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, pre_warm_nccl=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, disable_piecewise_cuda_graph=True, enforce_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, gc_threshold=None, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='round-robin-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_fused_moe_sum_all_reduce=False, enable_prefill_context_parallel=False, prefill_cp_mode='in-seq-split', enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], enable_adaptive_dispatch_to_encoder=False, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, engine_info_bootstrap_port=6789, modelexpress_config=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, enable_mm_global_cache=False, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2026-04-03 19:21:26] mp.set_executable /data/yihao/repos/sglang/.venv/bin/python3 -> /tmp/sglang_temp_file_1775244086.9585037_837831.sh (script='#!/bin/sh\nexec numactl --cpunodebind=0 --membind=0 /data/yihao/repos/sglang/.venv/bin/python3 "$@"')
[2026-04-03 19:21:26] mp.set_executable revert to /data/yihao/repos/sglang/.venv/bin/python3
2026-04-03 19:21:27.102 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:27] Persistent cache disabled, using in-memory JIT cache
2026-04-03 19:21:27.102 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:27] Persistent cache disabled, using in-memory JIT cache
2026-04-03 19:21:27.102 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:27] Persistent cache disabled, using in-memory JIT cache
2026-04-03 19:21:27.102 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:27] Persistent cache disabled, using in-memory JIT cache
2026-04-03 19:21:27.102 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:27] Persistent cache disabled, using in-memory JIT cache
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_section', 'mrope_interleaved', 'interleaved'}
[2026-04-03 19:21:27] thinker_config is None. Initializing Qwen3-ASR thinker with default values
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_section', 'mrope_interleaved', 'interleaved'}
[2026-04-03 19:21:27] thinker_config is None. Initializing Qwen3-ASR thinker with default values
Fetching 7 files: 100%|█████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 8279.79it/s]
Download complete: : 0.00B [00:00, ?B/s]                                                  | 0/7 [00:00<?, ?it/s]
[2026-04-03 19:21:28] No HuggingFace chat template found
[2026-04-03 19:21:28] No chat template found, defaulting to 'string' content format
[2026-04-03 19:21:34] NUMA affinity is already constrained for process, skipping NUMA node configuration for GPU. Remove your constraints to allow automatic configuration.
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
[2026-04-03 19:21:34] thinker_config is None. Initializing Qwen3-ASR thinker with default values
`torch_dtype` is deprecated! Use `dtype` instead!
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
[2026-04-03 19:21:34] thinker_config is None. Initializing Qwen3-ASR thinker with default values
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
[2026-04-03 19:21:34] thinker_config is None. Initializing Qwen3-ASR thinker with default values
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
[2026-04-03 19:21:34] thinker_config is None. Initializing Qwen3-ASR thinker with default values
Fetching 7 files: 100%|████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 10138.17it/s]
Download complete: : 0.00B [00:00, ?B/s]                                                  | 0/7 [00:00<?, ?it/s]
2026-04-03 19:21:35.519 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:35] Persistent cache disabled, using in-memory JIT cache
2026-04-03 19:21:35.519 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:35] Persistent cache disabled, using in-memory JIT cache
2026-04-03 19:21:35.519 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:35] Persistent cache disabled, using in-memory JIT cache
2026-04-03 19:21:35.519 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:35] Persistent cache disabled, using in-memory JIT cache
2026-04-03 19:21:35.519 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:35] Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:36] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-04-03 19:21:36] Init torch distributed ends. elapsed=0.24 s, mem usage=0.09 GB
[2026-04-03 19:21:36] Load weight begin. avail mem=139.12 GB
[2026-04-03 19:21:36] Multimodal attention backend not set. Use fa3.
[2026-04-03 19:21:36] Using fa3 as multimodal attention backend.
[2026-04-03 19:21:36] Found local HF snapshot for Qwen/Qwen3-ASR-1.7B at /home/radixark/.cache/huggingface/hub/models--Qwen--Qwen3-ASR-1.7B/snapshots/7278e1e70fe206f11671096ffdd38061171dd6e5; skipping download.
Multi-thread loading shards: 100% Completed | 2/2 [00:00<00:00,  4.40it/s]
[2026-04-03 19:21:37] Load weight end. elapsed=0.85 s, type=Qwen3ASRForConditionalGeneration, avail mem=135.21 GB, mem usage=3.91 GB.
[2026-04-03 19:21:37] Using KV cache dtype: torch.bfloat16
[2026-04-03 19:21:37] KV Cache is allocated. #tokens: 1144694, K size: 61.13 GB, V size: 61.13 GB
[2026-04-03 19:21:37] Memory pool end. avail mem=11.91 GB
[2026-04-03 19:21:37] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
[2026-04-03 19:21:37] thinker_config is None. Initializing Qwen3-ASR thinker with default values
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
[2026-04-03 19:21:37] thinker_config is None. Initializing Qwen3-ASR thinker with default values
Fetching 7 files: 100%|████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 74329.44it/s]
Download complete: : 0.00B [00:00, ?B/s]              [2026-04-03 19:21:38] max_total_num_tokens=1144694, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=65536, available_gpu_mem=11.81 GB
[2026-04-03 19:21:38] INFO:     Started server process [2931704]
[2026-04-03 19:21:38] INFO:     Waiting for application startup.
[2026-04-03 19:21:38] Using default chat sampling params from model generation config: {'temperature': 1e-06}
[2026-04-03 19:21:38] INFO:     Application startup complete.
[2026-04-03 19:21:38] INFO:     Uvicorn running on http://127.0.0.1:21000 (Press CTRL+C to quit)
[2026-04-03 19:21:38] INFO:     127.0.0.1:53444 - "GET /health_generate HTTP/1.1" 503 Service Unavailable
[2026-04-03 19:21:39] INFO:     127.0.0.1:53446 - "GET /model_info HTTP/1.1" 200 OK
Download complete: : 0.00B [00:01, ?B/s]
2026-04-03 19:21:40,391 - CUTE_DSL - WARNING - [handle_import_error] - Unexpected error during package walk: cutlass.cute.experimental
[2026-04-03 19:21:40] Unexpected error during package walk: cutlass.cute.experimental
[2026-04-03 19:21:41] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.00
[2026-04-03 19:21:41] INFO:     127.0.0.1:53454 - "POST /generate HTTP/1.1" 200 OK
[2026-04-03 19:21:41] The server is fired up and ready to roll!
[2026-04-03 19:21:48] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.76
[2026-04-03 19:21:49] INFO:     127.0.0.1:34368 - "GET /health_generate HTTP/1.1" 200 OK
test_chinese_transcription (__main__.TestQwen3ASRTranscription.test_chinese_transcription)
Test Chinese audio transcription. ... [CI Test Method] TestQwen3ASRTranscription.test_chinese_transcription
[2026-04-03 19:21:50] Prefill batch, #new-seq: 1, #new-token: 65, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.68
[2026-04-03 19:21:50] INFO:     127.0.0.1:34384 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
[ZH Transcription] 甚至出现交易几乎停滞的情况。
ok
test_english_transcription (__main__.TestQwen3ASRTranscription.test_english_transcription)
Test English audio transcription. ... [CI Test Method] TestQwen3ASRTranscription.test_english_transcription
[2026-04-03 19:21:50] Prefill batch, #new-seq: 1, #new-token: 202, #cached-token: 4, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 231.21
[2026-04-03 19:21:51] Decode batch, #running-req: 1, #token: 227, token usage: 0.00, cuda graph: False, gen throughput (token/s): 2.40, #queue-req: 0
[2026-04-03 19:21:51] INFO:     127.0.0.1:34400 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
[EN Transcription] Uh huh. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well, but he did very well when he started writing for other people.
ok
test_multiple_requests_consistency (__main__.TestQwen3ASRTranscription.test_multiple_requests_consistency)
Test that repeated requests produce consistent output. ... [CI Test Method] TestQwen3ASRTranscription.test_multiple_requests_consistency
[2026-04-03 19:21:51] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 205, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 251.48
[2026-04-03 19:21:51] Decode batch, #running-req: 1, #token: 218, token usage: 0.00, cuda graph: False, gen throughput (token/s): 60.06, #queue-req: 0
[2026-04-03 19:21:52] INFO:     127.0.0.1:34406 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
[2026-04-03 19:21:52] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 205, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 1.26
[2026-04-03 19:21:52] Decode batch, #running-req: 1, #token: 209, token usage: 0.00, cuda graph: False, gen throughput (token/s): 60.93, #queue-req: 0
[2026-04-03 19:21:52] Decode batch, #running-req: 1, #token: 249, token usage: 0.00, cuda graph: False, gen throughput (token/s): 65.79, #queue-req: 0
[2026-04-03 19:21:53] INFO:     127.0.0.1:34414 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
[2026-04-03 19:21:53] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 205, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 1.26
[2026-04-03 19:21:53] Decode batch, #running-req: 1, #token: 240, token usage: 0.00, cuda graph: False, gen throughput (token/s): 60.92, #queue-req: 0
[2026-04-03 19:21:53] INFO:     127.0.0.1:34418 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
[Consistency] All 3 requests match: Uh huh. Oh yeah, yeah. He wasn't even that big when I started listening to him, ...
ok

----------------------------------------------------------------------
Ran 3 tests in 34.964s

OK

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

- Introduced Qwen3-ASR model with configuration and processor classes. - Updated entry points to handle Qwen3-ASR in the transcription endpoint. - Enhanced multimodal processing to support Qwen3-ASR. - Added tests for Qwen3-ASR transcription functionality. - Updated existing files to include Qwen3ASR in relevant imports and configurations.

gemini-code-assist · 2026-04-03T19:29:15Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

AgainstEntropy · 2026-04-06T04:46:08Z

Duplicated with #22073 . Closing.

AgainstEntropy closed this Apr 6, 2026

AgainstEntropy deleted the feat/qwen3-asr branch April 9, 2026 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] add Qwen3-ASR model support and related configurations#22071

[feat] add Qwen3-ASR model support and related configurations#22071
AgainstEntropy wants to merge 1 commit into
sgl-project:mainfrom
AgainstEntropy:feat/qwen3-asr

AgainstEntropy commented Apr 3, 2026

Uh oh!

gemini-code-assist Bot commented Apr 3, 2026

Uh oh!

AgainstEntropy commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AgainstEntropy commented Apr 3, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 3, 2026

Uh oh!

AgainstEntropy commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant