Skip to content

[feat] add Qwen3-ASR model support and related configurations#22071

Closed
AgainstEntropy wants to merge 1 commit into
sgl-project:mainfrom
AgainstEntropy:feat/qwen3-asr
Closed

[feat] add Qwen3-ASR model support and related configurations#22071
AgainstEntropy wants to merge 1 commit into
sgl-project:mainfrom
AgainstEntropy:feat/qwen3-asr

Conversation

@AgainstEntropy
Copy link
Copy Markdown
Collaborator

Motivation

#22025

Modifications

  • Introduced Qwen3-ASR model with configuration and processor classes.
  • Updated entry points to handle Qwen3-ASR in the transcription endpoint.
  • Enhanced multimodal processing to support Qwen3-ASR.
  • Added tests for Qwen3-ASR transcription functionality.
  • Updated existing files to include Qwen3ASR in relevant imports and configurations.

Accuracy Tests

  • official qwen_asr
$ python test_qwen3_asr_official.py 
Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.42it/s]
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
English
Uh huh. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well, but he did very well when he started writing for other people.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Chinese
甚至出现交易几乎停滞的情况。
  • this PR
$ python test_qwen3_asr.py 
command=sglang serve --model-path Qwen/Qwen3-ASR-1.7B --served-model-name qwen3-asr --trust-remote-code --disable-cuda-graph --device cuda --host 127.0.0.1 --port 21000
CI_OFFLINE: Launching server HF_HUB_OFFLINE=0 model=Qwen/Qwen3-ASR-1.7B
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_section', 'mrope_interleaved', 'interleaved'}
`torch_dtype` is deprecated! Use `dtype` instead!
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
/data/yihao/repos/sglang/python/sglang/srt/entrypoints/http_server.py:172: FastAPIDeprecationWarning: ORJSONResponse is deprecated, FastAPI now serializes data directly to JSON bytes via Pydantic when a return type or response model is set, which is faster and doesn't need a custom response class. Read more in the FastAPI docs: https://fastapi.tiangolo.com/advanced/custom-response/#orjson-or-response-model and https://fastapi.tiangolo.com/tutorial/response-model/
  from sglang.srt.utils.json_response import (
[2026-04-03 19:21:26] server_args=ServerArgs(model_path='Qwen/Qwen3-ASR-1.7B', tokenizer_path='Qwen/Qwen3-ASR-1.7B', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=21000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_keyfile_password=None, enable_ssl_refresh=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.907, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, disable_priority_preemption=False, default_priority_value=None, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, incremental_streaming_output=False, enable_streaming_session=False, random_seed=222246292, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, use_ray=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_mfu_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, extra_metric_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='qwen3-asr', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', attn_cp_size=1, moe_dp_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, experts_shared_outer_loras=None, attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='auto', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, mamba_backend='triton', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_max_trie_depth=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, enforce_disable_flashinfer_allreduce_fusion=False, enable_aiter_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, enable_elastic_expert_backup=False, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype=None, mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, linear_attn_backend='triton', linear_attn_decode_backend=None, linear_attn_prefill_backend=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_hisparse=False, hisparse_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=True, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, pre_warm_nccl=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, disable_piecewise_cuda_graph=True, enforce_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, gc_threshold=None, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='round-robin-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_fused_moe_sum_all_reduce=False, enable_prefill_context_parallel=False, prefill_cp_mode='in-seq-split', enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], enable_adaptive_dispatch_to_encoder=False, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, engine_info_bootstrap_port=6789, modelexpress_config=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, enable_mm_global_cache=False, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2026-04-03 19:21:26] mp.set_executable /data/yihao/repos/sglang/.venv/bin/python3 -> /tmp/sglang_temp_file_1775244086.9585037_837831.sh (script='#!/bin/sh\nexec numactl --cpunodebind=0 --membind=0 /data/yihao/repos/sglang/.venv/bin/python3 "$@"')
[2026-04-03 19:21:26] mp.set_executable revert to /data/yihao/repos/sglang/.venv/bin/python3
2026-04-03 19:21:27.102 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:27] Persistent cache disabled, using in-memory JIT cache
2026-04-03 19:21:27.102 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:27] Persistent cache disabled, using in-memory JIT cache
2026-04-03 19:21:27.102 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:27] Persistent cache disabled, using in-memory JIT cache
2026-04-03 19:21:27.102 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:27] Persistent cache disabled, using in-memory JIT cache
2026-04-03 19:21:27.102 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:27] Persistent cache disabled, using in-memory JIT cache
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_section', 'mrope_interleaved', 'interleaved'}
[2026-04-03 19:21:27] thinker_config is None. Initializing Qwen3-ASR thinker with default values
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_section', 'mrope_interleaved', 'interleaved'}
[2026-04-03 19:21:27] thinker_config is None. Initializing Qwen3-ASR thinker with default values
Fetching 7 files: 100%|█████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 8279.79it/s]
Download complete: : 0.00B [00:00, ?B/s]                                                  | 0/7 [00:00<?, ?it/s]
[2026-04-03 19:21:28] No HuggingFace chat template found
[2026-04-03 19:21:28] No chat template found, defaulting to 'string' content format
[2026-04-03 19:21:34] NUMA affinity is already constrained for process, skipping NUMA node configuration for GPU. Remove your constraints to allow automatic configuration.
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
[2026-04-03 19:21:34] thinker_config is None. Initializing Qwen3-ASR thinker with default values
`torch_dtype` is deprecated! Use `dtype` instead!
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
[2026-04-03 19:21:34] thinker_config is None. Initializing Qwen3-ASR thinker with default values
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
[2026-04-03 19:21:34] thinker_config is None. Initializing Qwen3-ASR thinker with default values
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
[2026-04-03 19:21:34] thinker_config is None. Initializing Qwen3-ASR thinker with default values
Fetching 7 files: 100%|████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 10138.17it/s]
Download complete: : 0.00B [00:00, ?B/s]                                                  | 0/7 [00:00<?, ?it/s]
2026-04-03 19:21:35.519 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:35] Persistent cache disabled, using in-memory JIT cache
2026-04-03 19:21:35.519 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:35] Persistent cache disabled, using in-memory JIT cache
2026-04-03 19:21:35.519 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:35] Persistent cache disabled, using in-memory JIT cache
2026-04-03 19:21:35.519 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:35] Persistent cache disabled, using in-memory JIT cache
2026-04-03 19:21:35.519 DEBUG Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:35] Persistent cache disabled, using in-memory JIT cache
[2026-04-03 19:21:36] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-04-03 19:21:36] Init torch distributed ends. elapsed=0.24 s, mem usage=0.09 GB
[2026-04-03 19:21:36] Load weight begin. avail mem=139.12 GB
[2026-04-03 19:21:36] Multimodal attention backend not set. Use fa3.
[2026-04-03 19:21:36] Using fa3 as multimodal attention backend.
[2026-04-03 19:21:36] Found local HF snapshot for Qwen/Qwen3-ASR-1.7B at /home/radixark/.cache/huggingface/hub/models--Qwen--Qwen3-ASR-1.7B/snapshots/7278e1e70fe206f11671096ffdd38061171dd6e5; skipping download.
Multi-thread loading shards: 100% Completed | 2/2 [00:00<00:00,  4.40it/s]
[2026-04-03 19:21:37] Load weight end. elapsed=0.85 s, type=Qwen3ASRForConditionalGeneration, avail mem=135.21 GB, mem usage=3.91 GB.
[2026-04-03 19:21:37] Using KV cache dtype: torch.bfloat16
[2026-04-03 19:21:37] KV Cache is allocated. #tokens: 1144694, K size: 61.13 GB, V size: 61.13 GB
[2026-04-03 19:21:37] Memory pool end. avail mem=11.91 GB
[2026-04-03 19:21:37] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
[2026-04-03 19:21:37] thinker_config is None. Initializing Qwen3-ASR thinker with default values
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
[2026-04-03 19:21:37] thinker_config is None. Initializing Qwen3-ASR thinker with default values
Fetching 7 files: 100%|████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 74329.44it/s]
Download complete: : 0.00B [00:00, ?B/s]              [2026-04-03 19:21:38] max_total_num_tokens=1144694, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=65536, available_gpu_mem=11.81 GB
[2026-04-03 19:21:38] INFO:     Started server process [2931704]
[2026-04-03 19:21:38] INFO:     Waiting for application startup.
[2026-04-03 19:21:38] Using default chat sampling params from model generation config: {'temperature': 1e-06}
[2026-04-03 19:21:38] INFO:     Application startup complete.
[2026-04-03 19:21:38] INFO:     Uvicorn running on http://127.0.0.1:21000 (Press CTRL+C to quit)
[2026-04-03 19:21:38] INFO:     127.0.0.1:53444 - "GET /health_generate HTTP/1.1" 503 Service Unavailable
[2026-04-03 19:21:39] INFO:     127.0.0.1:53446 - "GET /model_info HTTP/1.1" 200 OK
Download complete: : 0.00B [00:01, ?B/s]
2026-04-03 19:21:40,391 - CUTE_DSL - WARNING - [handle_import_error] - Unexpected error during package walk: cutlass.cute.experimental
[2026-04-03 19:21:40] Unexpected error during package walk: cutlass.cute.experimental
[2026-04-03 19:21:41] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.00
[2026-04-03 19:21:41] INFO:     127.0.0.1:53454 - "POST /generate HTTP/1.1" 200 OK
[2026-04-03 19:21:41] The server is fired up and ready to roll!
[2026-04-03 19:21:48] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.76
[2026-04-03 19:21:49] INFO:     127.0.0.1:34368 - "GET /health_generate HTTP/1.1" 200 OK
test_chinese_transcription (__main__.TestQwen3ASRTranscription.test_chinese_transcription)
Test Chinese audio transcription. ... [CI Test Method] TestQwen3ASRTranscription.test_chinese_transcription
[2026-04-03 19:21:50] Prefill batch, #new-seq: 1, #new-token: 65, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.68
[2026-04-03 19:21:50] INFO:     127.0.0.1:34384 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
[ZH Transcription] 甚至出现交易几乎停滞的情况。
ok
test_english_transcription (__main__.TestQwen3ASRTranscription.test_english_transcription)
Test English audio transcription. ... [CI Test Method] TestQwen3ASRTranscription.test_english_transcription
[2026-04-03 19:21:50] Prefill batch, #new-seq: 1, #new-token: 202, #cached-token: 4, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 231.21
[2026-04-03 19:21:51] Decode batch, #running-req: 1, #token: 227, token usage: 0.00, cuda graph: False, gen throughput (token/s): 2.40, #queue-req: 0
[2026-04-03 19:21:51] INFO:     127.0.0.1:34400 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
[EN Transcription] Uh huh. Oh yeah, yeah. He wasn't even that big when I started listening to him, but and his solo music didn't do overly well, but he did very well when he started writing for other people.
ok
test_multiple_requests_consistency (__main__.TestQwen3ASRTranscription.test_multiple_requests_consistency)
Test that repeated requests produce consistent output. ... [CI Test Method] TestQwen3ASRTranscription.test_multiple_requests_consistency
[2026-04-03 19:21:51] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 205, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 251.48
[2026-04-03 19:21:51] Decode batch, #running-req: 1, #token: 218, token usage: 0.00, cuda graph: False, gen throughput (token/s): 60.06, #queue-req: 0
[2026-04-03 19:21:52] INFO:     127.0.0.1:34406 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
[2026-04-03 19:21:52] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 205, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 1.26
[2026-04-03 19:21:52] Decode batch, #running-req: 1, #token: 209, token usage: 0.00, cuda graph: False, gen throughput (token/s): 60.93, #queue-req: 0
[2026-04-03 19:21:52] Decode batch, #running-req: 1, #token: 249, token usage: 0.00, cuda graph: False, gen throughput (token/s): 65.79, #queue-req: 0
[2026-04-03 19:21:53] INFO:     127.0.0.1:34414 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
[2026-04-03 19:21:53] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 205, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 1.26
[2026-04-03 19:21:53] Decode batch, #running-req: 1, #token: 240, token usage: 0.00, cuda graph: False, gen throughput (token/s): 60.92, #queue-req: 0
[2026-04-03 19:21:53] INFO:     127.0.0.1:34418 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
[Consistency] All 3 requests match: Uh huh. Oh yeah, yeah. He wasn't even that big when I started listening to him, ...
ok

----------------------------------------------------------------------
Ran 3 tests in 34.964s

OK

Speed Tests and Profiling

More tests will be posted here.

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

- Introduced Qwen3-ASR model with configuration and processor classes.
- Updated entry points to handle Qwen3-ASR in the transcription endpoint.
- Enhanced multimodal processing to support Qwen3-ASR.
- Added tests for Qwen3-ASR transcription functionality.
- Updated existing files to include Qwen3ASR in relevant imports and configurations.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@AgainstEntropy
Copy link
Copy Markdown
Collaborator Author

Duplicated with #22073 . Closing.

@AgainstEntropy AgainstEntropy deleted the feat/qwen3-asr branch April 9, 2026 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant