Skip to content

[serve.llm] Port collisions in multi-replica TP deployments with NIXL #58072

@nrghosh

Description

@nrghosh

What happened + What you expected to happen

When we deploy Ray Serve LLM with multiple replicas (num_replicas ≥ 2) and Tensor Parallelism (tensor_parallel_size ≥ 2), port collisions occur between TP workers from different replicas when using NIXL KV transfer backend.

What happened:
Second replica fails to start with NIXL_ERR_BACKEND or ZMQ port binding errors
Logs show workers trying to bind to already-used ports
Autoscaling from 1 to 2+ replicas breaks

Expected:
Each TP worker should get a unique port
Autoscaling should work reliably

Related Issues
PR #57771
Issue #55775

Versions / Dependencies

Ray: 2.50+ (after PR #57771)
vLLM: 0.11+
Python: 3.11+

Reproduction script

#serve_llama_3dot1_8b_quantized_tp1_2p6d.yaml
applications:
  - args:
      prefill_config:
        model_loading_config:
          model_id: neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16
        accelerator_type: L4
        engine_kwargs:
          max_model_len: 8192
          tensor_parallel_size: 1
          enforce_eager: true
          kv_transfer_config:
            kv_connector: NixlConnector
            kv_role: kv_both
        deployment_config:
          autoscaling_config:
            min_replicas: 2
            max_replicas: 2
      decode_config:
        model_loading_config:
          model_id: neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16
        accelerator_type: L4
        engine_kwargs:
          max_model_len: 8192
          tensor_parallel_size: 1
          enforce_eager: true
          kv_transfer_config:
            kv_connector: NixlConnector
            kv_role: kv_both
        deployment_config:
          autoscaling_config:
            min_replicas: 6
            max_replicas: 6
    import_path: ray.serve.llm:build_pd_openai_app
    name: llm-endpoint
    route_prefix: /

Error output

zmq.error.ZMQError: Address already in use

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

Labels

bugSomething that is supposed to be working; but isn'tllmserveRay Serve Related IssuestabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions