Skip to content

Fixed the IPV6 socket issue in encode server of EPD#21493

Open
hhzguo wants to merge 2 commits intosgl-project:mainfrom
hhzguo:hhzguo/ipv6-encoder
Open

Fixed the IPV6 socket issue in encode server of EPD#21493
hhzguo wants to merge 2 commits intosgl-project:mainfrom
hhzguo:hhzguo/ipv6-encoder

Conversation

@hhzguo
Copy link
Copy Markdown
Contributor

@hhzguo hhzguo commented Mar 26, 2026

Motivation

When run EPD on node with IPV6 address, the zmq for IPV6 has the following issues:

1 ipv6 address is not handled correctly in get_zmq_socket_host function if host is none
2 encode server doesn't handle IPV6, which causes the communication problems between encode server and prefill server

Modifications

add zmq.IPV6 to socket

Accuracy Tests

lm-eval run result:
image

Benchmarking and Profiling

The change doesn't impact the performance.
But I post the benchmark result here for ref.
1 encode server, 1 prefill and 1 decode on the same h200 node.

image

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces IPv6 support for ZMQ communication and refactors network address handling using the NetworkAddress utility. It also adds error handling for socket operations in the encode_server. A critical issue was identified where exceptions in the socket sending logic are caught and logged but not re-raised, which could lead to silent data loss and make debugging difficult.

@hhzguo hhzguo force-pushed the hhzguo/ipv6-encoder branch from c8e680b to 7bd5d75 Compare March 26, 2026 23:14
@ShangmingCai
Copy link
Copy Markdown
Collaborator

Maybe covered in #21236?

@hhzguo
Copy link
Copy Markdown
Contributor Author

hhzguo commented Mar 27, 2026

Maybe covered in #21236?

partially. You may ask the owner of 21236 to merge the change in network.py to his/her and I will close mine. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants