Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler][Bug] Raylet address already in use during worker node startup #23152

Closed
1 of 2 tasks
bveeramani opened this issue Mar 14, 2022 · 6 comments
Closed
1 of 2 tasks
Labels
bug Something that is supposed to be working; but isn't needs-repro-script Issue needs a runnable script to be reproduced stale The issue is stale. It will be closed within 7 days unless there are further conversation triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@bveeramani
Copy link
Member

bveeramani commented Mar 14, 2022

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Clusters

Issue Severity

Medium: It contributes to significant difficulty to complete my task but I work arounds and get it resolved.

What happened + What you expected to happen

I ran my training script and received an esoteric error.

(scheduler +28s) Restarting 2 nodes of type ray.worker.default (lost contact with raylet).
(raylet, ip=10.0.2.101) E0301 08:45:44.685832418   28476 server_chttp2.cc:48]        {"created":"@1646124344.685770642","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":897,"referenced_errors":[{"created":"@1646124344.685766022","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":348,"referenced_errors":[{"created":"@1646124344.685748525","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1646124344.685743165","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1646124344.685764680","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1646124344.685761637","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=10.0.2.101) [2022-03-01 08:45:44,730 C 28476 28476] grpc_server.cc:93:  Check failed: server_ Failed to start the grpc server. The specified port is 8076. This means that Ray's core components will not be able to function correctly. If the server startup error message is `Address already in use`, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running lsof -i :8076 to check if there are other processes listening to the port.
(raylet, ip=10.0.2.101) *** StackTrace Information ***
(raylet, ip=10.0.2.101)     ray::SpdLogMessage::Flush()
(raylet, ip=10.0.2.101)     ray::RayLog::~RayLog()
(raylet, ip=10.0.2.101)     ray::rpc::GrpcServer::Run()
(raylet, ip=10.0.2.101)     ray::ObjectManager::ObjectManager()
(raylet, ip=10.0.2.101)     ray::raylet::NodeManager::NodeManager()
(raylet, ip=10.0.2.101)     ray::raylet::Raylet::Raylet()
(raylet, ip=10.0.2.101)     main::{lambda()#1}::operator()()
(raylet, ip=10.0.2.101)     std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.2.101)     std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.2.101)     std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.2.101)     ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=10.0.2.101)     std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.2.101)     boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=10.0.2.101)     boost::asio::detail::scheduler::do_run_one()
(raylet, ip=10.0.2.101)     boost::asio::detail::scheduler::run()
(raylet, ip=10.0.2.101)     boost::asio::io_context::run()
(raylet, ip=10.0.2.101)     main
(raylet, ip=10.0.2.101)     __libc_start_main
(raylet, ip=10.0.2.101)
(raylet, ip=10.0.2.101) E0301 08:46:25.409726834   28711 server_chttp2.cc:48]        {"created":"@1646124385.409664672","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":897,"referenced_errors":[{"created":"@1646124385.409660201","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":348,"referenced_errors":[{"created":"@1646124385.409642471","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1646124385.409637365","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1646124385.409658882","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1646124385.409655976","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=10.0.2.101) [2022-03-01 08:46:25,453 C 28711 28711] grpc_server.cc:93:  Check failed: server_ Failed to start the grpc server. The specified port is 8076. This means that Ray's core components will not be able to function correctly. If the server startup error message is `Address already in use`, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running lsof -i :8076 to check if there are other processes listening to the port.
(raylet, ip=10.0.2.101) *** StackTrace Information ***
(raylet, ip=10.0.2.101)     ray::SpdLogMessage::Flush()
(raylet, ip=10.0.2.101)     ray::RayLog::~RayLog()
(raylet, ip=10.0.2.101)     ray::rpc::GrpcServer::Run()
(raylet, ip=10.0.2.101)     ray::ObjectManager::ObjectManager()
(raylet, ip=10.0.2.101)     ray::raylet::NodeManager::NodeManager()
(raylet, ip=10.0.2.101)     ray::raylet::Raylet::Raylet()
(raylet, ip=10.0.2.101)     main::{lambda()#1}::operator()()
(raylet, ip=10.0.2.101)     std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.2.101)     std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.2.101)     std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.2.101)     ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=10.0.2.101)     std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.2.101)     boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=10.0.2.101)     boost::asio::detail::scheduler::do_run_one()
(raylet, ip=10.0.2.101)     boost::asio::detail::scheduler::run()
(raylet, ip=10.0.2.101)     boost::asio::io_context::run()
(raylet, ip=10.0.2.101)     main
(raylet, ip=10.0.2.101)     __libc_start_main
(raylet, ip=10.0.2.101)
(raylet, ip=10.0.2.25) E0301 08:45:44.666379347   22255 server_chttp2.cc:48]        {"created":"@1646124344.666318750","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":897,"referenced_errors":[{"created":"@1646124344.666313631","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":348,"referenced_errors":[{"created":"@1646124344.666295384","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1646124344.666289842","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1646124344.666312054","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1646124344.666309500","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=10.0.2.25) [2022-03-01 08:45:44,710 C 22255 22255] grpc_server.cc:93:  Check failed: server_ Failed to start the grpc server. The specified port is 8076. This means that Ray's core components will not be able to function correctly. If the server startup error message is `Address already in use`, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running lsof -i :8076 to check if there are other processes listening to the port.
(raylet, ip=10.0.2.25) *** StackTrace Information ***
(raylet, ip=10.0.2.25)     ray::SpdLogMessage::Flush()
(raylet, ip=10.0.2.25)     ray::RayLog::~RayLog()
(raylet, ip=10.0.2.25)     ray::rpc::GrpcServer::Run()
(raylet, ip=10.0.2.25)     ray::ObjectManager::ObjectManager()
(raylet, ip=10.0.2.25)     ray::raylet::NodeManager::NodeManager()
(raylet, ip=10.0.2.25)     ray::raylet::Raylet::Raylet()
(raylet, ip=10.0.2.25)     main::{lambda()#1}::operator()()
(raylet, ip=10.0.2.25)     std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.2.25)     std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.2.25)     std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.2.25)     ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=10.0.2.25)     std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.2.25)     boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=10.0.2.25)     boost::asio::detail::scheduler::do_run_one()
(raylet, ip=10.0.2.25)     boost::asio::detail::scheduler::run()
(raylet, ip=10.0.2.25)     boost::asio::io_context::run()
(raylet, ip=10.0.2.25)     main
(raylet, ip=10.0.2.25)     __libc_start_main
(raylet, ip=10.0.2.25)
(raylet, ip=10.0.2.25) E0301 08:46:25.435618107   22489 server_chttp2.cc:48]        {"created":"@1646124385.435553285","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":897,"referenced_errors":[{"created":"@1646124385.435548451","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":348,"referenced_errors":[{"created":"@1646124385.435527347","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1646124385.435519868","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1646124385.435547238","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1646124385.435544036","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=10.0.2.25) [2022-03-01 08:46:25,479 C 22489 22489] grpc_server.cc:93:  Check failed: server_ Failed to start the grpc server. The specified port is 8076. This means that Ray's core components will not be able to function correctly. If the server startup error message is `Address already in use`, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running lsof -i :8076 to check if there are other processes listening to the port.
(raylet, ip=10.0.2.25) *** StackTrace Information ***
(raylet, ip=10.0.2.25)     ray::SpdLogMessage::Flush()
(raylet, ip=10.0.2.25)     ray::RayLog::~RayLog()
(raylet, ip=10.0.2.25)     ray::rpc::GrpcServer::Run()
(raylet, ip=10.0.2.25)     ray::ObjectManager::ObjectManager()
(raylet, ip=10.0.2.25)     ray::raylet::NodeManager::NodeManager()
(raylet, ip=10.0.2.25)     ray::raylet::Raylet::Raylet()
(raylet, ip=10.0.2.25)     main::{lambda()#1}::operator()()
(raylet, ip=10.0.2.25)     std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.2.25)     std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.2.25)     std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.2.25)     ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=10.0.2.25)     std::_Function_handler<>::_M_invoke()
(raylet, ip=10.0.2.25)     boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=10.0.2.25)     boost::asio::detail::scheduler::do_run_one()
(raylet, ip=10.0.2.25)     boost::asio::detail::scheduler::run()
(raylet, ip=10.0.2.25)     boost::asio::io_context::run()
(raylet, ip=10.0.2.25)     main
(raylet, ip=10.0.2.25)     __libc_start_main
(raylet, ip=10.0.2.25) 

Versions / Dependencies

Python: 3.7.7
Ray: 1.11
OS: Ubuntu 18.04

Reproduction script

cluster_name: balajis-cluster

min_workers: 2
max_workers: 2

provider:
    type: aws
    region: us-west-2

auth:
    ssh_user: ubuntu
    ssh_private_key: ~/.ssh/id

available_node_types:
    ray.head.default:
        node_config:
            KeyName: balajis-key-pair
            InstanceType: p3.2xlarge
            ImageId: latest_dlami
            BlockDeviceMappings:
            - DeviceName: /dev/sda1
              Ebs:
                  VolumeSize: 600
    ray.worker.default:
        min_workers: 2
        node_config:
            KeyName: balajis-key-pair
            InstanceType: p3.16xlarge
            ImageId: latest_dlami
            BlockDeviceMappings:
            - DeviceName: /dev/sda1
              Ebs:
                  VolumeSize: 600

head_node_type: ray.head.default

setup_commands:
    - pip install 'ray[default]' torch torchvision

Cannot reproduce.

Anything else

May be duplicate of #20402

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@bveeramani bveeramani added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 14, 2022
@bveeramani bveeramani changed the title [Bug] Failed to start the grpc server. [Bug] Failed to start the grpc server Mar 14, 2022
@stephanie-wang stephanie-wang added the needs-repro-script Issue needs a runnable script to be reproduced label Mar 15, 2022
@stephanie-wang
Copy link
Contributor

Can you provide a cluster config? Otherwise I don't think we'll be able to reproduce this issue and will have to close this issue.

@stephanie-wang stephanie-wang changed the title [Bug] Failed to start the grpc server [autoscaler][Bug] Raylet address already in use during worker node startup Mar 15, 2022
@bveeramani
Copy link
Member Author

I've added my cluster configuration, although it might still be hard to reproduce with the configuration alone. I ran into the error seemingly randomly while I was trying to run some Train benchmarks, so I don't know how I would reproduce.

@bveeramani
Copy link
Member Author

I'll try reproducing later today.

@david-waterworth
Copy link

david-waterworth commented Mar 31, 2022

I'm seeing this exact same error running ray start --head on my Ubuntu 20.04 workstation.

Edit: My issue is ray fails to start because redis-server was already running. I noticed this by running ray stop and getting the message

Could not terminate "/usr/bin/redis-server 127.0.0.1:6379" "" "" "" "" "" "" "" due to (pid=8310, name='redis-server')

So I stopped redis-server with /etc/init.d/redis-server stop and then ray start --head worked (but then my redis queue based code started to fail). It appears that ray depends on a version of redis-server that is different to that required by the python rq package which is causing my issue.

@stale
Copy link

stale bot commented Jul 30, 2022

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jul 30, 2022
@stale
Copy link

stale bot commented Sep 24, 2022

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

@stale stale bot closed this as completed Sep 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't needs-repro-script Issue needs a runnable script to be reproduced stale The issue is stale. It will be closed within 7 days unless there are further conversation triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

3 participants