Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

长时间连接会导致服务端异常 #490

Closed
Flandoll opened this issue Apr 10, 2021 · 101 comments
Closed

长时间连接会导致服务端异常 #490

Flandoll opened this issue Apr 10, 2021 · 101 comments

Comments

@Flandoll
Copy link

连接到服务端一段时间后服务端的CPU使用率会异常升高,并且无法连接到服务端

我的使用环境中最容易触发此异常的就是打开多个油管视频页面,播放几次其中的视频,之后长时间不关闭浏览器(中途电脑可能会进入睡眠模式)

此异常在我更新服务端到1.9.0之后就开始,到目前使用的1.10.1都一直存在,尚不清楚1.8.x是否存在

2021-04-05T21:24:47.366094002+00:00 INFO shadowsocks 1.10.1
2021-04-05T21:24:47.367802311+00:00 INFO shadowsocks tcp server listening on *.*.*.*:*, inbound address *.*.*.*:*
2021-04-06T08:57:47.521403191+00:00 WARN handshake failed, maybe wrong method or key, or under reply attacks. peer: *.*.*.*:*, error: unexpected end of file
2021-04-06T08:57:47.770084864+00:00 WARN handshake failed, maybe wrong method or key, or under reply attacks. peer: *.*.*.*:*, error: unexpected end of file
2021-04-06T08:58:15.940106176+00:00 WARN handshake failed, maybe wrong method or key, or under reply attacks. peer: *.*.*.*:*, error: unexpected end of file
2021-04-07T19:00:25.211424806+00:00 WARN handshake failed, maybe wrong method or key, or under reply attacks. peer: *.*.*.*:*, error: unexpected end of file
2021-04-09T06:49:23.888023166+00:00 ERROR tcp server accept failed with error: Too many open files (os error 24)
2021-04-09T06:49:24.888402494+00:00 ERROR tcp server accept failed with error: Too many open files (os error 24)
2021-04-09T06:51:59.888296564+00:00 ERROR tcp server accept failed with error: Too many open files (os error 24)
2021-04-09T06:52:00.890276714+00:00 ERROR tcp server accept failed with error: Too many open files (os error 24)
~
~
~
2021-04-09T23:33:54.829875812+00:00 ERROR tcp server accept failed with error: Too many open files (os error 24)
2021-04-09T23:33:55.903900128+00:00 ERROR tcp server accept failed with error: Too many open files (os error 24)
2021-04-09T23:33:56.929962562+00:00 ERROR tcp server accept failed with error: Too many open files (os error 24)
2021-04-09T23:33:58.038965784+00:00 ERROR tcp server accept failed with error: Too many open files (os error 24)
2021-04-09T23:33:59.084612733+00:00 ERROR tcp server accept failed with error: Too many open files (os error 24)

这是最近的一次报错信息,后面一段除了时间不同,错误内容都一样,并且重复了五万多行

@zonyitoo
Copy link
Collaborator

zonyitoo commented Apr 10, 2021

Your error logs actually not reported so frequently, each of them may have at least 1 second between. So I don't think they are related to your high CPU usage.

But from your logs, there are several problems:

handshake failed, maybe wrong method or key, or under reply attacks.

This indicates that there are another clients trying to connect to your port with wrong method or key. You should find it out.

tcp server accept failed with error: Too many open files (os error 24)

You have too many connection alives. You should raise the file descriptor limit with -n or ulimit, for example:

ssserver -n 10240 --udp-max-associations 1024

--udp-max-associations will also help to limit the file descriptor consumption by UDP associations.

zonyitoo added a commit that referenced this issue Apr 10, 2021
@Chandler-Lu
Copy link

Chandler-Lu commented Apr 13, 2021

Your error logs actually not reported so frequently, each of them may have at least 1 second between. So I don't think they are related to your high CPU usage.

But from your logs, there are several problems:

handshake failed, maybe wrong method or key, or under reply attacks.

This indicates that there are another clients trying to connect to your port with wrong method or key. You should find it out.

这个问题似乎确实存在( #428 ),具体表现为一段时间不连接后,首次连接无法与服务器通信。而且当且仅当我发起连接时,才会出现 handshake failed, maybe wrong method or key, or under reply attacks.
个人认为不太像有其他客户端连接或者是重放攻击之类的,因为这个问题出现时,立即同配置切换至 libev,则完全正常

@zonyitoo
Copy link
Collaborator

因为这个问题出现时,立即同配置切换至 libev

Switch server to libev or client?

@Chandler-Lu
Copy link

Chandler-Lu commented Apr 13, 2021 via email

@DuckSoft
Copy link
Contributor

DuckSoft commented Apr 13, 2021

立即切换到 libev

切换服务器实现,IV 重放过滤器不会被保留,重放/异常包就会“正常”地通过,强烈不推荐这么做。

请谨记:重启 Shadowsocks 服务器必须更换密码。

1.服务器开在哪个端口?是否有可能被别人用工具扫了?

2.handshake failed, maybe wrong method or key, or under reply attacks. peer: *.*.*.*:*, error: unexpected end of file 里面的 peer 是你认识/你自己的客户端/远端 IP 地址吗?

@Chandler-Lu
Copy link

立即切换到 libev

切换服务器实现,IV 重放过滤器不会被保留,重放/异常包就会“正常”地通过,强烈不推荐这么做。

请谨记:重启 Shadowsocks 服务器必须更换密码。

1.服务器开在哪个端口?是否有可能被别人用工具扫了?

2.handshake failed, maybe wrong method or key, or under reply attacks. peer: *.*.*.*:*, error: unexpected end of file 里面的 peer 是你认识/你自己的客户端/远端 IP 地址吗?

服务器开在21030,iptables仅允许中转ip连接。个人认为被扫的可能性不大,因为仅当我自己连接时,才会出现 maybe wrong method or key,不可能别人扫我的时间刚好都是我自己连接的时候吧。

@DuckSoft
Copy link
Contributor

因为仅当我自己连接时,才会出现 maybe wrong method or key

let target_addr = match Address::read_from(&mut self.stream).await {

看源代码发现这个错误只会出现在新连接的 Address 读不到的情形。如果连 Address 都读不到,很可能是遭到了针对性的重放攻击,或者中转隧道的传输有问题。

请问你的中转方案是什么?是否有使用 simple-obfs 之类的 SIP003 插件?

@Chandler-Lu
Copy link

因为仅当我自己连接时,才会出现 maybe wrong method or key

let target_addr = match Address::read_from(&mut self.stream).await {

看源代码发现这个错误只会出现在新连接的 Address 读不到的情形。如果连 Address 都读不到,很可能是遭到了针对性的重放攻击,或者中转隧道的传输有问题。

请问你的中转方案是什么?是否有使用 simple-obfs 之类的 SIP003 插件?

内网隧道方案我不太清楚,是服务商提供的,出口->落地由 iptables 中转,无 SIP003 插件,测试的时候已经尽可能做到配置最小化。

具体表现是使用 tcping 测延迟时(客户端qx和clash),若长时间无连接后发起连接,可能出现 tcp 重传(上千的延迟),同时出现 wrong method or key 的报错,而 libev 服务端没有这个现象

@zonyitoo
Copy link
Collaborator

zonyitoo commented Apr 13, 2021

WARN handshake failed, maybe wrong method or key, or under reply attacks. peer: ...:*, error: unexpected end of file

As we can see, error is unexpected end of file, which means that Address::read_from got EOF before finished reading the Address.

As far as I know, libev won't print any logs about this error. Maybe that's why you think everything works fine with libev.

Libev doesn't treat EOF as an error.

@zonyitoo
Copy link
Collaborator

zonyitoo commented Apr 13, 2021

If you can be sure that there are no attacker that is trying to repeat your connections, then

  1. when you are trying to initiate a new connection to server, why server's read() returns EOF before receiving a complete target Address? Normally, EOF was triggered by RST, could you run tcpdump and see if ssserver received a RST or not?
  2. If it didn't receive RST, then you may try to run strace on ssserver and see what exactly happened when read() returns EOF.

@zonyitoo
Copy link
Collaborator

zonyitoo commented Apr 13, 2021

具体表现是使用 tcping 测延迟时(客户端qx和clash),若长时间无连接后发起连接,可能出现 tcp 重传(上千的延迟)

Could you test again without tcping?

I took a glance on tcping's code, and this lines will definitely trigger the error log you mentioned above.

@zonyitoo zonyitoo reopened this Apr 13, 2021
@zonyitoo
Copy link
Collaborator

zonyitoo commented Apr 13, 2021

diff --git a/crates/shadowsocks-service/src/server/tcprelay.rs b/crates/shadowsocks-service/src/server/tcprelay.rs
index f54aea8..9e7d79a 100644
--- a/crates/shadowsocks-service/src/server/tcprelay.rs
+++ b/crates/shadowsocks-service/src/server/tcprelay.rs
@@ -13,7 +13,7 @@ use shadowsocks::{
     crypto::v1::CipherKind,
     net::{AcceptOpts, TcpStream as OutboundTcpStream},
     relay::{
-        socks5::Address,
+        socks5::{Address, Error as Socks5Error},
         tcprelay::{
             utils::{copy_from_encrypted, copy_to_encrypted},
             ProxyServerStream,
@@ -89,6 +89,13 @@ impl TcpServerClient {
     async fn serve(mut self) -> io::Result<()> {
         let target_addr = match Address::read_from(&mut self.stream).await {
             Ok(a) => a,
+            Err(Socks5Error::IoError(ref err)) if err.kind() == ErrorKind::UnexpectedEof => {
+                debug!(
+                    "handshake failed, received EOF before a complete target Address, peer: {}",
+                    self.peer_addr
+                );
+                return Ok(());
+            }
             Err(err) => {
                 // https://github.com/shadowsocks/shadowsocks-rust/issues/292
                 //

Here is an easy solution to be compatible with tcping. It may possibly become a regression of #292 . What do you think? @studentmain

@Flandoll
Copy link
Author

刚才服务端再次出现异常,并且只要尝试连接服务端,服务端的CPU占用就会升高

20210415094048

在很长一段时间内CPU的占用率一直是100%

@zonyitoo
Copy link
Collaborator

In the mean time, did you see any error logs?

zonyitoo added a commit that referenced this issue Apr 15, 2021
- ref #490
- Enable clippy on github action
@Flandoll
Copy link
Author

In the mean time, did you see any error logs?

日志和之前的一样,没有更多的信息

@zonyitoo
Copy link
Collaborator

zonyitoo commented Apr 15, 2021

@Flandoll

  1. Did you increase fd limits?
  2. Could you run perf top on this ssserver process and see what exactly it was doing when CPU usage is high?
  3. Is your password long? How many bytes does your password have?

@dev4u
Copy link

dev4u commented Apr 15, 2021

有条件可以通过降版本来验证是否是版本导致的。

@Flandoll
Copy link
Author

@zonyitoo

  1. Did you increase fd limits?

系统默认设置

  1. Could you run perf top on this ssserver process and see what exactly it was doing when CPU usage is high?

20210418125602
在高占用时运行了perf toplsof

  1. Is your password long? How many bytes does your password have?

密码为13位大小写字母+符号+数字

@zonyitoo
Copy link
Collaborator

zonyitoo commented Apr 18, 2021

@Flandoll

As I said, you have to raise your fd limit because you have already seem: tcp server accept failed with error: Too many open files (os error 24)

It is obvious that you have already exhausted your fds:

image

Just start server with -n 10240 --udp-max-associations 1024, and see if you can encounter the same problem.

Test with the lastest v1.10.7.

@zonyitoo
Copy link
Collaborator

zonyitoo commented Apr 18, 2021

If you can still reproduce it after increase the fd limits, could you tell me how can I reproduce it in my local environment?

What's your reproduce steps?

@Flandoll
Copy link
Author

What's your reproduce steps?

我也没有能快速重现的办法,通常我只是使用火狐打开多个标签页播放油管视频,长时间不关闭标签页
大约几十小时后fd就会用尽,也有可能需要几天才会

@zonyitoo
Copy link
Collaborator

zonyitoo commented Apr 19, 2021

@Flandoll Have you increased your fd limit? It is quite normal when you running with system's default fd limit.

@Flandoll
Copy link
Author

Flandoll commented Apr 19, 2021

大约11小时前我使用 -n 10240 --udp-max-associations 1024 参数重新启动了服务端,目前fd上升到340

虽然fd有时候会小幅下降,但总体成上升趋势,如果这种状态保持不变,fd总会有耗尽的时候

@zonyitoo
Copy link
Collaborator

fd总会有耗尽的时候

It shouldn't. 10240 should be enough for normal users.

@zonyitoo
Copy link
Collaborator

Did you see the same problem again?

@Flandoll
Copy link
Author

目前没有出现无法连接的状况,但是我观察到fd数量还是处于持续上升的状态,截至当前已经达到2814

@zonyitoo
Copy link
Collaborator

zonyitoo commented Jun 5, 2021

Please try the nightly build.

And I am finding a way to compatible with the situation I just found above.

@zonyitoo
Copy link
Collaborator

zonyitoo commented Jun 5, 2021

Here is another build https://github.com/shadowsocks/shadowsocks-rust/actions/runs/909993185 that try to purge those half-open connections.

CLIENT --xxxx-- PROXY ------- ssserver -------- TARGET

When CLIENT kill itself without sending FIN or RST to PROXY, the two connections between PROXY - ssserver and ssserver - TARGET won't be close actively because none of them knows when to close.

But most TARGETs will close the connection if there is no data activity. So it should help to purging the PROXY - ssserver connection.

The PROXY may be a router.

@Flandoll
Copy link
Author

Flandoll commented Jun 5, 2021

我注意到一个问题,使用x86-64-unknown-linux-gnu包以及我在CentOS上构建的可执行文件时,netstat -alnpo命令的结果中,Timer列的状态总是为off

tcp        0      0 10.1.0.5:54750          40.100.54.34:443        ESTABLISHED 10912/ssserver       off (0.00/0/0)
tcp        0      0 10.1.0.5:51708          52.42.216.19:443        ESTABLISHED 10912/ssserver       off (0.00/0/0)
tcp        0      0 10.1.0.5:53408          52.98.85.194:443        ESTABLISHED 10912/ssserver       off (0.00/0/0)

当使用x86_64-unknown-linux-musl包中的可执行文件时Timer列才会显示keepalive

tcp        0      0 10.0.0.5:36366          172.217.24.142:443      ESTABLISHED 25480/ssserver-02    keepalive (7.87/0/0)
tcp        0      0 10.0.0.5:57732          54.148.233.111:443      ESTABLISHED 25480/ssserver-02    keepalive (5.47/0/0)
tcp        0      0 10.0.0.5:48594          91.213.30.173:443       ESTABLISHED 25480/ssserver-02    keepalive (13.08/0/0)

@zonyitoo
Copy link
Collaborator

zonyitoo commented Jun 6, 2021

The nightly build only contains x86_64-unknown-linux-musl.

Did you compile with the latest master branch?

@Flandoll
Copy link
Author

Flandoll commented Jun 6, 2021

Did you compile with the latest master branch?

是的,Commits e32c869 master

@zonyitoo
Copy link
Collaborator

zonyitoo commented Jun 6, 2021

It shouldn't be off. Maybe that is the root cause?

Could you run strace on ssserver process and see if it actually enabled SO_KEEPALIVE?

@Flandoll
Copy link
Author

Flandoll commented Jun 6, 2021

我刚测试了1.11.0.x86_64-unknown-linux-gnu/musl1.11.1.x86_64-unknown-linux-gnu/musl这四个预构建的ssserver

1.11.0的Timer只有off,1.11.1的Timeron off keepalive三种状态
我重新编译了代码 d61d102Timer状态与1.11.1一样

1.11.0的strace输出

[neko@* ~]$ sudo strace -p 161079 -e network
strace: Process 161079 attached
accept4(12, {sa_family=AF_INET, sin_port=htons(55878), sin_addr=inet_addr("*.*.*.*")}, [128->16], SOCK_CLOEXEC|SOCK_NONBLOCK) = 19
accept4(12, 0x7ffc290470b8, [128], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(12, {sa_family=AF_INET, sin_port=htons(55883), sin_addr=inet_addr("*.*.*.*")}, [128->16], SOCK_CLOEXEC|SOCK_NONBLOCK) = 21
accept4(12, 0x7ffc290470b8, [128], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(12, {sa_family=AF_INET, sin_port=htons(55886), sin_addr=inet_addr("*.*.*.*")}, [128->16], SOCK_CLOEXEC|SOCK_NONBLOCK) = 23
accept4(12, 0x7ffc290470b8, [128], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
...

d61d102strace输出

[neko@* ~]$ sudo strace -p 161493 -e network
strace: Process 161493 attached
accept4(9, {sa_family=AF_INET, sin_port=htons(52915), sin_addr=inet_addr("*.*.*.*")}, [128->16], SOCK_CLOEXEC|SOCK_NONBLOCK) = 13
setsockopt(13, SOL_TCP, TCP_NODELAY, [0], 4) = 0
setsockopt(13, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(13, SOL_TCP, TCP_KEEPIDLE, [15], 4) = 0
setsockopt(13, SOL_TCP, TCP_KEEPINTVL, [15], 4) = 0
accept4(9, 0x7ffdadd87638, [128], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(9, {sa_family=AF_INET, sin_port=htons(52919), sin_addr=inet_addr("*.*.*.*")}, [128->16], SOCK_CLOEXEC|SOCK_NONBLOCK) = 15
setsockopt(15, SOL_TCP, TCP_NODELAY, [0], 4) = 0
setsockopt(15, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(15, SOL_TCP, TCP_KEEPIDLE, [15], 4) = 0
setsockopt(15, SOL_TCP, TCP_KEEPINTVL, [15], 4) = 0
accept4(9, 0x7ffdadd87638, [128], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
accept4(9, {sa_family=AF_INET, sin_port=htons(52923), sin_addr=inet_addr("*.*.*.*")}, [128->16], SOCK_CLOEXEC|SOCK_NONBLOCK) = 17
setsockopt(17, SOL_TCP, TCP_NODELAY, [0], 4) = 0
setsockopt(17, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(17, SOL_TCP, TCP_KEEPIDLE, [15], 4) = 0
setsockopt(17, SOL_TCP, TCP_KEEPINTVL, [15], 4) = 0
accept4(9, 0x7ffdadd87638, [128], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
...

@zonyitoo
Copy link
Collaborator

zonyitoo commented Jun 7, 2021

Yes, SO_KEEPALIVE is not enabled by default. So in v1.11.1, it enables SO_KEEPALIVE and sets timeouts default to 15 seconds.

Does it fix your issue?

@zonyitoo
Copy link
Collaborator

zonyitoo commented Jun 7, 2021

One interesting found: Azure's Firewall will drop idle TCP connections after 4 minutes no activity with RST.

https://docs.microsoft.com/en-us/azure/firewall/firewall-faq#how-does-azure-firewall-handle-idle-timeouts

How does Azure Firewall handle idle timeouts?

When a connection has an idle timeout (four minutes of no activity), Azure Firewall gracefully terminates the connection by sending a TCP RST packet.

So, technically, even SO_KEEPALIVE is not enabled, Azure's Firewall will purge dead connections for you.

@Flandoll
Copy link
Author

Flandoll commented Jun 7, 2021

我之前注意到了Azure的连接超时,但事实上这些半连接并没有被清除,我甚至怀疑过防火墙没有发送RST而是直接断开了连接。

@1fd
Copy link

1fd commented Jun 7, 2021

这个问题最后怎么解决,我好像也是这样的问题

@zonyitoo
Copy link
Collaborator

zonyitoo commented Jun 7, 2021

Just use v1.11.1.

@1fd
Copy link

1fd commented Jun 7, 2021

已经更了1.11.1 使用shadowsocks-v1.11.1.x86_64-unknown-linux-gnu.tar.xz

这样是正常的吗 https://z3.ax1x.com/2021/06/07/20HJmQ.png

@zonyitoo
Copy link
Collaborator

zonyitoo commented Jun 7, 2021

As it said, wrong method or key.

@Klaaktu
Copy link

Klaaktu commented Jun 9, 2021

...by sending a TCP RST packet.

Doesn't Shadowsocks ignore RST?

@zonyitoo
Copy link
Collaborator

zonyitoo commented Jun 9, 2021

Since this project uses network stack provided by kernels, how can it ignore RST.

@Klaaktu
Copy link

Klaaktu commented Jun 9, 2021

Ah ok, I think I saw somewhere that the original project did but couldn't find it...
Btw, can the new keepalive option be set to the kernel default in some way?

@zonyitoo
Copy link
Collaborator

zonyitoo commented Jun 9, 2021

Well, currently there is no way to make that possible. But in most cases, there is no reason to use the default configuration of system.

@zonyitoo
Copy link
Collaborator

@Flandoll Does v1.11.1/2 fix your issue?

@Flandoll
Copy link
Author

目前没有观察到fd数量异常。

@zonyitoo
Copy link
Collaborator

So it should be considered fixed. Great.

@pexcn pexcn mentioned this issue Jun 17, 2021
29 tasks
atkdef pushed a commit to atkdef/shadowsocks-rust that referenced this issue Sep 24, 2021
- default timeout is 24 hours
- ref shadowsocks#490
- refactored tunnel copy with copy_bidirectional to avoid unnecessary
  splits
atkdef pushed a commit to atkdef/shadowsocks-rust that referenced this issue Sep 24, 2021
- ref shadowsocks#490
- fixed bug, ssserver slient-drop doesn't work
atkdef pushed a commit to atkdef/shadowsocks-rust that referenced this issue Sep 24, 2021
atkdef pushed a commit to atkdef/shadowsocks-rust that referenced this issue Sep 24, 2021
atkdef pushed a commit to atkdef/shadowsocks-rust that referenced this issue Sep 24, 2021
atkdef pushed a commit to atkdef/shadowsocks-rust that referenced this issue Sep 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants