Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

callhome getting a timeout after 10 mins #201

Open
vishalearnz opened this issue Jan 17, 2020 · 14 comments
Open

callhome getting a timeout after 10 mins #201

vishalearnz opened this issue Jan 17, 2020 · 14 comments
Labels
information Includes some important information, perhaps in addition to a bug report or question.

Comments

@vishalearnz
Copy link

vishalearnz commented Jan 17, 2020

Hi Michal,

I am implemented callhome feature where I am using API's of Libnetconf2.
I am noticing that every 10 mins the device is getting disconnected . Can you tell me , if it can be made persistent i.e I dont want to get it disconnected.
Also I want to know , in which of the files this 10 min timer is present , so that If I want to extend it , I can do .

PS: I am using latest libnetconf2 tag (v0.12-r2) , ".so" gets created with name libnetconf2.so.0.12.55

Thanks & Regards,
Vishal

@michalvasko
Copy link
Member

Hi Vishal,
could you please provide some more information? What does the client/server prints, so you at least know whether the server terminates the session or the client? Also, you are hardly using the latest libnetconf2 tag, we have released version 1.1.3 with the same tag.

Regards,
Michal

@vishalearnz
Copy link
Author

Hi Michal,
At the server end we have set a callback which calls
nc_ps_poll(nc_ps, -1, &ncs);
after 10 mins nc_ps_poll returns "NC_PSPOLL_SESSION_TERM" and disconnects the callhome device

As the client we are using ODL ,
Logs at the ODL says :
6:55:02.764 INFO [nioEventLoopGroup-4-56] Removing Uri{_value=NodeCallHome3} from Netconf Topology.
16:55:02.764 WARN [nioEventLoopGroup-4-56] RemoteDevice{NodeCallHome3}: Session terminated Session closed
16:55:02.770 INFO [nioEventLoopGroup-4-56] Removing Uri{_value=NodeCallHome3} from Netconf Topology.
16:55:02.773 INFO [opendaylight-cluster-data-notification-dispatcher-185] Setting failed status for callhome device id:Uri{_value=NodeCallHome3}.
16:55:02.773 INFO [opendaylight-cluster-data-notification-dispatcher-181] Setting failed status for callhome device id:Uri{_value=NodeCallHome3}.
16:55:02.773 INFO [opendaylight-cluster-data-notification-dispatcher-184] Setting failed status for callhome device id:Uri{_value=NodeCallHome3}.
16:55:02.774 WARN [nioEventLoopGroup-4-56] RemoteDevice{NodeCallHome3}: Session went down
java.io.IOException: End of input detected. Close the session.
at org.opendaylight.netconf.nettyutil.AbstractNetconfSession.endOfInput(AbstractNetconfSession.java:94) [281:org.opendaylight.netconf.netty-util:1.6.1]
at org.opendaylight.netconf.nettyutil.AbstractNetconfSession.channelInactive(AbstractNetconfSession.java:174) [281:org.opendaylight.netconf.netty-util:1.6.1]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242) [104:io.netty.transport:4.1.34.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228) [104:io.netty.transport:4.1.34.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221) [104:io.netty.transport:4.1.34.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390) [99:io.netty.codec:4.1.34.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355) [99:io.netty.codec:4.1.34.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242) [104:io.netty.transport:4.1.34.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228) [104:io.netty.transport:4.1.34.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221) [104:io.netty.transport:4.1.34.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390) [99:io.netty.codec:4.1.34.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355) [99:io.netty.codec:4.1.34.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242) [104:io.netty.transport:4.1.34.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228) [104:io.netty.transport:4.1.34.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221) [104:io.netty.transport:4.1.34.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1403) [104:io.netty.transport:4.1.34.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242) [104:io.netty.transport:4.1.34.Final]
at io.netty.channel.AbstractChannelHandlerContext.access$300(AbstractChannelHandlerContext.java:38) [104:io.netty.transport:4.1.34.Final]
at io.netty.channel.AbstractChannelHandlerContext$4.run(AbstractChannelHandlerContext.java:233) [104:io.netty.transport:4.1.34.Final]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [101:io.netty.common:4.1.34.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404) [101:io.netty.common:4.1.34.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:495) [104:io.netty.transport:4.1.34.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905) [101:io.netty.common:4.1.34.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [101:io.netty.common:4.1.34.Final]
at java.lang.Thread.run(Thread.java:745) [?:?]

Regards,
Vishal

@michalvasko
Copy link
Member

Hi Vishal,
okay, it would seem the server is terminating the connection, could you enable some (-v2) messages? The server should print the reason for why it is disconnecting. But the only default 10 min timeout I found in the old code was hello timeout. If that is the one elapsed, there is a problem with the client, but it is too soon to say, please increase verbosity of the server and then I can help you more.

Regards,
Michal

@vishalearnz
Copy link
Author

vishalearnz commented Jan 21, 2020

Hi Michal,

I am not using netopeer so -v2 i cant do But I have added verbosity to the logs using the api's
'nc_verbosity(NC_VERB_DEBUG)' and a callback function added using 'nc_set_print_clb()'
Following are the logs after enabling verbose mode:

After 10 mins logs:

Jan 21 04:43:45 6 2 0 netconfAgent:NoName D [19:39:59] LibNetConfOriginator.cpp:264
Jan 21 04:43:45 6 2 0 netconfAgent:NoName D LIBNETCONF : Session 1: SSH channel poll error (Received SSH_MSG_DISCONNECT: 2:User session has timed out idling after 600000 ms.).
Jan 21 04:43:45 6 2 0 netconfAgent:NoName D [19:39:59] LibNetConfOriginator.cpp:126
Jan 21 04:43:45 6 2 0 netconfAgent:NoName D Got nc_poll_sess_termination.Deleting and exiting the worker thread
Jan 21 04:43:46 6 2 0 netconfAgent:NoName D [19:40:00] LibNetConfOriginator.cpp:105
Jan 21 04:43:46 6 2 0 netconfAgent:NoName D Inside free_ds

After a min again logs:

Jan 21 04:44:26 6 2 0 netconfAgent:NoName D LIBNETCONF : Session 1: Call Home thread failed to wake up in a timely manner, fatal synchronization problem.
Jan 21 04:44:26 6 2 0 netconfAgent:NoName D [19:40:40] LibNetConfOriginator.cpp:264
Jan 21 04:44:26 6 2 0 netconfAgent:NoName D LIBNETCONF : Pthread condition timedwait failed (Operation not permitted).
Jan 21 04:44:26 6 2 0 netconfAgent:NoName D [19:40:40] LibNetConfOriginator.cpp:264
Jan 21 04:44:26 6 2 0 netconfAgent:NoName D LIBNETCONF : Call Home client "OdlClient" thread exit

Inference

It seems that, the ssh is disconnecting because keepalive is not there which should tell the ssh to be online. So is there a way in libnetconf2 to send keepalive messages to ssh which will not stop the termination of connection?

version of libssh used - libssh-0.9.0

Thanks & Regards,
Vishal

@michalvasko
Copy link
Member

Hi Vishal,
I was right, this is not a libnetconf2 timeout but an SSH one. Keep-alive should be enabled for all the sockets (both server and client) but it is on TCP-level, this seems like a specific SSH one. You will have to configure this yourself. It may even be in the configuration of your distribution, libssh applies this configuration since version 0.9.0 (files in /etc/ssh).

Regards,
Michal

@vishalearnz
Copy link
Author

vishalearnz commented Jan 22, 2020

Hi Michal,

As per your suggestion, and surfing the internet I found out that, we can add "ClientAliveInterval"/"ServerAliveInterval" option to /etc/ssh/sshd_config file so that the the session can be alive

Reference : - https://unix.stackexchange.com/questions/3026/what-options-serveraliveinterval-and-clientaliveinterval-in-sshd-config-exac

But this din't work at all . After looking through the libssh code(0.9.0) I found out that the above two options are not present and only following are the options available:

enum ssh_bind_config_opcode_e {
32     /* Known but not allowed in Match block */
33     BIND_CFG_NOT_ALLOWED_IN_MATCH = -4,
34     /* Unknown opcode */
35     BIND_CFG_UNKNOWN = -3,
36     /* Known and not applicable to libssh */
37     BIND_CFG_NA = -2,
38     /* Known but not supported by current libssh version */
39     BIND_CFG_UNSUPPORTED = -1,
40     BIND_CFG_INCLUDE,
41     BIND_CFG_HOSTKEY,
42     BIND_CFG_LISTENADDRESS,
43     BIND_CFG_PORT,
44     BIND_CFG_LOGLEVEL,
45     BIND_CFG_CIPHERS,
46     BIND_CFG_MACS,
47     BIND_CFG_KEXALGORITHMS,
48     BIND_CFG_MATCH,
49     BIND_CFG_PUBKEY_ACCEPTED_KEY_TYPES,
50     BIND_CFG_HOSTKEY_ALGORITHMS,
51 
 52     BIND_CFG_MAX /* Keep this one last in the list */
53 };

Also when I enabled the logs in libssh and redirected it to print it to file following are the results I got:

Unknown option: Protocol, line: 3
Unknown option: Banner, line: 4
Unknown option: RhostsRSAAuthentication, line: 5
Unknown option: HostbasedAuthentication, line: 6
Unknown option: PasswordAuthentication, line: 7
Unknown option: PermitEmptyPasswords, line: 8
Unknown option: ChallengeResponseAuthentication, line: 9
Unknown option: AllowTcpForwarding, line: 10
Unknown option: PidFile, line: 12
Unknown option: TCPKeepAlive, line: 13
Unknown option: ClientAliveCountMax, line: 14
Unknown option: ClientAliveInterval, line: 15
Unknown option: MaxStartups, line: 16 
Unknown option: ForceCommand, line: 46
Count 46
Poll callback on socket 12 (POLLOUT ), out buffer 22
Received POLLOUT in connecting state
ssh_handle_key_exchange: current state : 2
Poll callback on socket 12 (POLLOUT ), out buffer 22
Enabling POLLOUT for socket
ssh_handle_key_exchange: current state : 2
Poll callback on socket 12 (POLLIN POLLOUT ), out buffer 0
Received banner: SSH-2.0-OpenSSH_5.3
SSH client banner: SSH-2.0-OpenSSH_5.3
Analyzing banner: SSH-2.0-OpenSSH_5.3
We are talking to an OpenSSH client version: 5.3 (50300)
kex algos: ecdh-sha2-nistp256,diffie-hellman-group14-sha1,diffie-hellman-group-exchange-sha1                                                                                    
server host key algo: rsa-sha2-512,rsa-sha2-256,ssh-rsa                                                                                                                         
encryption client->server: aes128-ctr,aes256-ctr,aes192-ctr,[email protected],[email protected],[email protected]                                         
encryption server->client: aes128-ctr,aes256-ctr,aes192-ctr,[email protected],[email protected],[email protected]                                         
mac algo client->server: hmac-sha2-512,hmac-sha2-256,[email protected],[email protected],hmac-sha1,[email protected]
mac algo server->client: hmac-sha2-512,hmac-sha2-256,[email protected],[email protected],hmac-sha1,[email protected]    
compression algo client->server: none                                                                                                                                           
compression algo server->client: none                                                                                                                                           
languages client->server:                                                                                                                                                       
languages server->client:                                                                                                                                                       
packet: wrote [type=20, len=652, padding_size=5, comp=646, payload=646]                                                                                                         
SSH_MSG_KEXINIT sent                                                                                                                                                            
Enabling POLLOUT for socket                                                                                                                                                     
ssh_handle_key_exchange: current state : 4                                                                                                                                      
Poll callback on socket 12 (POLLIN POLLOUT ), out buffer 0                                                                                                                      
packet: read type 20 [len=868,padding=8,comp=859,payload=859]
Dispatching handler for packet type 20
kex algos: diffie-hellman-group-exchange-sha1
server host key algo: [email protected],[email protected],[email protected],[email protected],ssh-rsa,ssh-dss                       
encryption client->server: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected]
encryption server->client: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected]
mac algo client->server: hmac-md5,hmac-sha1,[email protected],hmac-sha2-256,hmac-sha2-512,hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96                  
mac algo server->client: hmac-md5,hmac-sha1,[email protected],hmac-sha2-256,hmac-sha2-512,hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96                  
compression algo client->server: none,[email protected],zlib                                                                                                                     
compression algo server->client: none,[email protected],zlib                                                                                                                     
languages client->server:                                                                                                                                                       
languages server->client:
Negotiated diffie-hellman-group-exchange-sha1,ssh-rsa,aes128-ctr,aes128-ctr,hmac-sha1,hmac-sha1,none,none,,
Set output algorithm aes128-ctr                                                                                                                                                 
Set HMAC output algorithm to hmac-sha1                                                                                                                                          
Set input algorithm aes128-ctr                                                                                                                                                  
Set HMAC input algorithm to hmac-sha1                                                                                                                                           
sending control flow event                                                                                                                                                      
sending channel_write_wontblock callback                                                                                                                                        
ssh_handle_key_exchange: current state : 6                                                                                                                                      
Poll callback on socket 12 (POLLOUT ), out buffer 0                                                                                                                             
sending control flow event                                                                                                                                                      
sending channel_write_wontblock callback                                                                                                                                        
ssh_handle_key_exchange: current state : 6                                                                                                                                      
ssh_handle_key_exchange: current state : 6                                                                                                                                      
ssh_handle_key_exchange: current state : 6                                                                                                                                      
ssh_handle_key_exchange: current state : 6                                                                                                                                      
ssh_handle_key_exchange: current state : 6                                                                                                                                      
ssh_handle_key_exchange: current state : 6                                                                                                                                      
ssh_handle_key_exchange: current state : 6                                                                                                                                      
ssh_handle_key_exchange: current state : 6                                                                                                                                      
Poll callback on socket 12 (POLLIN ), out buffer 0                                                                                                                              
packet: read type 34 [len=20,padding=6,comp=13,payload=13]                                                                                                                      
Dispatching handler for packet type 34

Every other website which I saw is saying the same thing to add those options . But those options are not available in libssh code at all.
Can you pl help me on this , or if you can tell me whom should be the better person to ask this.

Thanks & Regards.
Vishal

@michalvasko
Copy link
Member

Hi Vishal,
this quite interesting information for any future reference but like you said, you cannot set it in libssh so I do not think it is relevant for libnetconf2. My guess is that your NETCONF client (ODL) is the one disconnecting, which is supported by this message

Jan 21 04:43:45 6 2 0 netconfAgent:NoName D LIBNETCONF : Session 1: SSH channel poll error (Received SSH_MSG_DISCONNECT: 2:User session has timed out idling after 600000 ms.).

It says that it has failed to poll a channel because it has disconnected and it then prints out the reason that the client sent. So, you need to configure this timeout in your client somehow, it is not a libnetconf2 issue.

Regards,
Michal

@vishalearnz
Copy link
Author

vishalearnz commented Jan 28, 2020

Hi Michal,

This is just an FYI .
As you said that problem was not with libnetconf2 .
libssh terminates the connection after 10 minutes if it doesnot gets response from client.
Also the parameters like "ClientAliveInterval"/"ServerAliveInterval" is provided by openssh in /etc/config/sshd_config file and libssh is not using these as they term it as notsafe .
So to keep the server alive one needs to send an unsolicited dummy message before the timeout to keep it alive . For this one can use ssh_send_ignore() api of libssh.
I tried it and it worked for me .

For reference pl refer to the link : https://bugs.libssh.org/T212

Also to add to this , this issue of connection termination at fixed intervals(10 min) is with ODL client , for MGSoft client it works fine (without changing the code) and doesn't gets disconnected.

Thanks & Regards,
Vishal

@michalvasko
Copy link
Member

Hi Vishal,
thanks for the information. So it is like I said and it seems to me the correct solution would be to change the ODL client configuration, right? I just want to make sure I understand the problem, you are probably not able to change anything in the client.

Regards,
Michal

@vishalearnz
Copy link
Author

Hi Michal ,

In a way yes . I am not changing the the ODL client configuration . I don't have much info on what to change in ODL client for the same.

But still as per the RFC 8071 for netconf callhome

This states that :
If a persistent connection is desired, the NETCONF/RESTCONF server, as the connection initiator, SHOULD actively test the aliveness of the connection using a keep-alive mechanism. For SSH-based connections, per Section 4 of [RFC4254], the server SHOULD
send an SSH_MSG_GLOBAL_REQUEST message with a purposely nonexistent "request name" value(e.g. [email protected]) and the "want reply" value set to '1'.

So to do this it has to be a change in server side , which i think would be correct .
Is there anything you can suggest me on this to implement it.

Regards,
Vishal

@michalvasko
Copy link
Member

Hi Vishal,
libnetconf2 supports sending keep-alive messages but since the spec you quoted cannot be satisfied with the libraries we use (OpenSSL issue openssl/openssl#4856, libssh does not provide the API as seen in your bug) to my knowledge, it uses TCP packets and it worked fine for most use cases. But it seems ODL has a really aggressive config and implements a separate SSH timeout, which needs SSH keep-alive so there is nothing we can do about that in libnetconf2. I suppose your solution of sending some other dummy SSH message is fine but I would prefer not to add it into the library. Also, it is the server that should send keep-alives based on the specs, not the client.

Regards,
Michal

@michalvasko michalvasko added the information Includes some important information, perhaps in addition to a bug report or question. label Jan 28, 2020
@tonic12
Copy link

tonic12 commented Jan 31, 2020

Hi, Michal!

ODL for netconf-callhome use (https://tools.ietf.org/html/rfc8071) in section-4.1 point S7 we looks:

If a persistent connection is desired, the NETCONF/RESTCONF server, as the connection initiator, SHOULD actively test the aliveness of the connection using a keep-alive mechanism. For TLS-based connections, the NETCONF/RESTCONF server SHOULD send HeartbeatRequest messages, as defined by [RFC6520]. For SSH-based connections, per Section 4 of [RFC4254], the server SHOULD send an SSH_MSG_GLOBAL_REQUEST message with a purposely nonexistent "request name" value (e.g., [email protected]) and the "want reply" value set to '1'.

I think it's netopeer2 problem (libSSH), not ODL. Сorrect me if I'm wrong.

@michalvasko
Copy link
Member

michalvasko commented Jan 31, 2020

Hi,
well, I suppose, but like I tried to explain there is no way of implementing this keep-alive in libnetconf2 because libssh and OpenSSL do not provide API for this functionality.

However, note that the specification is talking about what a NETCONF server as the connection initiator should do, there is nothing about what the client should do. Obviously, it can expect to receive these messages because the server should send them, but that is still technically a proprietary functionality and I would expect it can at least be configured/turned off.

Regards,
Michal

@ihrasko
Copy link

ihrasko commented Aug 25, 2022

FYI: https://jira.opendaylight.org/browse/NETCONF-832 and https://jira.opendaylight.org/browse/NETCONF-681 have been fixed.

IMO: they could be related to this problem.

The first one has fixed call-home devices heart-beats so they are not disconnected after 10 minutes any more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
information Includes some important information, perhaps in addition to a bug report or question.
Projects
None yet
Development

No branches or pull requests

4 participants