-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[chassis] Fix issues regarding database service failure handling and mid-plane connectivity for namespace. #10500
Conversation
df46ed418e661a9bccdb2639d8873def356f8ba0 (HEAD -> master, origin/master, origin/HEAD) Fix the LLDP_LOC_CHASSIS not getting populated if no remote neighbors are present (sonic-net#39) e487532e11cc0e97cfce573b6b997fdd0beeb660 [CI] Set up CI&PR with Azure Pipelines (sonic-net#38) 3c9f488490a1dbded20dbf2d8a88a5ab4dbda8df Replace swsssdk's SonicV2Connector with swsscommon's implementation (sonic-net#35) Signed-off-by: Abhishek Dosi <[email protected]>
78f167e4728f939712b3f3ea550949e2ea675fec With the changes in PR:sonic-net#5289 access to redis unix socket is given to the redis group members. Many of sonic-util commands (especially in multi-asic) case use redis unix socket to connect to DB and thus those comamnd fails without providing sudo. This PR is continuation of PR: sonic-net#7002 where we default to use TCP for Redis if user is not root Signed-off-by: Abhishek Dosi <[email protected]>
mid-plane connectivity for namespace. Signed-off-by: Abhishek Dosi <[email protected]>
cc @anamehra |
@ysmanman @minionatwork can you please review this too, |
1 similar comment
@ysmanman @minionatwork can you please review this too, |
@@ -152,7 +175,8 @@ function postStartAction() | |||
# then we catch python exception of file not valid | |||
# that comes to syslog which is unwanted so wait till database | |||
# config is ready and then ping | |||
until [[ ($(docker exec -i database$DEV pgrep -x -c supervisord) -gt 0) && ($($SONIC_DB_CLI PING | grep -c PONG) -gt 0) ]]; do | |||
until [[ ($(docker exec -i database$DEV pgrep -x -c supervisord) -gt 0) && ($($SONIC_DB_CLI PING | grep -c PONG) -gt 0) && | |||
($(docker exec -i database$DEV sonic-db-cli PING | grep -c PONG) -gt 0) ]]; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't line 178 also doing SONIC_DB_CLI PING and line 179 also doing sonic-db-cli PING using docker exec? Whats the difference here?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are in namespace context SONIC_DB_CLI maps as sonic-db-cli -n asicx
|
||
|
||
if [[ -n "$midplane_subnet" ]]; then | ||
# Use /16 for loopback interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont see code to cleanup all these linux networking state when database docker restart?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ip netns delete
should remove all the resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
add @nikitha6179 for viz |
@minionatwork @judyjoseph this pr enforces not to have Updated Nokia chassisdb.conf file with same. |
@ysmanman please review also. |
your code is still using midplane_subnet bash variable. Where is it set now?. |
it should be present in line-card |
/azp run Azure.sonic-buildimage |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run Azure.sonic-buildimage |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good to me.
…anch Related work items: #52, #71, #73, #75, #77, sonic-net#1306, sonic-net#1588, sonic-net#1991, sonic-net#2031, sonic-net#2040, sonic-net#2053, sonic-net#2066, sonic-net#2069, sonic-net#2087, sonic-net#2107, sonic-net#2110, sonic-net#2112, sonic-net#2113, sonic-net#2117, sonic-net#2124, sonic-net#2125, sonic-net#2126, sonic-net#2128, sonic-net#2130, sonic-net#2131, sonic-net#2132, sonic-net#2133, sonic-net#2134, sonic-net#2135, sonic-net#2136, sonic-net#2137, sonic-net#2138, sonic-net#2139, sonic-net#2140, sonic-net#2143, sonic-net#2158, sonic-net#2161, sonic-net#2233, sonic-net#2243, sonic-net#2250, sonic-net#2254, sonic-net#2260, sonic-net#2261, sonic-net#2267, sonic-net#2278, sonic-net#2282, sonic-net#2285, sonic-net#2288, sonic-net#2289, sonic-net#2292, sonic-net#2294, sonic-net#8887, sonic-net#9279, sonic-net#9390, sonic-net#9511, sonic-net#9700, sonic-net#10025, sonic-net#10322, sonic-net#10479, sonic-net#10484, sonic-net#10493, sonic-net#10500, sonic-net#10580, sonic-net#10595, sonic-net#10628, sonic-net#10634, sonic-net#10635, sonic-net#10644, sonic-net#10670, sonic-net#10691, sonic-net#10716, sonic-net#10731, sonic-net#10750, sonic-net#10751, sonic-net#10752, sonic-net#10761, sonic-net#10769, sonic-net#10775, sonic-net#10776, sonic-net#10779, sonic-net#10786, sonic-net#10792, sonic-net#10793, sonic-net#10800, sonic-net#10806, sonic-net#10826, sonic-net#10839, sonic-net#10840, sonic-net#10842, sonic-net#10844, sonic-net#10847, sonic-net#10849, sonic-net#10852, sonic-net#10865, sonic-net#10872, sonic-net#10877, sonic-net#10886, sonic-net#10889, sonic-net#10903, sonic-net#10904, sonic-net#10905, sonic-net#10913, sonic-net#10914, sonic-net#10916, sonic-net#10919, sonic-net#10925, sonic-net#10926, sonic-net#10929, sonic-net#10933, sonic-net#10934, sonic-net#10937, sonic-net#10941, sonic-net#10947, sonic-net#10952, sonic-net#10953, sonic-net#10957, sonic-net#10959, sonic-net#10971, sonic-net#10972, sonic-net#10980
@@ -180,7 +203,8 @@ function postStartAction() | |||
# then we catch python exception of file not valid | |||
# that comes to syslog which is unwanted so wait till database | |||
# config is ready and then ping | |||
until [[ ($(docker exec -i database$DEV pgrep -x -c supervisord) -gt 0) && ($($SONIC_DB_CLI PING | grep -c PONG) -gt 0) ]]; do | |||
until [[ ($(docker exec -i database$DEV pgrep -x -c supervisord) -gt 0) && ($($SONIC_DB_CLI PING | grep -c PONG) -gt 0) && | |||
($(docker exec -i database$DEV sonic-db-cli PING | grep -c PONG) -gt 0) ]]; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you have an issue in the supervisor's database containers here, since midplane_subnet
is not defined the midplane interface doesn't exist in the container, so trying to PING the databases from within the container will fail?
What/Why I did:
Issue1: By setting up of ipvlan interface in interface-config.sh we are not tolerant to failures. Reason being interface-config.service is one-shot and do not have restart capability.
Scenario: For example if let's say database service goes in fail state then interface-services also gets failed because of dependency check but later database service gets restart but interface service will remain in stuck state and the ipvlan interface nevers get created.
Solution: Moved all the logic in database service from interface-config service which looks more align logically also since the namespace is created here and all the network setting (sysctl) are happening here.With this if database starts we recreate the interface.
Issue 2: Use of IPVLAN vs MACVLAN
Currently we are using ipvlan mode. However above failure scenario is not handle correctly by ipvlan mode. Once the ipvlan interface is created and ip address assign to it and if we restart interface-config or database (new PR) service Linux Kernel gives error "Error: Address already assigned to an ipvlan device." based on this:https://github.com/torvalds/linux/blob/master/drivers/net/ipvlan/ipvlan_main.c#L978Reason being if we do not do cleanup of ip address assignment (need to be unique for IPVLAN) it remains in Kernel Database and never goes to free pool even though namespace is deleted.
Solution: Considering this hard dependency of unique ip macvlan mode is better for us and since everything is managed by Linux Kernel and no dependency for on user configured IP address.
Issue3: Namespace database Service do not check reachability to Supervisor Redis Chassis Server.
Currently there is no explicit check as we never do Redis PING from namespace to Supervisor Redis Chassis Server. With this check it's possible we will start database and all other docker even though there is no connectivity and will hit the error/failure late in cycle
Solution: Added explicit PING from namespace that will check this reachability.
Issue 4:flushdb give exception when trying to accces Chassis Server DB over Unix Sokcet.
Solution: Handle gracefully via try..except and log the message.
Verification:
Did manual verification by killing database service and see post restart all redis connectivity and database service are healthy.