Skip to content

Commit

Permalink
[chassis] Fix issues regarding database service failure handling and …
Browse files Browse the repository at this point in the history
…mid-plane connectivity for namespace. (#10500)

What/Why I did:

Issue1: By setting up of ipvlan interface in interface-config.sh we are not tolerant to failures. Reason being interface-config.service is one-shot and do not have restart capability. 

Scenario: For example if let's say database service goes in fail state  then interface-services also gets failed because of dependency check but later database service gets restart but interface service will remain in stuck state and the ipvlan interface nevers get created.

Solution: Moved all the logic in database service from interface-config service which looks more align logically also since the namespace is created here and all the network setting (sysctl) are happening here.With this if database starts we recreate the interface.

Issue 2: Use of IPVLAN vs MACVLAN

Currently we are using ipvlan mode.  However above failure scenario is not handle correctly by ipvlan mode. Once the ipvlan interface is created and ip address assign to it and if we restart interface-config or database (new PR) service Linux Kernel gives error "Error: Address already assigned to an ipvlan device."  based on this:https://github.com/torvalds/linux/blob/master/drivers/net/ipvlan/ipvlan_main.c#L978Reason being if we do not do cleanup of ip address assignment (need to be unique for IPVLAN)  it remains in Kernel Database and never goes to free pool even though namespace is deleted. 

Solution: Considering this hard dependency of unique ip macvlan mode is better for us and since everything is managed by Linux Kernel and no dependency for on user configured IP address.

Issue3: Namespace database Service do not check reachability to Supervisor Redis Chassis   Server.

Currently there is no explicit check as we never do Redis PING from namespace to Supervisor Redis Chassis  Server. With this check it's possible we will start database and all other docker even though there is no connectivity and will hit the error/failure late in cycle

Solution: Added explicit PING from namespace that will check this reachability.

Issue 4:flushdb give exception when trying to accces Chassis Server DB over Unix Sokcet.

Solution: Handle gracefully via try..except and log the message.
  • Loading branch information
abdosi authored May 24, 2022
1 parent a477dbb commit 0285bfe
Show file tree
Hide file tree
Showing 4 changed files with 40 additions and 53 deletions.
1 change: 0 additions & 1 deletion device/nokia/x86_64-nokia_ixr7250e_sup-r0/chassisdb.conf
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,3 @@ start_chassis_db=1
chassis_db_address=10.6.0.100
lag_id_start=1
lag_id_end=512
midplane_subnet=10.6.0.0/16
8 changes: 6 additions & 2 deletions dockers/docker-database/flush_unused_database
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ import swsssdk
import redis
import subprocess
import time
import syslog

while(True):
output = subprocess.Popen(['sonic-db-cli', 'PING'], stdout=subprocess.PIPE, text=True).communicate()[0]
Expand All @@ -24,5 +25,8 @@ for instname, v in instlists.items():
if dbinst == instname:
continue

r = redis.Redis(host=insthost, unix_socket_path=instsocket, db=dbid)
r.flushdb()
try:
r = redis.Redis(host=insthost, unix_socket_path=instsocket, db=dbid)
r.flushdb()
except (redis.exceptions.ConnectionError):
syslog.syslog(syslog.LOG_INFO,"flushdb:Redis Unix Socket connection error for path {} and dbaname {}".format(instsocket, dbname))
42 changes: 34 additions & 8 deletions files/build_templates/docker_image_ctl.j2
Original file line number Diff line number Diff line change
Expand Up @@ -118,12 +118,8 @@ function preStartAction()

function setPlatformLagIdBoundaries()
{
CHASSIS_CONF=/usr/share/sonic/device/$PLATFORM/chassisdb.conf
if [ -f "$CHASSIS_CONF" ]; then
source $CHASSIS_CONF
docker exec -i ${DOCKERNAME} $SONIC_DB_CLI CHASSIS_APP_DB SET "SYSTEM_LAG_ID_START" "$lag_id_start"
docker exec -i ${DOCKERNAME} $SONIC_DB_CLI CHASSIS_APP_DB SET "SYSTEM_LAG_ID_END" "$lag_id_end"
fi
docker exec -i ${DOCKERNAME} $SONIC_DB_CLI CHASSIS_APP_DB SET "SYSTEM_LAG_ID_START" "$lag_id_start"
docker exec -i ${DOCKERNAME} $SONIC_DB_CLI CHASSIS_APP_DB SET "SYSTEM_LAG_ID_END" "$lag_id_end"
}
function waitForAllInstanceDatabaseConfigJsonFilesReady()
{
Expand Down Expand Up @@ -158,13 +154,40 @@ sleep 1
function postStartAction()
{
{%- if docker_container_name == "database" %}
CHASSISDB_CONF="/usr/share/sonic/device/$PLATFORM/chassisdb.conf"
[ -f $CHASSISDB_CONF ] && source $CHASSISDB_CONF
if [ "$DEV" ]; then
# Enable the forwarding on eth0 interface in namespace.
SYSCTL_NET_CONFIG="/etc/sysctl.d/sysctl-net.conf"
docker exec -i database$DEV sed -i -e "s/^net.ipv4.conf.eth0.forwarding=0/net.ipv4.conf.eth0.forwarding=1/;
s/^net.ipv6.conf.eth0.forwarding=0/net.ipv6.conf.eth0.forwarding=1/" $SYSCTL_NET_CONFIG
docker exec -i database$DEV sysctl --system -e
link_namespace $DEV


if [[ -n "$midplane_subnet" ]]; then
# Use /16 for loopback interface
ip netns exec "$NET_NS" ip addr add 127.0.0.1/16 dev lo
ip netns exec "$NET_NS" ip addr del 127.0.0.1/8 dev lo

# Create eth1 in database instance
ip link add name ns-eth1"$NET_NS" link eth1-midplane type macvlan mode bridge
ip link set dev ns-eth1"$NET_NS" netns "$NET_NS"
ip netns exec "$NET_NS" ip link set ns-eth1"$NET_NS" name eth1

# Configure IP address and enable eth1
lc_slot_id=$(python3 -c 'import sonic_platform.platform; platform_chassis = sonic_platform.platform.Platform().get_chassis(); print(platform_chassis.get_my_slot())' 2>/dev/null)
lc_ip_address=`echo $midplane_subnet | awk -F. '{print $1 "." $2}'`.$lc_slot_id.$(($DEV + 10))
lc_subnet_mask=${midplane_subnet#*/}
ip netns exec "$NET_NS" ip addr add $lc_ip_address/$lc_subnet_mask dev eth1
ip netns exec "$NET_NS" ip link set dev eth1 up

# Allow localnet routing on the new interfaces if midplane is using a
# subnet in the 127/8 range.
if [[ "${midplane_subnet#127}" != "$midplane_subnet" ]]; then
ip netns exec "$NET_NS" bash -c "echo 1 > /proc/sys/net/ipv4/conf/eth1/route_localnet"
fi
fi
fi
# Setup ebtables configuration
ebtables_config
Expand All @@ -180,7 +203,8 @@ function postStartAction()
# then we catch python exception of file not valid
# that comes to syslog which is unwanted so wait till database
# config is ready and then ping
until [[ ($(docker exec -i database$DEV pgrep -x -c supervisord) -gt 0) && ($($SONIC_DB_CLI PING | grep -c PONG) -gt 0) ]]; do
until [[ ($(docker exec -i database$DEV pgrep -x -c supervisord) -gt 0) && ($($SONIC_DB_CLI PING | grep -c PONG) -gt 0) &&
($(docker exec -i database$DEV sonic-db-cli PING | grep -c PONG) -gt 0) ]]; do
sleep 1;
done
if [[ ("$BOOT_TYPE" == "warm" || "$BOOT_TYPE" == "fastfast") && -f $WARM_DIR/dump.rdb ]]; then
Expand Down Expand Up @@ -222,7 +246,9 @@ function postStartAction()
($(docker exec -i ${DOCKERNAME} $SONIC_DB_CLI CHASSIS_APP_DB PING | grep -c True) -gt 0) ]]; do
sleep 1
done
setPlatformLagIdBoundaries
if [[ -n "$lag_id_start" && -n "$lag_id_end" ]]; then
setPlatformLagIdBoundaries
fi
REDIS_SOCK="/var/run/redis-chassis/redis_chassis.sock"
fi
chgrp -f redis $REDIS_SOCK && chmod -f 0760 $REDIS_SOCK
Expand Down
42 changes: 0 additions & 42 deletions files/image_config/interfaces/interfaces-config.sh
Original file line number Diff line number Diff line change
Expand Up @@ -60,48 +60,6 @@ for intf_pid in $(ls -1 /var/run/dhclient*.Ethernet*.pid 2> /dev/null); do
[[ -f ${intf_pid} ]] && kill `cat ${intf_pid}` && rm -f ${intf_pid}
done


# Setup eth1 if we connect to a remote chassis DB.
PLATFORM=${PLATFORM:-`sonic-cfggen -H -v DEVICE_METADATA.localhost.platform`}
CHASSISDB_CONF="/usr/share/sonic/device/$PLATFORM/chassisdb.conf"
[[ -f $CHASSISDB_CONF ]] && source $CHASSISDB_CONF

ASIC_CONF="/usr/share/sonic/device/$PLATFORM/asic.conf"
[[ -f $ASIC_CONF ]] && source $ASIC_CONF

if [[ -n "$midplane_subnet" && ($NUM_ASIC -gt 1) ]]; then
for asic_id in `seq 0 $((NUM_ASIC - 1))`; do
NET_NS="asic$asic_id"

PIDS=`ip netns pids "$NET_NS" 2>/dev/null`
if [[ "$?" -ne "0" ]]; then # namespace doesn't exist
continue
fi

# Use /16 for loopback interface
ip netns exec $NET_NS ip addr add 127.0.0.1/16 dev lo
ip netns exec $NET_NS ip addr del 127.0.0.1/8 dev lo

# Create eth1 in database instance
ip link add name ns-eth1 link eth1-midplane type ipvlan mode l2
ip link set dev ns-eth1 netns $NET_NS
ip netns exec $NET_NS ip link set ns-eth1 name eth1

# Configure IP address and enable eth1
lc_slot_id=$(python3 -c 'import sonic_platform.platform; platform_chassis = sonic_platform.platform.Platform().get_chassis(); print(platform_chassis.get_my_slot())' 2>/dev/null)
lc_ip_address=`echo $midplane_subnet | awk -F. '{print $1 "." $2}'`.$lc_slot_id.$((asic_id + 10))
lc_subnet_mask=${midplane_subnet#*/}
ip netns exec $NET_NS ip addr add $lc_ip_address/$lc_subnet_mask dev eth1
ip netns exec $NET_NS ip link set dev eth1 up

# Allow localnet routing on the new interfaces if midplane is using a
# subnet in the 127/8 range.
if [[ "${midplane_subnet#127}" != "$midplane_subnet" ]]; then
ip netns exec $NET_NS bash -c "echo 1 > /proc/sys/net/ipv4/conf/eth1/route_localnet"
fi
done
fi

# Read sysctl conf files again
sysctl -p /etc/sysctl.d/90-dhcp6-systcl.conf

Expand Down

0 comments on commit 0285bfe

Please sign in to comment.