Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[20191130.82] intfmgrd failed to assign IPv4 address to PortChannel #10336

Open
yozhao101 opened this issue Mar 23, 2022 · 8 comments
Open

[20191130.82] intfmgrd failed to assign IPv4 address to PortChannel #10336

yozhao101 opened this issue Mar 23, 2022 · 8 comments

Comments

@yozhao101
Copy link
Contributor

yozhao101 commented Mar 23, 2022

Description

Recently we observed the testing of teamd auto-restart feature consistently failed on str-s6000-on-2 in our nightly test. The failure is due to the BGP sessions with 5 T2 upstream VMs were not up.

When we did RCA for this BGP session issue, BGP sessions were not up is due to there is no IPv4 address assigned to these 5 PortChannels. We found the root cause is there is a race condition between the processes intfmgrd from SwSS and teammgrd from teamd. Details are as following:

intfmgrd will assign an IPv4 address to each PortChannel while teammgrd will bring up each PortChannel and add interfaces into each PortChannel.

If teammgrd finishes its job first, then intfmgrd can successfully assign IPv4 address to each PortChannel. However, if intfmgrd did its job first before teammgrd brings PortChannel up and adds interfaces into it, then there will be no IPv4 address assigned to each PortChannel.

We attach the syslog file from device for reference and Bing also shared the error messages from device.

Steps to reproduce the issue:

  1. Checks whether the autorestart feature was enabled for teamd on device by running command show feature status.
    If it was not enabled, we can issue the command sudo config feature autorestart teamd enabled to enable it.
  2. Terminates the process teamsyncd or teammgrd in teamd container by running the sudo kill -SIGKILL <pid> command.
  3. Checks the output of command show ip bgp summary to see whether BGP sessions with T2 devices are up.
  4. Checks the output of command ifconfig PortChannel102 to see whether IPv4 address was assigned to that PortChannel.

Describe the results you received:

    root@str-s6000-on-2:~# ifconfig PortChannel102
    PortChannel102: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9100
    inet6 fe80::92b1:1cff:fef4:9d40 prefixlen 64 scopeid 0x20<link>
    inet6 fe80::92b1:1cff:fef4:9d40 prefixlen 64 scopeid 0x20<link>
    ether 90:b1:1c:f4:9d:40 txqueuelen 1000 (Ethernet)
    RX packets 198 bytes 13592 (13.2 KiB)
    RX errors 0 dropped 73 overruns 0 frame 0
    TX packets 4 bytes 520 (520.0 B)
    TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Describe the results you expected:

    root@str-s6000-on-2:~# ifconfig PortChannel102
    PortChannel102: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9100
    inet 10.0.0.0 netmask 255.255.255.254 broadcast 0.0.0.0
    inet6 fe80::92b1:1cff:fef4:9d40 prefixlen 64 scopeid 0x20<link>
    inet6 fe80::92b1:1cff:fef4:9d40 prefixlen 64 scopeid 0x20<link>
    ether 90:b1:1c:f4:9d:40 txqueuelen 1000 (Ethernet)
    RX packets 198 bytes 13592 (13.2 KiB)
    RX errors 0 dropped 73 overruns 0 frame 0
    TX packets 4 bytes 520 (520.0 B)
    TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Output of show version:

        admin@str-s6000-on-2:~$ show version
        
        SONiC Software Version: SONiC.20191130.82
        Distribution: Debian 9.13
        Kernel: 4.9.0-14-2-amd64
        Build commit: bc95bcd3f2
        Build date: Fri Mar 4 05:08:09 UTC 2022
        Built by: cloudtest@a7a9ad96c000001
        
        Platform: x86_64-dell_s6000_s1220-r0
        HwSKU: Force10-S6000
        ASIC: broadcom
        Serial Number: NA
        Uptime: 18:46:05 up 16:39, 1 user, load average: 20.60, 15.91, 14.94
        
        Docker images:
        REPOSITORY TAG IMAGE ID SIZE
        docker-fpm-frr 20191130.82 c13e64bb9ba9 326MB
        docker-fpm-frr latest c13e64bb9ba9 326MB
        docker-orchagent 20191130.82 58ad294c2e42 335MB
        docker-orchagent latest 58ad294c2e42 335MB
        docker-teamd 20191130.82 a46001dd7f0e 316MB
        docker-teamd latest a46001dd7f0e 316MB
        docker-syncd-brcm 20191130.82 089b07e3dab2 438MB
        docker-syncd-brcm latest 089b07e3dab2 438MB
        docker-acms 20191130.82 1bcf52c56143 184MB
        docker-acms latest 1bcf52c56143 184MB
        docker-snmp-sv2 20191130.82 4f0691a29d60 353MB
        docker-snmp-sv2 latest 4f0691a29d60 353MB
        docker-platform-monitor 20191130.82 5c9eb22cd5c2 359MB
        docker-platform-monitor latest 5c9eb22cd5c2 359MB
        docker-lldp-sv2 20191130.82 8a2f6b793b75 313MB
        docker-lldp-sv2 latest 8a2f6b793b75 313MB
        docker-sonic-telemetry 20191130.82 d7251e568bcd 354MB
        docker-sonic-telemetry latest d7251e568bcd 354MB
        docker-router-advertiser 20191130.82 4a54e3fb2884 290MB
        docker-router-advertiser latest 4a54e3fb2884 290MB
        docker-dhcp-relay 20191130.82 ea8ffd93f757 300MB
        docker-dhcp-relay latest ea8ffd93f757 300MB
        docker-database 20191130.82 59984da19b6c 290MB
        docker-database latest 59984da19b6c 290MB
        k8s.gcr.io/pause 3.2 80d28bedfe5d 683kB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

syslog.txt

@bingwang-ms
Copy link
Contributor

bingwang-ms commented Mar 24, 2022

Error logs

# grep setIntfIp /var/log/syslog
Mar 23 06:48:27.867907 str-s6000-on-2 ERR swss#intfmgrd: :- setIntfIp: Command '/sbin/ip address "add" "10.0.0.24/31" dev "PortChannel1020"' failed with rc 2
Mar 23 06:48:28.080929 str-s6000-on-2 ERR swss#intfmgrd: :- setIntfIp: Command '/sbin/ip -6 address "add" "fc00::31/126" broadcast "fc00::33" dev "PortChannel1020"' failed with rc 2
Mar 23 06:48:28.172293 str-s6000-on-2 ERR swss#intfmgrd: :- setIntfIp: Command '/sbin/ip address "add" "10.0.0.28/31" dev "PortChannel1023"' failed with rc 2
Mar 23 06:48:28.333535 str-s6000-on-2 ERR swss#intfmgrd: :- setIntfIp: Command '/sbin/ip -6 address "add" "fc00::39/126" broadcast "fc00::3b" dev "PortChannel1023"' failed with rc 2
Mar 23 06:48:28.458087 str-s6000-on-2 ERR swss#intfmgrd: :- setIntfIp: Command '/sbin/ip address "add" "10.0.0.0/31" dev "PortChannel102"' failed with rc 2
Mar 23 06:48:29.053525 str-s6000-on-2 ERR swss#intfmgrd: :- setIntfIp: Command '/sbin/ip -6 address "add" "fc00::1/126" broadcast "fc00::3" dev "PortChannel102"' failed with rc 2
Mar 23 06:48:31.154541 str-s6000-on-2 ERR swss#intfmgrd: :- setIntfIp: Command '/sbin/ip address "add" "10.0.0.4/31" dev "PortChannel105"' failed with rc 2
Mar 23 06:48:31.296070 str-s6000-on-2 ERR swss#intfmgrd: :- setIntfIp: Command '/sbin/ip -6 address "add" "fc00::9/126" broadcast "fc00::b" dev "PortChannel105"' failed with rc 2
Mar 23 06:48:31.478231 str-s6000-on-2 ERR swss#intfmgrd: :- setIntfIp: Command '/sbin/ip address "add" "10.0.0.8/31" dev "PortChannel108"' failed with rc 2
Mar 23 06:48:31.620110 str-s6000-on-2 ERR swss#intfmgrd: :- setIntfIp: Command '/sbin/ip -6 address "add" "fc00::11/126" broadcast "fc00::13" dev "PortChannel108"' failed with rc 2

@yozhao101
Copy link
Contributor Author

yozhao101 commented Mar 28, 2022

Update: We also observed the auto-restart testing failure of teamd on device str-e1031-acs-3.

    admin@str-e1031-acs-3:~$ show version
    
    SONiC Software Version: SONiC.20201231.57
    Distribution: Debian 10.11
    Kernel: 4.19.0-12-2-amd64
    Build commit: c449752173
    Build date: Sat Mar 12 09:26:29 UTC 2022
    Built by: cloudtest@45a5a407c000000
    
    Platform: x86_64-cel_e1031-r0
    HwSKU: Celestica-E1031-T48S4
    ASIC: broadcom
    ASIC Count: 1
    Serial Number: R0882F2B039723BY000014
    Uptime: 21:25:55 up 1:06, 1 user, load average: 3.75, 4.40, 4.47

the failure is due to the race condition between processes intfmgrd and teammgrd. Following are log messages from device:

    7012 Mar 28 06:08:39.295430 str-e1031-acs-3 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel102 admin:1 oper:0 addr:00:e0:ec:84:9c:6a ifindex:133 master:0 type:team
    7013 Mar 28 06:08:39.295430 str-e1031-acs-3 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel103 admin:0 oper:0 addr:00:e0:ec:84:9c:6a ifindex:70 master:0 type:team
    7014 Mar 28 06:08:39.305753 str-e1031-acs-3 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet50 admin:0 oper:0 addr:00:e0:ec:84:9c:6a ifindex:128 master:70
    7015 Mar 28 06:08:39.305753 str-e1031-acs-3 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet50 admin:0 oper:0 addr:00:e0:ec:84:9c:6a ifindex:128 master:0
    7016 Mar 28 06:08:39.309084 str-e1031-acs-3 INFO kernel: [15348.251900] PortChannel103 (unregistering): Port device Ethernet50 removed
    7017 Mar 28 06:08:39.325924 str-e1031-acs-3 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet50 admin:0 oper:0 addr:00:e0:ec:84:9c:6a ifindex:128 master:0
    7018 Mar 28 06:08:39.325924 str-e1031-acs-3 NOTICE swss#portsyncd: :- onMsg: nlmsg type:17 key:PortChannel103 admin:0 oper:0 addr:00:e0:ec:84:9c:6a ifindex:70 master:0 type:team
    7019 Mar 28 06:08:39.332406 str-e1031-acs-3 DEBUG bgp#bgpcfgd: Received message : '('PortChannel102|10.0.0.58/31', 'SET', (('state', 'ok'),))'
    7020 Mar 28 06:08:39.347869 str-e1031-acs-3 ERR swss#intfmgrd: :- setIntfIp: Command '/sbin/ip address "add" "10.0.0.60/31" dev "PortChannel103"' failed with rc 2
    7021 Mar 28 06:08:39.367808 str-e1031-acs-3 INFO systemd-udevd[219174]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
    7022 Mar 28 06:08:39.381589 str-e1031-acs-3 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel103 admin:0 oper:0 addr:be:bc:57:27:23:9d ifindex:135 master:0 type:team
    7023 Mar 28 06:08:39.386025 str-e1031-acs-3 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel103 admin:0 oper:0 addr:be:bc:57:27:23:9d ifindex:135 master:0 type:team
    7024 Mar 28 06:08:39.386466 str-e1031-acs-3 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel103 admin:0 oper:0 addr:00:e0:ec:84:9c:6a ifindex:135 master:0 type:team
    7025 Mar 28 06:08:39.402756 str-e1031-acs-3 INFO kernel: [15348.344417] PortChannel103: Mode changed to "loadbalance"
    7026 Mar 28 06:08:39.428449 str-e1031-acs-3 INFO swss#/supervisord: intfmgrd RTNETLINK answers: No such device
    7027 Mar 28 06:08:39.527631 str-e1031-acs-3 NOTICE teamd#teammgrd: :- addLag: Start port channel PortChannel103 with teamd

@yozhao101
Copy link
Contributor Author

Update: See the same race condition issue on device str2-7215-acs-1:

After teamd, swss and syncd were restarted, BGP sessions with T2 devices were not up:

    admin@str2-7215-acs-1:~$ show ip bgp summary
    
    IPv4 Unicast Summary:
    BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
    BGP table version 6729
    RIB entries 1923, using 246144 bytes of memory
    Peers 4, using 72256 KiB of memory
    Peer groups 4, using 128 bytes of memory
    
    Neighbhor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd NeighborName
    ----------- --- ----- --------- --------- -------- ----- ------ --------- -------------- --------------
    10.0.0.57 4 64600 1162 1169 0 0 0 00:24:07 960 ARISTA01T1
    10.0.0.59 4 64600 1162 1651 0 0 0 00:24:08 960 ARISTA02T1
    10.0.0.61 4 64600 653 1622 0 0 0 00:24:39 Active ARISTA03T1
    10.0.0.63 4 64600 653 1875 0 0 0 00:24:38 Active ARISTA04T1
    Total number of neighbors 4
    
    admin@str2-7215-acs-1:~$ show interfaces portchannel
    Flags: A - active, I - inactive, Up - up, Dw - Down, N/A - not available,
    S - selected, D - deselected, * - not synced
    No. Team Dev Protocol Ports
    ----- -------------- ----------- -------------
    101 PortChannel101 LACP(A)(Up) Ethernet48(S)
    102 PortChannel102 LACP(A)(Up) Ethernet49(S)
    103 PortChannel103 LACP(A)(Up) Ethernet50(S)
    104 PortChannel104 LACP(A)(Up) Ethernet51(S)
    
    admin@str2-7215-acs-1:~$ sudo ifconfig PortChannel104
    PortChannel104: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9100
    inet6 fe80::52e0:efff:fe7a:f1d1 prefixlen 64 scopeid 0x20<link>
    inet6 fc00::7d prefixlen 126 scopeid 0x0<global>
    ether 50:e0:ef:7a:f1:d1 txqueuelen 1000 (Ethernet)
    RX packets 319 bytes 73005 (71.2 KiB)
    RX errors 0 dropped 53 overruns 0 frame 0
    TX packets 132 bytes 113745 (111.0 KiB)
    TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Syslog messages are as following:

    2462 Mar 28 21:57:10.191592 str2-7215-acs-1 ERR swss#intfmgrd: :- setIntfIp: Command '/sbin/ip address "add" "10.0.0.62/31" dev "PortChannel104"' failed with rc 2
    2463 Mar 28 21:57:10.208147 str2-7215-acs-1 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel103 admin:1 oper:0 addr:50:e0:ef:7a:f1:d1 ifindex:70 master:0 type:team
    2464 Mar 28 21:57:10.219028 str2-7215-acs-1 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel104 admin:0 oper:0 addr:50:e0:ef:7a:f1:d1 ifindex:7 master:0 type:team
    2465 Mar 28 21:57:10.228669 str2-7215-acs-1 DEBUG bgp#bgpcfgd: Received message : '('PortChannel104|10.0.0.62/31', 'SET', (('state', 'ok'),))'
    2466 Mar 28 21:57:10.231346 str2-7215-acs-1 NOTICE swss#portsyncd: :- onMsg: nlmsg type:17 key:PortChannel104 admin:0 oper:0 addr:50:e0:ef:7a:f1:d1 ifindex:7 master:0 type:team
    2467 Mar 28 21:57:10.268135 str2-7215-acs-1 INFO syncd#/supervisord: syncd Port Init: portNum=24 Speed=2 IF=13 IsPhy=1 PhyAddr=6 smiIf=0 xsmiif=4
    2468 Mar 28 21:57:10.288978 str2-7215-acs-1 INFO swss#/supervisord: intfmgrd RTNETLINK answers: File exists
    2469 Mar 28 21:57:10.289427 str2-7215-acs-1 INFO swss#/supervisord: tunnelmgrd delete tunnel "tun0" failed: No such device
    2470 Mar 28 21:57:10.289791 str2-7215-acs-1 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel104 admin:0 oper:0 addr:b6:c3:0d:c9:f4:8e ifindex:71 master:0 type:team
    2471 Mar 28 21:57:10.319042 str2-7215-acs-1 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel104 admin:0 oper:0 addr:b6:c3:0d:c9:f4:8e ifindex:71 master:0 type:team
    2472 Mar 28 21:57:10.319498 str2-7215-acs-1 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel104 admin:0 oper:0 addr:50:e0:ef:7a:f1:d1 ifindex:71 master:0 type:team
    2473 Mar 28 21:57:10.334480 str2-7215-acs-1 INFO kernel: [10310.155465] PortChannel104: Mode changed to "loadbalance"
    2474 Mar 28 21:57:10.414005 str2-7215-acs-1 NOTICE teamd#teammgrd: :- addLag: Start port channel PortChannel104 with teamd
    2475 Mar 28 21:57:10.452720 str2-7215-acs-1 INFO syncd#/supervisord: syncd Port Init: portNum=23 Speed=2 IF=13 IsPhy=1 PhyAddr=7 smiIf=0 xsmiif=4
    2476 Mar 28 21:57:10.546531 str2-7215-acs-1 INFO kernel: [10310.366995] IPv6: ADDRCONF(NETDEV_UP): PortChannel104: link is not ready
    2477 Mar 28 21:57:10.546553 str2-7215-acs-1 INFO kernel: [10310.367006] 8021q: adding VLAN 0 to HW filter on device PortChannel104
    2478 Mar 28 21:57:10.591152 str2-7215-acs-1 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel104 admin:1 oper:0 addr:50:e0:ef:7a:f1:d1 ifindex:71 master:0 type:team
    2479 Mar 28 21:57:10.634905 str2-7215-acs-1 NOTICE teamd#tlm_teamd: :- try_add_lag: The LAG 'PortChannel104' has been added.
    2480 Mar 28 21:57:10.640513 str2-7215-acs-1 NOTICE teamd#teammgrd: :- setLagAdminStatus: Set port channel PortChannel104 admin status to up
    2481 Mar 28 21:57:10.646134 str2-7215-acs-1 NOTICE teamd#teammgrd: :- setLagMtu: Set port channel PortChannel104 MTU to 9100

@judyjoseph
Copy link
Contributor

From logs I see this is a completely different scenario where the portchannels were not cleaned up on exit

-  In this autorestart testcase, SIGKILL is being sent to teammgrd
-  teammgrd/teamsyncd don’t hande the SIGKILL. Only SIGTERM is handled and teamd processes and portchannels are cleaned up.
-  So with SIGKILL -- the teammgrd process exits, but the portchannels and teamd processes are still present. 
-  This will affect the teammgrd/teamsyncd when they are spawned again.

So can we try updating the autorestart testcase to use SIGTERM and see the behavior ?

@zhangyanzhao
Copy link
Collaborator

@yozhao101 can you please try Judy's suggestion? Thanks.

@volodymyrsamotiy
Copy link
Collaborator

This issue still exists and I observe it on all branches (I checked on "master", "202205", "202012").
@yozhao101, @judyjoseph, is there any plan for fixing it?
Thanks!

@vivekrnv
Copy link
Contributor

Also observed on 202211. Can be recoverable through a config reload or reboot.

@dgsudharsan
Copy link
Collaborator

Observed in 202311 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants