Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[202411] Fix dynamic buffer issue for x86_64-nvidia_sn4280-r0 platform #17360

Open
wants to merge 234 commits into
base: master
Choose a base branch
from

Conversation

JibinBao
Copy link
Contributor

@JibinBao JibinBao commented Mar 5, 2025

Description of PR

For x86_64-nvidia_sn4280-r0, because some ports are NPU DPU ports that do not have lossless traffic, so remove the ports from tested ports

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405
  • 202411

Approach

What is the motivation for this PR?

Fix qos buffer test issue for x86_64-nvidia_sn4280-r0

How did you do it?

Remove NPU DPU ports from the tested ports

How did you verify/test it?

run test buffer tests on x86_64-nvidia_sn4280-r0

Any platform specific information?

x86_64-nvidia_sn4280-r0

Supported testbed topology if it's a new test case?

Documentation

wsycqyz and others added 30 commits December 10, 2024 15:30
…ven't been added to yang model (sonic-net#16152)

[202411] Temporary skip subnet decap test with issue since feature haven't been added to yang model
Description of PR
Fix acl/test_stress_acl.py using bad interface name for ACL table creation

Summary:
Fixes # (issue)
In acl/test_stress_acl.py, it attempts to retrieve an interface that can be used to create a ACL table. DUTs with and without PortChannels require different methods respectively.

Currently, it checks by filtering with topo. However, some topology flags can have configurations that have or not have PortChannels, making topos no longer a sufficient check - in some topos the test will fail with:

Error: Failed to parse ACL table config: exception=Cannot bind ACL to specified port Ethernet136
Reproducible by manually running the following on the DUT:

config acl add table DATAACL L3 -s ingress -p Ethernet0
^FAILS
config acl add table DATAACL L3 -s ingress -p PortChannel101
^WORKS
* Add ERR log ignore for ECMP/LAG hash not supported

On Broadcom platforms, ECMP/LAG hash is not supported. Adding ERR log
ignore as it is expected for this error log to print.

* Added platform checks to log ignore extension

* Remove extraneous file change
What is the motivation for this PR?
Fix the ValueError in test_generic_hash.py

How did you do it?
Check if the field existed in the list before removing it

How did you verify/test it?
Run test_generic_hash.py

Signed-off-by: Kevin Wang <[email protected]>
* Skip test_dhcp_relay for Cisco 8122 BE deployment

* Revise

* Revise

* Revise
…cd and orchagent (sonic-net#15911)

What is the motivation for this PR?
Add a new test case to address sonic-net#15889

How did you do it?
Restart syncd and orchagent, then check lldp entries in APPL_DB

How did you verify/test it?
run test_lldp_syncd.py
What is the motivation for this PR?
Fix the issue on Cisco platform

How did you do it?
Remove forward action

How did you verify/test it?
Run the test on Cisco platform

Signed-off-by: Kevin Wang <[email protected]>
* Fix orchagent crash when setting up mock dualtor environment for t0

When running dualtor tests on t0 topo, the DUT has to enter a mocked
dualtor state. Part of this setup is adding a tunnel table to CONFIG_DB,
which involves a Broadcom SAI attribute that is only supported when
`sai_tunnel_support=1` is set in `syncd:/etc/sai.d/config.bcm` - this
attribute is not set until `apply_peer_switch_table_to_dut()` is run.

Changing an unsupported Broadcom SAI attribute will cause orchagent to crash.

Fix this issue by first running the setup function
`apply_peer_switch_table_to_dut()` that will set
`sai_tunnel_support=1`, before adding the tunnel table with
`apply_tunnel_table_to_dut()`.

* Fix dualtor tests overwriting /etc/sonic/config_db.json

When running dualtor tests on a t0 topology, the test will overwrite
`/etc/sonic/config_db.json` during the test, causing `config_reload()`
at the end of the test to not restore the pre-test state of CONFIG_DB.

Fix by adding a fixture to backup `/etc/sonic/config_db.json` before
the test, then restore and `config reload -y` it after the test.

* Change to reuse old config restore and use running_golden_config source
What is the motivation for this PR?
In PR sonic-net#16152, we need to skip subnet decap test in PR test if issue not resolved, but conditions should be or, not and

How did you do it?
Set conditions_logical_operator to or

How did you verify/test it?
…and fix range related issues (sonic-net#16139)

Regression test were failing becuase it was not bale to calculate correct range

Description of PR
Support for single Queue was not there such as 'Ethernet128|1' modified core to support both
'Ethernet128|1' and 'Ethernet128|0-2'

Summary:
Fixes # (issue)

co-authorized by: [email protected]
What is the motivation for this PR?
To ensure proper Dual ToR setups in test_dhcp_relay_stress testcase.

How did you do it?
Added the Dual ToR-related fixture.

How did you verify/test it?
Validated that Dual ToR setups successfully pass the test_dhcp_relay_stress test case.
… test (sonic-net#15976)

What is the motivation for this PR?
There is flaky failure in this case because default route missing

How did you do it?
Increase wait time for it.
Add log for triage

How did you verify/test it?
Run test
For 202411 branch PR testing, branch "202411" instead of "master" should be used.

Signed-off-by: Xin Wang <[email protected]>
…#16230)

What is the motivation for this PR?
vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations.

the vm_topology script running statistics with restart-ptf:
real    18m50.615s
user    0m0.009s
sys     0m0.099s

With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations.

How did you do it?
Introduce the thread pool to vm_topology to parallel run functions that take time to finish.
* restart-ptf on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	total run time
add_host_ports	1040s
bind_fp_ports	96.3s
init	        16.7s

* remove-topo on dualtor-120 vm_topology profile statistics:

Top three total run time function call:

function name	                        total run time
remove_host_ports	                    165s
unbind_fp_ports	                        40.6s
remove_injected_fp_ports_from_docker	3.3s

Let's use thread pool to parallel run the following functions that take most of time from the above statistics:

* add_host_ports
* remove_host_ports
* bind_fp_ports
* unbind_fp_ports

Two new classes are introduced to support this feature:

* class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode.
* class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together.

How did you verify/test it?

Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers.

operation	    vm_topology run time without this PR	vm_topology run time with this PR
remove-topo	    3m19.786s	                            1m18.430s
restart-ptf	    18m50.615s	                            3m58.963s

* restart-ptf with-this-PR vm_topology profile statistics:
function name	total run time without this PR	total run time with this PR
add_host_ports	1040s	                        169s
bind_fp_ports	96.3s	                        39.3s

* remove-topo with-this-pR vm_topology profile statistics:

function name	    total run time without this PR	total run time with this PR
remove_host_ports	165s	                        68.8s
unbind_fp_ports	    40.6s	                        8.4

Signed-off-by: Longxiang <[email protected]>
…ner check tests (sonic-net#16252)

Description of PR
Summary:
Notice quite a lot flaky failures on PR test,
https://elastictest.org/scheduler/testplan/676e75562c6c7b8d3e3bd8bf?testcase=process_monitoring%2Ftest_critical_process_monitoring.py&type=console
Simply wait for 120s is not enough for multi-asic kvm testbeds, enhance the config_reload to use safe_reload to make sure the testbed is healthy.

Approach
What is the motivation for this PR?
Notice quite a lot flaky failures on PR test,
https://elastictest.org/scheduler/testplan/676e75562c6c7b8d3e3bd8bf?testcase=process_monitoring%2Ftest_critical_process_monitoring.py&type=console

How did you do it?
Simply wait for 120s is not enough for multi-asic kvm testbeds, enhance the config_reload to use safe_reload to make sure the testbed is healthy.

How did you verify/test it?
PR test will verify it.

authorized by: [email protected]
…sonic-net#16235)

What is the motivation for this PR?
BGP routes would be setup during add-topo https://github.com/sonic-net/sonic-mgmt/blob/master/ansible/roles/vm_set/tasks/add_topo.yml#L276.
But there are some scenarios that route in DUT has been messed up, but bgp sessions are all up, sanity would treat it as healthy and wouldn't take action to recover it.

Loopbackv4 address has been replaced, it would cause all kernel routes from bgp miss
In some test cases announce or withdraw routes from ptf but fail to recover (i.e. test_stress_routes)
Healthy status:

admin@sonic:~$ ip route show default
default nhid 282 proto bgp src 10.1.0.32 metric 20 
        nexthop via 10.0.0.57 dev PortChannel101 weight 1 
        nexthop via 10.0.0.59 dev PortChannel103 weight 1 
        nexthop via 10.0.0.61 dev PortChannel105 weight 1 
        nexthop via 10.0.0.63 dev PortChannel106 weight 1 
admin@sonic:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 2890
RIB entries 2893, using 648032 bytes of memory
Peers 6, using 4451856 KiB of memory
Peer groups 4, using 256 bytes of memory


Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down      State/PfxRcd  NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.57      4  65200        763        764         0      0       0  11:46:17             1439  ARISTA01M1
10.0.0.59      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA02M1
10.0.0.61      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA03M1
10.0.0.63      4  65200        763        765         0      0       0  11:46:17             1439  ARISTA04M1
10.0.0.65      4  64001        712        761         0      0       0  11:46:15                2  ARISTA01MX
10.0.0.67      4  64002        712        761         0      0       0  11:46:15                2  ARISTA02MX

Total number of neighbors 6
Issue status, no default route, but show ip bgp sum looks good

admin@sonic:~$ ip route show default
admin@sonic:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 2892
RIB entries 2893, using 648032 bytes of memory
Peers 6, using 4451856 KiB of memory
Peer groups 4, using 256 bytes of memory


Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down      State/PfxRcd  NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.57      4  65200        764        767         0      0       0  11:47:14             1439  ARISTA01M1
10.0.0.59      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA02M1
10.0.0.61      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA03M1
10.0.0.63      4  65200        764        768         0      0       0  11:47:14             1439  ARISTA04M1
10.0.0.65      4  64001        713        764         0      0       0  11:47:12                2  ARISTA01MX
10.0.0.67      4  64002        713        764         0      0       0  11:47:12                2  ARISTA02MX

Total number of neighbors 6
How did you do it?
Add default routes check in sanity check, and re-announce routes if issue happen

How did you verify/test it?
Run sanity check
…rsion doesn't support dhcp_server (sonic-net#16086)

What is the motivation for this PR?
Golden config db var has been changed from json like string to dict by this RP sonic-net#15922
But for mx with OS version doesn't support dhcp_server, it would still generate empty json like string, which would cause failure

How did you do it?
Change empty golden config generated for mx doesn't support dhcp_server

How did you verify/test it?
Deploy-mg
What is the motivation for this PR?
default_pfcwd_status should be disable for m0 and mx

How did you do it?
In golden config_db, disable pfcwd for m0 and mx

How did you verify/test it?
Deploy topo
…c-net#16264)

What is the motivation for this PR?
Default route check in sanity is added by sonic-net#16235
It only supports single asic for now. But constraint for single asic in recover stage is missed
It would cause keyError in multi-asic if there is bgp sanity check failure

    def adaptive_recover(dut, localhost, fanouthosts, nbrhosts, tbinfo, check_results, wait_time):
        outstanding_action = None
        for result in check_results:
            if result['failed']:
                if result['check_item'] == 'interfaces':
                    action = _recover_interfaces(dut, fanouthosts, result, wait_time)
                elif result['check_item'] == 'services':
                    action = _recover_services(dut, result)
                elif result['check_item'] == 'bgp':
                    # If there is only default route missing issue, only need to re-announce routes to recover
>                   if ("no_v4_default_route" in result['bgp'] and len(result['bgp']) == 1 or
                        "no_v6_default_route" in result['bgp'] and len(result['bgp']) == 1 or
                        ("no_v4_default_route" in result['bgp'] and "no_v6_default_route" in result['bgp'] and
                         len(result['bgp']) == 2)):
E                        KeyError: 'bgp'

check_results = [{'bgp0': {'down_neighbors': ['2603:10e2:400:1::5', 'fc00::2']}, 'bgp3': {'down_neighbors': ['2603:10e2:400:1::6']}, 'check_item': 'bgp', 'failed': True, ...}]
dut        = <MultiAsicSonicHost vlab-08>
fanouthosts = {}
localhost  = <tests.common.devices.local.Localhost object at 0x77f1b9270a90>
nbrhosts   = {'ARISTA01T0': <EosHost VM0129>, 'ARISTA01T2': <EosHost VM0128>}
outstanding_action = None
result     = {'bgp0': {'down_neighbors': ['2603:10e2:400:1::5', 'fc00::2']}, 'bgp3': {'down_neighbors': ['2603:10e2:400:1::6']}, 'check_item': 'bgp', 'failed': True, ...}
tbinfo     = {'auto_recover': 'False', 'comment': 'Tests multi-asic virtual switch vm', 'conf-name': 'vms-kvm-four-asic-t1-lag', 'duts': ['vlab-08'], ...}
wait_time  = 30
How did you do it?
Add single asic constraint in recover

How did you verify/test it?
Run test
…d telemetry test (sonic-net#16112)

What is the motivation for this PR?
Client mac of DHCP packets should be ptf interface mac rather than DUT interface mac

How did you do it?
Modify test case to use ptf interface mac

How did you verify/test it?
Run tests in m0/t0 topo
Description of PR
Add BFD up count check and MAC entries count check to sanity check for T2 topo.

Summary:
Fixes # (issue) Microsoft ADO 29825439 & 29825466

Approach
What is the motivation for this PR?
During our T2 Nightly run, we found that there will be a chance that the port channel connection between 2 ASICs is up but MAC address was not learned and the BFD session between them is down. Therefore, we need to have sanity check to make sure BFD are all up and all MAC addresses are learned, otherwise issue like this will affect the test result and can impact production env.

How did you do it?
Added check_bfd_up_count() function to sanity check for T2 topo only. This check will take ~4 seconds to run on a T2 device with 3 LC (frontend nodes).
Added check_mac_entry_count() function to sanity check for T2 supervisor only. This check will take ~17 seconds to finish on a T2 device where its supervisor has 10 ASICs.
How did you verify/test it?
I ran the updated code on T2 with multiple test modules and can confirm it's checking the BFD up count and MAC entries count properly. Elastictest link: https://elastictest.org/scheduler/testplan/676bbfe8ab42af53500adb8d?leftSideViewMode=detail

Besides, I can also confirm that these 2 checks will be skipped on non-T2 devices.

Any platform specific information?
Supported testbed topology if it's a new test case?
T2

co-authorized by: [email protected]
…d and enable_pfcwd functions. (sonic-net#16176)

Description of PR
Summary:
Even after sonic-net#15863, the problem is still seen in RP. The issue is that the sort() function only sorts alphabetically, not numerically. So we need to use natsort module instead of normal sort() in these functions.

Approach
What is the motivation for this PR?
Seeing a lot of loganalyzer messages, but in RP only.

How did you do it?
Change sort to natsort().

How did you verify/test it?
Running in my TB, I don't see the LA error anymore.

Any platform specific information?
T2 mAsic only.

co-authorized by: [email protected]
Description of PR
Summary:
Fixes # (issue) 304717824

Type of change
For devices that doesnt have console port (i.e Cisco 8800 console only exists in RP), we should skip this test for this DUT. Otherwise we are going to have KeyError: 'ManagementIp'

Approach
What is the motivation for this PR?
How did you do it?
We skip if the key does not exists. If key does not exists then the graph will be empty. For example:

 "dut-lc1-1": {},
How did you verify/test it?
verified on physical testbed

Signed-off-by: Austin Pham <[email protected]>
Description of PR
Summary:
Fixes # (issue) 30457143

Currently this is using the old method of capturing queue counters:

sudo ip netns exec asic1 show queue counters Ethernet128
This will throw an issue with some testbed and says Ethernet128 does not exists. (haven't had chance to confirm why)

However since we have new support for -n sonic-net/sonic-utilities#2439 we should be using this instead

Tested by running manual commands

admin@str3-8800-lc4-1:~$ sudo ip netns exec asic1 show queue counters Ethernet128
Port doesn't exist! Ethernet128
admin@str3-8800-lc4-1:~$ show queue counters Ethernet128 -n asic1
For namespace asic1:
       Port    TxQ    Counter/pkts    Counter/bytes    Drop/pkts    Drop/bytes
-----------  -----  --------------  ---------------  -----------  ------------
Ethernet128    UC0               0                0            0             0
...
Type of change

Approach
What is the motivation for this PR?
How did you do it?
Update the test to use new APIs that support -n

How did you verify/test it?
Manually run, needs to verify with available testbed.

Signed-off-by: Austin Pham <[email protected]>
mssonicbld and others added 17 commits March 3, 2025 13:46
What is the motivation for this PR?
The current logic simply checks for the keyword "status" in service_info without considering the context. If the keyword appears in 'last output' or another unintended section of the data, the logic misinterprets it.

How did you do it?
Ensuring "status" detection is specific to the intended line. Using a stricter parsing approach to avoid unintended matches.

How did you verify/test it?
Ran the test_pretest.py on kvm-t0 testbed. Captured the get_monit_services_status() output
…n for autonegotiation (sonic-net#17191) (sonic-net#17277)

[ansible] Correct the formatting for linkmetadata minigraph generation for autonegotiation
chore: update test for t2

Description of PR
Summary:
Previously this test skipped for eos neighbor due to legacy. Now we should be able to enable this. This PR added the necessary changes to make this compatible

Fixes # (issue) 28839241

Approach
What is the motivation for this PR?
Described above

How did you do it?
Adjust the relevant code to be compatible with eos neighbor.

How did you verify/test it?
Verified on T2 testbed compatibility.

Signed-off-by: Austin Pham <[email protected]>
Two improvements:
1. force replace the dst arp responder config.
2. add show arp and neighbor check in I/O utility to help debug.

Signed-off-by: Longxiang Lyu <[email protected]>
…ness (sonic-net#17040)

Description of PR
Summary:
Fixes # (issue)
test_tunnel_memory_leak is flaky, sometimes it failed:
Failed: Test failed because there is memory leak on dut

Based on the calculation:

21/11/2024 20:15:33 test_tunnel_memory_leak.test_tunnel_memo L0167 INFO   | tunnel_packet_handler.py original MEM USAGE:61.8046875

21/11/2024 20:24:33 test_tunnel_memory_leak._check_memory    L0102 ERROR  | tunnel_packet_handler.py MEM usage exceeds the threshold. current usage:64.3671875, target usage: 63.658828125
64.3671875/61.8046875 - 1= 0.041461
actual increased percentage is more than 4%.

The current percentage is 3%.
Confirmed with tunnel feature owner, it's safe to increase it to 5% to reduce the flakiness.
What is the motivation for this PR?
We are preparing to deploy impact area PR testing in 202411 branch, which won't always test all test sets, which means some test may not be test for a long time and if it's flaky, we are not able to find out. So baseline test is needed to find out flaky tests in 202411.

How did you do it?
Deploy baseline test in 202411
…-net#17136)

Description of PR
Summary:
Fix version check logic, just disable feature less than 202411

Approach
What is the motivation for this PR?
Fix version check logic, just disable feature less than 202411

How did you do it?
Refactor the version check logic

How did you verify/test it?
verified via mgmt test
Description of PR
Add a 60s sleep to the end of platform_tests/link_flap/test_cont_link_flap.py test

Summary:
Fixes # (issue)

Approach
What is the motivation for this PR?
There could be a small chance that the sudo route_check.py ran during the teardown stage of platform_tests/link_flap/test_cont_link_flap.py test would fail

co-authorized by: [email protected]
Summary:
This PR will enable bmp feature related test in KVM default test path, so that the new feature could be tested more.

What is the motivation for this PR?
This PR will enable bmp feature related test in KVM default test path, so that the new feature could be tested more.

How did you do it?
Added relevant test cases into kvmtest.sh

How did you verify/test it?
Verified in pre-commit.
What is the motivation for this PR?
In PR sonic-net#15605, the platform_api_conn fixture was relocated to a common location. The fixture start_platform_api_service used by it is not imported in this script, which will casue error fixture 'start_platform_api_service' not found. This PR addresses the issue by moving fixture start_platform_api_service to a common place tests/conftest.py.

How did you do it?
This PR addresses the issue by moving fixture start_platform_api_service to a common place tests/conftest.py.

How did you verify/test it?
Test locally to run the test script test_reload_dpu.py. There is no such import error.
…7305)

The feature is not ready for BRCM platform, need to xfail for BRCM before the fix is in place.

What is the motivation for this PR?
To xfail the test_lag_member_forwarding on BRCM platform.

How did you do it?
Xfail the test_lag_member_forwarding in conditional mark file.

How did you verify/test it?
Run test.

Any platform specific information?
Only xfail for BRCM platform.
) (sonic-net#17245)

Current hash logic can not support lag only and ecmp only hash

Signed-off-by: Kevin Wang <[email protected]>
Co-authored-by: Kevin Wang <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@JibinBao JibinBao closed this Mar 5, 2025
@JibinBao JibinBao reopened this Mar 5, 2025
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.