-
Notifications
You must be signed in to change notification settings - Fork 774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[202411] Fix dynamic buffer issue for x86_64-nvidia_sn4280-r0 platform #17360
Open
JibinBao
wants to merge
234
commits into
sonic-net:master
Choose a base branch
from
JibinBao:cp_buffer_4280_202411
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ven't been added to yang model (sonic-net#16152) [202411] Temporary skip subnet decap test with issue since feature haven't been added to yang model
Description of PR Fix acl/test_stress_acl.py using bad interface name for ACL table creation Summary: Fixes # (issue) In acl/test_stress_acl.py, it attempts to retrieve an interface that can be used to create a ACL table. DUTs with and without PortChannels require different methods respectively. Currently, it checks by filtering with topo. However, some topology flags can have configurations that have or not have PortChannels, making topos no longer a sufficient check - in some topos the test will fail with: Error: Failed to parse ACL table config: exception=Cannot bind ACL to specified port Ethernet136 Reproducible by manually running the following on the DUT: config acl add table DATAACL L3 -s ingress -p Ethernet0 ^FAILS config acl add table DATAACL L3 -s ingress -p PortChannel101 ^WORKS
* Add ERR log ignore for ECMP/LAG hash not supported On Broadcom platforms, ECMP/LAG hash is not supported. Adding ERR log ignore as it is expected for this error log to print. * Added platform checks to log ignore extension * Remove extraneous file change
What is the motivation for this PR? Fix the ValueError in test_generic_hash.py How did you do it? Check if the field existed in the list before removing it How did you verify/test it? Run test_generic_hash.py Signed-off-by: Kevin Wang <[email protected]>
* Skip test_dhcp_relay for Cisco 8122 BE deployment * Revise * Revise * Revise
…cd and orchagent (sonic-net#15911) What is the motivation for this PR? Add a new test case to address sonic-net#15889 How did you do it? Restart syncd and orchagent, then check lldp entries in APPL_DB How did you verify/test it? run test_lldp_syncd.py
What is the motivation for this PR? Fix the issue on Cisco platform How did you do it? Remove forward action How did you verify/test it? Run the test on Cisco platform Signed-off-by: Kevin Wang <[email protected]>
* Fix orchagent crash when setting up mock dualtor environment for t0 When running dualtor tests on t0 topo, the DUT has to enter a mocked dualtor state. Part of this setup is adding a tunnel table to CONFIG_DB, which involves a Broadcom SAI attribute that is only supported when `sai_tunnel_support=1` is set in `syncd:/etc/sai.d/config.bcm` - this attribute is not set until `apply_peer_switch_table_to_dut()` is run. Changing an unsupported Broadcom SAI attribute will cause orchagent to crash. Fix this issue by first running the setup function `apply_peer_switch_table_to_dut()` that will set `sai_tunnel_support=1`, before adding the tunnel table with `apply_tunnel_table_to_dut()`. * Fix dualtor tests overwriting /etc/sonic/config_db.json When running dualtor tests on a t0 topology, the test will overwrite `/etc/sonic/config_db.json` during the test, causing `config_reload()` at the end of the test to not restore the pre-test state of CONFIG_DB. Fix by adding a fixture to backup `/etc/sonic/config_db.json` before the test, then restore and `config reload -y` it after the test. * Change to reuse old config restore and use running_golden_config source
What is the motivation for this PR? In PR sonic-net#16152, we need to skip subnet decap test in PR test if issue not resolved, but conditions should be or, not and How did you do it? Set conditions_logical_operator to or How did you verify/test it?
…and fix range related issues (sonic-net#16139) Regression test were failing becuase it was not bale to calculate correct range Description of PR Support for single Queue was not there such as 'Ethernet128|1' modified core to support both 'Ethernet128|1' and 'Ethernet128|0-2' Summary: Fixes # (issue) co-authorized by: [email protected]
What is the motivation for this PR? To ensure proper Dual ToR setups in test_dhcp_relay_stress testcase. How did you do it? Added the Dual ToR-related fixture. How did you verify/test it? Validated that Dual ToR setups successfully pass the test_dhcp_relay_stress test case.
Signed-off-by: Kevin Wang <[email protected]>
…net#15898) Error log for default route incorrect is ambiguous
… test (sonic-net#15976) What is the motivation for this PR? There is flaky failure in this case because default route missing How did you do it? Increase wait time for it. Add log for triage How did you verify/test it? Run test
For 202411 branch PR testing, branch "202411" instead of "master" should be used. Signed-off-by: Xin Wang <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
…#16230) What is the motivation for this PR? vm_topology builds up the testbed connections (veth links, ovs bridges, etc) on the test server by running Linux commands, which involves a lot of waiting for I/O operations. the vm_topology script running statistics with restart-ptf: real 18m50.615s user 0m0.009s sys 0m0.099s With the I/O bound nature, vm_topology runtime could be greatly decreased by using threading pool to parallelize the I/O operations. How did you do it? Introduce the thread pool to vm_topology to parallel run functions that take time to finish. * restart-ptf on dualtor-120 vm_topology profile statistics: Top three total run time function call: function name total run time add_host_ports 1040s bind_fp_ports 96.3s init 16.7s * remove-topo on dualtor-120 vm_topology profile statistics: Top three total run time function call: function name total run time remove_host_ports 165s unbind_fp_ports 40.6s remove_injected_fp_ports_from_docker 3.3s Let's use thread pool to parallel run the following functions that take most of time from the above statistics: * add_host_ports * remove_host_ports * bind_fp_ports * unbind_fp_ports Two new classes are introduced to support this feature: * class VMTopologyWorker: a worker class to support work in either single thread mode or thread pool mode. * class ThreadBufferHandler: a logging handler to buffer logs from each task submitted to the VMTopologyWorker and flush when the task ends. This is to ensure vm_topology logs are grouped by the tasks, logs from different tasks will not be mixed together. How did you verify/test it? Let's test this PR on a dualtor-120 testbed with this PR, and the thread pool has 13 thread workers. operation vm_topology run time without this PR vm_topology run time with this PR remove-topo 3m19.786s 1m18.430s restart-ptf 18m50.615s 3m58.963s * restart-ptf with-this-PR vm_topology profile statistics: function name total run time without this PR total run time with this PR add_host_ports 1040s 169s bind_fp_ports 96.3s 39.3s * remove-topo with-this-pR vm_topology profile statistics: function name total run time without this PR total run time with this PR remove_host_ports 165s 68.8s unbind_fp_ports 40.6s 8.4 Signed-off-by: Longxiang <[email protected]>
…ner check tests (sonic-net#16252) Description of PR Summary: Notice quite a lot flaky failures on PR test, https://elastictest.org/scheduler/testplan/676e75562c6c7b8d3e3bd8bf?testcase=process_monitoring%2Ftest_critical_process_monitoring.py&type=console Simply wait for 120s is not enough for multi-asic kvm testbeds, enhance the config_reload to use safe_reload to make sure the testbed is healthy. Approach What is the motivation for this PR? Notice quite a lot flaky failures on PR test, https://elastictest.org/scheduler/testplan/676e75562c6c7b8d3e3bd8bf?testcase=process_monitoring%2Ftest_critical_process_monitoring.py&type=console How did you do it? Simply wait for 120s is not enough for multi-asic kvm testbeds, enhance the config_reload to use safe_reload to make sure the testbed is healthy. How did you verify/test it? PR test will verify it. authorized by: [email protected]
…sonic-net#16235) What is the motivation for this PR? BGP routes would be setup during add-topo https://github.com/sonic-net/sonic-mgmt/blob/master/ansible/roles/vm_set/tasks/add_topo.yml#L276. But there are some scenarios that route in DUT has been messed up, but bgp sessions are all up, sanity would treat it as healthy and wouldn't take action to recover it. Loopbackv4 address has been replaced, it would cause all kernel routes from bgp miss In some test cases announce or withdraw routes from ptf but fail to recover (i.e. test_stress_routes) Healthy status: admin@sonic:~$ ip route show default default nhid 282 proto bgp src 10.1.0.32 metric 20 nexthop via 10.0.0.57 dev PortChannel101 weight 1 nexthop via 10.0.0.59 dev PortChannel103 weight 1 nexthop via 10.0.0.61 dev PortChannel105 weight 1 nexthop via 10.0.0.63 dev PortChannel106 weight 1 admin@sonic:~$ show ip bgp sum IPv4 Unicast Summary: BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0 BGP table version 2890 RIB entries 2893, using 648032 bytes of memory Peers 6, using 4451856 KiB of memory Peer groups 4, using 256 bytes of memory Neighbhor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd NeighborName ----------- --- ----- --------- --------- -------- ----- ------ --------- -------------- -------------- 10.0.0.57 4 65200 763 764 0 0 0 11:46:17 1439 ARISTA01M1 10.0.0.59 4 65200 763 765 0 0 0 11:46:17 1439 ARISTA02M1 10.0.0.61 4 65200 763 765 0 0 0 11:46:17 1439 ARISTA03M1 10.0.0.63 4 65200 763 765 0 0 0 11:46:17 1439 ARISTA04M1 10.0.0.65 4 64001 712 761 0 0 0 11:46:15 2 ARISTA01MX 10.0.0.67 4 64002 712 761 0 0 0 11:46:15 2 ARISTA02MX Total number of neighbors 6 Issue status, no default route, but show ip bgp sum looks good admin@sonic:~$ ip route show default admin@sonic:~$ show ip bgp sum IPv4 Unicast Summary: BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0 BGP table version 2892 RIB entries 2893, using 648032 bytes of memory Peers 6, using 4451856 KiB of memory Peer groups 4, using 256 bytes of memory Neighbhor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd NeighborName ----------- --- ----- --------- --------- -------- ----- ------ --------- -------------- -------------- 10.0.0.57 4 65200 764 767 0 0 0 11:47:14 1439 ARISTA01M1 10.0.0.59 4 65200 764 768 0 0 0 11:47:14 1439 ARISTA02M1 10.0.0.61 4 65200 764 768 0 0 0 11:47:14 1439 ARISTA03M1 10.0.0.63 4 65200 764 768 0 0 0 11:47:14 1439 ARISTA04M1 10.0.0.65 4 64001 713 764 0 0 0 11:47:12 2 ARISTA01MX 10.0.0.67 4 64002 713 764 0 0 0 11:47:12 2 ARISTA02MX Total number of neighbors 6 How did you do it? Add default routes check in sanity check, and re-announce routes if issue happen How did you verify/test it? Run sanity check
…rsion doesn't support dhcp_server (sonic-net#16086) What is the motivation for this PR? Golden config db var has been changed from json like string to dict by this RP sonic-net#15922 But for mx with OS version doesn't support dhcp_server, it would still generate empty json like string, which would cause failure How did you do it? Change empty golden config generated for mx doesn't support dhcp_server How did you verify/test it? Deploy-mg
What is the motivation for this PR? default_pfcwd_status should be disable for m0 and mx How did you do it? In golden config_db, disable pfcwd for m0 and mx How did you verify/test it? Deploy topo
…c-net#16264) What is the motivation for this PR? Default route check in sanity is added by sonic-net#16235 It only supports single asic for now. But constraint for single asic in recover stage is missed It would cause keyError in multi-asic if there is bgp sanity check failure def adaptive_recover(dut, localhost, fanouthosts, nbrhosts, tbinfo, check_results, wait_time): outstanding_action = None for result in check_results: if result['failed']: if result['check_item'] == 'interfaces': action = _recover_interfaces(dut, fanouthosts, result, wait_time) elif result['check_item'] == 'services': action = _recover_services(dut, result) elif result['check_item'] == 'bgp': # If there is only default route missing issue, only need to re-announce routes to recover > if ("no_v4_default_route" in result['bgp'] and len(result['bgp']) == 1 or "no_v6_default_route" in result['bgp'] and len(result['bgp']) == 1 or ("no_v4_default_route" in result['bgp'] and "no_v6_default_route" in result['bgp'] and len(result['bgp']) == 2)): E KeyError: 'bgp' check_results = [{'bgp0': {'down_neighbors': ['2603:10e2:400:1::5', 'fc00::2']}, 'bgp3': {'down_neighbors': ['2603:10e2:400:1::6']}, 'check_item': 'bgp', 'failed': True, ...}] dut = <MultiAsicSonicHost vlab-08> fanouthosts = {} localhost = <tests.common.devices.local.Localhost object at 0x77f1b9270a90> nbrhosts = {'ARISTA01T0': <EosHost VM0129>, 'ARISTA01T2': <EosHost VM0128>} outstanding_action = None result = {'bgp0': {'down_neighbors': ['2603:10e2:400:1::5', 'fc00::2']}, 'bgp3': {'down_neighbors': ['2603:10e2:400:1::6']}, 'check_item': 'bgp', 'failed': True, ...} tbinfo = {'auto_recover': 'False', 'comment': 'Tests multi-asic virtual switch vm', 'conf-name': 'vms-kvm-four-asic-t1-lag', 'duts': ['vlab-08'], ...} wait_time = 30 How did you do it? Add single asic constraint in recover How did you verify/test it? Run test
…d telemetry test (sonic-net#16112) What is the motivation for this PR? Client mac of DHCP packets should be ptf interface mac rather than DUT interface mac How did you do it? Modify test case to use ptf interface mac How did you verify/test it? Run tests in m0/t0 topo
Description of PR Add BFD up count check and MAC entries count check to sanity check for T2 topo. Summary: Fixes # (issue) Microsoft ADO 29825439 & 29825466 Approach What is the motivation for this PR? During our T2 Nightly run, we found that there will be a chance that the port channel connection between 2 ASICs is up but MAC address was not learned and the BFD session between them is down. Therefore, we need to have sanity check to make sure BFD are all up and all MAC addresses are learned, otherwise issue like this will affect the test result and can impact production env. How did you do it? Added check_bfd_up_count() function to sanity check for T2 topo only. This check will take ~4 seconds to run on a T2 device with 3 LC (frontend nodes). Added check_mac_entry_count() function to sanity check for T2 supervisor only. This check will take ~17 seconds to finish on a T2 device where its supervisor has 10 ASICs. How did you verify/test it? I ran the updated code on T2 with multiple test modules and can confirm it's checking the BFD up count and MAC entries count properly. Elastictest link: https://elastictest.org/scheduler/testplan/676bbfe8ab42af53500adb8d?leftSideViewMode=detail Besides, I can also confirm that these 2 checks will be skipped on non-T2 devices. Any platform specific information? Supported testbed topology if it's a new test case? T2 co-authorized by: [email protected]
…d and enable_pfcwd functions. (sonic-net#16176) Description of PR Summary: Even after sonic-net#15863, the problem is still seen in RP. The issue is that the sort() function only sorts alphabetically, not numerically. So we need to use natsort module instead of normal sort() in these functions. Approach What is the motivation for this PR? Seeing a lot of loganalyzer messages, but in RP only. How did you do it? Change sort to natsort(). How did you verify/test it? Running in my TB, I don't see the LA error anymore. Any platform specific information? T2 mAsic only. co-authorized by: [email protected]
Description of PR Summary: Fixes # (issue) 304717824 Type of change For devices that doesnt have console port (i.e Cisco 8800 console only exists in RP), we should skip this test for this DUT. Otherwise we are going to have KeyError: 'ManagementIp' Approach What is the motivation for this PR? How did you do it? We skip if the key does not exists. If key does not exists then the graph will be empty. For example: "dut-lc1-1": {}, How did you verify/test it? verified on physical testbed Signed-off-by: Austin Pham <[email protected]>
Description of PR Summary: Fixes # (issue) 30457143 Currently this is using the old method of capturing queue counters: sudo ip netns exec asic1 show queue counters Ethernet128 This will throw an issue with some testbed and says Ethernet128 does not exists. (haven't had chance to confirm why) However since we have new support for -n sonic-net/sonic-utilities#2439 we should be using this instead Tested by running manual commands admin@str3-8800-lc4-1:~$ sudo ip netns exec asic1 show queue counters Ethernet128 Port doesn't exist! Ethernet128 admin@str3-8800-lc4-1:~$ show queue counters Ethernet128 -n asic1 For namespace asic1: Port TxQ Counter/pkts Counter/bytes Drop/pkts Drop/bytes ----------- ----- -------------- --------------- ----------- ------------ Ethernet128 UC0 0 0 0 0 ... Type of change Approach What is the motivation for this PR? How did you do it? Update the test to use new APIs that support -n How did you verify/test it? Manually run, needs to verify with available testbed. Signed-off-by: Austin Pham <[email protected]>
[PR:17265] Update dualtor memory to 6G
What is the motivation for this PR? The current logic simply checks for the keyword "status" in service_info without considering the context. If the keyword appears in 'last output' or another unintended section of the data, the logic misinterprets it. How did you do it? Ensuring "status" detection is specific to the intended line. Using a stricter parsing approach to avoid unintended matches. How did you verify/test it? Ran the test_pretest.py on kvm-t0 testbed. Captured the get_monit_services_status() output
…n for autonegotiation (sonic-net#17191) (sonic-net#17277) [ansible] Correct the formatting for linkmetadata minigraph generation for autonegotiation
chore: update test for t2 Description of PR Summary: Previously this test skipped for eos neighbor due to legacy. Now we should be able to enable this. This PR added the necessary changes to make this compatible Fixes # (issue) 28839241 Approach What is the motivation for this PR? Described above How did you do it? Adjust the relevant code to be compatible with eos neighbor. How did you verify/test it? Verified on T2 testbed compatibility. Signed-off-by: Austin Pham <[email protected]>
Two improvements: 1. force replace the dst arp responder config. 2. add show arp and neighbor check in I/O utility to help debug. Signed-off-by: Longxiang Lyu <[email protected]>
…sonic-net#17174) Signed-off-by: Longxiang <[email protected]>
…ness (sonic-net#17040) Description of PR Summary: Fixes # (issue) test_tunnel_memory_leak is flaky, sometimes it failed: Failed: Test failed because there is memory leak on dut Based on the calculation: 21/11/2024 20:15:33 test_tunnel_memory_leak.test_tunnel_memo L0167 INFO | tunnel_packet_handler.py original MEM USAGE:61.8046875 21/11/2024 20:24:33 test_tunnel_memory_leak._check_memory L0102 ERROR | tunnel_packet_handler.py MEM usage exceeds the threshold. current usage:64.3671875, target usage: 63.658828125 64.3671875/61.8046875 - 1= 0.041461 actual increased percentage is more than 4%. The current percentage is 3%. Confirmed with tunnel feature owner, it's safe to increase it to 5% to reduce the flakiness.
What is the motivation for this PR? We are preparing to deploy impact area PR testing in 202411 branch, which won't always test all test sets, which means some test may not be test for a long time and if it's flaky, we are not able to find out. So baseline test is needed to find out flaky tests in 202411. How did you do it? Deploy baseline test in 202411
Signed-off-by: Longxiang <[email protected]>
…-net#17136) Description of PR Summary: Fix version check logic, just disable feature less than 202411 Approach What is the motivation for this PR? Fix version check logic, just disable feature less than 202411 How did you do it? Refactor the version check logic How did you verify/test it? verified via mgmt test
Description of PR Add a 60s sleep to the end of platform_tests/link_flap/test_cont_link_flap.py test Summary: Fixes # (issue) Approach What is the motivation for this PR? There could be a small chance that the sudo route_check.py ran during the teardown stage of platform_tests/link_flap/test_cont_link_flap.py test would fail co-authorized by: [email protected]
Summary: This PR will enable bmp feature related test in KVM default test path, so that the new feature could be tested more. What is the motivation for this PR? This PR will enable bmp feature related test in KVM default test path, so that the new feature could be tested more. How did you do it? Added relevant test cases into kvmtest.sh How did you verify/test it? Verified in pre-commit.
What is the motivation for this PR? In PR sonic-net#15605, the platform_api_conn fixture was relocated to a common location. The fixture start_platform_api_service used by it is not imported in this script, which will casue error fixture 'start_platform_api_service' not found. This PR addresses the issue by moving fixture start_platform_api_service to a common place tests/conftest.py. How did you do it? This PR addresses the issue by moving fixture start_platform_api_service to a common place tests/conftest.py. How did you verify/test it? Test locally to run the test script test_reload_dpu.py. There is no such import error.
…7305) The feature is not ready for BRCM platform, need to xfail for BRCM before the fix is in place. What is the motivation for this PR? To xfail the test_lag_member_forwarding on BRCM platform. How did you do it? Xfail the test_lag_member_forwarding in conditional mark file. How did you verify/test it? Run test. Any platform specific information? Only xfail for BRCM platform.
) (sonic-net#17245) Current hash logic can not support lag only and ecmp only hash Signed-off-by: Kevin Wang <[email protected]> Co-authored-by: Kevin Wang <[email protected]>
/azp run |
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
/azp run |
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
/azp run |
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of PR
For x86_64-nvidia_sn4280-r0, because some ports are NPU DPU ports that do not have lossless traffic, so remove the ports from tested ports
Summary:
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
Fix qos buffer test issue for x86_64-nvidia_sn4280-r0
How did you do it?
Remove NPU DPU ports from the tested ports
How did you verify/test it?
run test buffer tests on x86_64-nvidia_sn4280-r0
Any platform specific information?
x86_64-nvidia_sn4280-r0
Supported testbed topology if it's a new test case?
Documentation