-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Extremely slow "add policy route" #4822
Comments
I found a similar closed issue that might be relevant to your problem: underlay subnet has an unexpected policy route. This issue was resolved by a pull request that fixed the logical router policy for the default subnet with logical gateway enabled [1]. Regarding the duplicate "add policy route" operations you're experiencing, Kube-OVN has a mechanism to handle policy routes and prevent duplicate entries. It uses functions like If you're seeing a large number of duplicate operations, it might be worth checking if these functions are being correctly utilized in your setup or if there might be a specific configuration issue causing this behavior. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
Any reason for slow
|
The slow "add policy route" operation in Kube-OVN is primarily due to two issues:
These factors contribute to the slow decrease in the "Work Queue Depth" for "Subnet," with occasional increases due to re-processing after events like a controller restart following an Out of Memory (OOM) issue [1]. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
@zsxsoft This may be related to the logic of adding policies. When add policy, the kube-ovn-controller needs to check if the policy exists by performing a long string match. In your environment, each policy addition requires approximately 260 × 10 checks, which amounts to a total of 260 × 260 × 10 checks. |
@oilbeater kube-ovn/pkg/controller/subnet.go Line 2565 in 8701625
It seems that each time |
I see #4538, is the patch useful? I didn't measure the time. |
@zsxsoft it should help. I tested v1.14.0 with this patch and v1.12.28 without this patch in a cluster with 5 nodes and 250 subnets. For v1.14.0, it takes about 10ms to process a listLogicalRouterPoliciesByFilter and it takes about 600ms in v1.12.28. |
@oilbeater O i see, I read the wrong version of code. I cherry-picked a6f13a6 into v1.12.30, with the help of b76c044, But it still used 27min to create 100 subnets. Is it expected? |
Log: It seems it will delay 10s before
|
ACLs may have same problem as |
apiVersion: kubeovn.io/v1
kind: Subnet
metadata:
name: net-a
spec:
acls:
- action: drop
direction: from-lport
match: ip.dst == 192.168.103.2
priority: 1021
cidrBlock: 192.168.103.0/24
default: false
dhcpV4Options: lease_time=3600,router=192.168.103.1,server_id=192.168.103.1,mtu=1400
enableDHCP: true
enableLb: true
excludeIps:
- 192.168.103.1
gateway: 192.168.103.1
gatewayNode: ""
gatewayType: distributed
mtu: 1400
natOutgoing: true
policyRoutingPriority: 3368
policyRoutingTableID: 3368
private: false
protocol: IPv4
provider: ovn
vpc: ovn-cluster |
I am unable to reproduce this issue in the cluster I’m working on. Do you have a large number of ACL rules configured in your subnet? After reviewing the code, the only suspicious part I found in the logs is this: subnet.go, line 872. |
@oilbeater No, only 1 ACL is applied, but all subnets contain it. I have lots of SecurityGroup(~300), does this affect? |
Kube-OVN Version
v1.12.28
Kubernetes Version
v1.31.2
Operation-system/Kernel Version
TencentOS Server 4.2
6.6.47-12.tl4.x86_64
Description
This issue contains 2 problems.
I have a cluster with 10 nodes, 260 subnets in 1 vpc, ~5k ports. Today, I discovered that my
ovs-ovn
on some nodes was killed due to OOM. Therefore, I increased the memory limit and restarted thekube-ovn-controller
.Then I found my Work Queue Latency has remained at a very high level. (>10min)
I noticed that the controller was continuously performing "add policy route" operations in the logs at a VERY SLOW pace (approximately 1-3 seconds per entry). This's the first problem.
I understand that after restarting the KubeOVN controller, it needs to traverse all 10 nodes and 260 subnets. I expected the number of
add policy route
operations to be ~2600.[root@vm-master-1 a]# cat 2.log | grep 'add policy route' | wc -l 3558
However, after waiting for a long time, I found that this number far exceeded than it, and there appeared to be a large number of duplicate operations. (Same node, same subnet, but executed twice)
Now I'm unable to create new subnets, so I plan to wait overnight and check again the next day to see if the operations have completed. If more information is needed, please contact me.
Steps To Reproduce
/
Current Behavior
/
Expected Behavior
/
The text was updated successfully, but these errors were encountered: