Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 创建pod时绑定安全组,进入pod后ping不通网关 #4742

Open
QEDQCD opened this issue Nov 18, 2024 · 8 comments
Open

[BUG] 创建pod时绑定安全组,进入pod后ping不通网关 #4742

QEDQCD opened this issue Nov 18, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@QEDQCD
Copy link
Contributor

QEDQCD commented Nov 18, 2024

Kube-OVN Version

v1.13.0

Kubernetes Version

Client Version: v1.29.3
Server Version: v1.29.3

Operation-system/Kernel Version

/etc/os-release
"CentOS Stream 9"
uname -r
5.14.0-407.el9.x86_64
sbctl版本
kubectl-ko sbctl --version
ovn-sbctl 24.03.5
Open vSwitch Library 3.3.3
DB Schema 20.33.0
nbctl版本
kubectl-ko nbctl --version
ovn-nbctl 24.03.5
Open vSwitch Library 3.3.3
DB Schema 7.3.0

Description

1 创建安全组,放开全部网段 0.0.0.0/0
2 创建pod,绑定该安全组
3 kubectl exec 进入pod,ping网关地址失败

Steps To Reproduce

1 创建安全组sg.yaml
`apiVersion: kubeovn.io/v1
kind: SecurityGroup
metadata:
name: sg-example
spec:
allowSameGroupTraffic: true
egressRules:

  • ipVersion: ipv4
    policy: allow
    priority: 1
    protocol: all
    remoteAddress: 0.0.0.0/0
    remoteType: address
    ingressRules:
  • ipVersion: ipv4
    policy: allow
    priority: 1
    protocol: all
    remoteAddress: 0.0.0.0/0
    remoteType: address`

2 创建pod,绑定该安全组
pod.yaml
`apiVersion: v1
kind: Pod
metadata:
labels:
app: static
annotations:
ovn.kubernetes.io/port_security: 'true'
ovn.kubernetes.io/security_groups: 'sg-example'
name: sg-pod
namespace: default
spec:
nodeName: worker-1
containers:

  • image: registry.xxx.com/ubuntu:22.04
    imagePullPolicy: IfNotPresent
    command: ["sleep"]
    args: ["infinity"]
    name: test
    `

Current Behavior

进入 pod
kubectl exec -it sg-pod -- bash
尝试ping 网关地址,ping不通
ping 240.0.0.1
PING 240.0.0.1 (240.0.0.1): 56 data bytes
^C--- 240.0.0.1 ping statistics ---
8 packets transmitted, 0 packets received, 100% packet loss

pod解除安全组规则的关联关系后,立刻能通

Expected Behavior

绑定安全组,进入pod后能ping通网关地址

@QEDQCD QEDQCD added the bug Something isn't working label Nov 18, 2024
Copy link

dosubot bot commented Nov 18, 2024

@QEDQCD 你好!我是Dosu,我是一个可以帮助您解决错误、回答问题并成为贡献者的机器人。我在这里帮助您,同时等待人类维护者。

I found a similar closed issue that might be relevant to your problem. The issue titled "[BUG][KIND] Pod inside Custom VPC+NatGateway cannot access Internet using SNAT](#4241)" was resolved by adding a logical router static route in the VPC specification. Although your issue involves security groups, the solution might provide some insight into routing configurations that could affect connectivity.

In your case, since the pod can ping the gateway when the security group is removed, it suggests that the security group rules might not be applied correctly or there might be an issue with how the security group interacts with the gateway. Here are a few things you could check:

  1. Security Group Rules: Ensure that the security group rules are correctly defined and applied. The rules should allow traffic to and from the gateway IP address.

  2. AllowSameGroupTraffic: Verify that the allowSameGroupTraffic setting is correctly configured and that the security group status reflects this setting. This feature ensures that traffic within the same security group is allowed [1].

  3. Network Policies: Check if there are any other network policies or configurations that might be affecting the pod's ability to communicate with the gateway.

  4. Logs and Events: Review the logs and events for any errors or warnings related to the security group or network configuration.

If these steps do not resolve the issue, further investigation into the specific interactions between security groups and the gateway in your environment might be necessary.

To continue talking to Dosu, mention @dosu.

@QEDQCD
Copy link
Contributor Author

QEDQCD commented Nov 21, 2024

经详细排查可能是ovn的问题,偶现
查看ovn上sg的规则如下图
kubectl-ko nbctl acl-list ovn.sg.sg.example
image
可以看到from-lport to-lport 优先级2100这条规则允许所有ipv4的流量通过,且优先级比2003 kube-ovn所加的默认deny规则高,所以应该优先匹配这条,但实际效果却匹配了默认的deny规则,效果如下

执行命令
kubectl-ko ovn-trace default/toolbox-2 240.0.0.1 icmp
image
在ovn中匹配了默认的deny规则,icmp被拦截,ping不通,与预期不符

详情见kubectl-ko命令执行结果
kubectl-ko命令执行结果.txt

@zbb88888
Copy link
Collaborator

ovn.sg.sg.example 这个 group 中有这个 pod 网卡的port id 么?

@QEDQCD
Copy link
Contributor Author

QEDQCD commented Nov 22, 2024

ovn.sg.sg.example 这个 group 中有这个 pod 网卡的port id 么?

有的
查询port_group
命令: kubectl-ko nbctl list port_group
结果:
_uuid : afad0cba-e0f2-4093-bd8e-45d595b775e5
acls : [1829da71-6480-46ad-987a-d03cb88eac0c, 215ee263-189c-457e-bebf-d36f75a191a3, 33d884c8-1efb-46fb-9277-27c6047a6681, 37d1d8af-fba6-4986-8b64-f9362b2feaed, 3bb5aad3-6e8a-4c38-9c6d-3316677dce0f, 676b7261-b0f9-4101-bbcf-14929f8bfb75, 6f6674d0-7a93-4af5-9829-4daed948714e, 88ac7a21-2069-438a-bfef-91c0bb7aef5d, 8f5e7ebf-9b76-46da-9599-0d8550b11d19, 99c49324-2edd-4370-85d7-fb81d5edf02a, 9ba9933f-ea2f-450b-8652-d20194226072, ae63de7d-927e-4173-bf83-c57830a23301, b172130b-7b39-4ba5-bf7e-49cd5a850954, dcfa51ee-768d-419d-baf7-35c3246f0472, ef9a2737-87c2-492b-844c-ed7ff319a1e1, fc2c869f-8dd3-4d69-8268-d9328a7352f2]
external_ids : {sg=sg-example, type=security_group}
name : ovn.sg.sg.example
ports : [ba339dc8-979e-4614-bd3c-5065d446a7fb]

包含ports : [ba339dc8-979e-4614-bd3c-5065d446a7fb]

在查询logical_switch_port时也能grep到
命令:kubectl-ko nbctl --data=bare --no-heading --columns=name,addresses,_uuid find logical_switch_port |grep ba339dc8-979e-4614-bd3c-5065d446a7fb
结果: ba339dc8-979e-4614-bd3c-5065d446a7fb

@zbb88888
Copy link
Collaborator

@zhangzujian 看起来这是个 bug

@hackerain
Copy link
Contributor

这是由v1.12.x升级到v1.13.0导致的问题,安全组的acl规则在v1.12.x用的是tier 0,升级之后,变成了tier 2,因此新建的安全组规则都是tier 2的,但是deny all的acl规则还是tier 0的,因为tier 0要比tier 2优先级高,所以deny all本来是最低优先级的,现在变成了最高优先级。

临时解决办法就是把deny all的两条规则删除,然后重启ovn controller让其重建规则就可以了:

[root@master-1 ~]# kubectl-ko nbctl acl-list  ovn.sg.kubeovn_deny_all
from-lport  2003 (inport == @ovn.sg.kubeovn_deny_all && ip) drop
  to-lport  2003 (outport == @ovn.sg.kubeovn_deny_all && ip) drop
[root@master-1 ~]# kubectl-ko nbctl acl-del ovn.sg.kubeovn_deny_all  from-lport 2003 "inport == @ovn.sg.kubeovn_deny_all && ip"
[root@master-1 ~]# kubectl-ko nbctl acl-del ovn.sg.kubeovn_deny_all  to-lport 2003 "outport == @ovn.sg.kubeovn_deny_all && ip"
[root@master-1 ~]# kubectl rollout restart  deployment/kube-ovn-controller -n kube-system

hackerain added a commit to hackerain/kube-ovn that referenced this issue Nov 27, 2024
the acls in v1.13.x are in tier 2 rather than tier 0 in v1.12.x, which
results that legacy denyall sg will drop all traffics if a pod bound a sg,
because acls in tier 0 have the higest priority. we should recreate acls
in denyall sg when upgrading to v1.13.x.

Signed-off-by: Rain Suo <[email protected]>
hackerain added a commit to hackerain/kube-ovn that referenced this issue Nov 27, 2024
the acls in v1.13.x are in tier 2 rather than tier 0 in v1.12.x, which
results that legacy denyall sg will drop all traffics if a pod bound a sg,
because acls in tier 0 have the higest priority. we should recreate acls
in denyall sg when upgrading to v1.13.x.

Signed-off-by: Rain Suo <[email protected]>
hackerain added a commit to hackerain/kube-ovn that referenced this issue Nov 27, 2024
the acls in v1.13.x are in tier 2 rather than tier 0 in v1.12.x, which
results that legacy denyall sg will drop all traffics if a pod bound a sg,
because acls in tier 0 have the higest priority. we should recreate acls
in denyall sg when upgrading to v1.13.x.

Signed-off-by: Rain Suo <[email protected]>
hackerain added a commit to hackerain/kube-ovn that referenced this issue Nov 27, 2024
the acls in v1.13.x are in tier 2 rather than tier 0 in v1.12.x, which
results that legacy denyall sg will drop all traffics if a pod bound a sg,
because acls in tier 0 have the higest priority. we should recreate acls
in denyall sg when upgrading to v1.13.x.

Signed-off-by: Rain Suo <[email protected]>
@hackerain
Copy link
Contributor

@bobz965 @zhangzujian 看这样处理可以吗?#4768

@zbb88888
Copy link
Collaborator

@bobz965 @zhangzujian 看这样处理可以吗?#4768

感谢,我看了下

hackerain added a commit to hackerain/kube-ovn that referenced this issue Dec 2, 2024
the acls in v1.13.x are in tier 2 rather than tier 0 in v1.12.x, the legacy acls
may cause some unexpected behaviors because acls in tier 0 have the higest priority.
we should delete legacy acls and recreate them when upgrading to v1.13.x.

Signed-off-by: Rain Suo <[email protected]>
hackerain added a commit to hackerain/kube-ovn that referenced this issue Dec 4, 2024
the acls in v1.13.x are in tier 2 rather than tier 0 in v1.12.x, the legacy acls
may cause some unexpected behaviors because acls in tier 0 have the higest priority.
we should delete legacy acls and recreate them when upgrading to v1.13.x.

Signed-off-by: suo <[email protected]>
hackerain added a commit to hackerain/kube-ovn that referenced this issue Dec 4, 2024
the acls in v1.13.x are in tier 2 rather than tier 0 in v1.12.x, the legacy acls
may cause some unexpected behaviors because acls in tier 0 have the higest priority.
we should delete legacy acls and recreate them when upgrading to v1.13.x.

Signed-off-by: suo <[email protected]>
hackerain added a commit to hackerain/kube-ovn that referenced this issue Dec 4, 2024
the acls in v1.13.x are in tier 2 rather than tier 0 in v1.12.x, the legacy acls
may cause some unexpected behaviors because acls in tier 0 have the higest priority.
we should delete legacy acls and recreate them when upgrading to v1.13.x.

Signed-off-by: suo <[email protected]>
hackerain added a commit to hackerain/kube-ovn that referenced this issue Dec 5, 2024
the acls in v1.13.x are in tier 2 rather than tier 0 in v1.12.x, the legacy acls
may cause some unexpected behaviors because acls in tier 0 have the higest priority.
we should delete legacy acls and recreate them when upgrading to v1.13.x.

Signed-off-by: suo <[email protected]>
hackerain added a commit to hackerain/kube-ovn that referenced this issue Dec 5, 2024
the acls in v1.13.x are in tier 2 rather than tier 0 in v1.12.x, the legacy acls
may cause some unexpected behaviors because acls in tier 0 have the higest priority.
we should delete legacy acls and recreate them when upgrading to v1.13.x.

Signed-off-by: suo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants