Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多次进行删除创建Pod之后,会导致新创建Pod出现Pending状态 #201

Open
liufangpeng opened this issue Feb 15, 2023 · 0 comments

Comments

@liufangpeng
Copy link

liufangpeng commented Feb 15, 2023

第一次部署的时候可以正常创建,多次进行delete\create同一个Pod之后出现异常

使用命令:kubectl -n test-testgpu get event

LAST SEEN TYPE REASON OBJECT MESSAGE
3m4s Warning FailedScheduling pod/binpack-3-7b8684575d-cqntk 0/1 nodes are available: 1 Insufficient GPU Memory in one device.

使用命令:nvidia-smi

Wed Feb 15 14:57:18 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40-4Q On | 00000000:02:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

使用命令:kubectl -n kube-system logs -f gpushare-schd-extender-594b9bc6d6-lh8w9

[ debug ] 2023/02/15 06:58:39 routes.go:162: /gpushare-scheduler/filter response=&{0xc42047e1e0 0xc420548300 0xc420355b80 0x565b70 true false false false 0xc4200aa580 {0xc42037a1c0 map[Content-Type:[application/json]] false false} map[Content-Type:[application/json]] true 111 -1 200 false false [] 0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0] [0 0 0] 0xc420348070 0}
[ debug ] 2023/02/15 06:58:58 controller.go:295: No need to update pod name binpack-3-7b8684575d-9ksk4 in ns test-testgpu and old status is Pending, new status is Pending; its old annotation map[ovn.kubernetes.io/logical_switch:ovn-default ovn.kubernetes.io/mac_address:00:00:00:B9:31:56 ovn.kubernetes.io/network_type:geneve ovn.kubernetes.io/pod_nic_type:veth-pair ovn.kubernetes.io/gateway:10.183.0.1 ovn.kubernetes.io/logical_router:ovn-cluster kubernetes.io/psp:20-user-restricted ovn.kubernetes.io/allocated:true ovn.kubernetes.io/cidr:10.183.0.0/16 ovn.kubernetes.io/ip_address:10.183.0.113 cpaas.io/creator:admin kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container binpack-3; cpu, memory limit for container binpack-3] and new annotation map[ovn.kubernetes.io/logical_router:ovn-cluster ovn.kubernetes.io/logical_switch:ovn-default ovn.kubernetes.io/mac_address:00:00:00:B9:31:56 ovn.kubernetes.io/network_type:geneve ovn.kubernetes.io/pod_nic_type:veth-pair ovn.kubernetes.io/gateway:10.183.0.1 kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container binpack-3; cpu, memory limit for container binpack-3 kubernetes.io/psp:20-user-restricted ovn.kubernetes.io/allocated:true ovn.kubernetes.io/cidr:10.183.0.0/16 ovn.kubernetes.io/ip_address:10.183.0.113 cpaas.io/creator:admin]
[ debug ] 2023/02/15 06:59:28 controller.go:295: No need to update pod name binpack-3-7b8684575d-9ksk4 in ns test-testgpu and old status is Pending, new status is Pending; its old annotation map[ovn.kubernetes.io/logical_router:ovn-cluster ovn.kubernetes.io/logical_switch:ovn-default ovn.kubernetes.io/mac_address:00:00:00:B9:31:56 ovn.kubernetes.io/network_type:geneve ovn.kubernetes.io/pod_nic_type:veth-pair ovn.kubernetes.io/gateway:10.183.0.1 kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container binpack-3; cpu, memory limit for container binpack-3 kubernetes.io/psp:20-user-restricted ovn.kubernetes.io/allocated:true ovn.kubernetes.io/cidr:10.183.0.0/16 ovn.kubernetes.io/ip_address:10.183.0.113 cpaas.io/creator:admin] and new annotation map[ovn.kubernetes.io/gateway:10.183.0.1 ovn.kubernetes.io/logical_router:ovn-cluster ovn.kubernetes.io/logical_switch:ovn-default ovn.kubernetes.io/mac_address:00:00:00:B9:31:56 ovn.kubernetes.io/network_type:geneve ovn.kubernetes.io/pod_nic_type:veth-pair cpaas.io/creator:admin kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container binpack-3; cpu, memory limit for container binpack-3 kubernetes.io/psp:20-user-restricted ovn.kubernetes.io/allocated:true ovn.kubernetes.io/cidr:10.183.0.0/16 ovn.kubernetes.io/ip_address:10.183.0.113]

后续重新创建了gpushare-scheduler-extender就可以继续正常创建了,但是重复创建几次Pod又Pending
目前没有找到具体什么原因

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant