Pod驱逐
问题描述
节点Pod被驱逐
原因
1. 查看节点和该节点pod状态
查看节点状态为Ready,查看该节点的所有pod,发现存在被驱逐的pod和nvidia-device-plugin为pending
root@host:~$ kgpoallowide |grep 192.168.1.1
department-56 173e397c-ea35-4aac-85d8-07106e55d7b7 0/1 Evicted 0 52d <none> 192.168.1.1 <none>
kube-system nvidia-device-plugin-daemonset-d58d2 0/1 Pending 0 1s <none> 192.168.1.1 <none>
2. 查看对应节点kubelet的日志
0905 15:42:13.182280 23506 eviction_manager.go:142] Failed to admit pod rdma-device-plugin-daemonset-8nwb8_kube-system(acc28a85-cfb0-11e9-9729-6c92bf5e2432) - node has conditions: [DiskPressure]
I0905 15:42:14.827343 23506 kubelet.go:1836] SyncLoop (ADD, "api"): "nvidia-device-plugin-daemonset-88sm6_kube-system(adbd9227-cfb0-11e9-9729-6c92bf5e2432)"
W0905 15:42:14.827372 23506 eviction_manager.go:142] Failed to admit pod nvidia-device-plugin-daemonset-88sm6_kube-system(adbd9227-cfb0-11e9-9729-6c92bf5e2432) - node has conditions: [DiskPressure]
I0905 15:42:15.722378 23506 kubelet_node_status.go:607] Update capacity for nvidia.com/gpu-share to 0
I0905 15:42:16.692488 23506 kubelet.go:1852] SyncLoop (DELETE, "api"): "rdma-device-plugin-daemonset-8nwb8_kube-system(acc28a85-cfb0-11e9-9729-6c92bf5e2432)"
W0905 15:42:16.698445 23506 status_manager.go:489] Failed to delete status for pod "rdma-device-plugin-daemonset-8nwb8_kube-system(acc28a85-cfb0-11e9-9729-6c92bf5e2432)": pod "rdma-device-plugin-daemonset-8nwb8" not found
I0905 15:42:16.698490 23506 kubelet.go:1846] SyncLoop (REMOVE, "api"): "rdma-device-plugin-daemonset-8nwb8_kube-system(acc28a85-cfb0-11e9-9729-6c92bf5e2432)"
I0905 15:42:16.699267 23506 kubelet.go:2040] Failed to delete pod "rdma-device-plugin-daemonset-8nwb8_kube-system(acc28a85-cfb0-11e9-9729-6c92bf5e2432)", err: pod not found
W0905 15:42:16.777355 23506 eviction_manager.go:332] eviction manager: attempting to reclaim nodefs
I0905 15:42:16.777384 23506 eviction_manager.go:346] eviction manager: must evict pod(s) to reclaim nodefs
E0905 15:42:16.777390 23506 eviction_manager.go:357] eviction manager: eviction thresholds have been met, but no pods are active to evict
存在关于pod驱逐相关的日志,驱逐的原因为node has conditions: [DiskPressure]
。
3. 查看磁盘相关信息
[root@host /]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 126G 0 126G 0% /dev
tmpfs 126G 0 126G 0% /dev/shm
tmpfs 126G 27M 126G 1% /run
tmpfs 126G 0 126G 0% /sys/fs/cgroup
/dev/sda1 20G 19G 0 100% / # 根目录磁盘满
/dev/nvme1n1 3.0T 191G 2.8T 7% /data2
/dev/nvme0n1 3.0T 1.3T 1.7T 44% /data1
/dev/sda4 182G 95G 87G 53% /data
/dev/sda3 20G 3.8G 15G 20% /usr/local
tmpfs 26G 0 26G 0% /run/user/0
发现根目录的磁盘盘,接着查看哪些文件占用磁盘。
[root@host ~/kata]# du -sh ./*
1.0M ./log
944K ./netlink
6.6G ./kernel3
/var/log/下存在7G 的日志。清理相关日志和无用文件后,根目录恢复空间。
[root@host /data]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 126G 0 126G 0% /dev
tmpfs 126G 0 126G 0% /dev/shm
tmpfs 126G 27M 126G 1% /run
tmpfs 126G 0 126G 0% /sys/fs/cgroup
/dev/sda1 20G 5.8G 13G 32% / # 根目录正常
/dev/nvme1n1 3.0T 191G 2.8T 7% /data2
查看节点pod状态,相关plugin的pod恢复正常。
root@host:~$ kgpoallowide |grep 192.168.1.1
kube-system nvidia-device-plugin-daemonset-h4pjc 1/1 Running 0 16m 192.168.1.1 192.168.1.1 <none>
kube-system rdma-device-plugin-daemonset-xlkbv 1/1 Running 0 16m 192.168.1.1 192.168.1.1 <none>
4. 查看kubelet配置
查看kubelet关于pod驱逐相关的参数配置,可见节点kubelet开启了驱逐机制,正常情况下该配置应该是关闭的。
ExecStart=/usr/local/bin/kubelet \
...
--eviction-hard=nodefs.available<1% \
解决方案
总结以上原因为,kubelet开启了pod驱逐的机制,根目录的磁盘达到100%,pod被驱逐,且无法再正常创建在该节点。
解决方案如下:
1、关闭kubelet的驱逐机制。
2、清除根目录的文件,恢复根目录空间,并后续增加根目录的磁盘监控。
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.
最后修改 December 25, 2022: deploy by blog source (a162b04)