Open
Description
What happened?
- When the system disk if full, and then the router container is died
- kubelet liveness probe error(as the pid can not found)
[root@yf-master-0-0 crio]# journalctl -u kubelet.service --since "2025-06-23 14:08:28" | grep "router-default-8d89fcb6d-pf8tx\|913720ce-6f30"
Jun 23 14:08:28 yf-master-0-0 hyperkube[3381926]: E0623 14:08:28.608550 3381926 prober.go:104] "Probe errored" err="rpc error: code = NotFound desc = container is not created or running: checking if PID of c4743be5a13d8787a72acb82642057c03a9fd076dbfb36809fef960f287b1b29 is running failed: open /proc/2563009/stat: no such file or directory: container process not found" probeType="Liveness" pod="ccos-ingress/router-default-8d89fcb6d-pf8tx" podUID="913720ce-6f30-4612-8ba0-d451dc991303" containerName="router"
Jun 23 14:08:28 yf-master-0-0 hyperkube[3381926]: E0623 14:08:28.608625 3381926 prober.go:104] "Probe errored" err="rpc error: code = NotFound desc = container is not created or running: checking if PID of c4743be5a13d8787a72acb82642057c03a9fd076dbfb36809fef960f287b1b29 is running failed: open /proc/2563009/stat: no such file or directory: container process not found" probeType="Readiness" pod="ccos-ingress/router-default-8d89fcb6d-pf8tx" podUID="913720ce-6f30-4612-8ba0-d451dc991303" containerName="router"
Jun 23 14:08:38 yf-master-0-0 hyperkube[3381926]: E0623 14:08:38.608556 3381926 prober.go:104] "Probe errored" err="rpc error: code = NotFound desc = container is not created or running: checking if PID of c4743be5a13d8787a72acb82642057c03a9fd076dbfb36809fef960f287b1b29 is running failed: open /proc/2563009/stat: no such file or directory: container process not found" probeType="Liveness" pod="ccos-ingress/router-default-8d89fcb6d-pf8tx" podUID="913720ce-6f30-4612-8ba0-d451dc991303" containerName="router"
Jun 23 14:08:38 yf-master-0-0 hyperkube[3381926]: E0623 14:08:38.608612 3381926 prober.go:104] "Probe errored" err="rpc error: code = NotFound desc = container is not created or running: checking if PID of c4743be5a13d8787a72acb82642057c03a9fd076dbfb36809fef960f287b1b29 is running failed: open /proc/2563009/stat: no such file or directory: container process not found" probeType="Readiness" pod="ccos-ingress/router-default-8d89fcb6d-pf8tx" podUID="913720ce-6f30-4612-8ba0-d451dc991303" containerName="router"
What did you expect to happen?
I know the exit file is writed by the conmon, and the system disk is full, maybe the conmon can not write the exit code in this file, and then lead to the container status not updated anymore
If there is another way to recover the container except for restarting cri-o
How can we reproduce it (as minimally and precisely as possible)?
make the system disk full, and kill the pid process of the container(maybe stop the conmon process at the same time)
Anything else we need to know?
No response
CRI-O and Kubernetes version
$ crio --version
# paste output here
1.29.11
$ kubectl version --output=json
# paste output here
1.29.6
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here