Description
What happened?
Starting with cri-o 1.31 and 1.32, we are experiencing an intermittent issue where kubelet will attempt to start a container, and receive this context deadline exceeded message. The container will then fail to start.
Jan 09 22:00:32 kubelet[46938]: E0109 22:00:32.615351 46938 log.go:32] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" podSandboxID="b61b3b06381b6e030e0db6d126e613c7391b2d46947a2307fbf2061b237e340a"
Jan 09 22:00:32 kubelet[46938]: E0109 22:00:32.621327 46938 kuberuntime_manager.go:1274] "Unhandled Error" err="init container &Container{Name:install-cni-plugin,Image:docker.io/flannel/flannel-cni-plugin:v1.5.1-flannel2,Command:[cp],Args:[-f /flannel /opt/cni/bin/flannel],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{100 -3} {} 100m DecimalSI},},Requests:ResourceList{cpu: {{100 -3} {} 100m DecimalSI},memory: {{52428800 0} {} 50Mi BinarySI},},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:cni-plugin,ReadOnly:false,MountPath:/opt/cni/bin,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:kube-api-access-pqww5,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:Never,SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[ALL],},Privileged:*false,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:*false,AllowPrivilegeEscalation:*false,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,AppArmorProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod kube-flannel-ds-mz4j6_kube-flannel(f90de2e5-9c6b-4b33-b001-ad8724d7475f): CreateContainerError: context deadline exceeded"
Jan 09 22:00:32 kubelet[46938]: E0109 22:00:32.622800 46938 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to "StartContainer" for "install-cni-plugin" with CreateContainerError: "context deadline exceeded"" pod="kube-flannel/kube-flannel-ds-mz4j6" podUID="f90de2e5-9c6b-4b33-b001-ad8724d7475f"
What did you expect to happen?
The container to start.
How can we reproduce it (as minimally and precisely as possible)?
Using kubernetes 1.32 and cri-o 1.31 or 1.32, attempt to start a container with kubelet/crictl. This is an intermittent issue. We've seen this issue with static pods and daemon/replica sets.
Anything else we need to know?
When the issue is happening, we are starting up about 10 pods, usually only 1 to a couple fail to start up. This issue is not seen when running cri-o 1.30. Monitoring the cpu usage of crio and the system, the cpu and memory never peak.
It looks like kubelet redrives crio to create this container a couple times. Here are some potentially relevant crio logs:
This is from the crio logs when it is first asked to start kube-flannel:
Jan 09 21:50:29 crio[46784]: time="2025-01-09 21:50:29.093653563Z" level=info msg="Running pod sandbox: kube-flannel/kube-flannel-ds-mz4j6/POD" id=e2ce965c-d3db-461c-8f00-9c51ff5f1c7e name=/runtime.v1.RuntimeService/RunPodSandbox
Jan 09 21:50:29 crio[46784]: time="2025-01-09 21:50:29.093787912Z" level=info msg="Allowed annotations are specified for workload [io.containers.trace-syscall]"
Jan 09 21:50:29 crio[46784]: time="2025-01-09 21:50:29.107788922Z" level=info msg="Running pod sandbox: kube-system/kube-proxy-zzllq/POD" id=4401a111-f18b-42e3-af00-549331cbe255 name=/runtime.v1.RuntimeService/RunPodSandbox
Jan 09 21:50:29 crio[46784]: time="2025-01-09 21:50:29.117524470Z" level=info msg="Allowed annotations are specified for workload [io.containers.trace-syscall]"
Jan 09 21:50:29 crio[46784]: time="2025-01-09 21:50:29.132781722Z" level=info msg="Ran pod sandbox bb61e19a5991be550d1562a5443f4ba7bf1348fd051a9186b42295c21353c709 with infra container: kube-system/kube-proxy-zzllq/POD" id=4401a111-f18b-42e3-af00-549331cbe255 name=/runtime.v1.RuntimeService/RunPodSandbox
Jan 09 21:50:29 crio[46784]: time="2025-01-09 21:50:29.151638239Z" level=info msg="Ran pod sandbox b61b3b06381b6e030e0db6d126e613c7391b2d46947a2307fbf2061b237e340a with infra container: kube-flannel/kube-flannel-ds-mz4j6/POD" id=e2ce965c-d3db-461c-8f00-9c51ff5f1c7e name=/runtime.v1.RuntimeService/RunPodSandbox
Jan 09 21:50:29 crio[46784]: time="2025-01-09 21:50:29.181753348Z" level=info msg="Creating container: kube-flannel/kube-flannel-ds-mz4j6/install-cni-plugin" id=e999c0fd-92da-496a-b4db-e880da769baf name=/runtime.v1.RuntimeService/CreateContainer
I don't see a message after CreateContainer for the first attempt at starting the pod.
This happens 2 minutes later when kubelet retries:
Jan 09 22:00:33 crio[46784]: time="2025-01-09 22:00:33.574948975Z" level=info msg="Creating container: kube-flannel/kube-flannel-ds-mz4j6/install-cni-plugin" id=c27dcf88-11cf-4006-b1b5-3f1c8cef1f88 name=/runtime.v1.RuntimeService/CreateContainer
Jan 09 22:00:33 crio[46784]: time="2025-01-09 22:00:33.579815435Z" level=warning msg="error reserving ctr name k8s_install-cni-plugin_kube-flannel-ds-mz4j6_kube-flannel_f90de2e5-9c6b-4b33-b001-ad8724d7475f_1 for id e9fdcd7512c85806a06e4b0926b6477e170ec7bd563e70c0f3e1a81ed542060b:
name is reserved"
Jan 09 22:00:33 crio[46784]: time="2025-01-09 22:00:33.580207554Z" level=info msg="Creation of container k8s_install-cni-plugin_kube-flannel-ds-mz4j6_kube-flannel_f90de2e5-9c6b-4b33-b001-ad8724d7475f_1 not yet finished. Currently at stage container creating. Waiting up to 5m59.99
4481685s for it to finish" id=c27dcf88-11cf-4006-b1b5-3f1c8cef1f88 name=/runtime.v1.RuntimeService/CreateContainer
CRI-O and Kubernetes version
$ crio --version
crio version 1.31.3
GitCommit: 5865976a62cd3abd1b56ec21ea4fde19a730fe87
GitCommitDate: 2024-12-02T10:48:49Z
GitTreeState: dirty
BuildDate: 1970-01-01T00:00:00Z
GoVersion: go1.22.5
Compiler: gc
Platform: linux/s390x
Linkmode: static
BuildTags:
static
netgo
osusergo
exclude_graphdriver_btrfs
seccomp
apparmor
selinux
exclude_graphdriver_devicemapper
LDFlags: unknown
SeccompEnabled: true
AppArmorEnabled: false
and
$ crio --version
crio version 1.32.0
GitCommit: b7f3c240bcbda6fae8d43561694d18317e09e167
GitCommitDate: 2024-12-19T15:54:01Z
GitTreeState: dirty
BuildDate: 1970-01-01T00:00:00Z
GoVersion: go1.23.3
Compiler: gc
Platform: linux/s390x
Linkmode: static
BuildTags:
static
netgo
osusergo
exclude_graphdriver_btrfs
seccomp
apparmor
selinux
exclude_graphdriver_devicemapper
LDFlags: unknown
SeccompEnabled: true
AppArmorEnabled: false
$ kubectl version --output=json
{
"clientVersion": {
"major": "1",
"minor": "32",
"gitVersion": "v1.32.0",
"gitCommit": "70d3cc986aa8221cd1dfb1121852688902d3bf53",
"gitTreeState": "clean",
"buildDate": "2024-12-11T18:05:36Z",
"goVersion": "go1.23.3",
"compiler": "gc",
"platform": "linux/s390x"
},
"kustomizeVersion": "v5.5.0"
}
OS version
# On Linux:
$ cat /etc/os-release
RHEL9.4
$ uname -a
Linux host 5.14.0-427.37.1.el9.s390x #1 SMP Thu Oct 17 04:26:45 CDT 2024 s390x s390x s390x GNU/Linux