-
Notifications
You must be signed in to change notification settings - Fork 586
Closed
Description
csi-rbdplugin is crashing on one of the k8s node. Here is the log of the csi-rbdplugin container:
Defaulted container "csi-rbdplugin" out of: csi-rbdplugin, driver-registrar, liveness-prometheus
I0826 09:34:01.412332 57346 cephcsi.go:191] Driver version: v3.11.0 and Git version: bc24b5eca87626d690a29effa9d7420cc0154a7a
I0826 09:34:01.413253 57346 cephcsi.go:268] Initial PID limit is set to 256123
I0826 09:34:01.413510 57346 cephcsi.go:274] Reconfigured PID limit to -1 (max)
I0826 09:34:01.414051 57346 cephcsi.go:223] Starting driver type: rbd with name: rbd.csi.ceph.com
I0826 09:34:01.438534 57346 mount_linux.go:282] Detected umount with safe 'not mounted' behavior
I0826 09:34:01.453157 57346 rbd_attach.go:242] nbd module loaded
I0826 09:34:01.453253 57346 rbd_attach.go:256] kernel version "6.6.43-flatcar" supports cookie feature
I0826 09:34:01.497897 57346 rbd_attach.go:272] rbd-nbd tool supports cookie feature
I0826 09:34:01.498969 57346 server.go:114] listening for CSI-Addons requests on address: &net.UnixAddr{Name:"/csi/csi-addons.sock", Net:"unix"}
I0826 09:34:01.499266 57346 server.go:117] Listening for connections on address: &net.UnixAddr{Name:"//csi/csi.sock", Net:"unix"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x1bedd56]
goroutine 75 [running]:
github.com/ceph/ceph-csi/internal/rbd.RunVolumeHealer(0xc0007e7ea0, 0x3b27aa0)
/go/src/github.com/ceph/ceph-csi/internal/rbd/rbd_healer.go:199 +0x3d6
github.com/ceph/ceph-csi/internal/rbd/driver.(*Driver).Run.func1()
/go/src/github.com/ceph/ceph-csi/internal/rbd/driver/driver.go:191 +0x1f
created by github.com/ceph/ceph-csi/internal/rbd/driver.(*Driver).Run in goroutine 1
/go/src/github.com/ceph/ceph-csi/internal/rbd/driver/driver.go:189 +0x749
I tried to run a debugger in the container and traced to the PV causing the crash:
Name: pvc-4a532f6e-35c3-11e7-870a-00505601176d
Labels: <none>
Annotations: pv.kubernetes.io/bound-by-controller: yes
pv.kubernetes.io/migrated-to: rbd.csi.ceph.com
pv.kubernetes.io/provisioned-by: kubernetes.io/rbd
Finalizers: [kubernetes.io/pv-protection external-provisioner.volume.kubernetes.io/finalizer]
StorageClass: fast
Status: Bound
Claim: default/devops-compair-deploy-staging-redis-pvc
Reclaim Policy: Delete
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 2Gi
Node Affinity: <none>
Message:
Source:
Type: RBD (a Rados Block Device mount on the host that shares a pod's lifetime)
CephMonitors: [10.93.1.100:6789]
RBDImage: kubernetes-dynamic-pvc-4a5f348c-35c3-11e7-a683-005056011766
FSType:
RBDPool: rbd
RadosUser: kube
Keyring: /etc/ceph/keyring
SecretRef: &SecretReference{Name:ceph-secret-user,Namespace:default,}
ReadOnly: false
Events: <none>
And also traced that pv.Spec.PersistentVolumeSource.CSI is nil on this line
Here is the value of pv.Spec.PersistentVolumeSource before crash:
(dlv) print pv.Spec.PersistentVolumeSource
k8s.io/api/core/v1.PersistentVolumeSource {
GCEPersistentDisk: *k8s.io/api/core/v1.GCEPersistentDiskVolumeSource nil,
AWSElasticBlockStore: *k8s.io/api/core/v1.AWSElasticBlockStoreVolumeSource nil,
HostPath: *k8s.io/api/core/v1.HostPathVolumeSource nil,
Glusterfs: *k8s.io/api/core/v1.GlusterfsPersistentVolumeSource nil,
NFS: *k8s.io/api/core/v1.NFSVolumeSource nil,
RBD: *k8s.io/api/core/v1.RBDPersistentVolumeSource {
CephMonitors: []string len: 1, cap: 4, [
"10.93.1.100:6789",
],
RBDImage: "kubernetes-dynamic-pvc-4a5f348c-35c3-11e7-a683-005056011766",
FSType: "",
RBDPool: "rbd",
RadosUser: "kube",
Keyring: "/etc/ceph/keyring",
SecretRef: *(*"k8s.io/api/core/v1.SecretReference")(0xc00061c9a0),
ReadOnly: false,},
ISCSI: *k8s.io/api/core/v1.ISCSIPersistentVolumeSource nil,
Cinder: *k8s.io/api/core/v1.CinderPersistentVolumeSource nil,
CephFS: *k8s.io/api/core/v1.CephFSPersistentVolumeSource nil,
FC: *k8s.io/api/core/v1.FCVolumeSource nil,
Flocker: *k8s.io/api/core/v1.FlockerVolumeSource nil,
FlexVolume: *k8s.io/api/core/v1.FlexPersistentVolumeSource nil,
AzureFile: *k8s.io/api/core/v1.AzureFilePersistentVolumeSource nil,
VsphereVolume: *k8s.io/api/core/v1.VsphereVirtualDiskVolumeSource nil,
Quobyte: *k8s.io/api/core/v1.QuobyteVolumeSource nil,
AzureDisk: *k8s.io/api/core/v1.AzureDiskVolumeSource nil,
PhotonPersistentDisk: *k8s.io/api/core/v1.PhotonPersistentDiskVolumeSource nil,
PortworxVolume: *k8s.io/api/core/v1.PortworxVolumeSource nil,
ScaleIO: *k8s.io/api/core/v1.ScaleIOPersistentVolumeSource nil,
Local: *k8s.io/api/core/v1.LocalVolumeSource nil,
StorageOS: *k8s.io/api/core/v1.StorageOSPersistentVolumeSource nil,
CSI: *k8s.io/api/core/v1.CSIPersistentVolumeSource nil,}
Looks like it should reference RBD instead of CSI, or VolumeHealer should skip RBD volumes?
NOTE: the PV was provisioned by in-tree kubernetes.io/rbd provisioner and migrated to CSI.
I don't have enough knowledge on either ceph-csi code base or volume healer function to continue my debugging. Any suggestions? Thanks!
Environment details
- Image/version of Ceph CSI driver : v3.11.0 and v3.12.1
- Helm chart version : v3.11.0 and v3.12.1
- Kernel version : 6.1.90-flatcar
- Mounter used for mounting PVC (for cephFS its
fuseorkernel. for rbd its
krbdorrbd-nbd) : - Kubernetes cluster version : v1.27.16
- Ceph cluster version : v16.2.15
Metadata
Metadata
Assignees
Labels
No labels