这是indexloc提供的服务,不要输入任何密码
Skip to content

csi-rbdplugin on the node crashes on start with nil pointer #4807

@xcompass

Description

@xcompass

csi-rbdplugin is crashing on one of the k8s node. Here is the log of the csi-rbdplugin container:

Defaulted container "csi-rbdplugin" out of: csi-rbdplugin, driver-registrar, liveness-prometheus
I0826 09:34:01.412332   57346 cephcsi.go:191] Driver version: v3.11.0 and Git version: bc24b5eca87626d690a29effa9d7420cc0154a7a
I0826 09:34:01.413253   57346 cephcsi.go:268] Initial PID limit is set to 256123
I0826 09:34:01.413510   57346 cephcsi.go:274] Reconfigured PID limit to -1 (max)
I0826 09:34:01.414051   57346 cephcsi.go:223] Starting driver type: rbd with name: rbd.csi.ceph.com
I0826 09:34:01.438534   57346 mount_linux.go:282] Detected umount with safe 'not mounted' behavior
I0826 09:34:01.453157   57346 rbd_attach.go:242] nbd module loaded
I0826 09:34:01.453253   57346 rbd_attach.go:256] kernel version "6.6.43-flatcar" supports cookie feature
I0826 09:34:01.497897   57346 rbd_attach.go:272] rbd-nbd tool supports cookie feature
I0826 09:34:01.498969   57346 server.go:114] listening for CSI-Addons requests on address: &net.UnixAddr{Name:"/csi/csi-addons.sock", Net:"unix"}
I0826 09:34:01.499266   57346 server.go:117] Listening for connections on address: &net.UnixAddr{Name:"//csi/csi.sock", Net:"unix"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x1bedd56]

goroutine 75 [running]:
github.com/ceph/ceph-csi/internal/rbd.RunVolumeHealer(0xc0007e7ea0, 0x3b27aa0)
        /go/src/github.com/ceph/ceph-csi/internal/rbd/rbd_healer.go:199 +0x3d6
github.com/ceph/ceph-csi/internal/rbd/driver.(*Driver).Run.func1()
        /go/src/github.com/ceph/ceph-csi/internal/rbd/driver/driver.go:191 +0x1f
created by github.com/ceph/ceph-csi/internal/rbd/driver.(*Driver).Run in goroutine 1
        /go/src/github.com/ceph/ceph-csi/internal/rbd/driver/driver.go:189 +0x749

I tried to run a debugger in the container and traced to the PV causing the crash:

Name:            pvc-4a532f6e-35c3-11e7-870a-00505601176d
Labels:          <none>
Annotations:     pv.kubernetes.io/bound-by-controller: yes
                 pv.kubernetes.io/migrated-to: rbd.csi.ceph.com
                 pv.kubernetes.io/provisioned-by: kubernetes.io/rbd
Finalizers:      [kubernetes.io/pv-protection external-provisioner.volume.kubernetes.io/finalizer]
StorageClass:    fast
Status:          Bound
Claim:           default/devops-compair-deploy-staging-redis-pvc
Reclaim Policy:  Delete
Access Modes:    RWO
VolumeMode:      Filesystem
Capacity:        2Gi
Node Affinity:   <none>
Message:
Source:
    Type:          RBD (a Rados Block Device mount on the host that shares a pod's lifetime)
    CephMonitors:  [10.93.1.100:6789]
    RBDImage:      kubernetes-dynamic-pvc-4a5f348c-35c3-11e7-a683-005056011766
    FSType:
    RBDPool:       rbd
    RadosUser:     kube
    Keyring:       /etc/ceph/keyring
    SecretRef:     &SecretReference{Name:ceph-secret-user,Namespace:default,}
    ReadOnly:      false
Events:            <none>

And also traced that pv.Spec.PersistentVolumeSource.CSI is nil on this line

Here is the value of pv.Spec.PersistentVolumeSource before crash:

(dlv) print pv.Spec.PersistentVolumeSource
k8s.io/api/core/v1.PersistentVolumeSource {
        GCEPersistentDisk: *k8s.io/api/core/v1.GCEPersistentDiskVolumeSource nil,
        AWSElasticBlockStore: *k8s.io/api/core/v1.AWSElasticBlockStoreVolumeSource nil,
        HostPath: *k8s.io/api/core/v1.HostPathVolumeSource nil,
        Glusterfs: *k8s.io/api/core/v1.GlusterfsPersistentVolumeSource nil,
        NFS: *k8s.io/api/core/v1.NFSVolumeSource nil,
        RBD: *k8s.io/api/core/v1.RBDPersistentVolumeSource {
                CephMonitors: []string len: 1, cap: 4, [
                        "10.93.1.100:6789",
                ],
                RBDImage: "kubernetes-dynamic-pvc-4a5f348c-35c3-11e7-a683-005056011766",
                FSType: "",
                RBDPool: "rbd",
                RadosUser: "kube",
                Keyring: "/etc/ceph/keyring",
                SecretRef: *(*"k8s.io/api/core/v1.SecretReference")(0xc00061c9a0),
                ReadOnly: false,},
        ISCSI: *k8s.io/api/core/v1.ISCSIPersistentVolumeSource nil,
        Cinder: *k8s.io/api/core/v1.CinderPersistentVolumeSource nil,
        CephFS: *k8s.io/api/core/v1.CephFSPersistentVolumeSource nil,
        FC: *k8s.io/api/core/v1.FCVolumeSource nil,
        Flocker: *k8s.io/api/core/v1.FlockerVolumeSource nil,
        FlexVolume: *k8s.io/api/core/v1.FlexPersistentVolumeSource nil,
        AzureFile: *k8s.io/api/core/v1.AzureFilePersistentVolumeSource nil,
        VsphereVolume: *k8s.io/api/core/v1.VsphereVirtualDiskVolumeSource nil,
        Quobyte: *k8s.io/api/core/v1.QuobyteVolumeSource nil,
        AzureDisk: *k8s.io/api/core/v1.AzureDiskVolumeSource nil,
        PhotonPersistentDisk: *k8s.io/api/core/v1.PhotonPersistentDiskVolumeSource nil,
        PortworxVolume: *k8s.io/api/core/v1.PortworxVolumeSource nil,
        ScaleIO: *k8s.io/api/core/v1.ScaleIOPersistentVolumeSource nil,
        Local: *k8s.io/api/core/v1.LocalVolumeSource nil,
        StorageOS: *k8s.io/api/core/v1.StorageOSPersistentVolumeSource nil,
        CSI: *k8s.io/api/core/v1.CSIPersistentVolumeSource nil,}

Looks like it should reference RBD instead of CSI, or VolumeHealer should skip RBD volumes?

NOTE: the PV was provisioned by in-tree kubernetes.io/rbd provisioner and migrated to CSI.

I don't have enough knowledge on either ceph-csi code base or volume healer function to continue my debugging. Any suggestions? Thanks!

Environment details

  • Image/version of Ceph CSI driver : v3.11.0 and v3.12.1
  • Helm chart version : v3.11.0 and v3.12.1
  • Kernel version : 6.1.90-flatcar
  • Mounter used for mounting PVC (for cephFS its fuse or kernel. for rbd its
    krbd or rbd-nbd) :
  • Kubernetes cluster version : v1.27.16
  • Ceph cluster version : v16.2.15

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions