这是indexloc提供的服务,不要输入任何密码
Skip to content

Trident self-healing triggers read-only remount but no recovery #1039

@ysakashita

Description

@ysakashita

Describe the bug

When the connection to the storage is down for an extended period of time, the volume becomes read-only mounted.
This occurs because the iSCSI session is logged out by Trident's self-healing and then remounted as read-only(ro)by Ext4-fs.

In this state, self-healing will not restore read-write even if the storage connection is restored.

Environment

  • Trident version: 25.02.1
  • Trident installation flags used: silenceAutosupport: true (Trident Operator)
  • Container runtime: containerd://1.7.27
  • Kubernetes version: K8S v1.32.5
  • Kubernetes orchestrator: Kubernetes
  • Kubernetes enabled feature gates: none
  • OS: Ubuntu 20.04.5 LTS
  • NetApp backend types: ONTAP AFF 9.15.1P7
  • Other:

To Reproduce

  1. Confirm that the self-healing feature is enabled
$ kubectl get tridentorchestrators trident -o yaml |grep -i heal
      {"apiVersion":"trident.netapp.io/v1","kind":"TridentOrchestrator","metadata":{"annotations":{},"name":"trident"},"spec":{"autosupportImage":"apj-zlab-docker-local.edge.artifactory.corp.yahoo.co.jp:4443/team-stateful/trident-autosupport:25.02","debug":true,"imageRegistry":"apj-zlab-docker-local.edge.artifactory.corp.yahoo.co.jp:4443/team-stateful/k8scsi","iscsiSelfHealingInterval":"300s","namespace":"trident","silenceAutosupport":true,"tridentImage":"apj-zlab-docker-local.edge.artifactory.corp.yahoo.co.jp:4443/team-stateful/trident:25.02.1"}}
  iscsiSelfHealingInterval: 300s
    iscsiSelfHealingInterval: 300s
    iscsiSelfHealingWaitTime: 7m0s
  1. Deploy a Pod with PV
$ kubectl get pod,pvc,pv -o wide
NAME                                        READY   STATUS    RESTARTS   AGE     IP           NODE                                       NOMINATED NODE   READINESS GATES
...
pod/random-io-loader-ontap-block-2          1/1     Running   0          5h55m   10.26.2.15   demo-sts-mpath5-w-default-e4bdf350-8ztwd   <none>           <none>

NAME                                                                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE    VOLUMEMODE
...
persistentvolumeclaim/volume-random-io-loader-ontap-block-2          Bound    pvc-10697ddb-7af6-4ebb-a7bc-96ae91d556fd   10Gi       RWO            ontap-block    <unset>                 5d8h   Filesystem

NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                 STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE    VOLUMEMODE
...
persistentvolume/pvc-10697ddb-7af6-4ebb-a7bc-96ae91d556fd   10Gi       RWO            Delete           Bound    loader/volume-random-io-loader-ontap-block-2          ontap-block    <unset>                          5d8h   Filesystem
  1. Storage connection failure occurs (Both storage controllers was down)
root@demo-sts-mpath5-w-default-e4bdf350-8ztwd:/home/ubuntu# multipath -ll
3600a09803831386f385d585974355433 dm-0 NETAPP,LUN C-Mode
size=10G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| `- 3:0:0:0 sdb 8:16 failed faulty running
`-+- policy='service-time 0' prio=0 status=enabled
  `- 2:0:0:0 sda 8:0  failed faulty running
  1. iSCSI session logout by trident (self-healing)
$ kubectl logs trident-node-linux-xgjn6 
...
time="2025-07-02T08:11:13Z" level=debug msg="Logout is successful." requestID=b8150647-283c-4860-aaf5-6d3e12fa5e2c requestSource=Internal
...
  1. Remounted in read-only mode on Ext4-fs

We checked dmesg logs in node.

root@demo-sts-mpath5-w-default-e4bdf350-8ztwd:/home/ubuntu# dmesg
...
[ 7733.853879] blk_update_request: I/O error, dev dm-0, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[ 7733.853880] JBD2: Detected IO errors while flushing file data on dm-0-8
[ 7733.853882] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[ 7733.853883] EXT4-fs (dm-0): I/O error while writing superblock
[ 7733.853885] EXT4-fs (dm-0): previous I/O error to superblock detected
[ 7733.853886] EXT4-fs error (device dm-0): ext4_journal_check_start:61: Detected aborted journal
[ 7733.853887] EXT4-fs (dm-0): Remounting filesystem read-only
...

As you may know, Ext4-fs try to remount read-only to prevent corruption if changes to the superblock are made
while the storage is not writable.

  • Confirm read-only mount:
root@demo-sts-mpath5-w-default-e4bdf350-8ztwd:/home/ubuntu# mount |grep pvc
/dev/mapper/3600a09803831386f385d585974355433 on /var/lib/kubelet/pods/01d6fe9b-7f32-4c16-ae97-2b7d3fe9a023/volumes/kubernetes.io~csi/pvc-10697ddb-7af6-4ebb-a7bc-96ae91d556fd/mount type ext4 (ro,relatime,discard,stripe=16)
/dev/mapper/3600a09803831386f385d585974355433 on /var/lib/kubelet/pods/01d6fe9b-7f32-4c16-ae97-2b7d3fe9a023/volumes/kubernetes.io~csi/pvc-10697ddb-7af6-4ebb-a7bc-96ae91d556fd/mount type ext4 (ro,relatime,discard,stripe=16)
  1. Recover storage connection

iSCSI session restored by Trident self-haling after connection to storage was restored

time="2025-07-02T09:26:14Z" level=debug msg=">>>> command.Execute." args="[-m session]" command=iscsiadm logLayer=csi_frontend requestID=6fabb14e-bedf-4829-885b-0571f07a6273 requestSource=Periodic workflow="node_server=heal_iscsi"
time="2025-07-02T09:26:14Z" level=debug msg="<<<< Execute." command=iscsiadm error="<nil>" logLayer=csi_frontend output="tcp: [1] 100.78.187.5:3260,1038 iqn.1992-08.com.netapp:sn.7dded402aad511ef892cd039ea54a52f:vs.6 (non-flash)\ntcp: [2] 100.78.187.6:3260,1039 iqn.1992-08.com.netapp:sn.7dded402aad511ef892cd039ea54a52f:vs.6 (non-flash)" requestID=6fabb14e-bedf-4829-885b-0571f07a6273 requestSource=Periodic workflow="node_server=heal_iscsi"
time="2025-07-02T09:26:14Z" level=debug msg="Adding iSCSI session info." Portal="100.78.187.5:3260,1038" PortalIP=100.78.187.5 SID=1 TargetName="iqn.1992-08.com.netapp:sn.7dded402aad511ef892cd039ea54a52f:vs.6" logLayer=csi_frontend requestID=6fabb14e-bedf-4829-885b-0571f07a6273 requestSource=Periodic workflow="node_server=heal_iscsi"
time="2025-07-02T09:26:14Z" level=debug msg="Adding iSCSI session info." Portal="100.78.187.6:3260,1039" PortalIP=100.78.187.6 SID=2 TargetName="iqn.1992-08.com.netapp:sn.7dded402aad511ef892cd039ea54a52f:vs.6" logLayer=csi_frontend requestID=6fabb14e-bedf-4829-885b-0571f07a6273 requestSource=Periodic workflow="node_server=heal_iscsi"
  1. Confirm path and mount
  • Multipath is recovered
root@demo-sts-mpath5-w-default-e4bdf350-8ztwd:/home/ubuntu# multipath -ll
3600a09803831386f385d585974355433 dm-0 NETAPP,LUN C-Mode
size=10G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 3:0:0:0 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  `- 2:0:0:0 sda 8:0  active ready running
  • Mount remains read-only still
root@demo-sts-mpath5-w-default-e4bdf350-8ztwd:/home/ubuntu# mount |grep pvc
/dev/mapper/3600a09803831386f385d585974355433 on /var/lib/kubelet/pods/01d6fe9b-7f32-4c16-ae97-2b7d3fe9a023/volumes/kubernetes.io~csi/pvc-10697ddb-7af6-4ebb-a7bc-96ae91d556fd/mount type ext4 (ro,relatime,discard,stripe=16)
/dev/mapper/3600a09803831386f385d585974355433 on /var/lib/kubelet/pods/01d6fe9b-7f32-4c16-ae97-2b7d3fe9a023/volumes/kubernetes.io~csi/pvc-10697ddb-7af6-4ebb-a7bc-96ae91d556fd/mount type ext4 (ro,relatime,discard,stripe=16)

Expected behavior

I hope to restore to read-write with self-healing, even those that have been mounted in read-only.

Additional context

I don't seem to need iSCSI session logout with Trident self-healing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions