-
Notifications
You must be signed in to change notification settings - Fork 245
Description
Describe the bug
When the connection to the storage is down for an extended period of time, the volume becomes read-only mounted.
This occurs because the iSCSI session is logged out by Trident's self-healing and then remounted as read-only(ro)by Ext4-fs.
In this state, self-healing will not restore read-write even if the storage connection is restored.
Environment
- Trident version: 25.02.1
- Trident installation flags used: silenceAutosupport: true (Trident Operator)
- Container runtime: containerd://1.7.27
- Kubernetes version: K8S v1.32.5
- Kubernetes orchestrator: Kubernetes
- Kubernetes enabled feature gates: none
- OS: Ubuntu 20.04.5 LTS
- NetApp backend types: ONTAP AFF 9.15.1P7
- Other:
To Reproduce
- Confirm that the self-healing feature is enabled
$ kubectl get tridentorchestrators trident -o yaml |grep -i heal
{"apiVersion":"trident.netapp.io/v1","kind":"TridentOrchestrator","metadata":{"annotations":{},"name":"trident"},"spec":{"autosupportImage":"apj-zlab-docker-local.edge.artifactory.corp.yahoo.co.jp:4443/team-stateful/trident-autosupport:25.02","debug":true,"imageRegistry":"apj-zlab-docker-local.edge.artifactory.corp.yahoo.co.jp:4443/team-stateful/k8scsi","iscsiSelfHealingInterval":"300s","namespace":"trident","silenceAutosupport":true,"tridentImage":"apj-zlab-docker-local.edge.artifactory.corp.yahoo.co.jp:4443/team-stateful/trident:25.02.1"}}
iscsiSelfHealingInterval: 300s
iscsiSelfHealingInterval: 300s
iscsiSelfHealingWaitTime: 7m0s
- Deploy a Pod with PV
$ kubectl get pod,pvc,pv -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
...
pod/random-io-loader-ontap-block-2 1/1 Running 0 5h55m 10.26.2.15 demo-sts-mpath5-w-default-e4bdf350-8ztwd <none> <none>
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE VOLUMEMODE
...
persistentvolumeclaim/volume-random-io-loader-ontap-block-2 Bound pvc-10697ddb-7af6-4ebb-a7bc-96ae91d556fd 10Gi RWO ontap-block <unset> 5d8h Filesystem
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE VOLUMEMODE
...
persistentvolume/pvc-10697ddb-7af6-4ebb-a7bc-96ae91d556fd 10Gi RWO Delete Bound loader/volume-random-io-loader-ontap-block-2 ontap-block <unset> 5d8h Filesystem
- Storage connection failure occurs (Both storage controllers was down)
root@demo-sts-mpath5-w-default-e4bdf350-8ztwd:/home/ubuntu# multipath -ll
3600a09803831386f385d585974355433 dm-0 NETAPP,LUN C-Mode
size=10G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| `- 3:0:0:0 sdb 8:16 failed faulty running
`-+- policy='service-time 0' prio=0 status=enabled
`- 2:0:0:0 sda 8:0 failed faulty running
- iSCSI session logout by trident (self-healing)
$ kubectl logs trident-node-linux-xgjn6
...
time="2025-07-02T08:11:13Z" level=debug msg="Logout is successful." requestID=b8150647-283c-4860-aaf5-6d3e12fa5e2c requestSource=Internal
...
- Remounted in read-only mode on Ext4-fs
We checked dmesg logs in node.
root@demo-sts-mpath5-w-default-e4bdf350-8ztwd:/home/ubuntu# dmesg
...
[ 7733.853879] blk_update_request: I/O error, dev dm-0, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[ 7733.853880] JBD2: Detected IO errors while flushing file data on dm-0-8
[ 7733.853882] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[ 7733.853883] EXT4-fs (dm-0): I/O error while writing superblock
[ 7733.853885] EXT4-fs (dm-0): previous I/O error to superblock detected
[ 7733.853886] EXT4-fs error (device dm-0): ext4_journal_check_start:61: Detected aborted journal
[ 7733.853887] EXT4-fs (dm-0): Remounting filesystem read-only
...
As you may know, Ext4-fs try to remount read-only to prevent corruption if changes to the superblock are made
while the storage is not writable.
- Confirm read-only mount:
root@demo-sts-mpath5-w-default-e4bdf350-8ztwd:/home/ubuntu# mount |grep pvc
/dev/mapper/3600a09803831386f385d585974355433 on /var/lib/kubelet/pods/01d6fe9b-7f32-4c16-ae97-2b7d3fe9a023/volumes/kubernetes.io~csi/pvc-10697ddb-7af6-4ebb-a7bc-96ae91d556fd/mount type ext4 (ro,relatime,discard,stripe=16)
/dev/mapper/3600a09803831386f385d585974355433 on /var/lib/kubelet/pods/01d6fe9b-7f32-4c16-ae97-2b7d3fe9a023/volumes/kubernetes.io~csi/pvc-10697ddb-7af6-4ebb-a7bc-96ae91d556fd/mount type ext4 (ro,relatime,discard,stripe=16)
- Recover storage connection
iSCSI session restored by Trident self-haling after connection to storage was restored
time="2025-07-02T09:26:14Z" level=debug msg=">>>> command.Execute." args="[-m session]" command=iscsiadm logLayer=csi_frontend requestID=6fabb14e-bedf-4829-885b-0571f07a6273 requestSource=Periodic workflow="node_server=heal_iscsi"
time="2025-07-02T09:26:14Z" level=debug msg="<<<< Execute." command=iscsiadm error="<nil>" logLayer=csi_frontend output="tcp: [1] 100.78.187.5:3260,1038 iqn.1992-08.com.netapp:sn.7dded402aad511ef892cd039ea54a52f:vs.6 (non-flash)\ntcp: [2] 100.78.187.6:3260,1039 iqn.1992-08.com.netapp:sn.7dded402aad511ef892cd039ea54a52f:vs.6 (non-flash)" requestID=6fabb14e-bedf-4829-885b-0571f07a6273 requestSource=Periodic workflow="node_server=heal_iscsi"
time="2025-07-02T09:26:14Z" level=debug msg="Adding iSCSI session info." Portal="100.78.187.5:3260,1038" PortalIP=100.78.187.5 SID=1 TargetName="iqn.1992-08.com.netapp:sn.7dded402aad511ef892cd039ea54a52f:vs.6" logLayer=csi_frontend requestID=6fabb14e-bedf-4829-885b-0571f07a6273 requestSource=Periodic workflow="node_server=heal_iscsi"
time="2025-07-02T09:26:14Z" level=debug msg="Adding iSCSI session info." Portal="100.78.187.6:3260,1039" PortalIP=100.78.187.6 SID=2 TargetName="iqn.1992-08.com.netapp:sn.7dded402aad511ef892cd039ea54a52f:vs.6" logLayer=csi_frontend requestID=6fabb14e-bedf-4829-885b-0571f07a6273 requestSource=Periodic workflow="node_server=heal_iscsi"
- Confirm path and mount
- Multipath is recovered
root@demo-sts-mpath5-w-default-e4bdf350-8ztwd:/home/ubuntu# multipath -ll
3600a09803831386f385d585974355433 dm-0 NETAPP,LUN C-Mode
size=10G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 3:0:0:0 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
`- 2:0:0:0 sda 8:0 active ready running
- Mount remains read-only still
root@demo-sts-mpath5-w-default-e4bdf350-8ztwd:/home/ubuntu# mount |grep pvc
/dev/mapper/3600a09803831386f385d585974355433 on /var/lib/kubelet/pods/01d6fe9b-7f32-4c16-ae97-2b7d3fe9a023/volumes/kubernetes.io~csi/pvc-10697ddb-7af6-4ebb-a7bc-96ae91d556fd/mount type ext4 (ro,relatime,discard,stripe=16)
/dev/mapper/3600a09803831386f385d585974355433 on /var/lib/kubelet/pods/01d6fe9b-7f32-4c16-ae97-2b7d3fe9a023/volumes/kubernetes.io~csi/pvc-10697ddb-7af6-4ebb-a7bc-96ae91d556fd/mount type ext4 (ro,relatime,discard,stripe=16)
Expected behavior
I hope to restore to read-write with self-healing, even those that have been mounted in read-only.
Additional context
I don't seem to need iSCSI session logout with Trident self-healing.