Trident CSI Node plugin is unregistered after Kubernetes version was updated

**Describe the bug**

Trident CSI Node plugin (`csi.trident.netapp.io`) on one node is now unregistered after the Kubernetes version was updated from v1.18.9 to v1.19.4. Pods on this node can no longer mount and unmount Trident volumes.

### Error messages
We see the following messages in the kubelet log.

`csi.trident.netapp.io` was unregistered since the registration socket (`/var/lib/kubelet/plugins_registry/csi.trident.netapp.io-reg.sock`) had been removed.

> I1119 05:47:54.246972 6550 plugin_watcher.go:212] Removing socket path /var/lib/kubelet/plugins_registry/csi.trident.netapp.io-reg.sock from desired state cache
> I1119 05:47:53.162305    6550 reconciler.go:139] operationExecutor.UnregisterPlugin started for plugin at "/var/lib/kubelet/plugins_registry/csi.trident.netapp.io-reg.sock" (plugin details: &{/var/lib/kubelet/plugins_registry/csi.trident.netapp.io-reg.sock 2020-11-04 05:08:19.553684094 +0000 UTC m=+38.893901704 0x704c200 csi.trident.netapp.io})
> I1119 05:47:53.163390 6550 csi_plugin.go:177] kubernetes.io/csi: registrationHandler.DeRegisterPlugin request for plugin csi.trident.netapp.io

The pod could not unmount the volume because `csi.trident.netapp.io` was not found.

> E1119 09:02:52.819122 6550 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi.trident.netapp.io^pvc-75a6fd7f-7aee-45e8-a5fa-d4500272528e podName:ad18a7d1-4090-4e0c-9e71-cba46dfc3657 nodeName:}" failed. No retries permitted until 2020-11-19 09:04:54.819071328 +0000 UTC m=+1310234.159288938 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume "data" (UniqueName: "kubernetes.io/csi/csi.trident.netapp.io^pvc-75a6fd7f-7aee-45e8-a5fa-d4500272528e") pod "ad18a7d1-4090-4e0c-9e71-cba46dfc3657" (UID: "ad18a7d1-4090-4e0c-9e71-cba46dfc3657") : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name csi.trident.netapp.io not found in the list of registered CSI drivers"

### Two trident-csi were running simultaneously
We found that two `trident-csi` (Node Plugin) pods on this node were running simultaneously for a very short time, and that the old `driver-registrar` had stopped after a new one had started.

[`driver-registrar`](https://github.com/kubernetes-csi/node-driver-registrar) removes the registration socket (`/var/lib/kubelet/plugins_registry/csi.trident.netapp.io-reg.sock`) when it recieves SIGTERM ([node_register.go#L113-L116](https://github.com/kubernetes-csi/node-driver-registrar/blob/v2.0.1/cmd/csi-node-driver-registrar/node_register.go#L113-L116)). Removing the socket causes the kubelet to unregister the Trident plugin. I believe this is the cause of the problem.

![image](https://user-images.githubusercontent.com/9250296/100171857-6d9b3a00-2f0a-11eb-9f37-d6a47f01205f.png)

### DaemonSet was recreated after updating
Trident-csi (Node Plugin) pods are managed by DaemonSet. Normally, only one pod runs on every node. But after Kubernetes was updated, trident-csi Daemonset was recreated by `trident-operator`. Deleting DaemonSet allows two pods (old and new) to run simultaneously.

We confirmed this on the `trident-operator` log.

Here, the `trident-csi` Daemonset was deleted.

> time="2020-11-19T05:47:45Z" level=debug msg="Deleted Kubernetes DaemonSet." DaemonSet=trident-csi namespace=trident

The `trident-csi` Daemonset was then created soon after.

> time="2020-11-19T05:47:45Z" level=debug msg="Creating object." kind=DaemonSet name=trident-csi namespace=trident

After Kubernetes was updated, the `shouldUpdate` flag was set to true ([controller.go#L1110](https://github.com/NetApp/trident/blob/v20.10.0/operator/controllers/provisioner/controller.go#L1110)). It seems that the `shouldUpdate` flag causes the `trident-csi` Daemonset to be deleted([installer.go#L1489-L1494](https://github.com/NetApp/trident/blob/v20.10.0/operator/controllers/provisioner/installer/installer.go#L1489-L1494)).

**Environment**
- Trident version: 20.10.0 with trident-operator
- Trident installation flags used: `silenceAutosupport: true` (Trident Operator)
- Container runtime: Docker 19.03.13
- Kubernetes version: v1.19.4
- Kubernetes orchestrator: Kubernetes
- Kubernetes enabled feature gates:
- OS: Ubuntu 18.04
- NetApp backend types: ONTAP AFF 9.1P14
- Other:

**To Reproduce**

Updating the Kubernetes version may reproduce this problem. Since updating Kubernetes takes a long time and does not always happen, we confirmed the following behaviors that cause this problem through different demonstrations.

### Two trident-csi causes the kubelet to unregister Trident plugin

1. Confirm that the Trident CSI driver is registered on the node.

```
$ kubectl describe csinodes.storage.k8s.io <NODE_NAME>
...
Spec:
  Drivers:
    csi.trident.netapp.io:
      Node ID:        <NODE_NAME>
      Topology Keys:  [topology.kubernetes.io/zone]
```

2. Copy `trident-csi` DaemonSet to run two trident-csi pods on each node.

```
$ kubectl get ds -n trident trident-csi -o json | jq '.metadata.name|="trident-csi-2"' | kubectl apply -f -
```

3. Wait for them to run, then delete copied `trident-csi-2` DaemonSet.

```
$ kubectl delete ds -n trident trident-csi-2
```

4. Confirm that the Trident CSI driver has disappeared in the Drivers section on the node. (This will take some time)

```
$ kubectl describe csinodes.storage.k8s.io <NODE_NAME>
Spec:
```

### Recreating DaemonSet allows two pods (old and new) to run simultaneously

1. Delete `trident-csi` DaemonSet. The DaemonSet will be recreated soon after by the trident-operator.

```
$ kubectl delete ds -n trident trident-csi
```

2. You will see two `trident-csi` pods on each node.

```
$ kubectl get pods -n trident -o wide
```

**Expected behavior**
Pods can mount and unmount Trident volumes after Kubernetes version is updated.

**Additional context**
None



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trident CSI Node plugin is unregistered after Kubernetes version was updated #487

Error messages

Two trident-csi were running simultaneously

DaemonSet was recreated after updating

Two trident-csi causes the kubelet to unregister Trident plugin

Recreating DaemonSet allows two pods (old and new) to run simultaneously

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Trident CSI Node plugin is unregistered after Kubernetes version was updated #487

Description

Error messages

Two trident-csi were running simultaneously

DaemonSet was recreated after updating

Two trident-csi causes the kubelet to unregister Trident plugin

Recreating DaemonSet allows two pods (old and new) to run simultaneously

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions