-
Notifications
You must be signed in to change notification settings - Fork 245
Description
I'm seeing regular failures in clone operations against a SolidFire backend due to this hardcoded 30 second timeout:
https://github.com/NetApp/trident/blob/v25.02.1/storage_drivers/solidfire/api/volume.go#L78
This has been observed with several ~530GB volumes with ~350GB of data, from the reports I've got this seems to only be an issue for larger volumes. I've not evaluated how long the cluster takes to clone volumes larger than this but I suspect it will go up, I'm sure there are also other factors that influence the time for these clone operations to complete. The two cases I looked at completed after ~33 seconds and ~37 seconds.
After reviewing logs and the code, this appears to be the sequence of events:
CloneVolume
is called by Trident to initiate the cloning of a volume, Trident passes the ID of the source volume to be cloned, the desired name for the clone, etc.- Solidfire API accepts the request, creates an async task and responds with the
asyncHandle
,volumeID
, etc. CloneVolume
callsWaitForVolumeByID
WaitForVolumeByID
begins polling the SolidFire API for the volume ID associated with the new volume that was returned by the API in step 2WaitForVolumeByID
eventually hits the hardcoded 30 second timeout and declares that the clone operation failed (Logs: Could not find volume after 30.00 seconds.), in response it initiates a second clone operation which starts at step 1 again (requesting the same name for the cloned volume)- Going by the durations I see in the logs, ~3 seconds later the first clone completes on the SolidFIre cluster and the volume is available for use
- After the second clone operation hits the 30 second timeout in
WaitForVolumeByID
(Logs: Could not find volume after 30.00 seconds.), since there are additional checks in code I haven't reviewed, Trident sees there's already a volume with the same name as it's desired name and just loops complaining (Logs: Found existing volume pvc-xxxx, aborting clone operation.) that the volume name is used (volume names don't need to be unique on SolidFire, IDs are unique but I don't care for duplicate volume names anyway) - Going by the durations I see in the logs, ~7 seconds later the second clone completes on the SolidFire cluster and the second volume is available for use
Each time, this all results in the k8s PV/PVC never getting provisioned and 2 orphaned volumes on the SolidFire cluster that need to be cleaned up.
Possible solutions from best to worst:
- Trident uses the
asyncHandle
returned in step 2 to callGetAsyncResult
and monitor the progress of the clone operation, rather than just polling for the volume by ID up to a hardcoded timeout - Make the timeout in
WaitForVolumeByID
configurable - Extend the timeout in
WaitForVolumeByID
to 60 seconds which may just increase the size of volume that this workflow will support