SolidFire: Timeout in function WaitForVolumeByID when cloning a volume

I'm seeing regular failures in clone operations against a SolidFire backend due to this hardcoded 30 second timeout: 

https://github.com/NetApp/trident/blob/v25.02.1/storage_drivers/solidfire/api/volume.go#L78

This has been observed with several ~530GB volumes with ~350GB of data, from the reports I've got this seems to only be an issue for larger volumes. I've not evaluated how long the cluster takes to clone volumes larger than this but I suspect it will go up, I'm sure there are also other factors that influence the time for these clone operations to complete. The two cases I looked at completed after ~33 seconds and ~37 seconds.

After reviewing logs and the code, this appears to be the sequence of events:

1. `CloneVolume` is called by Trident to initiate the cloning of a volume, Trident passes the ID of the source volume to be cloned, the desired name for the clone, etc.
2. Solidfire API accepts the request, creates an async task and responds with the `asyncHandle`, `volumeID`, etc.
3. `CloneVolume` calls `WaitForVolumeByID`
4. `WaitForVolumeByID` begins polling the SolidFire API for the volume ID associated with the new volume that was returned by the API in step 2
5. `WaitForVolumeByID` eventually hits the hardcoded 30 second timeout and declares that the clone operation failed (Logs: Could not find volume after 30.00 seconds.), in response it initiates a second clone operation which starts at step 1 again (requesting the same name for the cloned volume)
6. Going by the durations I see in the logs, ~3 seconds later the first clone completes on the SolidFIre cluster and the volume is available for use
7. After the second clone operation hits the 30 second timeout in `WaitForVolumeByID` (Logs: Could not find volume after 30.00 seconds.), since there are additional checks in code I haven't reviewed, Trident sees there's already a volume with the same name as it's desired name and just loops complaining (Logs: Found existing volume pvc-xxxx, aborting clone operation.) that the volume name is used (volume names don't need to be unique on SolidFire, IDs are unique but I don't care for duplicate volume names anyway)
8. Going by the durations I see in the logs, ~7 seconds later the second clone completes on the SolidFire cluster and the second volume is available for use

Each time, this all results in the k8s PV/PVC never getting provisioned and 2 orphaned volumes on the SolidFire cluster that need to be cleaned up.

Possible solutions from best to worst:
1. Trident uses the `asyncHandle` returned in step 2 to call `GetAsyncResult` and monitor the progress of the clone operation, rather than just polling for the volume by ID up to a hardcoded timeout
2. Make the timeout in `WaitForVolumeByID` configurable
3. Extend the timeout in `WaitForVolumeByID` to 60 seconds which may just increase the size of volume that this workflow will support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SolidFire: Timeout in function WaitForVolumeByID when cloning a volume #1008

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SolidFire: Timeout in function WaitForVolumeByID when cloning a volume #1008

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions