这是indexloc提供的服务,不要输入任何密码
Skip to content

SolidFire: Timeout in function WaitForVolumeByID when cloning a volume #1008

@colinfarley

Description

@colinfarley

I'm seeing regular failures in clone operations against a SolidFire backend due to this hardcoded 30 second timeout:

https://github.com/NetApp/trident/blob/v25.02.1/storage_drivers/solidfire/api/volume.go#L78

This has been observed with several ~530GB volumes with ~350GB of data, from the reports I've got this seems to only be an issue for larger volumes. I've not evaluated how long the cluster takes to clone volumes larger than this but I suspect it will go up, I'm sure there are also other factors that influence the time for these clone operations to complete. The two cases I looked at completed after ~33 seconds and ~37 seconds.

After reviewing logs and the code, this appears to be the sequence of events:

  1. CloneVolume is called by Trident to initiate the cloning of a volume, Trident passes the ID of the source volume to be cloned, the desired name for the clone, etc.
  2. Solidfire API accepts the request, creates an async task and responds with the asyncHandle, volumeID, etc.
  3. CloneVolume calls WaitForVolumeByID
  4. WaitForVolumeByID begins polling the SolidFire API for the volume ID associated with the new volume that was returned by the API in step 2
  5. WaitForVolumeByID eventually hits the hardcoded 30 second timeout and declares that the clone operation failed (Logs: Could not find volume after 30.00 seconds.), in response it initiates a second clone operation which starts at step 1 again (requesting the same name for the cloned volume)
  6. Going by the durations I see in the logs, ~3 seconds later the first clone completes on the SolidFIre cluster and the volume is available for use
  7. After the second clone operation hits the 30 second timeout in WaitForVolumeByID (Logs: Could not find volume after 30.00 seconds.), since there are additional checks in code I haven't reviewed, Trident sees there's already a volume with the same name as it's desired name and just loops complaining (Logs: Found existing volume pvc-xxxx, aborting clone operation.) that the volume name is used (volume names don't need to be unique on SolidFire, IDs are unique but I don't care for duplicate volume names anyway)
  8. Going by the durations I see in the logs, ~7 seconds later the second clone completes on the SolidFire cluster and the second volume is available for use

Each time, this all results in the k8s PV/PVC never getting provisioned and 2 orphaned volumes on the SolidFire cluster that need to be cleaned up.

Possible solutions from best to worst:

  1. Trident uses the asyncHandle returned in step 2 to call GetAsyncResult and monitor the progress of the clone operation, rather than just polling for the volume by ID up to a hardcoded timeout
  2. Make the timeout in WaitForVolumeByID configurable
  3. Extend the timeout in WaitForVolumeByID to 60 seconds which may just increase the size of volume that this workflow will support

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions