这是indexloc提供的服务,不要输入任何密码
Skip to content

[QST][CuTeDSL] Why does caching a CuTe tensor prevent the original torch tensor from being garbage collected? #2479

@alexwl

Description

@alexwl

CuTe DSL documentation recommends caching CuTe tensors to avoid the overhead of repeated from_dlpack calls.

Experiments with caching show that the original torch tensors are not being garbage collected.
For example, this program:

from cutlass.cute.runtime import from_dlpack
import torch

cache = {}
for i in range(0, 5):
    torch_tensor = torch.empty(1024, dtype=torch.int8, device="cuda")
    cute_tensor = from_dlpack(torch_tensor)
    cache[i] = cute_tensor
    del torch_tensor # explicit deletion to ensure GC opportunity
    print(f"allocated {torch.cuda.memory_allocated()} bytes")

produces:

allocated 1024 bytes
allocated 2048 bytes
allocated 3072 bytes
allocated 4096 bytes
allocated 5120 bytes

When the line cache[i] = cute_tensor is commented out, the program prints:

allocated 1024 bytes
allocated 1024 bytes
allocated 1024 bytes
allocated 1024 bytes
allocated 1024 bytes

This suggests that the CuTe tensor returned by from_dlpack retains an internal reference to the original torch tensor.
Is this the intended behavior?

Metadata

Metadata

Assignees

No one assigned

    Labels

    CuTe DSLbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions