[QST][CuTeDSL] Why does caching a CuTe tensor prevent the original torch tensor from being garbage collected?

CuTe DSL [documentation](https://docs.nvidia.com/cutlass/media/docs/pythonDSL/cute_dsl_general/framework_integration.html#when-to-use-explicit-conversion) recommends caching CuTe tensors to avoid the overhead of repeated `from_dlpack` calls.

Experiments with caching show that the original torch tensors are not being garbage collected.
For example, this program:
```python
from cutlass.cute.runtime import from_dlpack
import torch

cache = {}
for i in range(0, 5):
    torch_tensor = torch.empty(1024, dtype=torch.int8, device="cuda")
    cute_tensor = from_dlpack(torch_tensor)
    cache[i] = cute_tensor
    del torch_tensor # explicit deletion to ensure GC opportunity
    print(f"allocated {torch.cuda.memory_allocated()} bytes")
```

produces:

```
allocated 1024 bytes
allocated 2048 bytes
allocated 3072 bytes
allocated 4096 bytes
allocated 5120 bytes
```

When the line `cache[i] = cute_tensor` is commented out, the program prints:

```
allocated 1024 bytes
allocated 1024 bytes
allocated 1024 bytes
allocated 1024 bytes
allocated 1024 bytes
```

This suggests that the CuTe tensor returned by `from_dlpack` retains an internal reference to the original torch tensor.
Is this the intended behavior?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QST][CuTeDSL] Why does caching a CuTe tensor prevent the original torch tensor from being garbage collected? #2479

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QST][CuTeDSL] Why does caching a CuTe tensor prevent the original torch tensor from being garbage collected? #2479

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions