CuTe DSL [documentation](https://docs.nvidia.com/cutlass/media/docs/pythonDSL/cute_dsl_general/framework_integration.html#when-to-use-explicit-conversion) recommends caching CuTe tensors to avoid the overhead of repeated `from_dlpack` calls. Experiments with caching show that the original torch tensors are not being garbage collected. For example, this program: ```python from cutlass.cute.runtime import from_dlpack import torch cache = {} for i in range(0, 5): torch_tensor = torch.empty(1024, dtype=torch.int8, device="cuda") cute_tensor = from_dlpack(torch_tensor) cache[i] = cute_tensor del torch_tensor # explicit deletion to ensure GC opportunity print(f"allocated {torch.cuda.memory_allocated()} bytes") ``` produces: ``` allocated 1024 bytes allocated 2048 bytes allocated 3072 bytes allocated 4096 bytes allocated 5120 bytes ``` When the line `cache[i] = cute_tensor` is commented out, the program prints: ``` allocated 1024 bytes allocated 1024 bytes allocated 1024 bytes allocated 1024 bytes allocated 1024 bytes ``` This suggests that the CuTe tensor returned by `from_dlpack` retains an internal reference to the original torch tensor. Is this the intended behavior?