When implementing General Matrix Multiplication (GEMM) with a serial Splitk architecture, we have observed that the semaphore.release function in the current code does not invoke a threadfence. However, we are concerned about whether data inconsistency issues may arise in a multi-threaded environment. Is this design reasonable? Could it lead to data visibility anomalies? If not, what is the synchronization guarantee logic it relies on?