CN116192877A - Computing device, synchronization method, electronic device, and storage medium - Google Patents
Computing device, synchronization method, electronic device, and storage medium Download PDFInfo
- Publication number
- CN116192877A CN116192877A CN202310193983.4A CN202310193983A CN116192877A CN 116192877 A CN116192877 A CN 116192877A CN 202310193983 A CN202310193983 A CN 202310193983A CN 116192877 A CN116192877 A CN 116192877A
- Authority
- CN
- China
- Prior art keywords
- computing
- broadcast
- cluster
- broadcast group
- clusters
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1095—Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Multi Processors (AREA)
- Hardware Redundancy (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
本公开提供了一种计算设备、用于计算设备的同步方法、电子设备和计算机可读存储介质。该计算设备包括:多个计算簇,所述多个计算簇被划分为多个广播组;以及分别位于所述多个计算簇的每个计算簇的一个或多个簇内接口,其中每个广播组中的一个计算簇在作为消费者计算簇时,被配置为在针对所述广播组中的所有生产者计算簇的数据空间就绪时,向所述簇内接口发送广播同步屏障消息,所述广播同步屏障消息包括广播指示符和所述广播组的标识符,以及所述簇内接口被配置为将所述广播同步屏障消息展开为针对所述广播组中的每个生产者计算簇的同步屏障消息,并且将每个同步屏障消息发送给相应的生产者计算簇。
The present disclosure provides a computing device, a synchronization method for the computing device, an electronic device, and a computer-readable storage medium. The computing device includes: a plurality of computing clusters, the plurality of computing clusters are divided into a plurality of broadcast groups; and one or more intra-cluster interfaces respectively located in each computing cluster of the plurality of computing clusters, wherein each When a computing cluster in the broadcast group is used as a consumer computing cluster, it is configured to send a broadcast synchronization barrier message to the intra-cluster interface when the data space for all producer computing clusters in the broadcast group is ready, so The broadcast synchronization barrier message includes a broadcast indicator and an identifier of the broadcast group, and the intra-cluster interface is configured to expand the broadcast synchronization barrier message to compute clusters for each producer in the broadcast group Synchronize barrier messages, and send each synchronization barrier message to the corresponding producer computing cluster.
Description
技术领域technical field
本公开概括而言涉及处理器领域,更具体地,涉及一种计算设备、用于该计算设备的同步方法、电子设备和计算机可读存储介质。The present disclosure generally relates to the field of processors, and more specifically, to a computing device, a synchronization method for the computing device, an electronic device, and a computer-readable storage medium.
背景技术Background technique
当前,各种计算设备的结构越来越复杂,计算设备内部的硬件层级越来越多,需要解决各种层级之间的数据交互的同步问题。例如,对于包含多个计算簇的计算设备而言,如果产生数据的计算簇需要将所产生的数据写入同一计算簇以供该计算簇使用这些数据执行后续处理,则使用数据的计算簇需要告知产生数据的计算簇其是否准备好了空间来接收所产生的数据。At present, the structures of various computing devices are becoming more and more complex, and there are more and more hardware layers inside computing devices, so it is necessary to solve the synchronization problem of data interaction between various layers. For example, for a computing device that contains multiple computing clusters, if the computing cluster that generates the data needs to write the generated data to the same computing cluster so that the computing cluster can use the data to perform subsequent processing, the computing cluster that uses the data needs to Inform the computing cluster that produced the data whether it has space ready to receive the produced data.
在现有技术中,使用基于线程束(wrap)的同步屏障(barrier)机制,使用数据的计算簇向每个产生数据的计算簇逐个发送同步屏障消息来告知是否准备好了空间来接收该计算簇产生的数据。然而,这种同步屏障机制将使得使用数据的计算簇消耗大量的指令周期,尤其是在广播的情况下。In the existing technology, using the synchronization barrier mechanism based on the thread warp (wrap), the computing cluster using the data sends a synchronization barrier message one by one to each computing cluster that generates the data to inform whether the space is ready to receive the computing The data generated by the cluster. However, this synchronization barrier mechanism will consume a large number of instruction cycles for computing clusters using data, especially in the case of broadcasting.
发明内容Contents of the invention
针对上述问题,本公开提供了一种由使用数据的计算簇发送广播同步屏障消息,并且由该计算簇的簇内接口对其进行展开的方式来进行空间同步的方案,大大降低了计算簇发送同步屏障所需的指令周期。In view of the above problems, the present disclosure provides a space synchronization scheme in which a computing cluster that uses data sends a broadcast synchronization barrier message, and the intra-cluster interface of the computing cluster expands it to perform space synchronization, which greatly reduces Instruction cycles required for a synchronization barrier.
根据本公开的一个方面,提供了一种计算设备。该计算设备包括:多个计算簇,所述多个计算簇被划分为多个广播组;以及分别位于所述多个计算簇中的每个计算簇的一个或多个簇内接口,其中每个广播组中的一个计算簇在作为消费者计算簇时,被配置为在针对所述广播组中的所有生产者计算簇的数据空间就绪时,向所述簇内接口发送广播同步屏障消息,所述广播同步屏障消息包括广播指示符和所述广播组的标识符,以及所述簇内接口被配置为将所述广播同步屏障消息展开为针对所述广播组中的每个生产者计算簇的同步屏障消息,并且将每个同步屏障消息发送给相应的生产者计算簇。According to one aspect of the present disclosure, a computing device is provided. The computing device includes: a plurality of computing clusters, the plurality of computing clusters are divided into a plurality of broadcast groups; and one or more intra-cluster interfaces respectively located in each computing cluster of the plurality of computing clusters, wherein each When a computing cluster in a broadcast group is used as a consumer computing cluster, it is configured to send a broadcast synchronization barrier message to the intra-cluster interface when the data space for all producer computing clusters in the broadcast group is ready, The broadcast synchronization barrier message includes a broadcast indicator and an identifier of the broadcast group, and the intracluster interface is configured to expand the broadcast synchronization barrier message to compute clusters for each producer in the broadcast group The synchronization barrier message, and each synchronization barrier message is sent to the corresponding producer computing cluster.
在一些实现中,所述簇内接口被配置为基于所述广播同步屏障消息中的所述广播组的标识符确定相应的生产者计算簇。In some implementations, the intra-cluster interface is configured to determine a corresponding producer computing cluster based on an identifier of the broadcast group in the broadcast synchronization barrier message.
在一些实现中,所述簇内接口包括多个掩模寄存器,每个掩模寄存器对应于所述多个广播组中的一个广播组,所述掩模寄存器用于指示所述多个计算簇中属于所述广播组的计算簇,并且其中所述簇内接口被配置为基于所述广播同步屏障消息中的所述广播组的标识符确定与所述广播组相对应的屏障寄存器,并且基于与所述广播组相对应的屏障寄存器确定相应的生产者计算簇。In some implementations, the intra-cluster interface includes a plurality of mask registers, each mask register corresponds to one of the plurality of broadcast groups, and the mask register is used to indicate the plurality of computing clusters in a computing cluster belonging to the broadcast group, and wherein the intra-cluster interface is configured to determine a barrier register corresponding to the broadcast group based on the identifier of the broadcast group in the broadcast synchronization barrier message, and based on The barrier register corresponding to the broadcast group determines the corresponding producer computing cluster.
在一些实现中,所述消费者计算簇还被配置为:确定针对所述广播组中的所有生产者计算簇的数据空间是否已经准备就绪;响应于针对所述广播组中的所有生产者计算簇的数据空间都已准备就绪,产生所述广播同步屏障消息;以及响应于针对所述广播组中的部分生产者计算簇的数据空间尚未准备就绪,继续等待针对所述部分生产者计算簇的空间准备就绪。In some implementations, the consumer computing cluster is further configured to: determine whether data space for all producer computing clusters in the broadcast group is ready; The data space of the cluster is ready, generating the broadcast synchronization barrier message; and in response to the fact that the data space of the computing cluster for some producers in the broadcast group is not ready, continue to wait for the computing cluster for the part of the producers Space is ready.
在一些实现中,所述广播组中的每个生产者计算簇被配置为:确定是否接收到了来自所述广播组中的所有其他计算簇的同步屏障消息;响应于接收到了来自所述广播组中的所有其他计算簇的同步屏障消息,将所产生的数据发送给所述其他计算簇;以及响应于尚未接收到来自所述广播组中的所有其他计算簇的同步屏障消息,继续等待。In some implementations, each producer compute cluster in the broadcast group is configured to: determine whether a synchronization barrier message has been received from all other compute clusters in the broadcast group; sending the generated data to the other computing clusters for synchronization barrier messages from all the other computing clusters in the broadcast group; and continuing to wait in response to not receiving the synchronization barrier messages from all the other computing clusters in the broadcast group.
在一些实现中,所述簇内接口被配置为将每个同步屏障消息通过簇间接口依次发送给相应的生产者计算簇。In some implementations, the intra-cluster interface is configured to sequentially send each synchronization barrier message to the corresponding producer computing cluster through the inter-cluster interface.
根据本公开的另一个方面,提供了一种用于计算设备的同步方法,所述计算设备包括多个计算簇和根本位于所述多个计算簇中的每个计算簇的一个或多个簇内接口,其中所述多个计算簇被划分为多个广播组。所述同步方法包括:由每个广播组中作为消费者计算簇的一个计算簇,在针对所述广播组中的所有生产者计算簇的数据空间就绪时,向所述簇内接口发送广播同步屏障消息,所述广播同步屏障消息包括广播指示符和所述广播组的标识符,以及由所述簇内接口将所述广播同步屏障消息展开为针对所述广播组中的每个生产者计算簇的同步屏障消息,并且将每个同步屏障消息发送给相应的生产者计算簇。According to another aspect of the present disclosure, there is provided a synchronization method for a computing device comprising a plurality of computing clusters and one or more clusters underlying each computing cluster in the plurality of computing clusters An internal interface, wherein the plurality of computing clusters are divided into a plurality of broadcast groups. The synchronization method includes: a computing cluster serving as a consumer computing cluster in each broadcast group sends a broadcast synchronization to the intra-cluster interface when the data space for all producer computing clusters in the broadcast group is ready. a barrier message, the broadcast synchronization barrier message including a broadcast indicator and an identifier of the broadcast group, and the broadcast synchronization barrier message is expanded by the intra-cluster interface to calculate for each producer in the broadcast group The synchronization barrier message of the cluster, and each synchronization barrier message is sent to the corresponding producer computing cluster.
在一些实现中,由所述簇内接口将所述广播同步屏障消息展开为针对所述广播组中的每个生产者计算簇的同步屏障消息包括:由所述簇内接口基于所述广播同步屏障消息中的所述广播组的标识符确定相应的生产者计算簇。In some implementations, expanding, by the intra-cluster interface, the broadcast synchronization barrier message into a synchronization barrier message for each producer computing cluster in the broadcast group includes: synchronizing by the intra-cluster interface based on the broadcast The identifier of the broadcast group in the barrier message identifies the corresponding producer compute cluster.
在一些实现中,所述簇内接口包括多个掩模寄存器,每个掩模寄存器对应于所述多个广播组中的一个广播组,所述掩模寄存器用于指示所述多个计算簇中属于所述广播组的计算簇,并且其中由所述簇内接口基于所述广播同步屏障消息中的所述广播组的标识符确定相应的生产者计算簇包括:由所述簇内接口基于所述广播同步屏障消息中的所述广播组的标识符确定与所述广播组相对应的屏障寄存器,并且基于与所述广播组相对应的屏障寄存器确定相应的生产者计算簇。In some implementations, the intra-cluster interface includes a plurality of mask registers, each mask register corresponds to one of the plurality of broadcast groups, and the mask register is used to indicate the plurality of computing clusters Computing clusters belonging to the broadcast group, and wherein determining the corresponding producer computing cluster by the intra-cluster interface based on the identifier of the broadcast group in the broadcast synchronization barrier message comprises: by the intra-cluster interface based on The identifier of the broadcast group in the broadcast synchronization barrier message determines a barrier register corresponding to the broadcast group, and determines a corresponding producer computing cluster based on the barrier register corresponding to the broadcast group.
在一些实现中,该同步方法还包括:由所述消费者计算簇确定针对所述广播组中的所有生产者计算簇的数据空间是否已经准备就绪;响应于针对所述广播组中的所有生产者计算簇的数据空间都已准备就绪,产生所述广播同步屏障消息;以及响应于针对所述广播组中的部分生产者计算簇的数据空间尚未准备就绪,继续等待针对所述部分生产者计算簇的空间准备就绪。In some implementations, the synchronizing method further includes: determining, by the consumer computing cluster, whether data space for all producer computing clusters in the broadcast group is ready; The data spaces of the computing clusters of the other computing clusters are all ready to generate the broadcast synchronization barrier message; The space for the cluster is ready.
在一些实现中,该同步方法还包括:由所述广播组中的每个生产者计算簇确定是否接收到了来自所述广播组中的所有其他计算簇的同步屏障消息;响应于接收到了来自所述广播组中的所有其他计算簇的同步屏障消息,将所产生的数据发送给所述其他计算簇;以及响应于尚未接收到来自所述广播组中的所有其他计算簇的同步屏障消息,继续等待。In some implementations, the synchronization method further includes: determining, by each producer computing cluster in the broadcast group, whether a synchronization barrier message has been received from all other computing clusters in the broadcast group; to all other computing clusters in the broadcast group, sending the generated data to the other computing clusters; and in response to not having received a synchronization barrier message from all the other computing clusters in the broadcast group, continuing wait.
根据本公开的又一个方面,提供了一种电子设备,包括:存储器,非瞬时性地存储有计算机可执行指令;处理器,配置为运行所述计算机可执行指令;其中,所述计算机可执行指令被所述处理器运行时实现如上所述的同步方法。According to still another aspect of the present disclosure, there is provided an electronic device, including: a memory storing computer-executable instructions in a non-transitory manner; a processor configured to run the computer-executable instructions; wherein the computer-executable The instructions, when executed by the processor, implement the synchronization method as described above.
根据本公开的再一个方面,提供了一种计算机可读存储介质,其上存储有计算机程序代码,该计算机程序代码在被运行时执行如上所述的同步方法。According to still another aspect of the present disclosure, there is provided a computer-readable storage medium, on which computer program code is stored, and the computer program code executes the above-mentioned synchronization method when executed.
附图说明Description of drawings
通过参考下列附图所给出的本公开的具体实施方式的描述,将更好地理解本公开,并且本公开的其他目的、细节、特点和优点将变得更加显而易见。The present disclosure will be better understood, and other objects, details, features and advantages of the present disclosure will become more apparent by referring to the description of specific embodiments of the present disclosure given in the following drawings.
图1A示出了一种计算设备的示例性结构示意图。FIG. 1A shows an exemplary structural diagram of a computing device.
图1B示出了一种通过多层级簇间接口连接多个计算簇的计算设备的示意图。FIG. 1B shows a schematic diagram of a computing device connecting multiple computing clusters through a multi-level inter-cluster interface.
图2示出了计算簇的详细结构示意图。Fig. 2 shows a schematic diagram of a detailed structure of a computing cluster.
图3示出了根据本发明实施例的簇内接口中的屏障寄存器的示例性示意图。Fig. 3 shows an exemplary schematic diagram of a barrier register in an intra-cluster interface according to an embodiment of the present invention.
图4示出了根据本发明实施例的用于计算设备的同步方法的流程图。FIG. 4 shows a flowchart of a synchronization method for a computing device according to an embodiment of the present invention.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的优选实施例。虽然附图中显示了本公开的优选实施例,然而应该理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
在本文中使用的术语“包括”及其变形表示开放性包括,即“包括但不限于”。除非特别申明,术语“或”表示“和/或”。术语“基于”表示“至少部分地基于”。术语“一个实施例”和“一些实施例”表示“至少一个示例实施例”。术语“另一实施例”表示“至少一个另外的实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。As used herein, the term "comprise" and its variants mean open inclusion, ie "including but not limited to". The term "or" means "and/or" unless otherwise stated. The term "based on" means "based at least in part on". The terms "one embodiment" and "some embodiments" mean "at least one example embodiment." The term "another embodiment" means "at least one further embodiment". The terms "first", "second", etc. may refer to different or the same object.
图1A示出了一种计算设备100的示例性结构示意图。如图1A中所示,计算设备100可以包括多个计算簇110(图1A中示意性地示出了8个计算簇110-1、110-2、……110-8)。FIG. 1A shows a schematic structural diagram of an
计算设备100中的计算簇110可以进一步划分为多个广播组120,如图1A中将这8个计算簇110划分成了两个广播组120-1和120-2,其中计算簇110-1、110-2、110-5和110-6属于广播组120-1,计算簇110-3、110-4、110-7和110-8属于广播组120-2。在一个广播组中,产生数据的计算簇110需要将所产生的数据发送给同一广播组中的所有其他计算簇以供其他计算簇消费。这里,将计算簇划分为广播组例如可以基于计算簇的物理分布或者应用需求。The
每个计算簇110中可以包括一个或多个簇内接口112(图1中示意性地仅为每个计算簇110示出了一个簇内接口112)。各个计算簇110,更具体地,各个计算簇110的簇内接口112,之间还通过簇间接口130相连。每个计算簇110可以通过自己的簇内接口112和簇间接口130与其他计算簇110进行数据和指令的交换。簇内接口112可以是计算簇110的一个子模块或者是附接于计算簇110的一个单独的硬件模块。Each
簇间接口130可以包括一个或多个层级的硬件接口。图1B示出了一种通过多层级簇间接口130连接多个计算簇110的计算设备100的示意图。如图1B所示,整个计算设备100可以由多个设备部件(device)140构成,每个设备部件140可以包括多个晶片(die)150,每个晶片150可以包括多个计算簇110。图1B示意性地示出了三级簇间接口130,即直接连接各个计算簇110的簇内接口112的第一级簇间接口130-1,有时也称为片上网络(NOC);连接各个晶片150的第二级簇间接口130-2,有时也称为D2D接口,以及连接各个设备部件140的第三极簇间接口130-3,有时也称为P2P接口。
注意,图1B中仅示例性地将计算设备100显示为包括两个设备部件140,将每个设备部件140显示为包括两个晶片150,以及将每个晶片150显示为包括两个计算簇110。然而本领域技术人员可以理解,图1B中所显示的数量仅仅是示例性的,每个计算设备100可以包括更多或更少的设备部件140,每个设备部件140可以包括更多或更少的晶片150,和/或每个晶片150可以包括更多或更少的计算簇110。Note that
此外,广播组120的划分与计算簇110、晶片150和设备部件140之间不一定具有确定的对应关系。例如,一个广播组中的计算簇110的数量可以大于一个晶片150中的计算簇110的数量,例如可以达到包含多个设备组件140(如8个device)的节点量级。此外,一个计算簇110可以位于多个广播组120中,即,不同的广播组120可能包含同一计算簇110。In addition, the division of the broadcast group 120 does not necessarily have a definite corresponding relationship with the
在本文中,产生数据的计算簇110也被称为生产者计算簇,消费数据的计算簇110也被称为消费者计算簇。此外,如果一个计算簇110自身既产生数据也使用自身所产生的数据,则该计算簇110既是生产者计算簇也是消费者计算簇。Herein, the
图2示出了计算簇110的详细结构示意图。如图2中所示,除了簇内接口112之外,计算簇110还可以包括一个或多个计算单元(CU)114,每个计算单元114可以包括一个或多个执行单元(EU)116。每个执行单元116上可以运行一个或多个线程束,但是一个线程束只在一个执行单元116上运行。线程束是在GPU中经常采用的用于执行并行处理的一个基本调度单位,每个线程束可以包含特定数量的线程,例如32个或16个等,以运行各种不同的指令。FIG. 2 shows a schematic diagram of the detailed structure of the
在图1所示的计算设备100中,在一个广播组120(如广播组120-1)中,假设计算簇110-6作为消费者计算簇,3个计算簇110-1、110-2和110-5作为生产者计算簇都要向消费者计算簇110-6写入所产生的数据。则在写入之前,生产者计算簇需要知道消费者计算簇110-6是否已经为它们准备了所需的数据空间。在传统的基于线程束的同步屏障机制中,消费者计算簇110-6可以向每个生产者计算簇110-1、110-2和110-5分别发送同步屏障消息来告知消费者计算簇110-6已经为相应的生产者计算簇110-1、110-2和110-5准备了所需的数据空间,即数据空间就绪(ready)。In the
然而,这种方式下,消费者计算簇110-6需要向该广播组120-1中的所有其他计算簇都分别发送一个同步屏障消息,如bar_to_spc_1、bar_to_spc_2、bar_to_spc_5等,从而需要消耗计算簇110-6的3个指令周期。对于包含N个计算簇的广播组,需要消耗N-1个指令周期来进行空间同步。如果广播组中所包含的计算簇的数量N很大时,这种指令周期消耗将变得非常大。在计算设备100中,计算簇110承担着复杂的运算任务,在空间同步上消耗过多的指令周期势必减少其他运算任务可用的指令周期,造成计算设备的总体效率下降。However, in this way, the consumer computing cluster 110-6 needs to send a synchronization barrier message, such as bar_to_spc_1, bar_to_spc_2, bar_to_spc_5, etc., to all other computing clusters in the broadcast group 120-1, thus consuming the computing cluster 110 -6 for 3 instruction cycles. For a broadcast group containing N computing clusters, it needs to consume N-1 instruction cycles for space synchronization. If the number N of computing clusters included in the broadcast group is large, the instruction cycle consumption will become very large. In the
针对上述问题,本公开提供了一种由消费者计算簇发送单个广播同步屏障消息,并且通过簇内接口再将该广播同步屏障消息展开为各个同步屏障消息的方法,从而节省了计算簇之间进行空间同步所需的指令周期。In view of the above problems, the present disclosure provides a method in which a single broadcast synchronization barrier message is sent by a consumer computing cluster, and the broadcast synchronization barrier message is expanded into individual synchronization barrier messages through an intra-cluster interface, thereby saving inter-cluster Instruction cycles required for spatial synchronization.
具体地,消费者计算簇110-6可以被配置为确定针对广播组120-1中的所有生产者计算簇110-1、110-2和110-5的数据空间是否已经准备就绪。如果确定针对广播组120-1中的部分生产者计算簇(如计算簇110-1)的数据空间尚未准备就绪,则消费者计算簇110-6继续等待针对该部分生产者计算簇的空间准备就绪。如果确定针对广播组120-1中的所有生产者计算簇110-1、110-2和110-5的数据空间都已准备就绪,则消费者计算簇110-6可以产生一个广播同步屏障消息,并且将该广播同步屏障消息发送给消费者计算簇110-6的簇内接口112。Specifically, consumer computing cluster 110-6 may be configured to determine whether data spaces for all producer computing clusters 110-1, 110-2, and 110-5 in broadcast group 120-1 are ready. If it is determined that the data space for some of the producer computing clusters (such as computing cluster 110-1) in the broadcast group 120-1 is not ready yet, the consumer computing cluster 110-6 continues to wait for the space preparation of the part of the producer computing clusters ready. If it is determined that the data spaces of all producer computing clusters 110-1, 110-2, and 110-5 in the broadcast group 120-1 are ready, the consumer computing cluster 110-6 may generate a broadcast synchronization barrier message, And send the broadcast synchronization barrier message to the
也就是说,消费者计算簇110-6并不是在针对每个生产者计算簇的数据空间准备就绪时就向该生产者计算簇发送同步屏障,而是等所有生产者计算簇的空间准备就绪时才产生并发送该广播同步屏障消息。That is to say, the consumer computing cluster 110-6 does not send a synchronization barrier to the producer computing cluster when the data space of each producer computing cluster is ready, but waits for the space of all producer computing clusters to be ready Only when the broadcast synchronization barrier message is generated and sent.
由于簇内接口112可以用于与计算设备100中的所有计算簇110进行通信,因此为了使得簇内接口112能够准确处理该广播同步屏障消息,该广播同步屏障消息应当至少包含广播指示符和广播组120-1的标识符。例如,该广播同步屏障消息可以表示为bar_broadcast_groupID,其中“bar”表示这是一个同步屏障,“broadcast”作为广播指示符,向簇内接口112指示这是一个广播消息,“groupID”用于指示广播组(如广播组120-1)的标识符。Since the
簇内接口112被配置为将该广播同步屏障消息展开为针对该广播组120-1中的每个生产者计算簇的同步屏障消息,并且将每个同步屏障消息发送给相应的生产者计算簇。The
具体地,簇内接口112可以基于该广播同步屏障消息中的广播组的标识符(groupID)确定相应的生产者计算簇,如生产者计算簇110-1、110-2和110-5。Specifically, the
在一些实施例中,簇内接口112可以利用掩模寄存器(mask registers)来确定广播组中的生产者计算簇。图3示出了根据本发明实施例的簇内接口112中的屏障寄存器的示例性示意图。如图3中所示,簇内接口112可以包括多个掩模寄存器1122(图3中示例性地示出了两个屏障寄存器1122-1和1122-2),每个掩模寄存器1122对应于多个广播组120中的一个广播组,用于指示计算设备100的计算簇110中属于该广播组的计算簇。因此,每个掩模寄存器140包含的位数应当至少等于计算设备100中所包含的计算簇110的数量。更具体地,在针对一个广播组120的掩模寄存器1122中,属于该广播组120的计算簇110和不属于该广播组120的计算簇110应当区别表示,例如分别使用1和0来表示。In some embodiments, the
如图3中所示,假设如图1中所示,计算簇110-1、110-2、110-5和110-6属于广播组120-1,计算簇110-3、110-4、110-7和110-8属于广播组120-2,并且掩模寄存器1122-1对应于广播组120-1(标识符为group1),掩模寄存器1122-2对应于广播组120-2(标识符为group2),则掩模寄存器1122-1的位图为11001100,掩模寄存器1122-2的位图为00110011。注意,上述分组仅是示例性的,事实上,各个广播组所包含的计算簇可以交叠。例如,一个计算簇110可以既属于广播组120-1,又属于广播组120-2,即,不同广播组120可以包含同一计算簇110。当然,计算设备100中的某个计算簇110也可以不属于任何广播组。As shown in FIG. 3, assume that as shown in FIG. -7 and 110-8 belong to broadcast group 120-2, and mask register 1122-1 corresponds to broadcast group 120-1 (identifier is group1), and mask register 1122-2 corresponds to broadcast group 120-2 (identifier group2), the bitmap of mask register 1122-1 is 11001100, and the bitmap of mask register 1122-2 is 00110011. Note that the above grouping is only exemplary, in fact, the computing clusters included in each broadcast group may overlap. For example, a
簇内接口112被配置为基于广播同步屏障消息中的广播组的标识符确定与该广播组相对应的屏障寄存器,并且基于与该广播组相对应的屏障寄存器确定相应的生产者计算簇。The
例如,如果簇内接口112接收到的广播同步屏障消息bar_broadcast_groupID中的groupID为group1,这表示该广播同步屏障消息是针对标识符为group1的广播组120-1,则簇内接口112可以确定与该广播组120-1对应的屏障寄存器为屏障寄存器1122-1。掩模寄存器1122-1的位图为11001100,则表示广播组120-1包括四个计算簇110-1、110-2、110-5和110-6,从而与发送该广播同步屏障消息的消费者计算簇110-6相对应的生产者计算簇为计算簇110-1、110-2和110-5。For example, if the groupID in the broadcast synchronization barrier message bar_broadcast_groupID received by the
然后,簇内接口112可以将该广播同步屏障消息bar_broadcast_groupID展开为针对3个生产者计算簇110-1、110-2、110-5的同步屏障消息,如bar_to_spc_1、bar_to_spc_2、bar_to_spc_5,并分别发送给对应的生产者计算簇110-1、110-2和110-5。Then, the
这里,簇内接口112可以被配置为将展开的同步屏障消息通过簇间接口130依次发送给相应的生产者计算簇。例如,如图1B中所示,在广播组120的范围小于一个晶片150时,这里的簇间接口130可以仅包括计算簇110的簇内接口112之间的第一级簇间接口130-1。在广播组120的范围大于一个晶片150而小于一个设备部件140时,这里的簇间接口130可以包括计算簇110的簇内接口112之间的第一级簇间接口130-1和晶片之间的第二级簇间接口130-2。在广播组120的范围大于一个设备部件140而小于整个计算设备100(有时也称为节点)时,这里的簇间接口130可以包括计算簇110的簇内接口112之间的第一级簇间接口130-1、晶片之间的第二级簇间接口130-2以及设备部件140之间的第三极簇间接口130-3。Here, the
通过这种方式,可以在广播情况下,降低计算簇进行空间同步所使用的指令周期,同时,由作为硬件模块的簇内接口来将广播同步屏障消息展开为针对各个生产者计算簇的同步屏障消息对硬件带来的工作负担也较小。In this way, in the case of broadcasting, the instruction cycle used by the computing cluster for space synchronization can be reduced, and at the same time, the broadcast synchronization barrier message is expanded into a synchronization barrier for each producer computing cluster by the intra-cluster interface as a hardware module Messages also place less work on the hardware.
以上从消费者计算簇的角度描述了其对数据空间就绪的同步屏障广播操作。在多对多广播的情况下,一个广播组中的每个计算簇都要向该广播组中的所有其他计算簇发送同步屏障消息,并且只有针对一个同步屏障的线程束计数等于该广播组中的计算簇的数量时,每个生产者计算簇才将所产生的数据发送给所有消费者计算簇。因此,从生产者计算簇的角度出发,其在向其所有消费者计算簇发送所产生的数据之前还需要确定是否已经收到了所有消费者计算簇针对同一同步屏障的数据空间就绪消息。The above describes its synchronization barrier broadcast operation for data space readiness from the perspective of consumer computing clusters. In the case of many-to-many broadcast, each compute cluster in a broadcast group sends a sync barrier message to all other compute clusters in that broadcast group, and only if the warp count for a sync barrier is equal to When the number of computing clusters is calculated, each producer computing cluster sends the generated data to all consumer computing clusters. Therefore, from the perspective of the producer computing cluster, it needs to determine whether it has received the data space ready message of all the consumer computing clusters for the same synchronization barrier before sending the generated data to all the consumer computing clusters.
具体地,仍以图1A为例,假设广播组120-1中的计算簇110-1作为生产者计算簇,其被配置为确定是否接收到了来自该广播组120-1中的所有其他计算簇110-2、110-5和110-6的同步屏障消息。这里,计算簇110-2和110-5的同步屏障消息也可以是按照上述结合计算簇110-6所描述的方式发送的。Specifically, still taking FIG. 1A as an example, it is assumed that the computing cluster 110-1 in the broadcast group 120-1 is used as the producer computing cluster, and it is configured to determine whether it has received data from all other computing clusters in the broadcast group 120-1. Synchronization barrier messages for 110-2, 110-5 and 110-6. Here, the synchronization barrier messages of computing clusters 110-2 and 110-5 may also be sent in the manner described above in conjunction with computing cluster 110-6.
计算簇110-1可以对来自同一广播组中的所有其他计算簇的针对同一同步屏障的同步屏障消息进行计数。例如,该同步屏障消息除了广播指示符和广播组的标识符之外,还可以包括所针对的同步屏障的标识符(如barID)和到达指示。Compute cluster 110-1 may count sync barrier messages for the same sync barrier from all other compute clusters in the same broadcast group. For example, in addition to the broadcast indicator and the identifier of the broadcast group, the synchronization barrier message may also include an identifier of the targeted synchronization barrier (such as barID) and an arrival indication.
如果确定已经接收到了来自该广播组中的所有其他计算簇的同步屏障消息,则计算簇110-1可以将所产生的数据发送给这些其他计算簇。If it is determined that the synchronization barrier messages from all other computing clusters in the broadcast group have been received, computing cluster 110-1 may send the generated data to these other computing clusters.
如果确定还没有接收到来自该广播组中的所有其他计算簇的同步屏障消息,则计算簇110-1可以继续等待其他同步屏障消息。If it is determined that synchronization barrier messages from all other computing clusters in the broadcast group have not been received, computing cluster 110-1 may continue to wait for other synchronization barrier messages.
在一些实施例中,计算簇110-1可以执行一条同步屏障等待指令,如“bar_groupID_wait barID,N”,其中“groupID”指示所针对的广播组的标识符(如广播组120-1的标识符group1),“wait”表示等待指令,“barID”表示所针对的同步屏障,“N”表示同步屏障计数值(即该广播组所包含的计算簇的数量)。这里,同步屏障计数值为N包括了来自N-1个消费者计算簇的N-1个线程束和来自该生产者计算簇本身的一个线程束(即运行该同步屏障等待指令的线程束)。在图中所示的实例中,N=3。In some embodiments, computing cluster 110-1 may execute a synchronization barrier wait instruction, such as "bar_groupID_wait barID, N", where "groupID" indicates the identifier of the targeted broadcast group (such as the identifier of broadcast group 120-1 group1), "wait" indicates the waiting instruction, "barID" indicates the targeted synchronization barrier, and "N" indicates the synchronization barrier count value (that is, the number of computing clusters included in the broadcast group). Here, the synchronization barrier count value N includes N-1 thread warps from N-1 consumer computing clusters and a thread warp from the producer computing cluster itself (that is, the thread warp running the synchronization barrier waiting instruction) . In the example shown in the figure, N=3.
图4示出了根据本发明实施例的用于计算设备100的同步方法400的流程图。FIG. 4 shows a flowchart of a
如图4中所示,在方框410,由每个广播组(如广播组120-1)中作为消费者计算簇的一个计算簇(如计算簇110-6),在针对该广播组120-1中的所有生产者计算簇(如计算簇110-1、110-2和110-5)的数据空间就绪时,向该消费者计算簇的簇内接口112发送广播同步屏障消息。该广播同步屏障消息包括广播指示符和该广播组120-1的标识符。As shown in Figure 4, in block 410, by a computing cluster (as computing cluster 110-6) as consumer computing cluster in each broadcasting group (as broadcasting group 120-1), for this broadcasting group 120 When the data spaces of all producer computing clusters in -1 (such as computing clusters 110-1, 110-2 and 110-5) are ready, they send a broadcast synchronization barrier message to the
在方框420,由簇内接口112将该广播同步屏障消息展开为针对该广播组120-1中的每个生产者计算簇(如计算簇110-1、110-2和110-5)的同步屏障消息,并且将每个同步屏障消息发送给相应的生产者计算簇。At
本领域技术人员可以理解,上图所示的计算设备仅是示意性的。在一些实施例中,计算设备可以包含更多或更少的组成部分。Those skilled in the art can understand that the computing device shown in the above figure is only illustrative. In some embodiments, a computing device may contain more or fewer components.
以上结合附图对根据本公开的计算设备及其包含的计算簇的同步操作进行了描述。然而本领域技术人员可以理解,计算设备及其同步操作的执行并不局限于图中所示和以上所述的顺序,而是可以以任何其他合理的顺序来执行。此外,计算设备也不必须包括图中所示的所有组件,其可以仅仅包括执行本公开中所述的功能所必须的其中一些组件或更多组件,并且这些组件的连接方式也不局限于图中所示的形式。The synchronous operation of the computing device and the computing cluster it contains according to the present disclosure has been described above with reference to the accompanying drawings. However, those skilled in the art can understand that the execution of the computing device and its synchronous operations is not limited to the order shown in the figure and described above, but can be executed in any other reasonable order. In addition, the computing device does not necessarily include all the components shown in the figure, and it may only include some or more components necessary to perform the functions described in the present disclosure, and the connection mode of these components is not limited to that shown in the figure. in the form shown.
本公开可以实现为方法、计算设备、系统和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于执行本公开的各个方面的计算机可读程序指令。计算设备可以包括至少一个处理器和耦合到该至少一个处理器的至少一个存储器,该存储器可以存储用于由至少一个处理器执行的指令。该指令在由该至少一个处理器执行时,该计算设备可以执行上述非对称同步方法。The present disclosure can be implemented as a method, computing device, system and/or computer program product. A computer program product may include a computer readable storage medium having computer readable program instructions thereon for carrying out various aspects of the present disclosure. A computing device may include at least one processor and at least one memory coupled to the at least one processor, the memory may store instructions for execution by the at least one processor. When the instructions are executed by the at least one processor, the computing device may perform the asymmetric synchronization method described above.
在一个或多个示例性设计中,可以用硬件、软件、固件或它们的任意组合来实现本公开所述的功能。例如,如果用软件来实现,则可以将所述功能作为一个或多个指令或代码存储在计算机可读介质上,或者作为计算机可读介质上的一个或多个指令或代码来传输。In one or more exemplary designs, the functions described in this disclosure may be implemented in hardware, software, firmware, or any combination thereof. For example, if implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
本文公开的装置的各个单元可以使用分立硬件组件来实现,也可以集成地实现在一个硬件组件,如处理器上。例如,可以用通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或其它可编程逻辑器件、分立门或者晶体管逻辑、分立硬件组件或用于执行本文所述的功能的任意组合来实现或执行结合本公开所描述的各种示例性的逻辑块、模块和电路。Each unit of the device disclosed herein can be implemented using discrete hardware components, or integrated on one hardware component, such as a processor. For example, a general purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components or for The various illustrative logical blocks, modules, and circuits described in connection with the present disclosure are implemented or performed by performing any combination of the functions described herein.
本领域普通技术人员还应当理解,结合本公开的实施例描述的各种示例性的逻辑块、模块、电路和算法步骤可以实现成电子硬件、计算机软件或二者的组合。Those of ordinary skill in the art should also understand that various exemplary logical blocks, modules, circuits and algorithm steps described in conjunction with the embodiments of the present disclosure may be implemented as electronic hardware, computer software or a combination of both.
本公开的以上描述用于使本领域的任何普通技术人员能够实现或使用本公开。对于本领域普通技术人员来说,本公开的各种修改都是显而易见的,并且本文定义的一般性原理也可以在不脱离本公开的精神和保护范围的情况下应用于其它变形。因此,本公开并不限于本文所述的实例和设计,而是与本文公开的原理和新颖性特性的最广范围相一致。The above description of the present disclosure is provided to enable any person of ordinary skill in the art to make or use the present disclosure. Various modifications to the present disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other modifications without departing from the spirit and scope of the present disclosure. Thus, the disclosure is not to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (13)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310193983.4A CN116192877A (en) | 2023-03-02 | 2023-03-02 | Computing device, synchronization method, electronic device, and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310193983.4A CN116192877A (en) | 2023-03-02 | 2023-03-02 | Computing device, synchronization method, electronic device, and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN116192877A true CN116192877A (en) | 2023-05-30 |
Family
ID=86448525
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310193983.4A Pending CN116192877A (en) | 2023-03-02 | 2023-03-02 | Computing device, synchronization method, electronic device, and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116192877A (en) |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101739381A (en) * | 2008-11-19 | 2010-06-16 | 富士通株式会社 | Barrier synchronization apparatus, barrier synchronization process system and method |
| US20100269027A1 (en) * | 2009-04-16 | 2010-10-21 | International Business Machines Corporation | User level message broadcast mechanism in distributed computing environment |
| US20170357513A1 (en) * | 2016-06-09 | 2017-12-14 | International Business Machines Corporation | Transmitting data between execution slices of a multi-slice processor |
| CN110121699A (en) * | 2017-10-20 | 2019-08-13 | 图核有限公司 | Synchronization in more tiles, multi-chip processing arrangement |
| CN114546928A (en) * | 2020-11-24 | 2022-05-27 | 北京灵汐科技有限公司 | Core cluster synchronization method, control method and device, core and medium |
| CN114706813A (en) * | 2022-05-05 | 2022-07-05 | 上海壁仞智能科技有限公司 | Multi-core heterogeneous system on chip, asymmetric synchronization method, computing device and medium |
| CN115543641A (en) * | 2021-06-29 | 2022-12-30 | 辉达公司 | synchronization barrier |
-
2023
- 2023-03-02 CN CN202310193983.4A patent/CN116192877A/en active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101739381A (en) * | 2008-11-19 | 2010-06-16 | 富士通株式会社 | Barrier synchronization apparatus, barrier synchronization process system and method |
| US20100269027A1 (en) * | 2009-04-16 | 2010-10-21 | International Business Machines Corporation | User level message broadcast mechanism in distributed computing environment |
| US20170357513A1 (en) * | 2016-06-09 | 2017-12-14 | International Business Machines Corporation | Transmitting data between execution slices of a multi-slice processor |
| CN110121699A (en) * | 2017-10-20 | 2019-08-13 | 图核有限公司 | Synchronization in more tiles, multi-chip processing arrangement |
| CN114546928A (en) * | 2020-11-24 | 2022-05-27 | 北京灵汐科技有限公司 | Core cluster synchronization method, control method and device, core and medium |
| CN115543641A (en) * | 2021-06-29 | 2022-12-30 | 辉达公司 | synchronization barrier |
| CN114706813A (en) * | 2022-05-05 | 2022-07-05 | 上海壁仞智能科技有限公司 | Multi-core heterogeneous system on chip, asymmetric synchronization method, computing device and medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20230153163A1 (en) | Computational Partition for a Multi-Threaded, Self-Scheduling Reconfigurable Computing Fabric | |
| US9971635B2 (en) | Method and apparatus for a hierarchical synchronization barrier in a multi-node system | |
| US11288074B2 (en) | Loop execution control for a multi-threaded, self-scheduling reconfigurable computing fabric using a reenter queue | |
| US20190303145A1 (en) | Efficient Loop Execution for a Multi-Threaded, Self-Scheduling Reconfigurable Computing Fabric | |
| US20230153258A1 (en) | Multi-Threaded, Self-Scheduling Reconfigurable Computing Fabric | |
| US20190303144A1 (en) | Backpressure Control Using a Stop Signal for a Multi-Threaded, Self-Scheduling Reconfigurable Computing Fabric | |
| US20190303147A1 (en) | Execution Control of a Multi-Threaded, Self-Scheduling Reconfigurable Computing Fabric | |
| EP2159694B1 (en) | Method and device for barrier synchronization, and multicore processor | |
| CN111858626B (en) | Parallel execution-based data synchronization method and device | |
| US20090125907A1 (en) | System and method for thread handling in multithreaded parallel computing of nested threads | |
| CN101729410B (en) | Synchronization method and device of media access control (MAC) address table | |
| CN103810223B (en) | A kind of memory data organization querying method based on packet | |
| CN113568718B (en) | Task allocation method, device, electronic device and computer readable storage medium | |
| CN106033442A (en) | A Parallel Breadth-First Search Method Based on Shared Memory Architecture | |
| CN110119375B (en) | A control method for chaining multiple scalar cores into a single-core vector processing array | |
| JP2013008270A (en) | Parallel arithmetic unit and microcomputer | |
| CN116192877A (en) | Computing device, synchronization method, electronic device, and storage medium | |
| CN114706813A (en) | Multi-core heterogeneous system on chip, asymmetric synchronization method, computing device and medium | |
| WO2022151970A1 (en) | Data transmission method, system, and computing node | |
| CN116663639B (en) | Gradient data synchronization method, system, device and medium | |
| CN104932947B (en) | A kind of fence synchronous method and equipment | |
| CN116957902A (en) | NoC arbitration method for GPU | |
| CN103631659A (en) | Schedule optimization method for communication energy consumption in on-chip network | |
| CN103246497A (en) | Real-time parallel data processing method based on data partitioning | |
| CN113688090A (en) | Data transmission method, processor system, readable storage medium and electronic device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| CB02 | Change of applicant information | ||
| CB02 | Change of applicant information |
Country or region after: China Address after: 201114 room 1302, 13 / F, building 16, 2388 Chenhang Road, Minhang District, Shanghai Applicant after: Shanghai Bi Ren Technology Co.,Ltd. Address before: 201114 room 1302, 13 / F, building 16, 2388 Chenhang Road, Minhang District, Shanghai Applicant before: Shanghai Bilin Intelligent Technology Co.,Ltd. Country or region before: China |