WO2022110567A1 - Data reuse memory access conflict elimination method for coarse-grained reconfigurable structure - Google Patents
Data reuse memory access conflict elimination method for coarse-grained reconfigurable structure Download PDFInfo
- Publication number
- WO2022110567A1 WO2022110567A1 PCT/CN2021/079524 CN2021079524W WO2022110567A1 WO 2022110567 A1 WO2022110567 A1 WO 2022110567A1 CN 2021079524 W CN2021079524 W CN 2021079524W WO 2022110567 A1 WO2022110567 A1 WO 2022110567A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- reuse
- load
- data
- pair
- producer
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present application relates to the field of coarse-grained reconfigurable structure compilers, and in particular, to a cyclic transformation model for maximizing effective data reuse for reconfigurable structures and a configuration file modification strategy for the code generation stage.
- Coarse-Grained Reconfigurable Architecture is a new hardware architecture with higher energy efficiency. It is mainly used for hardware acceleration of computationally intensive applications, such as video processing, neural network operations. After research and investigation, it is found that the application execution time is mainly consumed in the kernel part of the program loop.
- the reference ADRES[1] model gives a typical CGRA structure, as shown in Figure 1.
- the compiler abstracts the program loop kernel part as a data flow graph (DFG) and maps it to different processing elements (PE) in the coarse-grained processing element array (PEA), And it is executed through soft water technology.
- the machine cycle required between two loop iterations is called the initiation interval (II), and the smaller the II, the better the acceleration performance.
- One of the main goals of a compiler for coarse-grained reconfigurable architectures is to efficiently choose scheduling and mapping strategies to minimize the II of the loop kernel.
- ORF on-chip register file
- the ORF is connected to each column of PEA through the crossbar and the column bus, and different PEs can access each storage bank of the on-chip memory through the column bus where they are located.
- multiple operations occupy the same hardware resources at the same time in the same cycle (such as accessing the same bank of multi-bank on-chip memory at the same time, or accessing on-chip memory through the column bus exceeds the bandwidth limit of the column bus), it will cause the soft pipeline to stall, resulting in II Increase.
- the cyclic transformation based on the polyhedral model is a new cyclic optimization method. Compared with the traditional unimodular matrix model, which uses affine transformation as the cyclic transformation, it has a wide range of applications, strong expressive ability, and large optimization space. advantage.
- Today's compilers in various fields such as Google MLIR[2], Polly[3], TVM[4] all use the polyhedron model. But less work has been done to apply the polyhedral compilation model to the compilation process for coarse-grained reconfigurable architectures.
- the PolyMap proposed in reference [5] utilizes the polyhedron compilation model, adjusts the hierarchical structure of the loop kernel through the analysis of the entire mapping flow, and realizes the parallelization of the loop unrolling in the inner layer of the loop kernel.
- Reference [6] will use the polyhedron model to represent the program and the program transformation process, and use the genetic algorithm to perform cyclic transformation to achieve higher computational efficiency.
- none of the above studies considered the application of the polyhedron model to increase data reuse between loop iterations to reduce the number of redundant memory access operations.
- Reference [7] divides the data into multiple clusters, so that the probability of different clusters being accessed is as equal as possible. This approach resolves conflicts at the high level of the array, and the coarse-grained approach results in limited optimization effects.
- Reference [8] analyzes all memory access addresses to a single array by the loop kernel in a single control cycle, and uses linear transformation to map different addresses to different storage locations in different on-chip memories.
- Reference [9] further subdivides the structure of the storage bank on the basis of reference [8], adds data partitions on the block, and increases the flexibility of linear transformation.
- Reference [10] upgrades the selection of linear transformation parameters from the analysis of single array access in a single control cycle to different array accesses in different control cycles, further reducing multi-bank conflicts.
- a memory bank merging algorithm is proposed to improve the area-performance ratio of the memory bank.
- Reference [11], on the basis of Reference [10] changes the execution time of each access operator in the scheduling process to further reduce access conflicts.
- a dual-forced scheduling scheme is proposed, and then the memory access operations are divided into different control cycles as much as possible.
- none of the above studies consider the method of reducing the number of memory access operations to reduce memory access conflicts.
- the purpose of this application is to develop a new method based on the CGRA architecture, in the compilation stage, the software loop code is compiled and optimized to maximize the data available between iterations during the program running process. Reuse, and a method for eliminating redundant fetches in data-reuse pairs, thereby reducing fetch conflicts during loop execution.
- Step 1 Use the CGRA compiler front-end to compile the loop kernel part of the source program into the first intermediate representation
- Step 2 performing a cyclic transformation of effective data reuse between iterations to obtain a transformed second intermediate representation
- Step 3 Generate a data flow graph to obtain a mapping result
- Step 4 Select a modification strategy for data reuse
- Step 5 Generate a configuration file
- the step 2 further includes:
- Step 2.1 analyze the dependency relationship and data reuse relationship in the original code, and obtain the dependency set and the reuse set; wherein, the reuse relationship means that two operations in different iterations access the same location in the memory, then the two operations are reused relationship; the two operations in the reuse relationship form a producer/consumer pair that includes a producer and a consumer, and the type of the producer/consumer pair includes a Load-Load reuse pair , Load-Store reuse pair, Store-Load reuse pair and Store-Store reuse pair;
- Step 2.2 Constrain the search space of the innermost time vector according to the dependency set; search the remaining search spaces that meet the constraints in turn, and find the innermost time vector with the largest number of effective reuses in the reuse set; according to the obtained optimal innermost time vector layer time vector, recursively get the outer layer time vector;
- Step 2.3 generating the second intermediate representation according to the time vector of each layer
- Described step 4 further comprises:
- Step 4.1 Analyze the data reuse relationship in the loop kernel to obtain different data reuse sets available; check whether each reuse relationship satisfies the hardware constraints in turn, and record the final available data reuse relationship in the producer and consumer nodes of the reuse relationship information;
- Step 4.2 When the configuration file is generated, select the corresponding modification strategy according to the reuse relationship to which the node belongs.
- a local register LRF (Local register file) is used to buffer the output of the producer, and the consumer loads data from the LRF.
- a global register GRF global register file
- step 2 a method based on affine transformation is used to arrange the execution sequence of the loop iteration.
- a first strategy is used to select cyclic transformations, wherein the first strategy refers to a strategy for narrowing the selection range of optional cyclic transformations based on dependencies, which includes the following steps:
- cone(M R ) represents a vertebral body formed by MR elements
- cone( MR -ir ) represents a vertebral body formed by elements other than i r
- formula (1) represents the currently selected time vector The line where it is located cannot be in the vertebral body composed of dependency vectors
- formula (2) means that the currently selected time vector can be collinear with a vector on the vertebral body surface composed of all dependency vectors, and the dependency vector must be greater than or equal to the time vector, i.e. where c is a real number and the value range of c is 0 ⁇ c ⁇ 1.
- step 2.2 the innermost time vector is iteratively selected using the first strategy from the innermost loop; after selecting an innermost time vector, all the dependency vectors are projected onto the normal plane of the time vector, and iteratively choose the next dependency vector; finally get a cyclic transformation that satisfies the cyclic dependency.
- step 2.2 includes a second strategy, and the second strategy refers to a strategy for selecting cyclic transformations to maximize effective data reuse, which includes the following steps:
- a memory fetch operation p is represented as Indicates that the memory fetch operation p accessed on-chip memory in iteration i address; given a reuse pair It is valid when the corresponding following formula (3) holds; where c is a constant threshold to help select valid reuse pairs:
- Each reuse pair has a corresponding formula (3), so that as many reuse pairs as possible are valid, that is, the innermost time vector that makes formula (3) true as much as possible is the optimal time vector; the optimal transformation search algorithm first searches The optimal time vector of the innermost layer, and then the time vector of the outer layer is recursively calculated according to the time vector of the inner layer using formula (2).
- modifying the strategy includes:
- the Store-Store reuse pair is not limited by the interval iteration between producer and consumer, the producer in this reuse relationship pair is eliminated.
- the modification strategy includes:
- data is transferred from producers to consumers using interconnected resources.
- the modification strategy includes:
- the modification strategy includes:
- the clock cycle I(r) between the producer and the consumer satisfies 1 ⁇ I(r)>0, and there is an interconnection resource connection between the PE where the producer is located and the PE where the consumer is located, then the reused Consumers are translated into routing operations that access data from the producer's output registers.
- the modification strategy includes:
- the clock cycle I(r)>1 between the producer and the consumer uses register resources to temporarily store data.
- the modification strategy includes:
- the reused data is stored in the LRF, and the consumer is transformed into a routing operation that obtains data from the LRF, where II is the start interval of the software pipeline;
- pe(s) and pe(t) represent the PE where the producer and consumer are located, respectively.
- modification strategy includes:
- GRF is then used to temporarily store data.
- modification strategy includes:
- the output operation sends data to the input operation according to the modification policy of the Store-Load reuse pair, and memory access operations during accumulation are converted to no-ops.
- the modification strategy includes:
- modification strategy includes:
- GRF is then used to temporarily store data.
- modification strategy includes:
- the output operation sends data to the input operation according to the modification policy of the Store-Load reuse pair, and memory access operations during accumulation are converted to no-ops.
- the present application innovatively uses the data reuse relationship between loops to modify the configuration file in the configuration file generation stage to reduce the number of memory access operations of the CGRA operation loop kernel, thereby avoiding Fetch conflict.
- This method is orthogonal to the existing methods for scheduling and mapping phase and data rearrangement and storage, and has strong scalability. It can simply work with the existing data placement scheme, scheduling and mapping scheme to achieve better results. High application speedup.
- Fig. 1 is a typical architecture diagram of a 4x4 CGRA in the prior art
- FIG. 2 is a schematic diagram of a compiler backend flow diagram of an embodiment of the present application
- Fig. 3 is the cyclic transformation model process of the embodiment of the present application.
- FIG. 4 is a schematic diagram of a configuration file modification strategy according to an embodiment of the present application.
- Figure 5 is the running time of an embodiment of the present application under 22 kernels on a 4x4 PEA.
- FIG. 2 a schematic diagram of a compiler back-end flow according to an embodiment of the present application, wherein,
- Step 201 adopts the CGRA compiler front-end based on LLVM (Low Level Virtual Machine) to compile the loop kernel part of the source program into the form of intermediate representation (IR);
- LLVM Low Level Virtual Machine
- Steps 202-204 for the IR generated by the front-end, using the cyclic transformation model that maximizes effective data reuse between iterations proposed by the present application, obtain a new affine-transformed cyclic kernel IR;
- Step 202 analyze the dependency relationship and data reuse relationship in the original code to obtain a dependency set and a reuse set;
- Step 203 Constrain the search space of the innermost time vector according to the dependency set; search the remaining search spaces that meet the constraints in turn, and find the innermost time vector with the largest number of effective reuses in the reuse set; The inner time vector, recursively get the outer time vector;
- Step 204 generating a new intermediate representation according to each layer time vector
- a data flow graph is generated through the modified intermediate representation, and a mapping result is obtained by using the existing scheduling, data differentiation, and spatial mapping schemes in the prior art; wherein, the existing scheduling, data Discrimination and spatial mapping schemes can be obtained from documents [12] and [13], that is, documents [12] and [13] are incorporated into this application by reference, and documents [12] and [13] are listed in the detailed description end of section
- Steps 211-212 selecting a modification strategy by adopting the configuration file modification scheme considering data reuse proposed in this application;
- Step 211 analyze all data reuse relationships in the loop kernel, and obtain different data reuse sets available; check whether each reuse relationship satisfies hardware constraints in turn, and record the finally available data reuse relationship in the producer and consumer of the reuse relationship.
- User node information
- Step 212 when the configuration file is generated, select a corresponding modification strategy according to the reuse relationship to which the node belongs;
- Step 213 generating a configuration file according to the previous modification policy.
- the method of reducing the number of redundant memory access operations is adopted to reduce the number of memory access conflicts during the execution of the loop kernel, which is the difference between the present application and other studies.
- the present application can be orthogonally combined with the existing scheduling, data placement, and mapping strategies to further improve the The performance of existing optimization work.
- the use of the optimization method for reducing memory access conflicts for the front-end loop transformation and configuration file generation stages enables the present application to be combined with most existing methods for reducing memory access conflicts, which is another difference between the present application and other researches.
- Reuse pairs can be divided into four categories, namely Load-Load (two load operations), Load-Store (load first, then write), Store-Load (write first, then load), and Store-Store (two write operations) reuse.
- LRF local register file
- GRF global register file
- our compiler should guarantee that the reused data can be successfully preserved for the lifetime of the LRF or GRF. Therefore, the goal of our loop transformation model is to use an affine transformation-based approach to orchestrate the execution order of loop iterations to minimize the lifetime of reusing data between iterations.
- step 203 shown in FIG. 2 the present application proposes a strategy for narrowing the selection range of optional cyclic transformations according to dependencies.
- cone( MR ) represents a vertebral body formed by elements of MR
- cone( MR - ir ) represents a vertebral body formed by elements other than ir .
- Formula (1) indicates that the line where the currently selected time vector is located cannot be in the vertebral body composed of the dependency vectors; formula (2) indicates that the currently selected time vector can be collinear with a vector on the surface of the vertebral body composed of all dependent vectors, And the dependency vector must be greater than or equal to the time vector, that is where c is a real number and the value range of c is 0 ⁇ c ⁇ 1.
- the present application iteratively uses this strategy to select the legal innermost time vector from the innermost loop. After selecting an innermost time vector, all dependency vectors can be projected onto the time vector normal plane, and the next dependency vector can be selected iteratively. Finally, a cyclic transformation that satisfies the cyclic dependency can be obtained.
- step 203 shown in FIG. 2 the present application proposes a strategy for selecting cyclic transformations to maximize effective data reuse. Whether a Load-Load or Store-Load reuse pair is valid depends on the innermost time vector when searching for optimal transitions.
- a fetch operation p can be represented as Indicates that the memory fetch operation p accessed on-chip memory in iteration i address. given reuse pair It is valid when the corresponding following formula (3) holds.
- c is a constant threshold used to help select valid reuse pairs.
- Each reused pair has a corresponding formula (3), so that as many reused pairs as possible are valid, that is, the innermost time vector for which formula (3) is established as much as possible is the optimal time vector.
- the optimal transform search algorithm first searches for the optimal time vector of the innermost layer, and then uses formula (2) to recursively calculate the time vector of the outer layer according to the time vector of the inner layer.
- FIG. 3 in order to further illustrate the above-mentioned cyclic transformation method, the program shown in FIG. 3( a ) is taken as an example.
- the goal of the optimization problem is to maximize the number of equations established in Figure 3(e).
- the nodes in Fig. 3(g) represent the innermost candidate time vectors, where the feasible solutions to the optimization problem are nodes, highlighted as gray nodes.
- Figure 3(f) shows the time vector for each recurrent layer.
- Figure 3(h) shows the transformed program, while the innermost loop is split into two layers in order to traverse all iterations.
- FIG. 4(a) is a representation of the PE structure, where Op represents an operation placed on the PE, including P represents a producer in a reused pair, C represents a consumer, and R represents a routing node.
- the LRF owned by each PE is also marked, and the LRF mark drawn with a diagonal pattern indicates that the LRF stores data.
- Store-Store reuse is not limited by interval iterations between producers and consumers. If such a reuse relationship exists, the producer in the reuse relationship pair needs to be eliminated.
- Figure 4(b) shows the modification strategy for Store-Store reuse.
- interconnect resources or registered resources should be used to move data from producers to consumers.
- r ⁇ s,t,d>, where s is the producer in the reused pair, t is the consumer in the reused pair, and d is the interval iteration number between s and t.
- I(r) d ⁇ II+(time(t)-time(s)), where II is the start interval of software pipeline.
- register resources When I(r)>1, register resources must be used to temporarily store data.
- the selected strategy will be recorded, and in the configuration file generation stage of step 213, the generated configuration file will be modified through the selected strategy.
- the test integrates the cyclic transformation model and configuration file modification strategy proposed in this application in 22 typical calculation-intensive application test sets
- 22 Compute-intensive cores are selected from datasets derived from digital signal processing, computer vision, and dynamic programming, such as existing benchmarks such as EEMBC, Polybench, and Machsuite.
- Figure 5 shows the integrated modulo scheduling compiler of the present application and the original compiler (that is, not using the present application, but using the compilation flow based on the following references [12] and [13], denoted as THP+DP) generation
- THP+DP compilation flow based on the following references [12] and [13], denoted as THP+DP
- Table 1 shows the comparison of the number of memory access operations before and after the implementation of the solution described in this application:
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
Description
本申请涉及粗粒度可重构结构编译器领域,具体地,涉及一种针对于可重构结构的最大化有效数据重用的循环变换模型与用于代码生成阶段的配置文件修改策略。The present application relates to the field of coarse-grained reconfigurable structure compilers, and in particular, to a cyclic transformation model for maximizing effective data reuse for reconfigurable structures and a configuration file modification strategy for the code generation stage.
粗粒度可重构架构(Coarse-Grained Reconfigurable architecture,CGRA)是一个新的具有更高能效比的硬件体系结构。它主要被应用于对计算密集型应用程序进行硬件加速,如视频处理,神经网络运算。经研究调查发现,应用执行时间主要消耗在程序循环内核部分之中。参考文献ADRES[1]模型给出了一个典型的CGRA结构,如图1所示。编译器将程序循环内核部分抽象为用数据流图(data flow graph,DFG)表示,并映射到粗粒度处理单元阵列(processing element array,PEA)中的不同处理单元(processing element,PE)上,并通过软流水技术执行。两次循环迭代间隔需要的机器周期称为启动间隔(initiation interval,II),II越小,加速性能越好。粗粒度可重构体系结构编译器的主要目标之一是有效地选择调度和映射策略,以最小化循环内核的II。Coarse-Grained Reconfigurable Architecture (CGRA) is a new hardware architecture with higher energy efficiency. It is mainly used for hardware acceleration of computationally intensive applications, such as video processing, neural network operations. After research and investigation, it is found that the application execution time is mainly consumed in the kernel part of the program loop. The reference ADRES[1] model gives a typical CGRA structure, as shown in Figure 1. The compiler abstracts the program loop kernel part as a data flow graph (DFG) and maps it to different processing elements (PE) in the coarse-grained processing element array (PEA), And it is executed through soft water technology. The machine cycle required between two loop iterations is called the initiation interval (II), and the smaller the II, the better the acceleration performance. One of the main goals of a compiler for coarse-grained reconfigurable architectures is to efficiently choose scheduling and mapping strategies to minimize the II of the loop kernel.
然而,软件流水线技术增加了对片上全局存储器(on-chip register file,ORF)数据访问并行性的要求,内存访问冲突导致的流水线中断成为CGRA效率的瓶颈。ORF通过crossbar和列总线连接至PEA各列,不同PE可以通过所在列总线访问片上存储器各存储体。当同一周期内多操作同时占用相同的硬件资源(如同时访问多存储体片上存储器的同一个存储体,或通过列总线访问片上存储器超出列总线带宽限制),将会导致软流水线停滞,导致II增加。However, the software pipeline technology increases the requirement for the parallelism of data access to the on-chip register file (ORF), and the pipeline interruption caused by memory access conflicts becomes the bottleneck of the efficiency of CGRA. The ORF is connected to each column of PEA through the crossbar and the column bus, and different PEs can access each storage bank of the on-chip memory through the column bus where they are located. When multiple operations occupy the same hardware resources at the same time in the same cycle (such as accessing the same bank of multi-bank on-chip memory at the same time, or accessing on-chip memory through the column bus exceeds the bandwidth limit of the column bus), it will cause the soft pipeline to stall, resulting in II Increase.
对22个常用运算核进行研究,发现访存操作数量占全部操作的52.2%,发生在同一控制周期的访存操作之间的访存冲突造成的时延占总运行时间的68.4%。已有的消除访存冲突的工作多关注调度和映射阶段调整访存操作的位置,以及调整数据在片上存储器的存储位置,但是却缺少针对造成访存冲突的根本原因,即因软流水技术导致的过多的访存操作数量进行优化。Research on 22 commonly used computing cores shows that the number of memory access operations accounts for 52.2% of all operations, and the delay caused by memory access conflicts between memory access operations in the same control cycle accounts for 68.4% of the total running time. The existing work to eliminate memory access conflicts mostly focuses on adjusting the location of memory access operations in the scheduling and mapping stages, and adjusting the storage location of data in the on-chip memory, but it lacks the root cause of memory access conflicts, which is caused by soft pipelining technology. The excessive number of memory fetch operations is optimized.
一些研究及分析如下:Some studies and analyses are as follows:
一、有关多面体模型运用于粗粒度可重构架构的研究1. Research on the application of polyhedral model to coarse-grained reconfigurable architecture
基于多面体模型的循环变换是一种新兴循环优化方法,它相较于传统的幺模矩阵模型使用仿射变换作为循环变换的方式,具有应用范围广,表达能力强,优 化空间大等多方面的优点。当今各领域编译器如参考文献Google MLIR[2],Polly[3],TVM[4]均使用了多面体模型。但将多面体编译模型应用于粗粒度可重构架构编译过程中的工作较少。参考文献[5]提出的PolyMap利用多面体编译模型,通过对整个映射流的分析调整循环内核的层级结构,对循环内核内层循环展开实现并行。参考文献[6]将使用多面体模型表示程序和程序转换过程,并利用遗传算法进行循环变换以取得更高计算效率。但上述研究均未考虑将多面体模型运用于增加循环迭代间数据重用,以降低冗余访存操作数量。The cyclic transformation based on the polyhedral model is a new cyclic optimization method. Compared with the traditional unimodular matrix model, which uses affine transformation as the cyclic transformation, it has a wide range of applications, strong expressive ability, and large optimization space. advantage. Today's compilers in various fields such as Google MLIR[2], Polly[3], TVM[4] all use the polyhedron model. But less work has been done to apply the polyhedral compilation model to the compilation process for coarse-grained reconfigurable architectures. The PolyMap proposed in reference [5] utilizes the polyhedron compilation model, adjusts the hierarchical structure of the loop kernel through the analysis of the entire mapping flow, and realizes the parallelization of the loop unrolling in the inner layer of the loop kernel. Reference [6] will use the polyhedron model to represent the program and the program transformation process, and use the genetic algorithm to perform cyclic transformation to achieve higher computational efficiency. However, none of the above studies considered the application of the polyhedron model to increase data reuse between loop iterations to reduce the number of redundant memory access operations.
二、对降低CGRA上运算访存冲突的研究2. Research on reducing the conflict of operation access on CGRA
访存冲突主要由大量访存操作同时访问。现有对于减少访存冲突的研究,多集中于利用对调度、映射过程,以及数据在片上存储器的存储方式进行优化,以依此减少多存储体冲突。参考文献[7]将数据分为了多个集群,使得不同集群被访问的概率尽可能相等。这种方式从数组高层次解决冲突,粗粒度的方法导致优化效果是有限的。参考文献[8]分析循环内核在单个控制周期内对单个数组的所有访存地址,利用线性变换将不同地址映射至不同片上存储器不同的存储位置。参考文献[9]在文献[8]的基础上,对存储体的构造进行进一步的细分,添加块上的数据分区,增大线性变换的灵活性。参考文献[10]将线性变换参数的选取从对单一控制周期内单一数组访问的分析提升到了不同控制周期内的不同数组访问,进一步降低多存储体冲突。同时提出了存储体合并算法,从而提高存储体的面积性能比例。参考文献[11]在参考文献[10]的基础上,在调度过程中改变各访存算子的执行时间,进一步降低访存冲突。同时提出了dual-forced调度方案,进而将访存操作尽可能的划分到不同的控制周期内。然而,上述研究均未考虑降低访存操作的数量,以降低访存冲突的方法。Memory fetch conflicts are mainly accessed by a large number of memory fetch operations at the same time. Existing researches on reducing memory access conflicts mostly focus on optimizing scheduling, mapping processes, and data storage methods in on-chip memory, so as to reduce multi-bank conflicts accordingly. Reference [7] divides the data into multiple clusters, so that the probability of different clusters being accessed is as equal as possible. This approach resolves conflicts at the high level of the array, and the coarse-grained approach results in limited optimization effects. Reference [8] analyzes all memory access addresses to a single array by the loop kernel in a single control cycle, and uses linear transformation to map different addresses to different storage locations in different on-chip memories. Reference [9] further subdivides the structure of the storage bank on the basis of reference [8], adds data partitions on the block, and increases the flexibility of linear transformation. Reference [10] upgrades the selection of linear transformation parameters from the analysis of single array access in a single control cycle to different array accesses in different control cycles, further reducing multi-bank conflicts. At the same time, a memory bank merging algorithm is proposed to improve the area-performance ratio of the memory bank. Reference [11], on the basis of Reference [10], changes the execution time of each access operator in the scheduling process to further reduce access conflicts. At the same time, a dual-forced scheduling scheme is proposed, and then the memory access operations are divided into different control cycles as much as possible. However, none of the above studies consider the method of reducing the number of memory access operations to reduce memory access conflicts.
上文中提到的参考文献信息如下:The references mentioned above are as follows:
[1]Y.Park,J.J.K.Park,and S.Mahlke.2012.Efficient performance scaling of future CGRAs for mobile applications.In International Conference on Field-Programmable Technology(FPT).335–342.[1] Y.Park, J.J.K.Park, and S.Mahlke.2012.Efficient performance scaling of future CGRAs for mobile applications.In International Conference on Field-Programmable Technology(FPT).335–342.
[2]Lattner C,Amini M,Bondhugula U,et al.MLIR:A Compiler Infrastructure for the End of Moore's Law[J].2020.[2] Lattner C, Amini M, Bondhugula U, et al. MLIR: A Compiler Infrastructure for the End of Moore's Law [J]. 2020.
[3]Grosser,T.,Groesslinger,A.,&Lengauer,C.(2012).Polly-Performing polyhedral optimizations on a low-level intermediate representation.Parallel Processing Letters,22(4),1–28.[3] Grosser, T., Groesslinger, A., & Lengauer, C. (2012). Polly-Performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 22(4), 1–28.
[4]Chen,Tianqi,ThierryMoreau,Ziheng Jiang,Lianmin Zheng,Eddie Yan,Haichen Shen,Meghan Cowan etal."TVM:An automated end-to-end optimizing compiler for deeplearning."In 13th USENIX Symposium on Operating Systems Design andImplementation(OSDI 18),pp.578-594.2018.[4] Chen, Tianqi, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan et al."TVM:An automated end-to-end optimizing compiler for deeplearning."In 13th USENIX Symposium on Operating Systems Design andImplementation (OSDI 18), pp.578-594.2018.
[5]Liu,D.,Yin,S.,Peng,Y.,Liu,L.,&Wei,S.(2015).Optimizing Spatial Mapping of Nested Loop for Coarse-Grained Reconfigurable Architectures.IEEE Transactions on Very Large Scale Integration(VLSI)Systems,23(11),2581–2594.[5] Liu, D., Yin, S., Peng, Y., Liu, L., & Wei, S. (2015). Optimizing Spatial Mapping of Nested Loop for Coarse-Grained Reconfigurable Architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 23(11), 2581–2594.
[6]Ganser S, Armin,Siegmund N,et al.Speeding up Itewrative Polyhedral Schedule Optimization with Surrogate Performance Models[J].Acm Transactions on Architecture&Code Optimization,2018,15(4):1-27. [6] Ganser S, Armin,Siegmund N,et al.Speeding up Itewrative Polyhedral Schedule Optimization with Surrogate Performance Models[J].Acm Transactions on Architecture&Code Optimization,2018,15(4):1-27.
[7]Kim,Y.,Lee,J.,Shrivastava,A.,&Paek,Y.(2010).Operation and data mapping for CGRAs with multi-bank memory.Proceedings of the ACM SIGPLAN Conference on Languages,Compilers,and Tools for Embedded Systems(LCTES),17–25.[7] Kim, Y., Lee, J., Shrivastava, A., & Paek, Y. (2010). Operation and data mapping for CGRAs with multi-bank memory. Proceedings of the ACM SIGPLAN Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES), 17–25.
[8]Wang Y,Li P,Zhang P,et al.Memory partitioning for multidimensional arrays in high-level synthesis[C]//Acm/edac/ieee Design Automation Conference.IEEE,2013.[8] Wang Y, Li P, Zhang P, et al. Memory partitioning for multidimensional arrays in high-level synthesis[C]//Acm/edac/ieee Design Automation Conference.IEEE, 2013.
[9]Wang,Y.,Li,P.,&Cong,J.(2014).Theory and algorithm for generalized memory partitioning in high-level synthesis.ACM/SIGDA International Symposium on Field Programmable Gate Arrays-FPGA,199–208.[9] Wang, Y., Li, P., & Cong, J. (2014). Theory and algorithm for generalized memory partitioning in high-level synthesis. ACM/SIGDA International Symposium on Field Programmable Gate Arrays-FPGA, 199–208 .
[10]Yin S,Xie Z,Meng C,et al.MultiBank memory optimization for parallel data access in multiple data arrays[C]//IEEE/ACM International Conference on Computer-aided Design.IEEE,2017.[10] Yin S, Xie Z, Meng C, et al.MultiBank memory optimization for parallel data access in multiple data arrays[C]//IEEE/ACM International Conference on Computer-aided Design.IEEE, 2017.
[11]Park,Hyunchul&Fan,Kevin&Mahlke,Scott&Oh,Taewook&Kim,Heeseok&Kim,Hong-seok.(2008).Edge-centric modulo scheduling for coarse-grained reconfigurable architectures.Parallel Architectures and Compilation Techniques-Conference Proceedings,PACT.166-176.10.1145/1454115.1454140.[11] Park, Hyunchul & Fan, Kevin & Mahlke, Scott & Oh, Taewook & Kim, Heeseok & Kim, Hong-seok. (2008). Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. Parallel Architectures and Compilation Techniques-Conference Proceedings, PACT.166-176.10. 1145/1454115.1454140.
发明内容SUMMARY OF THE INVENTION
有鉴于现有技术的上述缺陷,本申请的目的是开发一种其于CGRA架构的新的方法,在编译阶段,对软件循环代码进行编译优化,以最大化程序运行过程中迭代间可用的数据重用,以及消除数据重用对中的冗余访存操作,从而减少循环执行过程中的访存冲突的方法。In view of the above-mentioned defects of the prior art, the purpose of this application is to develop a new method based on the CGRA architecture, in the compilation stage, the software loop code is compiled and optimized to maximize the data available between iterations during the program running process. Reuse, and a method for eliminating redundant fetches in data-reuse pairs, thereby reducing fetch conflicts during loop execution.
包括以下步骤:Include the following steps:
步骤1、采用CGRA编译器前端将源程序的循环内核部分编译为第一中间表示;
步骤2、进行迭代间有效数据重用的循环变换,得到变换后的第二中间表示;
步骤3、生成数据流图,得到映射结果;
步骤4、选择数据重用的修改策略;
步骤5、生成配置文件;
所述步骤2进一步包括:The
步骤2.1、分析原始代码中的依赖关系和数据重用关系,得到依赖集与重用集;其中,所述重用关系是指不同的迭代中的两个操作访问存储器同一位置,则这两个操作存在重用关系;所述重用关系中的两个操作形成生产者/消费者对,所述生产者/消费者对包括生产者和消费者,所述生产者/消费者对的类型包括Load-Load重用对、Load-Store重用对、Store-Load重用对和Store-Store重用对;Step 2.1, analyze the dependency relationship and data reuse relationship in the original code, and obtain the dependency set and the reuse set; wherein, the reuse relationship means that two operations in different iterations access the same location in the memory, then the two operations are reused relationship; the two operations in the reuse relationship form a producer/consumer pair that includes a producer and a consumer, and the type of the producer/consumer pair includes a Load-Load reuse pair , Load-Store reuse pair, Store-Load reuse pair and Store-Store reuse pair;
步骤2.2、根据依赖集,对最内层时间向量的搜索空间进行约束;依次搜索剩余符合约束的搜索空间,找到使得重用集中有效重用数量最多的最内层时间向量;根据得到的最优最内层时间向量,递归地得到外层时间向量;Step 2.2. Constrain the search space of the innermost time vector according to the dependency set; search the remaining search spaces that meet the constraints in turn, and find the innermost time vector with the largest number of effective reuses in the reuse set; according to the obtained optimal innermost time vector layer time vector, recursively get the outer layer time vector;
步骤2.3、根据各层时间向量生成所述第二中间表示;Step 2.3, generating the second intermediate representation according to the time vector of each layer;
所述步骤4进一步包括:Described
步骤4.1、对循环内核中数据重用关系分析,得到可用的不同数据重用集合;依次检查每一个重用关系是否满足硬件约束,并将最终可用的数据重用关系记录在重用关系的生产者和消费者节点信息;Step 4.1. Analyze the data reuse relationship in the loop kernel to obtain different data reuse sets available; check whether each reuse relationship satisfies the hardware constraints in turn, and record the final available data reuse relationship in the producer and consumer nodes of the reuse relationship information;
步骤4.2、在配置文件生成时,根据节点所属的重用关系,选择对应的修改策略。Step 4.2: When the configuration file is generated, select the corresponding modification strategy according to the reuse relationship to which the node belongs.
进一步地,对于所述Load-Load重用对和所述Store-Load重用对,所述步骤2中,使用本地寄存器LRF(Local register file)缓冲生产者的输出,消费者从所述LRF加载数据。Further, for the Load-Load reuse pair and the Store-Load reuse pair, in the
进一步地,对于所述Load-Load重用对和所述Store-Load重用对,所述步骤2中,使用全局寄存器GRF(global register file)缓冲生产者的输出,消费者从所述GRF加载数据。Further, for the Load-Load reuse pair and the Store-Load reuse pair, in the
进一步地,所述步骤2中,使用基于仿射转换的方法编排循环迭代的执行顺序。Further, in the
进一步地,所述步骤2.2中,采用第一策略选取循环变换,其中所述第一策略指依赖关系缩小可选循环变换选取范围的策略,其包括如下步骤:Further, in the step 2.2, a first strategy is used to select cyclic transformations, wherein the first strategy refers to a strategy for narrowing the selection range of optional cyclic transformations based on dependencies, which includes the following steps:
定义 为第m维的时间向量,表示在该层连续两次迭代之间所有循环索引的变化为 其中x mi表示在完美嵌套循环的第m层,第i个循环变量每次循环迭代将会增加x mi;设完美嵌套循环中依赖集合R={<s,d>},即循环迭代s必须在循环迭代d之前执行;依赖集合中各元素的d与s的差值称为依赖向量,被保存在M R中;对于一组依赖向量集合M R,选取的时间向量组的最内层时间向量 至少满足公式(1)和公式(2)中的任意一个: definition is the time vector of the mth dimension, indicating that the change of all loop indices between two consecutive iterations of this layer is where x mi indicates that at the mth level of the perfectly nested loop, the i-th loop variable will increase x mi for each loop iteration; set the dependency set R={<s,d>} in the perfect nested loop, that is, the loop iteration s must be executed before loop iteration d; the difference between d and s of each element in the dependency set is called the dependency vector and is stored in MR ; for a set of dependency vector sets MR , the innermost value of the selected time vector group is layer time vector At least one of formula (1) and formula (2) is satisfied:
其中,cone(M R)表示由M R元素张成的椎体,cone(M R-i r)表示椎体由除i r之外的元素张成;公式(1)表示当前选择的时间向量所在的直线不能在依赖向量构成的 椎体中;公式(2)表示当前选择的时间向量可以与所有依赖向量构成的椎体表面上的一个向量共线,且该依赖向量必须大于或等于该时间向量,即 其中c为实数且c取值范围为0<c≤1。 Among them, cone(M R ) represents a vertebral body formed by MR elements, and cone( MR -ir ) represents a vertebral body formed by elements other than i r ; formula (1) represents the currently selected time vector The line where it is located cannot be in the vertebral body composed of dependency vectors; formula (2) means that the currently selected time vector can be collinear with a vector on the vertebral body surface composed of all dependency vectors, and the dependency vector must be greater than or equal to the time vector, i.e. where c is a real number and the value range of c is 0<c≤1.
进一步地,步骤2.2中,从最内层循环迭代地使用所述第一策略选取最内层时间向量;选择一个最内层时间向量后,将所有的依赖向量投影至时间向量法平面上,迭代地选择下一个依赖向量;最终得到一个满足循环依赖关系的循环变换。Further, in step 2.2, the innermost time vector is iteratively selected using the first strategy from the innermost loop; after selecting an innermost time vector, all the dependency vectors are projected onto the normal plane of the time vector, and iteratively choose the next dependency vector; finally get a cyclic transformation that satisfies the cyclic dependency.
进一步地,所述步骤2.2包括第二策略,所述第二策略指选取循环变换以最大化有效数据重用的策略,其包括如下步骤:Further, the step 2.2 includes a second strategy, and the second strategy refers to a strategy for selecting cyclic transformations to maximize effective data reuse, which includes the following steps:
一个访存操作p被表示为 表示访存操作p在迭代i中访问了片上存储器 地址;给定重用对 对应的下列公式(3)成立时,它是有效的;其中c是一个常量阈值,用于帮助选择有效的重用对: A memory fetch operation p is represented as Indicates that the memory fetch operation p accessed on-chip memory in iteration i address; given a reuse pair It is valid when the corresponding following formula (3) holds; where c is a constant threshold to help select valid reuse pairs:
每一个重用对都有对应的公式(3),使得尽可能多的重用对有效,即尽可能使公式(3)成立的最内层时间向量就是最优时间向量;最优变换搜索算法首先搜索最内层的最优时间向量,然后根据内层的时间向量利用公式(2)递归计算外层的时间向量。Each reuse pair has a corresponding formula (3), so that as many reuse pairs as possible are valid, that is, the innermost time vector that makes formula (3) true as much as possible is the optimal time vector; the optimal transformation search algorithm first searches The optimal time vector of the innermost layer, and then the time vector of the outer layer is recursively calculated according to the time vector of the inner layer using formula (2).
进一步地,所述步骤4.2中,修改策略包括:Further, in the step 4.2, modifying the strategy includes:
所述Store-Store重用对不受生产者和消费者之间间隔迭代的限制,此重用关系对中的生产者被消除。The Store-Store reuse pair is not limited by the interval iteration between producer and consumer, the producer in this reuse relationship pair is eliminated.
进一步地,步骤4.2中,修改策略包括:Further, in step 4.2, the modification strategy includes:
对于所述Store-Load重用对和所述Load-Load重用对,将数据从生产者转移到消费者。For the Store-Load reuse pair and the Load-Load reuse pair, data is transferred from the producer to the consumer.
进一步地,使用互连资源将数据从生产者转移到消费者。Further, data is transferred from producers to consumers using interconnected resources.
进一步地,使用注册资源将数据从生产者转移到消费者。Further, use registered resources to transfer data from producers to consumers.
进一步地,所述步骤4.2中,对于所述Store-Load重用对和Load-Load重用对,修改策略包括:Further, in the step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, the modification strategy includes:
生产者和消费者之间间隔的时钟周期I(r)≤0,该重用对无效。If the clock cycle I(r) ≤ 0 between the producer and the consumer, the reuse pair is invalid.
进一步地,所述步骤4.2中,对于所述Store-Load重用对和Load-Load重用对,修改策略包括:Further, in the step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, the modification strategy includes:
生产者和消费者之间间隔的时钟周期I(r)满足1≥I(r)>0,且生产者的所在的PE和消费者所在的PE之间存在互连线资源连接,则重用的消费者被转换为从生产者的输出寄存器访问数据的路由操作。The clock cycle I(r) between the producer and the consumer satisfies 1≥I(r)>0, and there is an interconnection resource connection between the PE where the producer is located and the PE where the consumer is located, then the reused Consumers are translated into routing operations that access data from the producer's output registers.
进一步地,所述步骤4.2中,对于所述Store-Load重用对和所述Load-Load重用对,修改策略包括:Further, in the step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, the modification strategy includes:
生产者和消费者之间间隔的时钟周期I(r)>1,使用寄存器资源来暂存数据。The clock cycle I(r)>1 between the producer and the consumer uses register resources to temporarily store data.
进一步地,所述步骤4.2中,对于所述Store-Load重用对和所述Load-Load重用对,修改策略包括:Further, in the step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, the modification strategy includes:
对于pe(s)=pe(t),且pe(s)上的LRF的剩余空间 满足 的情形, For pe(s)=pe(t), and the remaining space of the LRF on pe(s) Satisfy situation,
重用的数据存储到LRF中,消费者被转换为从LRF获取数据的路由操作,其中II是软件流水的启动间隔;The reused data is stored in the LRF, and the consumer is transformed into a routing operation that obtains data from the LRF, where II is the start interval of the software pipeline;
其中,pe(s)和pe(t)分别表示生产者和消费者所在的PE。Among them, pe(s) and pe(t) represent the PE where the producer and consumer are located, respectively.
进一步地,修改策略包括:Further, the modification strategy includes:
在没有选择使用LRF的情形,且剩余的GRF空间满足In the case where LRF is not selected, and the remaining GRF space satisfies
则将GRF用于临时存储数据。GRF is then used to temporarily store data.
进一步地,修改策略包括:Further, the modification strategy includes:
对于累加重用,输出操作根据所述Store-Load重用对的修改策略将数据发送到输入操作,累加期间的内存访问操作转换为空操作。For accumulation reuse, the output operation sends data to the input operation according to the modification policy of the Store-Load reuse pair, and memory access operations during accumulation are converted to no-ops.
进一步地,所述步骤4.2中,对于所述Store-Load重用对和所述Load-Load重用对,修改策略包括:Further, in the step 4.2, for the Store-Load reuse pair and the Load-Load reuse pair, the modification strategy includes:
对于Con(r)=true,pe(s)≠pe(t)的情形,添加额外的路由节点通过LRF进行数据移动;For the case of Con(r)=true, pe(s)≠pe(t), add additional routing nodes to move data through LRF;
其中,pe(s)和pe(t)分别表示生产者和消费者所在的PE,Con(r)=true表示pe(s)和pe(t)之间存在互连线资源连接。Among them, pe(s) and pe(t) represent the PEs where the producer and the consumer are located, respectively, and Con(r)=true means that there is an interconnection resource connection between pe(s) and pe(t).
进一步地,修改策略包括:Further, the modification strategy includes:
在没有选择使用LRF的情形,且剩余的GRF空间满足In the case where LRF is not selected, and the remaining GRF space satisfies
则将GRF用于临时存储数据。GRF is then used to temporarily store data.
进一步地,修改策略包括:Further, the modification strategy includes:
对于累加重用,输出操作根据所述Store-Load重用对的修改策略将数据发送到输入操作,累加期间的内存访问操作转换为空操作。For accumulation reuse, the output operation sends data to the input operation according to the modification policy of the Store-Load reuse pair, and memory access operations during accumulation are converted to no-ops.
与现有技术相比,本申请的有益效果如下:Compared with the prior art, the beneficial effects of the present application are as follows:
1、相比于现有减少访存冲突的方案,本申请创新性地利用循环间的数据重用关系,在配置文件生成阶段对配置文件修改,减少CGRA运算循环内核的访存操作数量,进而避免访存冲突。1. Compared with the existing solutions for reducing memory access conflicts, the present application innovatively uses the data reuse relationship between loops to modify the configuration file in the configuration file generation stage to reduce the number of memory access operations of the CGRA operation loop kernel, thereby avoiding Fetch conflict.
2、本方法与现有针对调度映射阶段和数据重排存储的方法是正交的,有很强的可扩展性,可以简单地和现有数据放置方案,调度和映射方案协同工作以取得更高的应用加速比。2. This method is orthogonal to the existing methods for scheduling and mapping phase and data rearrangement and storage, and has strong scalability. It can simply work with the existing data placement scheme, scheduling and mapping scheme to achieve better results. High application speedup.
图1是现有技术的4x4 CGRA典型构架图;Fig. 1 is a typical architecture diagram of a 4x4 CGRA in the prior art;
图2是本申请的实施例的编译器后端流程示意图;2 is a schematic diagram of a compiler backend flow diagram of an embodiment of the present application;
图3是本申请的实施例的循环变换模型过程;Fig. 3 is the cyclic transformation model process of the embodiment of the present application;
图4是本申请的实施例的配置文件修改策略示意图;4 is a schematic diagram of a configuration file modification strategy according to an embodiment of the present application;
图5是本申请的实施例在4x4 PEA上22个kernels下运行时间。Figure 5 is the running time of an embodiment of the present application under 22 kernels on a 4x4 PEA.
以下参考说明书附图介绍本申请的优选实施例,使其技术内容更加清楚和便于理解。本申请可以通过许多不同形式的实施例来得以体现,本申请的保护范围并非仅限于文中提到的实施例。The preferred embodiments of the present application will be described below with reference to the accompanying drawings, so as to make its technical content clearer and easier to understand. The present application can be embodied in many different forms of embodiments, and the protection scope of the present application is not limited to the embodiments mentioned herein.
以下将对本申请的构思、具体结构及产生的技术效果作进一步的说明,以充分地了解本申请的目的、特征和效果,但本申请的保护不仅限于此。The concept, specific structure and technical effects of the present application will be further described below to fully understand the purpose, features and effects of the present application, but the protection of the present application is not limited to this.
如图2所示,本申请的一个实施例的编译器后端流程示意图,其中,As shown in FIG. 2, a schematic diagram of a compiler back-end flow according to an embodiment of the present application, wherein,
步骤201,采用基于LLVM(Low Level Virtual Machine)的CGRA编译器前端将源程序的循环内核部分编译为中间表示(intermediate representation,IR)的形式;
步骤202-204,对前端生成的IR,利用本申请提出的最大化迭代间有效数据重用的循环变换模型,得到仿射变换后的新循环内核IR;其中,Steps 202-204, for the IR generated by the front-end, using the cyclic transformation model that maximizes effective data reuse between iterations proposed by the present application, obtain a new affine-transformed cyclic kernel IR; wherein,
步骤202,分析原始代码中的依赖关系和数据重用关系,得到依赖集与重用集;
步骤203,根据依赖集合,对最内层时间向量的搜索空间进行约束;依次搜索剩余符合约束的搜索空间,找到使得重用集合中有效重用数量最多的最内层时间向量;根据得到的最优最内层时间向量,递归地得到外层时间向量;Step 203: Constrain the search space of the innermost time vector according to the dependency set; search the remaining search spaces that meet the constraints in turn, and find the innermost time vector with the largest number of effective reuses in the reuse set; The inner time vector, recursively get the outer time vector;
步骤204,根据各层时间向量生成新的中间表示;
步骤205至步骤210,通过修改后的中间表示生成数据流图,采用现有技术中已有的调度、数据区分、空间映射方案,得到映射结果;其中,现有技术中已有的调度、数据区分、空间映射方案可以从文献[12]和[13]中获得,即文献[12]和[13]通过引用的方式加入到本申请中,文献[12]和[13]列在具体实施方式部分的末尾From
步骤211-212,采用本申请提出的考虑数据重用的配置文件修正方案选择修改策略;Steps 211-212, selecting a modification strategy by adopting the configuration file modification scheme considering data reuse proposed in this application;
步骤211,对循环内核中所有数据重用关系分析,得到可用的不同数据重用集合;依次检查每一个重用关系是否满足硬件约束,并将最终可用的数据重用关系记录在该重用关系的生产者和消费者节点信息;
步骤212,在配置文件生成时,根据节点所属的重用关系,选择对应的修改策略;Step 212, when the configuration file is generated, select a corresponding modification strategy according to the reuse relationship to which the node belongs;
步骤213,根据之前的修改策略生成配置文件。Step 213, generating a configuration file according to the previous modification policy.
上述流程中,采用降低冗余访存操作数量的方式,降低循环内核执行过程中发生访存冲突的次数,是本申请与其他研究的区别之处。此外,不同于现有的降低访存冲突的工作针对于数据在片上存储器内的放置,调度和映射策略,本申请可以正交地和现有的调度,数据放置,映射策略进行结合,进一步提升现有优化工作的性能。采用针对前端循环变换和配置文件生成阶段进行减少访存冲突优化方法,使得本申请可以与现有大多数减少访存冲突的方法相结合,是本申请与其他研究的另一个区别之处。In the above process, the method of reducing the number of redundant memory access operations is adopted to reduce the number of memory access conflicts during the execution of the loop kernel, which is the difference between the present application and other studies. In addition, unlike the existing work of reducing memory access conflicts, which is aimed at the placement, scheduling and mapping strategies of data in the on-chip memory, the present application can be orthogonally combined with the existing scheduling, data placement, and mapping strategies to further improve the The performance of existing optimization work. The use of the optimization method for reducing memory access conflicts for the front-end loop transformation and configuration file generation stages enables the present application to be combined with most existing methods for reducing memory access conflicts, which is another difference between the present application and other researches.
循环变换cyclic transformation
不同的迭代中的两个操作访问存储器同一个位置,则这两个操作存在重用关系。根据访问数据的顺序,重用关系的两个操作形成了生产者/消费者对,也称为重用对。重用对可以分为四类,即Load-Load(两个加载操作)、Load-Store(先加载后写)、Store-Load(先写后加载)和Store-Store(两个写操作)重用。对于Load-Load或Store-Load重用对,我们使用本地寄存器(Local register file,LRF)或全局寄存器(global register file,GRF)缓冲生产者的输出,这样消费者只需从LRF或GRF加载数据,而不需要访问OGM,从而减少内存访问操作。但是,由于硬件提供的GRF和LRF的大小有限,我们的编译器应该保证重用的数据在LRF或GRF的生命周期内能够成功地保存下来。因此,我们的循环转换模型的目标是使用基于仿射转换的方法编排循环迭代的执行顺序,以最小化迭代间重用数据的生命周期。If two operations in different iterations access the same location in memory, there is a reuse relationship between the two operations. Depending on the order in which the data is accessed, the two operations of a reuse relationship form a producer/consumer pair, also known as a reuse pair. Reuse pairs can be divided into four categories, namely Load-Load (two load operations), Load-Store (load first, then write), Store-Load (write first, then load), and Store-Store (two write operations) reuse. For Load-Load or Store-Load reuse pairs, we use a local register file (LRF) or a global register file (GRF) to buffer the output of the producer, so that the consumer only needs to load data from the LRF or GRF, There is no need to access the OGM, thereby reducing memory access operations. However, due to the limited size of GRFs and LRFs provided by the hardware, our compiler should guarantee that the reused data can be successfully preserved for the lifetime of the LRF or GRF. Therefore, the goal of our loop transformation model is to use an affine transformation-based approach to orchestrate the execution order of loop iterations to minimize the lifetime of reusing data between iterations.
在图2所示步骤203中,本申请提出了一种根据依赖关系缩小可选循环变换选取范围的策略。In
定义 为第m维的时间向量,表示在该层连续两次迭代之间所有循环索引的变化为 其中x mi表示在完美嵌套循环的第m层,第i个循环变量每次循环迭代将会增加x mi。对于图3(a)中的循环程序,它的时间向量集合表示为 和 一组时间向量对应了一个确定的循环执行顺序,可以通过对一组时间向量进行仿射变换,从而调整循环内部迭代的执行顺序,但是此变换不可以违反原有循环的依赖关系。设完美嵌套循环中依赖集合R={<s,d>},即循环迭代s必须在循环迭代d之前执行。依赖集合中各元素的d与s的差值称为依赖向量,被保存在M R中。对于一组依赖向量集合M R,选取的时间向量组的最内层时间向量 至少满足公式(1)和公式(2)中的任意一个,那么这一组时间向量组才有可能合法: definition is the time vector of the mth dimension, indicating that the change of all loop indices between two consecutive iterations of this layer is where xmi denotes that at the mth level of a perfectly nested loop, the i-th loop variable will increase by xmi for each loop iteration. For the loop program in Figure 3(a), its set of time vectors is expressed as and A set of time vectors corresponds to a certain loop execution order, and the execution order of iterations inside the loop can be adjusted by performing affine transformation on a set of time vectors, but this transformation cannot violate the dependencies of the original loop. Let the set of dependencies R={<s,d>} in a perfectly nested loop, that is, the loop iteration s must be executed before the loop iteration d. The difference between d and s of each element in the dependency set is called a dependency vector and is stored in MR . For a set of dependency vector sets MR , the innermost time vector of the selected time vector set At least any one of formula (1) and formula (2) is satisfied, then this set of time vector groups may be legal:
其中,cone(M R)表示由M R元素张成的椎体,cone(M R-i r)表示椎体由除i r之外的元素张成。公式(1)表示当前选择的时间向量所在的直线不能在依赖向量构成的 椎体中;公式(2)表示当前选择的时间向量可以与所有依赖向量构成的椎体表面上的一个向量共线,且该依赖向量必须大于或等于该时间向量,即 其中c为实数且c取值范围为0<c≤1。 Among them, cone( MR ) represents a vertebral body formed by elements of MR, and cone( MR - ir ) represents a vertebral body formed by elements other than ir . Formula (1) indicates that the line where the currently selected time vector is located cannot be in the vertebral body composed of the dependency vectors; formula (2) indicates that the currently selected time vector can be collinear with a vector on the surface of the vertebral body composed of all dependent vectors, And the dependency vector must be greater than or equal to the time vector, that is where c is a real number and the value range of c is 0<c≤1.
根据以上的选择策略,本申请从最内层循环迭代地使用该策略选取合法的最内层时间向量。选择一个最内层时间向量后,可以将所有的依赖向量投影至时间向量法平面上,迭代地选择下一个依赖向量。最终可以得到一个满足循环依赖关系的循环变换。According to the above selection strategy, the present application iteratively uses this strategy to select the legal innermost time vector from the innermost loop. After selecting an innermost time vector, all dependency vectors can be projected onto the time vector normal plane, and the next dependency vector can be selected iteratively. Finally, a cyclic transformation that satisfies the cyclic dependency can be obtained.
在图2所示步骤203中,本申请提出了一种选取循环变换以最大化有效数据重用的策略。在搜索最优转换时,Load-Load或Store-Load重用对是否有效取决于最里面的时间向量。一个访存操作p可以被表示为
表示访存操作p在迭代i中访问了片上存储器
地址。给定重用对
对应的下列公式(3)成立时,它是有效的。其中c是一个常量阈值,用于帮助选择有效的重用对。
In
每一个重用对都有对应的公式(3),使得尽可能多的重用对有效,即尽可能使公式(3)成立的最内层时间向量就是最优时间向量。最优变换搜索算法首先搜索最内层的最优时间向量,然后根据内层的时间向量利用公式(2)递归计算外层的时间向量。Each reused pair has a corresponding formula (3), so that as many reused pairs as possible are valid, that is, the innermost time vector for which formula (3) is established as much as possible is the optimal time vector. The optimal transform search algorithm first searches for the optimal time vector of the innermost layer, and then uses formula (2) to recursively calculate the time vector of the outer layer according to the time vector of the inner layer.
如图3所示,为了进一步说明上述的循环变换方法,还是以图3(a)中显示的程序为例。优化问题的目标是使图3(e)中所建立方程的数目最大化。图3(g)中的节点表示最内候选时间向量,其中优化问题的可行解为节点,以灰色节点突出显示。图3(f)表示每个循环层的时间矢量。图3(h)显示了转换后的程序,而最内层的循环被分成两层,以便遍历所有迭代。As shown in FIG. 3 , in order to further illustrate the above-mentioned cyclic transformation method, the program shown in FIG. 3( a ) is taken as an example. The goal of the optimization problem is to maximize the number of equations established in Figure 3(e). The nodes in Fig. 3(g) represent the innermost candidate time vectors, where the feasible solutions to the optimization problem are nodes, highlighted as gray nodes. Figure 3(f) shows the time vector for each recurrent layer. Figure 3(h) shows the transformed program, while the innermost loop is split into two layers in order to traverse all iterations.
配置文件修正Configuration file correction
在配置文件生成过程中,我们提出了一种配置文件修改策略,通过利用迭代间的数据重用来消除不必要的内存访问操作。该算法首先分析每一组重用对是否满足时间约束和寄存器空间约束,然后采用不同的修改策略来减少内存访问操作。修改策略展示在图4中。图4(a)是对于PE结构的表示,Op表示该PE上放置的操作,包括P表示重用对中的生产者,C为消费者,R为路由节点。每一个PE所拥有的LRF也被标记,绘制斜线图案的LRF标志表示该LRF存储了数据。During configuration file generation, we propose a configuration file modification strategy to eliminate unnecessary memory access operations by exploiting data reuse between iterations. The algorithm first analyzes whether each group of reused pairs satisfies the time constraints and register space constraints, and then adopts different modification strategies to reduce memory access operations. The modification strategy is shown in Figure 4. Figure 4(a) is a representation of the PE structure, where Op represents an operation placed on the PE, including P represents a producer in a reused pair, C represents a consumer, and R represents a routing node. The LRF owned by each PE is also marked, and the LRF mark drawn with a diagonal pattern indicates that the LRF stores data.
Store-Store重用不受生产者和消费者之间间隔迭代的限制。如果存在这样的重用关系,重用关系对中的生产者需要被消除。图4(b)显示了Store-Store重用的修改策略。Store-Store reuse is not limited by interval iterations between producers and consumers. If such a reuse relationship exists, the producer in the reuse relationship pair needs to be eliminated. Figure 4(b) shows the modification strategy for Store-Store reuse.
对于Store-Load和Load-Load重用,应使用互连资源或注册资源将数据从生 产者转移到消费者。给定一个数据重用对,r=<s,t,d>,其中s是重用对中的生产者,t是重用对中的消费者,d是s和t之间的间隔迭代次数。假设操作s被映射到软件流水中控制周期time(s)的第pe(s)个PE上,那么该重用对生产者和消费者之间间隔的时钟周期I(r)可以通过I(r)=d×II+(time(t)-time(s))来计算,其中II是软件流水的启动间隔。For Store-Load and Load-Load reuse, interconnect resources or registered resources should be used to move data from producers to consumers. Given a data reuse pair, r = <s,t,d>, where s is the producer in the reused pair, t is the consumer in the reused pair, and d is the interval iteration number between s and t. Assuming that the operation s is mapped to the pe(s)th PE of the control cycle time(s) in the software pipeline, the reuse of the clock cycle I(r) between the producer and the consumer can be passed through I(r) =d×II+(time(t)-time(s)), where II is the start interval of software pipeline.
如果I(r)≤0,重用对生产者执行在消费者之后或同时发生,则该重用对无效。If I(r) ≤ 0, the reuse pair is invalid for the producer to perform after or at the same time as the consumer.
当I(r)>0时,应该检查这个重用对的空间约束。假设生产者和消费者所在的pe(s)和pe(t)之间存在互连线资源连接,那么记Con(r)为真。图4(d)显示了当I(r)=1并且Con(r)=true时所选择的策略,其中重用的消费者被转换为从生产者的输出寄存器访问数据的路由操作。When I(r)>0, the space constraints of this reused pair should be checked. Assuming that there is an interconnection resource connection between pe(s) and pe(t) where the producer and consumer are located, let Con(r) be true. Figure 4(d) shows the strategy chosen when I(r) = 1 and Con(r) = true, where the reused consumer is transformed into a routing operation that accesses data from the producer's output register.
当I(r)>1,必须使用寄存器资源来暂存数据。When I(r)>1, register resources must be used to temporarily store data.
图4(c)显示pe(s)=pe(t)时的状态。如果pe(s)上的LRF的剩余空间 满足 重用的数据应该存储到LRF中,消费者被转换为从LRF获取数据的路由操作。 Fig. 4(c) shows the state when pe(s)=pe(t). If the remaining space of the LRF on pe(s) Satisfy The reused data should be stored into the LRF and the consumer is converted to a routing operation that gets the data from the LRF.
但是当I(r)≥1,Con(r)=true,pe(s)≠pe(t)时,则需要添加额外的路由节点进行数据移动。假设存在两个空操作np和nc,他们满足pe(np)=pe(s),pe(nc)=pe(t),(time(nc)-time(np))mod II=1,重用的数据将被这两个路由节点通过LRF移动。值得注意的是,生产者和消费者也可以选择为np和nc。图4(e)展示了不同的策略,包括选择生产者和消费者为路由节点的策略。从左至右四张图展示了生产者和消费者所在的位置,当生产者被选择为np,消费者被选择为nc,或者选择除生产者和消费者之外的空余节点作为np和nc的配置文件修正策略。However, when I(r)≥1, Con(r)=true, and pe(s)≠pe(t), additional routing nodes need to be added for data movement. Suppose there are two empty operations np and nc, they satisfy pe(np)=pe(s), pe(nc)=pe(t), (time(nc)-time(np))mod II=1, the reused Data will be moved through the LRF by these two routing nodes. It is worth noting that producers and consumers can also be selected as np and nc. Figure 4(e) shows different strategies, including the strategy of selecting producers and consumers as routing nodes. The four diagrams from left to right show the location of producers and consumers. When the producer is selected as np, the consumer is selected as nc, or the spare nodes other than the producer and consumer are selected as np and nc configuration file remediation policy.
如果没有选择上述策略,且剩余的GRF空间应满足 则GRF将用于临时存储数据。这种情况的修改策略如图4(f)所示。 If the above strategy is not selected, and the remaining GRF space should satisfy Then the GRF will be used to temporarily store the data. The modification strategy for this case is shown in Fig. 4(f).
除此之外,存在累加重用。即运算中存在对同一个存储位置数据的累加操作,该操作可以暂存在本地寄存器中进行,而不需要不断访问主存。给定一个存储-加载重用对r=<s,t,d>,如果r′=<t,s,d>是一个Load-Store重用对,我们称r为一个累加重用。累加重用的修改策略如图4(g)和图4(h)所示,其中包含I和O块的矩形框表示除内存访问操作外的所有操作集。标记为I和O的节点是直接连接到加载和存储操作(称为输入操作和输出操作)的操作。输出操作根据Store-Load重用修改策略将数据发送到输入操作,累加期间的内存访问操作应转换为空操作。Beyond that, there are cumulative uses. That is to say, there is an accumulation operation on the data of the same storage location in the operation, and this operation can be temporarily performed in the local register without constantly accessing the main memory. Given a store-load reuse pair r=<s,t,d>, we call r a cumulative reuse if r′=<t,s,d> is a Load-Store reuse pair. The modification strategy for cumulative reuse is shown in Fig. 4(g) and Fig. 4(h), where the rectangular box containing the I and O blocks represents the set of all operations except memory access operations. Nodes marked I and O are operations that are directly connected to load and store operations (called input operations and output operations). Output operations send data to input operations according to the Store-Load reuse modification strategy, and memory access operations during accumulation should be converted to no-ops.
选取的策略将会被记录下来,并在步骤213的配置文件生成阶段,通过选择的策略,对生成的配置文件进行修改。The selected strategy will be recorded, and in the configuration file generation stage of step 213, the generated configuration file will be modified through the selected strategy.
结果评估Outcome evaluation
利用针对背景技术节中提到的以ADRES为基础的典型CGRA结构设计实现的仿真环境,测试集成了本申请提出的循环变换模型和配置文件修改策略在22个典型计算密集型应用测试集,22个计算密集型内核从EEMBC、Polybench和Machsuite等现有基准等来源于数字信号处理、计算机视觉和动态规划的数据集中选择。图5展示了集成本申请的模调度编译器和原始编译器(即不使用本申请,而是使用的基于以下的参考文献[12]和[13]的编译流,记为THP+DP)生成的配置包执行时间比较。结果显示,本申请生成的配置信息可以获得平均44.4%的性能提升,说明本申请所述方案较已有方案可以有效提升粗粒度可重构架构编译器生成的配置信息包的性能。Using the simulation environment designed and implemented for the typical CGRA structure based on ADRES mentioned in the Background Art section, the test integrates the cyclic transformation model and configuration file modification strategy proposed in this application in 22 typical calculation-intensive application test sets, 22 Compute-intensive cores are selected from datasets derived from digital signal processing, computer vision, and dynamic programming, such as existing benchmarks such as EEMBC, Polybench, and Machsuite. Figure 5 shows the integrated modulo scheduling compiler of the present application and the original compiler (that is, not using the present application, but using the compilation flow based on the following references [12] and [13], denoted as THP+DP) generation The configuration package execution time comparison. The results show that the configuration information generated by this application can achieve an average performance improvement of 44.4%, indicating that the solution described in this application can effectively improve the performance of the configuration information package generated by the coarse-grained reconfigurable architecture compiler compared with the existing solutions.
表1展示了引用本申请所述方案施行前和施行后访存操作数量的对比:Table 1 shows the comparison of the number of memory access operations before and after the implementation of the solution described in this application:
可见,平均访存操作降低到了之前的55.4%,说明我们的模型大幅度降低了访存操作的数量。其中,通过循环转换模型发掘aes3、bezier1、strassen1潜在的数据重用关系,这三个内核中两个连续迭代之间的数据重用对数量从0、1、14增加到4、9、23。此外,其他内核在其默认状态下实现了最大程度的数据重用,因此循环转换模型有效的最大化循环迭代间所有可用的重用关系。It can be seen that the average memory access operation is reduced to 55.4% of the previous value, indicating that our model greatly reduces the number of memory access operations. Among them, the potential data reuse relationship of aes3, bezier1, and strassen1 is explored through the cyclic transformation model, and the number of data reuse pairs between two consecutive iterations in these three kernels increases from 0, 1, 14 to 4, 9, and 23. Furthermore, other kernels achieve maximum data reuse in their default state, so the loop transformation model effectively maximizes all available reuse relationships between loop iterations.
参考文献references
[12]Z.Zhao et al.,"Towards Higher Performance and Robust Compilation for CGRA Modulo Scheduling,"in IEEE Transactions on Parallel and Distributed Systems,vol.31,no.9,pp.2201-2219,1 Sept.2020,doi:10.1109/TPDS.2020.2989149.[12]Z.Zhao et al.,"Towards Higher Performance and Robust Compilation for CGRA Modulo Scheduling,"in IEEE Transactions on Parallel and Distributed Systems,vol.31,no.9,pp.2201-2219,1 Sept.2020 , doi:10.1109/TPDS.2020.2989149.
[13]Z.Zhao,Y.Liu,W.Sheng,T.Krishna,Q.Wang,and Z.Mao,“Optimizing the data placement and transformation for multi-bank cgra computing system,”in DATE,2018,pp.1087–1092.[13] Z. Zhao, Y. Liu, W. Sheng, T. Krishna, Q. Wang, and Z. Mao, “Optimizing the data placement and transformation for multi-bank cgra computing system,” in DATE, 2018, pp .1087–1092.
以上详细描述了本申请的较佳具体实施例。应当理解,本领域的普通技术无需创造性劳动就可以根据本申请的构思作出诸多修改和变化。因此,凡本技术领域中技术人员依本申请的构思在现有技术的基础上通过逻辑分析、推理或者有限的实验可以得到的技术方案,皆应在由权利要求书所确定的保护范围内。The preferred specific embodiments of the present application are described in detail above. It should be understood that many modifications and changes can be made in accordance with the concept of the present application without creative efforts by those skilled in the art. Therefore, any technical solutions that can be obtained by those skilled in the art through logical analysis, reasoning or limited experiments on the basis of the prior art according to the concept of the present application shall fall within the protection scope determined by the claims.
Claims (20)
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011377746 | 2020-11-30 | ||
CN202011377746.6 | 2020-11-30 | ||
CN202110061629.7 | 2021-01-18 | ||
CN202110061629.7A CN112631610B (en) | 2020-11-30 | 2021-01-18 | A Data Reuse Method to Eliminate Fetch Conflicts for Coarse-grained Reconfigurable Structures |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022110567A1 true WO2022110567A1 (en) | 2022-06-02 |
Family
ID=75294506
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/079524 WO2022110567A1 (en) | 2020-11-30 | 2021-03-08 | Data reuse memory access conflict elimination method for coarse-grained reconfigurable structure |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112631610B (en) |
WO (1) | WO2022110567A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230195439A1 (en) * | 2021-12-22 | 2023-06-22 | Samsung Electronics Co., Ltd. | Apparatus and method with neural network computation scheduling |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN119225976A (en) * | 2024-11-26 | 2024-12-31 | 天开智算(天津)科技有限公司 | Method, device and storage medium for realizing computing acceleration based on CGRA |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102200924A (en) * | 2011-05-17 | 2011-09-28 | 北京北大众志微系统科技有限责任公司 | Modulus-scheduling-based compiling method and device for realizing circular instruction scheduling |
CN103106067A (en) * | 2013-03-01 | 2013-05-15 | 清华大学 | Optimization method and system of cyclic mapping of processor |
CN105260222A (en) * | 2015-10-13 | 2016-01-20 | 哈尔滨工程大学 | Optimization method for initiation interval between circulating pipeline iterations in reconfigurable compiler |
US20160092285A1 (en) * | 2014-09-25 | 2016-03-31 | Intel Corporation | Method and Apparatus for Approximating Detection of Overlaps Between Memory Ranges |
Family Cites Families (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101101992B1 (en) * | 2010-04-30 | 2012-01-13 | 광운대학교 산학협력단 | Method and apparatus for optimizing application mapping in coarse grained reconstruction array |
CN102043659A (en) * | 2010-12-08 | 2011-05-04 | 上海交通大学 | Compiling device for eliminating memory access conflict and implementation method thereof |
CN102156666B (en) * | 2011-04-20 | 2012-11-28 | 上海交通大学 | Temperature optimizing method for resource scheduling of coarse reconfigurable array processor |
EP2523120A1 (en) * | 2011-05-12 | 2012-11-14 | Imec | Microcomputer architecture for low power efficient baseband processing |
CN102855197A (en) * | 2011-11-08 | 2013-01-02 | 东南大学 | Storage system implementing method for large-scale coarse-grained reconfigurable system |
KR101293701B1 (en) * | 2012-02-23 | 2013-08-06 | 국립대학법인 울산과학기술대학교 산학협력단 | Method and apparatus of executing nested loop on coarse-grained reconfigurable array |
CN103377035A (en) * | 2012-04-12 | 2013-10-30 | 浙江大学 | Pipeline parallelization method for coarse-grained streaming application |
CN203706197U (en) * | 2014-02-10 | 2014-07-09 | 东南大学 | Coarse-granularity dynamic and reconfigurable data regularity control unit structure |
CN103942082B (en) * | 2014-04-02 | 2017-03-29 | 南阳理工学院 | A kind of compiling optimization method of the internal storage access operation for eliminating redundancy |
CN103984560B (en) * | 2014-05-30 | 2017-09-19 | 东南大学 | Large-scale coarse-grained embedded reconfigurable system and its processing method |
CN105528243B (en) * | 2015-07-02 | 2019-01-11 | 中国科学院计算技术研究所 | A kind of priority packet dispatching method and system using data topology information |
CN105302624B (en) * | 2015-09-17 | 2018-10-26 | 哈尔滨工程大学 | Start spacing automatic analysis method between cycle flowing water iteration in a kind of reconfigurable compiling device |
CN105302525B (en) * | 2015-10-16 | 2018-01-05 | 上海交通大学 | Method for parallel processing for the reconfigurable processor of multi-level heterogeneous structure |
US10528356B2 (en) * | 2015-11-04 | 2020-01-07 | International Business Machines Corporation | Tightly coupled processor arrays using coarse grained reconfigurable architecture with iteration level commits |
CN105468568B (en) * | 2015-11-13 | 2018-06-05 | 上海交通大学 | Efficient coarseness restructurable computing system |
CN105335331B (en) * | 2015-12-04 | 2018-08-21 | 东南大学 | A kind of SHA256 realization method and systems based on extensive coarseness reconfigurable processor |
CN105700933A (en) * | 2016-01-12 | 2016-06-22 | 上海交通大学 | Parallelization and loop optimization method and system for a high-level language of reconfigurable processor |
CN105718245B (en) * | 2016-01-18 | 2018-08-28 | 清华大学 | Reconfigurable Computation cyclic mapping optimization method |
CN105867994A (en) * | 2016-04-20 | 2016-08-17 | 上海交通大学 | Instruction scheduling optimization method for coarse-grained reconfigurable architecture complier |
WO2019059927A1 (en) * | 2017-09-22 | 2019-03-28 | Intel Corporation | Loop nest reversal |
-
2021
- 2021-01-18 CN CN202110061629.7A patent/CN112631610B/en active Active
- 2021-03-08 WO PCT/CN2021/079524 patent/WO2022110567A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102200924A (en) * | 2011-05-17 | 2011-09-28 | 北京北大众志微系统科技有限责任公司 | Modulus-scheduling-based compiling method and device for realizing circular instruction scheduling |
CN103106067A (en) * | 2013-03-01 | 2013-05-15 | 清华大学 | Optimization method and system of cyclic mapping of processor |
US20160092285A1 (en) * | 2014-09-25 | 2016-03-31 | Intel Corporation | Method and Apparatus for Approximating Detection of Overlaps Between Memory Ranges |
CN105260222A (en) * | 2015-10-13 | 2016-01-20 | 哈尔滨工程大学 | Optimization method for initiation interval between circulating pipeline iterations in reconfigurable compiler |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230195439A1 (en) * | 2021-12-22 | 2023-06-22 | Samsung Electronics Co., Ltd. | Apparatus and method with neural network computation scheduling |
Also Published As
Publication number | Publication date |
---|---|
CN112631610A (en) | 2021-04-09 |
CN112631610B (en) | 2022-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112465108B (en) | A Neural Network Compilation Method for Storage and Computing Integrated Platform | |
Lu et al. | Optimizing depthwise separable convolution operations on gpus | |
US11669443B2 (en) | Data layout optimization on processing in memory architecture for executing neural network model | |
Che et al. | Dymaxion: Optimizing memory access patterns for heterogeneous systems | |
Baskaran et al. | Optimizing sparse matrix-vector multiplication on GPUs | |
CN108509270B (en) | A high-performance parallel implementation method of K-means algorithm on domestic Shenwei 26010 many-core processor | |
US20140149719A1 (en) | Arithmetic processing apparatus, control method of arithmetic processing apparatus, and a computer-readable storage medium storing a control program for controlling an arithmetic processing apparatus | |
US20120331278A1 (en) | Branch removal by data shuffling | |
Coarfa et al. | Co-array Fortran performance and potential: An NPB experimental study | |
Jiang et al. | Boyi: A systematic framework for automatically deciding the right execution model of OpenCL applications on FPGAs | |
EP2761435A1 (en) | Cache and/or socket sensitive multi-processor cores breadth-first traversal | |
CN112631610B (en) | A Data Reuse Method to Eliminate Fetch Conflicts for Coarse-grained Reconfigurable Structures | |
US6324629B1 (en) | Method for determining an optimized data organization | |
Ma et al. | Acceleration by inline cache for memory-intensive algorithms on FPGA via high-level synthesis | |
Ye et al. | Hida: A hierarchical dataflow compiler for high-level synthesis | |
Ferry et al. | Increasing fpga accelerators memory bandwidth with a burst-friendly memory layout | |
US20220350863A1 (en) | Technology to minimize the negative impact of cache conflicts caused by incompatible leading dimensions in matrix multiplication and convolution kernels without dimension padding | |
Su et al. | Accelerating inclusion-based pointer analysis on heterogeneous CPU-GPU systems | |
Nakano | Sequential memory access on the unified memory machine with application to the dynamic programming | |
Danckaert et al. | Platform Independent Data Transfer and Storage Exploration Illustrated on Parallel Cavity Detection Algorithm. | |
Xie et al. | MPU: Towards bandwidth-abundant SIMT processor via near-bank computing | |
US20220365891A1 (en) | Accelerator and electronic device including the same | |
Chen et al. | Reducing memory access conflicts with loop transformation and data reuse on coarse-grained reconfigurable architecture | |
Ozdal | Improving efficiency of parallel vertex-centric algorithms for irregular graphs | |
Zhuang et al. | A framework for parallelizing load/stores on embedded processors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21896087 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21896087 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21896087 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 150124) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21896087 Country of ref document: EP Kind code of ref document: A1 |