+

WO2022110567A1 - Procédé d'élimination de conflit d'accès à la mémoire à réutilisation de données pour structure reconfigurable grossière - Google Patents

Procédé d'élimination de conflit d'accès à la mémoire à réutilisation de données pour structure reconfigurable grossière Download PDF

Info

Publication number
WO2022110567A1
WO2022110567A1 PCT/CN2021/079524 CN2021079524W WO2022110567A1 WO 2022110567 A1 WO2022110567 A1 WO 2022110567A1 CN 2021079524 W CN2021079524 W CN 2021079524W WO 2022110567 A1 WO2022110567 A1 WO 2022110567A1
Authority
WO
WIPO (PCT)
Prior art keywords
reuse
load
data
pair
producer
Prior art date
Application number
PCT/CN2021/079524
Other languages
English (en)
Chinese (zh)
Inventor
绳伟光
陈雨歌
蒋剑飞
景乃锋
王琴
毛志刚
Original Assignee
上海交通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海交通大学 filed Critical 上海交通大学
Publication of WO2022110567A1 publication Critical patent/WO2022110567A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of coarse-grained reconfigurable structure compilers, and in particular, to a cyclic transformation model for maximizing effective data reuse for reconfigurable structures and a configuration file modification strategy for the code generation stage.
  • Coarse-Grained Reconfigurable Architecture is a new hardware architecture with higher energy efficiency. It is mainly used for hardware acceleration of computationally intensive applications, such as video processing, neural network operations. After research and investigation, it is found that the application execution time is mainly consumed in the kernel part of the program loop.
  • the reference ADRES[1] model gives a typical CGRA structure, as shown in Figure 1.
  • the compiler abstracts the program loop kernel part as a data flow graph (DFG) and maps it to different processing elements (PE) in the coarse-grained processing element array (PEA), And it is executed through soft water technology.
  • the machine cycle required between two loop iterations is called the initiation interval (II), and the smaller the II, the better the acceleration performance.
  • One of the main goals of a compiler for coarse-grained reconfigurable architectures is to efficiently choose scheduling and mapping strategies to minimize the II of the loop kernel.
  • ORF on-chip register file
  • the ORF is connected to each column of PEA through the crossbar and the column bus, and different PEs can access each storage bank of the on-chip memory through the column bus where they are located.
  • multiple operations occupy the same hardware resources at the same time in the same cycle (such as accessing the same bank of multi-bank on-chip memory at the same time, or accessing on-chip memory through the column bus exceeds the bandwidth limit of the column bus), it will cause the soft pipeline to stall, resulting in II Increase.
  • the cyclic transformation based on the polyhedral model is a new cyclic optimization method. Compared with the traditional unimodular matrix model, which uses affine transformation as the cyclic transformation, it has a wide range of applications, strong expressive ability, and large optimization space. advantage.
  • Today's compilers in various fields such as Google MLIR[2], Polly[3], TVM[4] all use the polyhedron model. But less work has been done to apply the polyhedral compilation model to the compilation process for coarse-grained reconfigurable architectures.
  • the PolyMap proposed in reference [5] utilizes the polyhedron compilation model, adjusts the hierarchical structure of the loop kernel through the analysis of the entire mapping flow, and realizes the parallelization of the loop unrolling in the inner layer of the loop kernel.
  • Reference [6] will use the polyhedron model to represent the program and the program transformation process, and use the genetic algorithm to perform cyclic transformation to achieve higher computational efficiency.
  • none of the above studies considered the application of the polyhedron model to increase data reuse between loop iterations to reduce the number of redundant memory access operations.
  • Reference [7] divides the data into multiple clusters, so that the probability of different clusters being accessed is as equal as possible. This approach resolves conflicts at the high level of the array, and the coarse-grained approach results in limited optimization effects.
  • Reference [8] analyzes all memory access addresses to a single array by the loop kernel in a single control cycle, and uses linear transformation to map different addresses to different storage locations in different on-chip memories.
  • Reference [9] further subdivides the structure of the storage bank on the basis of reference [8], adds data partitions on the block, and increases the flexibility of linear transformation.
  • Reference [10] upgrades the selection of linear transformation parameters from the analysis of single array access in a single control cycle to different array accesses in different control cycles, further reducing multi-bank conflicts.
  • a memory bank merging algorithm is proposed to improve the area-performance ratio of the memory bank.
  • Reference [11], on the basis of Reference [10] changes the execution time of each access operator in the scheduling process to further reduce access conflicts.
  • a dual-forced scheduling scheme is proposed, and then the memory access operations are divided into different control cycles as much as possible.
  • none of the above studies consider the method of reducing the number of memory access operations to reduce memory access conflicts.
  • the purpose of this application is to develop a new method based on the CGRA architecture, in the compilation stage, the software loop code is compiled and optimized to maximize the data available between iterations during the program running process. Reuse, and a method for eliminating redundant fetches in data-reuse pairs, thereby reducing fetch conflicts during loop execution.
  • Step 1 Use the CGRA compiler front-end to compile the loop kernel part of the source program into the first intermediate representation
  • Step 2 performing a cyclic transformation of effective data reuse between iterations to obtain a transformed second intermediate representation
  • Step 3 Generate a data flow graph to obtain a mapping result
  • Step 4 Select a modification strategy for data reuse
  • Step 5 Generate a configuration file
  • the step 2 further includes:
  • Step 2.1 analyze the dependency relationship and data reuse relationship in the original code, and obtain the dependency set and the reuse set; wherein, the reuse relationship means that two operations in different iterations access the same location in the memory, then the two operations are reused relationship; the two operations in the reuse relationship form a producer/consumer pair that includes a producer and a consumer, and the type of the producer/consumer pair includes a Load-Load reuse pair , Load-Store reuse pair, Store-Load reuse pair and Store-Store reuse pair;
  • Step 2.2 Constrain the search space of the innermost time vector according to the dependency set; search the remaining search spaces that meet the constraints in turn, and find the innermost time vector with the largest number of effective reuses in the reuse set; according to the obtained optimal innermost time vector layer time vector, recursively get the outer layer time vector;
  • Step 2.3 generating the second intermediate representation according to the time vector of each layer
  • Described step 4 further comprises:
  • Step 4.1 Analyze the data reuse relationship in the loop kernel to obtain different data reuse sets available; check whether each reuse relationship satisfies the hardware constraints in turn, and record the final available data reuse relationship in the producer and consumer nodes of the reuse relationship information;
  • Step 4.2 When the configuration file is generated, select the corresponding modification strategy according to the reuse relationship to which the node belongs.
  • a local register LRF (Local register file) is used to buffer the output of the producer, and the consumer loads data from the LRF.
  • a global register GRF global register file
  • step 2 a method based on affine transformation is used to arrange the execution sequence of the loop iteration.
  • a first strategy is used to select cyclic transformations, wherein the first strategy refers to a strategy for narrowing the selection range of optional cyclic transformations based on dependencies, which includes the following steps:
  • cone(M R ) represents a vertebral body formed by MR elements
  • cone( MR -ir ) represents a vertebral body formed by elements other than i r
  • formula (1) represents the currently selected time vector The line where it is located cannot be in the vertebral body composed of dependency vectors
  • formula (2) means that the currently selected time vector can be collinear with a vector on the vertebral body surface composed of all dependency vectors, and the dependency vector must be greater than or equal to the time vector, i.e. where c is a real number and the value range of c is 0 ⁇ c ⁇ 1.
  • step 2.2 the innermost time vector is iteratively selected using the first strategy from the innermost loop; after selecting an innermost time vector, all the dependency vectors are projected onto the normal plane of the time vector, and iteratively choose the next dependency vector; finally get a cyclic transformation that satisfies the cyclic dependency.
  • step 2.2 includes a second strategy, and the second strategy refers to a strategy for selecting cyclic transformations to maximize effective data reuse, which includes the following steps:
  • a memory fetch operation p is represented as Indicates that the memory fetch operation p accessed on-chip memory in iteration i address; given a reuse pair It is valid when the corresponding following formula (3) holds; where c is a constant threshold to help select valid reuse pairs:
  • Each reuse pair has a corresponding formula (3), so that as many reuse pairs as possible are valid, that is, the innermost time vector that makes formula (3) true as much as possible is the optimal time vector; the optimal transformation search algorithm first searches The optimal time vector of the innermost layer, and then the time vector of the outer layer is recursively calculated according to the time vector of the inner layer using formula (2).
  • modifying the strategy includes:
  • the Store-Store reuse pair is not limited by the interval iteration between producer and consumer, the producer in this reuse relationship pair is eliminated.
  • the modification strategy includes:
  • data is transferred from producers to consumers using interconnected resources.
  • the modification strategy includes:
  • the modification strategy includes:
  • the clock cycle I(r) between the producer and the consumer satisfies 1 ⁇ I(r)>0, and there is an interconnection resource connection between the PE where the producer is located and the PE where the consumer is located, then the reused Consumers are translated into routing operations that access data from the producer's output registers.
  • the modification strategy includes:
  • the clock cycle I(r)>1 between the producer and the consumer uses register resources to temporarily store data.
  • the modification strategy includes:
  • the reused data is stored in the LRF, and the consumer is transformed into a routing operation that obtains data from the LRF, where II is the start interval of the software pipeline;
  • pe(s) and pe(t) represent the PE where the producer and consumer are located, respectively.
  • modification strategy includes:
  • GRF is then used to temporarily store data.
  • modification strategy includes:
  • the output operation sends data to the input operation according to the modification policy of the Store-Load reuse pair, and memory access operations during accumulation are converted to no-ops.
  • the modification strategy includes:
  • modification strategy includes:
  • GRF is then used to temporarily store data.
  • modification strategy includes:
  • the output operation sends data to the input operation according to the modification policy of the Store-Load reuse pair, and memory access operations during accumulation are converted to no-ops.
  • the present application innovatively uses the data reuse relationship between loops to modify the configuration file in the configuration file generation stage to reduce the number of memory access operations of the CGRA operation loop kernel, thereby avoiding Fetch conflict.
  • This method is orthogonal to the existing methods for scheduling and mapping phase and data rearrangement and storage, and has strong scalability. It can simply work with the existing data placement scheme, scheduling and mapping scheme to achieve better results. High application speedup.
  • Fig. 1 is a typical architecture diagram of a 4x4 CGRA in the prior art
  • FIG. 2 is a schematic diagram of a compiler backend flow diagram of an embodiment of the present application
  • Fig. 3 is the cyclic transformation model process of the embodiment of the present application.
  • FIG. 4 is a schematic diagram of a configuration file modification strategy according to an embodiment of the present application.
  • Figure 5 is the running time of an embodiment of the present application under 22 kernels on a 4x4 PEA.
  • FIG. 2 a schematic diagram of a compiler back-end flow according to an embodiment of the present application, wherein,
  • Step 201 adopts the CGRA compiler front-end based on LLVM (Low Level Virtual Machine) to compile the loop kernel part of the source program into the form of intermediate representation (IR);
  • LLVM Low Level Virtual Machine
  • Steps 202-204 for the IR generated by the front-end, using the cyclic transformation model that maximizes effective data reuse between iterations proposed by the present application, obtain a new affine-transformed cyclic kernel IR;
  • Step 202 analyze the dependency relationship and data reuse relationship in the original code to obtain a dependency set and a reuse set;
  • Step 203 Constrain the search space of the innermost time vector according to the dependency set; search the remaining search spaces that meet the constraints in turn, and find the innermost time vector with the largest number of effective reuses in the reuse set; The inner time vector, recursively get the outer time vector;
  • Step 204 generating a new intermediate representation according to each layer time vector
  • a data flow graph is generated through the modified intermediate representation, and a mapping result is obtained by using the existing scheduling, data differentiation, and spatial mapping schemes in the prior art; wherein, the existing scheduling, data Discrimination and spatial mapping schemes can be obtained from documents [12] and [13], that is, documents [12] and [13] are incorporated into this application by reference, and documents [12] and [13] are listed in the detailed description end of section
  • Steps 211-212 selecting a modification strategy by adopting the configuration file modification scheme considering data reuse proposed in this application;
  • Step 211 analyze all data reuse relationships in the loop kernel, and obtain different data reuse sets available; check whether each reuse relationship satisfies hardware constraints in turn, and record the finally available data reuse relationship in the producer and consumer of the reuse relationship.
  • User node information
  • Step 212 when the configuration file is generated, select a corresponding modification strategy according to the reuse relationship to which the node belongs;
  • Step 213 generating a configuration file according to the previous modification policy.
  • the method of reducing the number of redundant memory access operations is adopted to reduce the number of memory access conflicts during the execution of the loop kernel, which is the difference between the present application and other studies.
  • the present application can be orthogonally combined with the existing scheduling, data placement, and mapping strategies to further improve the The performance of existing optimization work.
  • the use of the optimization method for reducing memory access conflicts for the front-end loop transformation and configuration file generation stages enables the present application to be combined with most existing methods for reducing memory access conflicts, which is another difference between the present application and other researches.
  • Reuse pairs can be divided into four categories, namely Load-Load (two load operations), Load-Store (load first, then write), Store-Load (write first, then load), and Store-Store (two write operations) reuse.
  • LRF local register file
  • GRF global register file
  • our compiler should guarantee that the reused data can be successfully preserved for the lifetime of the LRF or GRF. Therefore, the goal of our loop transformation model is to use an affine transformation-based approach to orchestrate the execution order of loop iterations to minimize the lifetime of reusing data between iterations.
  • step 203 shown in FIG. 2 the present application proposes a strategy for narrowing the selection range of optional cyclic transformations according to dependencies.
  • cone( MR ) represents a vertebral body formed by elements of MR
  • cone( MR - ir ) represents a vertebral body formed by elements other than ir .
  • Formula (1) indicates that the line where the currently selected time vector is located cannot be in the vertebral body composed of the dependency vectors; formula (2) indicates that the currently selected time vector can be collinear with a vector on the surface of the vertebral body composed of all dependent vectors, And the dependency vector must be greater than or equal to the time vector, that is where c is a real number and the value range of c is 0 ⁇ c ⁇ 1.
  • the present application iteratively uses this strategy to select the legal innermost time vector from the innermost loop. After selecting an innermost time vector, all dependency vectors can be projected onto the time vector normal plane, and the next dependency vector can be selected iteratively. Finally, a cyclic transformation that satisfies the cyclic dependency can be obtained.
  • step 203 shown in FIG. 2 the present application proposes a strategy for selecting cyclic transformations to maximize effective data reuse. Whether a Load-Load or Store-Load reuse pair is valid depends on the innermost time vector when searching for optimal transitions.
  • a fetch operation p can be represented as Indicates that the memory fetch operation p accessed on-chip memory in iteration i address. given reuse pair It is valid when the corresponding following formula (3) holds.
  • c is a constant threshold used to help select valid reuse pairs.
  • Each reused pair has a corresponding formula (3), so that as many reused pairs as possible are valid, that is, the innermost time vector for which formula (3) is established as much as possible is the optimal time vector.
  • the optimal transform search algorithm first searches for the optimal time vector of the innermost layer, and then uses formula (2) to recursively calculate the time vector of the outer layer according to the time vector of the inner layer.
  • FIG. 3 in order to further illustrate the above-mentioned cyclic transformation method, the program shown in FIG. 3( a ) is taken as an example.
  • the goal of the optimization problem is to maximize the number of equations established in Figure 3(e).
  • the nodes in Fig. 3(g) represent the innermost candidate time vectors, where the feasible solutions to the optimization problem are nodes, highlighted as gray nodes.
  • Figure 3(f) shows the time vector for each recurrent layer.
  • Figure 3(h) shows the transformed program, while the innermost loop is split into two layers in order to traverse all iterations.
  • FIG. 4(a) is a representation of the PE structure, where Op represents an operation placed on the PE, including P represents a producer in a reused pair, C represents a consumer, and R represents a routing node.
  • the LRF owned by each PE is also marked, and the LRF mark drawn with a diagonal pattern indicates that the LRF stores data.
  • Store-Store reuse is not limited by interval iterations between producers and consumers. If such a reuse relationship exists, the producer in the reuse relationship pair needs to be eliminated.
  • Figure 4(b) shows the modification strategy for Store-Store reuse.
  • interconnect resources or registered resources should be used to move data from producers to consumers.
  • r ⁇ s,t,d>, where s is the producer in the reused pair, t is the consumer in the reused pair, and d is the interval iteration number between s and t.
  • I(r) d ⁇ II+(time(t)-time(s)), where II is the start interval of software pipeline.
  • register resources When I(r)>1, register resources must be used to temporarily store data.
  • the selected strategy will be recorded, and in the configuration file generation stage of step 213, the generated configuration file will be modified through the selected strategy.
  • the test integrates the cyclic transformation model and configuration file modification strategy proposed in this application in 22 typical calculation-intensive application test sets
  • 22 Compute-intensive cores are selected from datasets derived from digital signal processing, computer vision, and dynamic programming, such as existing benchmarks such as EEMBC, Polybench, and Machsuite.
  • Figure 5 shows the integrated modulo scheduling compiler of the present application and the original compiler (that is, not using the present application, but using the compilation flow based on the following references [12] and [13], denoted as THP+DP) generation
  • THP+DP compilation flow based on the following references [12] and [13], denoted as THP+DP
  • Table 1 shows the comparison of the number of memory access operations before and after the implementation of the solution described in this application:

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

Procédé d'élimination de conflit d'accès à la mémoire à réutilisation de données pour une structure reconfigurable grossière. L'invention concerne un modèle de transformation de boucle qui est appliqué à un noyau de boucle imbriquée parfaite et est utilisé pour maximiser la réutilisation de données efficace de façon à maximiser la réutilisation de données disponible entre des itérations dans un processus d'exécution de programme et une stratégie de changement de fichier de configuration dans un étage de génération de code. Une opération d'accès à la mémoire redondante dans une paire de réutilisation de données est éliminée, et les conflits d'accès à la mémoire dans le processus d'exécution de boucle sont réduits.
PCT/CN2021/079524 2020-11-30 2021-03-08 Procédé d'élimination de conflit d'accès à la mémoire à réutilisation de données pour structure reconfigurable grossière WO2022110567A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202011377746 2020-11-30
CN202011377746.6 2020-11-30
CN202110061629.7 2021-01-18
CN202110061629.7A CN112631610B (zh) 2020-11-30 2021-01-18 一种针对粗粒度可重构结构的数据重用消除访存冲突方法

Publications (1)

Publication Number Publication Date
WO2022110567A1 true WO2022110567A1 (fr) 2022-06-02

Family

ID=75294506

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/079524 WO2022110567A1 (fr) 2020-11-30 2021-03-08 Procédé d'élimination de conflit d'accès à la mémoire à réutilisation de données pour structure reconfigurable grossière

Country Status (2)

Country Link
CN (1) CN112631610B (fr)
WO (1) WO2022110567A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230195439A1 (en) * 2021-12-22 2023-06-22 Samsung Electronics Co., Ltd. Apparatus and method with neural network computation scheduling

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119225976A (zh) * 2024-11-26 2024-12-31 天开智算(天津)科技有限公司 基于cgra实现计算加速的方法、装置和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102200924A (zh) * 2011-05-17 2011-09-28 北京北大众志微系统科技有限责任公司 基于模调度实现循环指令调度的编译方法及装置
CN103106067A (zh) * 2013-03-01 2013-05-15 清华大学 处理器循环映射的优化方法及系统
CN105260222A (zh) * 2015-10-13 2016-01-20 哈尔滨工程大学 一种可重构编译器中循环流水迭代间启动间距优化方法
US20160092285A1 (en) * 2014-09-25 2016-03-31 Intel Corporation Method and Apparatus for Approximating Detection of Overlaps Between Memory Ranges

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101101992B1 (ko) * 2010-04-30 2012-01-13 광운대학교 산학협력단 코어스 그레인드 재구성 어레이에서의 애플리케이션 매핑 최적화 방법 및 그 장치
CN102043659A (zh) * 2010-12-08 2011-05-04 上海交通大学 消除内存访问冲突的编译装置及其实现方法
CN102156666B (zh) * 2011-04-20 2012-11-28 上海交通大学 用于粗粒度可重构阵列处理器资源调度的温度优化方法
EP2523120A1 (fr) * 2011-05-12 2012-11-14 Imec Architecture de micro-ordinateur pour le traitement efficace de bande de base à faible consommation d'énergie
CN102855197A (zh) * 2011-11-08 2013-01-02 东南大学 一种面向大规模粗粒度可重构系统存储系统的实现方法
KR101293701B1 (ko) * 2012-02-23 2013-08-06 국립대학법인 울산과학기술대학교 산학협력단 코어스 그레인드 재구성 어레이에서의 중첩 루프문 수행 장치 및 그 방법
CN103377035A (zh) * 2012-04-12 2013-10-30 浙江大学 针对粗颗粒度流应用的流水并行化方法
CN203706197U (zh) * 2014-02-10 2014-07-09 东南大学 一种粗粒度动态可重构数据规整控制单元结构
CN103942082B (zh) * 2014-04-02 2017-03-29 南阳理工学院 一种消除冗余的内存访问操作的编译优化方法
CN103984560B (zh) * 2014-05-30 2017-09-19 东南大学 基于大规模粗粒度嵌入式可重构系统及其处理方法
CN105528243B (zh) * 2015-07-02 2019-01-11 中国科学院计算技术研究所 一种利用数据拓扑信息的优先级分组调度方法及系统
CN105302624B (zh) * 2015-09-17 2018-10-26 哈尔滨工程大学 一种可重构编译器中循环流水迭代间启动间距自动分析方法
CN105302525B (zh) * 2015-10-16 2018-01-05 上海交通大学 用于多层次异构结构的可重构处理器的并行处理方法
US10528356B2 (en) * 2015-11-04 2020-01-07 International Business Machines Corporation Tightly coupled processor arrays using coarse grained reconfigurable architecture with iteration level commits
CN105468568B (zh) * 2015-11-13 2018-06-05 上海交通大学 高效的粗粒度可重构计算系统
CN105335331B (zh) * 2015-12-04 2018-08-21 东南大学 一种基于大规模粗粒度可重构处理器的sha256实现方法及系统
CN105700933A (zh) * 2016-01-12 2016-06-22 上海交通大学 可重构处理器的高级语言的并行化和循环优化方法及系统
CN105718245B (zh) * 2016-01-18 2018-08-28 清华大学 可重构计算循环映射优化方法
CN105867994A (zh) * 2016-04-20 2016-08-17 上海交通大学 一种用于粗粒度可重构架构编译器的指令调度优化方法
WO2019059927A1 (fr) * 2017-09-22 2019-03-28 Intel Corporation Inversion d'imbrication de boucles

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102200924A (zh) * 2011-05-17 2011-09-28 北京北大众志微系统科技有限责任公司 基于模调度实现循环指令调度的编译方法及装置
CN103106067A (zh) * 2013-03-01 2013-05-15 清华大学 处理器循环映射的优化方法及系统
US20160092285A1 (en) * 2014-09-25 2016-03-31 Intel Corporation Method and Apparatus for Approximating Detection of Overlaps Between Memory Ranges
CN105260222A (zh) * 2015-10-13 2016-01-20 哈尔滨工程大学 一种可重构编译器中循环流水迭代间启动间距优化方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230195439A1 (en) * 2021-12-22 2023-06-22 Samsung Electronics Co., Ltd. Apparatus and method with neural network computation scheduling

Also Published As

Publication number Publication date
CN112631610A (zh) 2021-04-09
CN112631610B (zh) 2022-04-26

Similar Documents

Publication Publication Date Title
CN112465108B (zh) 一种面向存算一体平台的神经网络编译方法
Lu et al. Optimizing depthwise separable convolution operations on gpus
US11669443B2 (en) Data layout optimization on processing in memory architecture for executing neural network model
Che et al. Dymaxion: Optimizing memory access patterns for heterogeneous systems
Baskaran et al. Optimizing sparse matrix-vector multiplication on GPUs
CN108509270B (zh) 一种国产申威26010众核处理器上K-means算法的高性能并行实现方法
US20140149719A1 (en) Arithmetic processing apparatus, control method of arithmetic processing apparatus, and a computer-readable storage medium storing a control program for controlling an arithmetic processing apparatus
US20120331278A1 (en) Branch removal by data shuffling
Coarfa et al. Co-array Fortran performance and potential: An NPB experimental study
Jiang et al. Boyi: A systematic framework for automatically deciding the right execution model of OpenCL applications on FPGAs
EP2761435A1 (fr) Parcours en largeur d'abord de multiples c urs de processeur sensible au cache et/ou à la prise
CN112631610B (zh) 一种针对粗粒度可重构结构的数据重用消除访存冲突方法
US6324629B1 (en) Method for determining an optimized data organization
Ma et al. Acceleration by inline cache for memory-intensive algorithms on FPGA via high-level synthesis
Ye et al. Hida: A hierarchical dataflow compiler for high-level synthesis
Ferry et al. Increasing fpga accelerators memory bandwidth with a burst-friendly memory layout
US20220350863A1 (en) Technology to minimize the negative impact of cache conflicts caused by incompatible leading dimensions in matrix multiplication and convolution kernels without dimension padding
Su et al. Accelerating inclusion-based pointer analysis on heterogeneous CPU-GPU systems
Nakano Sequential memory access on the unified memory machine with application to the dynamic programming
Danckaert et al. Platform Independent Data Transfer and Storage Exploration Illustrated on Parallel Cavity Detection Algorithm.
Xie et al. MPU: Towards bandwidth-abundant SIMT processor via near-bank computing
US20220365891A1 (en) Accelerator and electronic device including the same
Chen et al. Reducing memory access conflicts with loop transformation and data reuse on coarse-grained reconfigurable architecture
Ozdal Improving efficiency of parallel vertex-centric algorithms for irregular graphs
Zhuang et al. A framework for parallelizing load/stores on embedded processors

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21896087

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21896087

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 21896087

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 150124)

122 Ep: pct application non-entry in european phase

Ref document number: 21896087

Country of ref document: EP

Kind code of ref document: A1

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载