+

WO2009001368A2 - Procédé et tissu de système sur puce - Google Patents

Procédé et tissu de système sur puce Download PDF

Info

Publication number
WO2009001368A2
WO2009001368A2 PCT/IN2007/000262 IN2007000262W WO2009001368A2 WO 2009001368 A2 WO2009001368 A2 WO 2009001368A2 IN 2007000262 W IN2007000262 W IN 2007000262W WO 2009001368 A2 WO2009001368 A2 WO 2009001368A2
Authority
WO
WIPO (PCT)
Prior art keywords
fabric
clusters
cluster
execution
units
Prior art date
Application number
PCT/IN2007/000262
Other languages
English (en)
Other versions
WO2009001368A3 (fr
Inventor
Soumitra Kumar Nandy
Ranjani Narayan
Keshavan Varadarajan
Mythri Alle
Amar Nath Satrawala
Shimoga Janakiram Adarsha Rao
Original Assignee
Indian Institute Of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Indian Institute Of Science filed Critical Indian Institute Of Science
Priority to PCT/IN2007/000262 priority Critical patent/WO2009001368A2/fr
Publication of WO2009001368A2 publication Critical patent/WO2009001368A2/fr
Publication of WO2009001368A3 publication Critical patent/WO2009001368A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This invention relates to a System on Chip fabric in which compute, storage and communication resources can be aggregated at runtime to perform specific application tasks, along with a method and system to complement this.
  • SoC platforms are programmable platforms comprising a variety of processing cores (RISC, DSP, application engines, coprocessors etc.), for which applications are developed in a High Level Language (HLL).
  • HLL High Level Language
  • Application development in HLL relies largely on the compiler infrastructure.
  • the micro architectural resources are exposed to the compiler, so that functionalities/modules (of the application) are expressed as an execution sequence of instructions such that impositions of structural and execution semantics of the architecture are adhered to.
  • This is achieved by following an execution pipeline in terms of instruction fetch/decode, execute and memory writes back stages. Consequently, it becomes necessary to follow the process of application decomposition into modules that can be implemented on cores.
  • the diktat of the given platform and execution paradigm determines the application decomposition and execution.
  • the shortcomings manifest as compromised performance, and inefficient utilization of micro-architectural resources, since the application is tailored to fit the given platform.
  • Multiprocessor System-on-Chip With regard to Multiprocessor System-on-Chip (MP-SoCs), the key to performance is to arrive at an optimal partition of the application into modules, which can be assigned to the heterogeneous processing cores. While software solutions offer flexibility in realization of applications within certain performance levels, they cannot ensure scalability both in terms of performance and application enhancements. Hardware solutions, as in ASICs on the other hand guarantee performance at the cost of flexibility and large NRE cost associated with silicon fabrication of the device.
  • Field Programmable Gate Arrays FPGAs
  • FPGAs offer a platform in which computational structures can be composed in terms of Look-up Tables (LUTs), which serve as logic elements. LUTs in the FPGA are used as universal logic gates that can emulate any logical operation.
  • the composition of computational structures i.e.
  • ASIC targeted application synthesis This involves taking a high-level language description (typically C) of the application and transforming the same into RTL to make it hardware synthesizable.
  • C high-level language description
  • the application synthesis is a process of customizing template processor architectures for the given application and addition of new instructions to the Instruction Set Architecture of the template processor.
  • the process of customizing involves determining the number of processors connected together, the interconnection bandwidth between these processors, the buffer sizes used hi communication, the instructions to be included/excluded in the processor etc.
  • This technique of ASIC targeted application synthesis reduces Non Recurring Engineering (NRE) costs associated design time and number of designers required to accomplish the task. However, this does not address the NRE costs associated with back end process of design (i.e.
  • NRE Non Recurring Engineering
  • this technique is not the preferred route if the number of instances of the final chip to be produced is a very small number, since it is not cost effective to manufacture ASICs in small numbers. Even though the end product may be technically superior (w.r.t performance and power characteristics) it does not make economic sense in the age of "short lifetime" gadgets.
  • the application substructures chosen for this hardware platform must be optimal, for the given hardware platform to improve performance of the application.
  • the application substructures are chosen to match the given hardware. Solutions in this space include DAPDNA from IPFlex [7], Stretch from Stretch Inc and MOLEN [9].
  • the DAPDNA platform contains 376 ALUs packed in 6 segments.
  • the application is expressed in a language called Dataflow C and then converted into a hardware configuration through the process of compilation.
  • the greatest limitation of this solution was the design entry point.
  • the language dataflow C is very restrictive.
  • another limitation of the DAPDNA approach is the time required to reconfigure the fabric. It takes about 200 cycles to load a new configuration.
  • MOLEN [9] and the solution from Stretch Inc. [13] are identical in their approaches. There is a core processor in both cases.
  • An improved design must address the following design constraints, in order to overcome the limitations of prior art: • Any solution that needs to extract as much performance as possible from a given application attempts to place closer, communicating entities, and tries to reduce the overhead of communication. This is one of the optimization criteria for the floor planning/placement step during back end processing to generate a die from RTL (for ASIC manufacture). This optimization criteria is also used on FPGAs, when performing placement and routing.
  • ASICs are the most efficient platforms with regard to performance and power efficiency.
  • NRE Non Recurring Engineering
  • the platform must comprise homogeneous building blocks (BBs), similar to an FPGA.
  • BBs homogeneous building blocks
  • the granularity of the BBs unlike FPGAs 1 , must be amenable to structural reorganization (runtime reconfiguration) so that application specific combinational/sequential circuits can be realized.
  • the BBs must satisfy the universality criteria i.e. it must support all possible elementary operations.
  • the universal nature of BBs will help reduce the design and development time for any new application and help support application scalability. Further a regular interconnect connecting the BBs would maintain regularity in their access pattern.
  • the platform must have high fault resilience to support fault-free operation, with the working subset of resources. These characteristics make the job of composition tractable, since there are no additional restrictions placed by hardware (other than the constraints imposed by the application due to computation characteristics).
  • FPGAs can support ran time reconfiguration, but in practice, the very large configuration data makes it infeasible.
  • RTL independence The identification of application modules/ftinctionalities and their mapping onto the platform must be independent of RTL, to enable early prototyping and to achieve application synthesis from HLL specification.
  • HLL Application development in HLL: The platform must support a synthesis methodology by which applications developed hi HLL can be directly realized on the platform.
  • Scalability The platform must support hardware scalability Le. increasing capacity/capability of the platform by increasing the number of building blocks, hi this invention, use of run time configurable hardware as a potential method to satisfy the above-cited requirements, is proposed.
  • It is another object of the present invention to provide a system relating to System on Chip fabric comprising, (a) A scheduler, (b) A cluster configuration store that contains the configuration for all possible clusters (which are defined as partitions that are disjoint sub-graphs of the dataflow graph or application graph) of the application, similar to an instruction store in a traditional architecture.
  • the cluster configuration store cannot be overwritten during the course of execution, (c) an execution fabric containing a plurality of computational resources, referred to as Operation Service Units (OSUs), storage units and switches, which are connected through a regular interconnection wherein an additional overlay network is available to facilitate communication between two resources which are not directly connected by the interconnection, (d) a resource binding agent, which is the logic that maps virtually bound clusters (a group of instructions that have strong producer-consumer relationship) to the execution fabric.
  • OSUs Operation Service Units
  • storage units and switches which are connected through a regular interconnection wherein an additional overlay network is available to facilitate communication between two resources which are not directly connected by the interconnection
  • a resource binding agent which is the logic that maps virtually bound clusters (a group of instructions that have strong producer-consumer relationship) to the execution fabric.
  • the binding determines unoccupied OSUs onto which the operations are mapped, the cluster configuration for the will fire clusters being obtained from the Cluster Configuration Store, (e) a Load Store Unit that handles all memory operations generated by the execution fabric, wherein a Controlled Dataflow paradigm is used wherein the memory is primarily used to store global variables, non-scalar variables and for pointer based manipulations, and (f) Store Destination Decision Logic (SDDL) that is responsible for determining where the output of a given cluster must be written to, wherein if the output data is meant for a cluster for which no input data is yet available then a new line is allocated within the scheduler and if the output data is meant for a cluster for which some of the inputs have arrived, then new data operand is written in the line already allocated to the cluster instance.
  • SDDL Store Destination Decision Logic
  • This SoC fabric in which compute, storage and communication resources can be aggregated at runtime to perform specific application tasks, is the focus of this invention.
  • Traditional compiler infrastructures are not capable of realizing application modules as execution sequence on . composed resources.
  • This invention proposes a methodology in which modules are compiled into a Virtual Instruction Set Architecture (VISA) from which data flow graphs ccoorrrreessppoonnddiinngg ttoo tthheessee mmoodduulleess aarree ggeenneerraatteedd a and executed on computational structures 2 composed at runtime on the REDEFINE fabric.
  • VSA Virtual Instruction Set Architecture
  • Fig. 1 shows a self-addressed active storage unit, which determines the next token to be issued to the OSU based on which it can fire.
  • Fig. 2 shows a regular tessellation formed using equilateral triangles.
  • Fig. 3 shows rectangular and hexagonal tessellations meeting specified constraints.
  • Fig. 4 shows the system of the present invention.
  • Fig. 5 shows the method of the present invention.
  • Fig. 5a shows the step of controlled dataflow execution, in the method of the present invention, in more detail.
  • Fig. 5b shows a cluster that has a fixed number of inputs and outputs.
  • Fig. 6a describes the templates showing mapping of a monadic operation.
  • Fig. 6b describes the templates showing mapping of a dyadic operation.
  • Fig. 7 shows the C language description of matrix vector multiply function.
  • Fig. 8 shows the Data Dependence graph of a matrix vector multiply kernel.
  • Fig 9 shows the mapping of the data dependence graph onto a hexagonal fabric.
  • Fig. 10 shows the Data flow graph of a Fibonacci kernel.
  • Fig.11 shows an unoptimized mapping of the data flow graph on the hexagonal fabric.
  • Fig.12 shows the optimized Mapping of Fibonacci function on the hexagonal fabric.
  • a Computational Structure is the subset of the hardware resources provisioned for execution of a subgraph from the dataflow graph of the application.
  • the fabric proposed in the present invention is also referred to as the REDEFINE fabric is a regular interconnection of resources comprising computation units, storage units and switches.
  • the REDEFINE fabric is a regular interconnection of resources comprising computation units, storage units and switches.
  • Each FU is capable of running a single operation. Examples of typical operations include add, multiply, AND, OR, NOT etc. In this case, the granularity of the application graph is in terms of these primitive operations.
  • Arithmetic Logic Units Unlike FUs, each ALU is capable of executing several operations. The use of ALUs instead of FUs makes the process of mapping the application graph onto the fabric simpler, since each ALU supports more than one operation.
  • a FU has only the necessary and sufficient logic required to execute a particular operation.
  • the use of FUs in a fabric increases utilization, since it is not overloaded with logic to executed different operations.
  • identification of subgraphs of the application graph to match the level of granularity imposed by the FUs and their interconnection is complex.
  • ALUs on the other hand have the logic for generic computations and hence make the problem of identification and mapping of subgraphs simpler.
  • the choice of computation unit is dependent on the domain of applications that the fabric targets, parameters of optimization (viz. power, utilization) etc. We refer to the chosen computation unit as Operation Service Unit (OSU). Storage Units
  • the storage units serve as placeholders for the input data, the control information for the FU and the intermediate results from OSUs. Any traditional control driven computational paradigm can be supported with a simple passive storage.
  • a distributed token matching unit is maintained in the storage units, necessitating active storage units.
  • Active storage units are small SRAMs/Register Files.
  • Each line accommodates the operands and predicates of an operation.
  • a bitmap, called operand availability bitmap maintains which operations have all inputs ready. All operations whose inputs are ready "can fire”.
  • One among the "Can Fire” operations, called the "Will Fire” operation is chosen for execution on the OSU. The choice of "Will Fire" operations is made by the priority encoder.
  • Each line may additionally contain the control information for the OSU (viz. opcode in case ALUs are used, destination storage unit, tag for output generated).
  • the storage unit serves as a wait and match unit of the dataflow architecture.
  • Fig. 1 shows a self-addressed active storage unit, which determines the next operation to be issued to the OSU based on which it can fire.
  • the self-addressed storage unit is used inside the execution fabric for scheduling operations on the OSU (42 in Fig. 4).
  • the operand availability bitmap 1 is connected to a priority encoder 2, which in turn is connected to a row decoder 3 to compute the exact row in which the data 4 is available. 4 is capable of holding all the inputs for an operation. For each operand present in 4, there is a bit in 1. This bit indicates whether the operand has arrived or not. Once all the bits are set to 1, a input along that line to 2 (priority encoder) is held high.
  • a regular interconnection network is defined in terms of switches such that data communication can be effected among OSUs and storage units and also serve as forwarding agents for data.
  • switches such that data communication can be effected among OSUs and storage units and also serve as forwarding agents for data.
  • regular interconnection networks we consider only those which can yield a planar interconnection of switches to enable easy realization on VLSI.
  • Regular fabric is a tessellation of a single unit (or the geometric transformations of this single unit viz. its rotation, its reflection) repeated in 2D space 3 .
  • Fig 2 is tessellation formed using equilateral triangles.
  • Fig. 2 shows a regular tessellation formed using equilateral triangles 10, 11. Tessellations using squares and hexagons are shown in Fig. 3 wherein the corners 21a and 21b represent an OSU and a storage unit, respectively. A pentagon or any other polygon does not satisfy this property.
  • the fabric needs to contain a good mix of OSU and Storage Units placed optimally and interconnected in a way that mimics execution flow in hardware.
  • the interconnection between OSUs and storage units is akin to a graph having edges between nodes, where
  • nodes represent either OSUs or storage units. There can only be three kinds of edges in this graph, namely type 1: between an OSU and Storage type 2: between two OSUs type 3: between two Storage Units (type 3).
  • type 2 and type 3 edges are not desired because
  • Two OSUs placed adjacent to each other will increase the complexity of the FSM of the combined unit.
  • Two OSUs placed adjacent to each other is tantamount to having two synchronous OSUs, which may have to be maintained as a 2-stage pipeline or a vector unit. In either case, a storage unit separating the two OSUs is imminent.
  • any regular structure with only type 1 edges is bipartite.
  • a triangular tessellation is not bipartite, whereas both square and hexagonal structures are, as shown in Fig 3.
  • the square tessellation i.e. mesh structure is more prevalent in designs.
  • the hexagon and square tessellations are chosen based on the nature of the application and its communication characteristics.
  • Triangular tessellation is not bipartite; however, these networks can be employed with appropriately designed OSU, which has storage units integrated into it.
  • HLL is compiled into computational structures that directly execute on the fabric as either combinational/sequential circuits. The details of the various steps of the execution orchestration are given below:
  • the HLL description is translated into an intermediate representation (ER.) with an orthogonal instruction set in Static Single Assignment (SSA) Form.
  • SSA Static Single Assignment
  • the orthogonal set of instructions is referred to as the Virtual Instruction Set Architecture (VISA).
  • VISA Virtual Instruction Set Architecture
  • the use of orthogonal instruction set makes the dataflow graph architecture independent (i.e. ISA independent)
  • Compilation into Clusters The Dataflow Graph of the application closely mimics the flow of signals in hardware.
  • the dataflow graph referred to henceforth as application graph, is partitioned into disjoint subgraphs called clusters, hi the initial state, every node of the dataflow graph may be considered as an independent cluster.
  • the criteria for merging two clusters to obtain bigger aggregations are: o Two communicating clusters are candidates for merger into a single cluster o The number of nodes in a cluster may not exceed a pre-determined threshold. o The number of inputs and outputs to the cluster may not exceed a pre- determined threshold o Two communicating nodes that are separated by several levels 4 in the dataflow graph cannot be merged if the absolute difference between their levels exceeds a certain threshold. o Two clusters cannot be merged, if there exists more than certain predefined maximum number of nodes guarded by complementary predicates.
  • the communication within the cluster is maximized and the communication across clusters is minimized.
  • the clustering algorithm is independent of the high level language syntaxes viz. loops as subgraphs, functions as subgraphs.
  • the application graph is compiled into a set of clusters.
  • the optimal computational structure for the cluster is determined based on the interactions of nodes within the cluster.
  • the communication patterns between instructions are mapped onto the fabric to compose computational structures at runtime. It casts the communication pattern of the application as a subset of the communication capability available in the fabric. This mimics the construction of a circuit at a grosser level.
  • the composition of the cluster is independent of the high level language syntaxes viz. loops as subgraphs, functions as subgraphs.
  • the application graph is compiled into a set of clusters.
  • the optimal computational structure for the cluster is determined based on the interactions of nodes within the cluster.
  • the communication patterns between instructions are mapped onto the fabric to compose computational structures at runtime. It casts the communication pattern of the application as a
  • Levels are assigned by performing a topological sort and its mapping on the fabric are determined at compile time. The actual binding of clusters to resources on fabric takes place at runtime.
  • Controlled Dataflow Execution The Controlled Dataflow Execution Paradigm is used to schedule and execute clusters of operations onto REDEFINE. Even though the overall schedule of the clusters is guided by the application, different execution paradigms use different scheduling strategies in order to maximize resource utilization and maximize instruction throughput, hence application performance, hi Controlled Dataflow, the scheduling of clusters is based on a Dataflow schedule akin to the scheduling of instructions in a traditional dataflow machine. The cluster is treated as a "hyper" operation with multiple inputs and outputs. A scheduler identifies the clusters to be launched for execution. The Scheduler identifies as many clusters as possible for scheduling to maximally utilize resources on the fabric. Depending on the availability of resources on the fabric, a subset of clusters ready for execution is selected.
  • Dataflow execution semantics identifies clusters, which "Can Fire".
  • the scheduler determines which of these clusters "Will Fire”.
  • the primary difference over a traditional Dataflow architecture is the use of hierarchical scheduling. At the higher level clusters are scheduled using dataflow semantics. Once a cluster is chosen for being scheduled, the instructions contained in a cluster are also chosen using a dataflow schedule.
  • the cluster scheduling follows the dynamic dataflow scheme allowing multiple cluster instances to execute simultaneously, while at the instruction level a static dataflow scheme is used.
  • static dataflow scheduling at the level of instructions does not have the disadvantages of the traditional static dataflow machine. Static dataflow machines cannot support execution of multiple instances of the same instruction simultaneously, which prevents them from exploiting inter-iteration parallelism that may exist in a loop.
  • Hierarchical scheduling with dynamic dataflow based cluster scheduling helps keep the number of operands from exploding during high ILP periods.
  • the use of clusters also helps in reducing the communication overheads. This is because a data produced and consumed within a cluster has no visibility outside the cluster instance. Due to reduced number of "visible" outputs (at cluster level), complexity of the Wait Match Unit (used in traditional dataflow architectures) is reduced. Only data writes to global load/stores and writes to other clusters are made visible. The entire cluster executes as an atomic operation.
  • Issue logic The "Will Fire" clusters are issued to the fabric by the issue logic. In order to issue a cluster, the issue logic needs to identify the resources on the fabric where the given cluster will be mapped. Once the resources on the fabric are identified, the issue logic does the following: o Writes the cluster inputs into the identified storage locations. o Writes the configuration information (viz. opcode) for the OSU into the related storage unit. The process of issuing clusters and its assignment to hardware is called binding. The decision to bind clusters at runtime increases the complexity of the hardware (when compared to a VLIW architecture), but the hardware is no longer as complex as the instruction execution pipeline in a Superscalar processor.
  • the high-level block diagram of the REDEFINE platform is shown in Figure 4. 41, 42, 43 and 49 form a part of the issue logic described previously. 45 is built of computation units and storage units interconnected through a regular network (as described above).
  • Scheduler (42) The Scheduler is responsible for scheduling clusters. The scheduler determines which clusters "Can Fire” and which cluster "Will Fire". Any cluster that has all inputs available "Can Fire". In order to choose one cluster to fire, priority is used. The compiler infrastructure suggests a priority of a cluster, based on whether cluster appears on the critical path. Several other factors are also considered for determining the priority. The design of the scheduler is very similar to the storage unit described previously ( Figure 1). Unlike the operation storage unit, it takes in more operands. Hence, more operands slots and operand availability bits are required. • Cluster Configuration Store (41): The cluster configuration store contains the configuration for all possible clusters of the application. This is similar to an instruction store in a traditional architecture.
  • Execution Fabric The execution fabric contains OSU and storage units connected through a regular interconnection. An additional overlay network is available to facilitate communication between two resources, which are not directly connected by the interconnection.
  • Load Store Unit handles all memory operations generated by the execution fabric.
  • the memory is primarily used to store global variables, non-scalar variables and for performing pointer based memory operations.
  • SDDL Store Destination Decision Logic
  • 42 selects a cluster for execution.
  • 42 provides the cluster number to be scheduled to 41 and 43.
  • 41 contains the instructions included in a cluster. It also includes the resource requirement specification for the cluster along with the mapping of cluster input data to instruction input data. Resource requirement specification is a kxk matrix that indicates how many OSUs is needed for the execution of the cluster and their positions in the fabric where they are needed.
  • 42 also provides the input operands for the cluster instance, which is chosen for execution.
  • 41 supplies the resource requirement specification.
  • 43 has data structures, which indicates the regions of 45 that are not being used. 43 matches the cluster requirement specification with the available resources and tries to find a match. If the resource requirement can be satisfied then the instructions are mapped on to the respective OSUs.
  • 45 is a collection of OSUs, storage units and switches which are interconnected together in a predetermined manner. Once the region within 45 is identified, the cluster input is mapped to the instruction operands, and they are forwarded to the appropriate OSUs. On 45, the OSUs execute all the instructions that have been mapped to it. Once the OSU completes execution of all the instructions, a message (44) is sent back to 43 indicating that the OSU is free. The results (48) of computation that need to be sent to other cluster instances are relayed to 49. 49 looks at the destination cluster identifier (which is compiler generated number associated with a cluster). Several instances of that cluster may be created during the execution time of that program. 49 determines the right cluster instance for which the data is destined to and then forwards the data to the right location within 42.
  • the destination cluster identifier which is compiler generated number associated with a cluster.
  • the method of the present invention is depicted in Fig. 5 wherein compute, storage and communication resources can be aggregated at runtime to carry out specific application tasks for maximizing power and performance efficiency in a scalable fashion.
  • This one time design followed by subsequent runtime aggregation ameliorates the NRE costs associated with back end process of designing an ASIC.
  • This comprises translating the High Level Language (HLL) applications 152 for SoC platforms to an intermediate representation 153.
  • the intermediate form is an orthogonal instruction set in Static Single Assignment (SSA) Form 154, wherein the orthogonal set of instructions forms the Virtual Instruction Set Architecture (VISA) 154.
  • the SSA VISA form is then converted into dataflow graphs 155.
  • the data flow graphs are then compiled into clusters 157.
  • Clusters include closely interacting group of nodes of the dataflow graphs. Several such disjoint sub graphs are grouped to form clusters. Each cluster has certain number of inputs and outputs 180.
  • the clustered data flow graph 158 is then translated into an executable 159 that can be executed on the computational structures composed at runtime. The executable 160 is now available for execution.
  • the executable thus created is executed using controlled dataflow execution paradigm.
  • clusters ready for execution are first identified 170.
  • Such clusters are called Can Fire clusters 171.
  • one cluster is chosen for execution 172.
  • This cluster is called the Will Fire Cluster 173.
  • the issue logic checks for availability of resources as specified in the resource requirements for this cluster 174.
  • the cluster configuration data is transferred to the storage units identified for the execution of this cluster 175.
  • the control information too is transferred 176.
  • the execution of the operations in the cluster ensues 1771
  • the dataflow graph contains only dyadic and monadic operations. This can be easily mapped to the hexagonal structure as shown in the templates in Figure 6.
  • Figure A is a mapping of a monadic operation on to the fabric. 60 is the producer and 61 is the consumer.
  • Figure B is a mapping of a dyadic operation on to the fabric. 62 and 63 are producers. 64 is the consumer.
  • Matrix Vector Multiply kernel Matrix Vector Multiply kernel
  • Figure 7 shows the kernel of the matrix vector multiply function. This code is passed through LLVM with optimization level set to 3. The resulting bytecode is then hand coded into a dataflow graph (refer Figure 8).
  • Figure 8 shows the data dependence graph of a matrix vector multiplication kernel. Nodes 80 and 81 are multiplies whose result gets added in 82 and the sum of operations 82 and 83 is done at 84. This dataflow graph is then mapped onto the hexagonal fabric as shown in Figure 9.
  • Figure 7 The C language description of matrix vector multiply function.
  • Figure 10 shows the dataflow graph for iterative Fibonacci sequence generator.
  • the basic blocks are marked by dotted ellipse encircling the operation nodes.
  • the basic blocks shown in this example are 110, 111, 112 and 113.
  • the mapping of the basic blocks 111 and 112, to the execution fabric with a hexagonal interconnection is shown in Figure 11.
  • the nodes 120-126 indicate the unoptimized mapping of the data dependence on to a fabric with hexagonal interconnection is shown in Figure 10.
  • the mapping of the dataflow graph (refer Figure 10) onto the hexagonal fabric is shown in Figure 11. It is important to note in Figure 11, there are non-neighbor producer consumers. Presence of these requires special communication mechanisms to transfer data from one portion of the fabric to another.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)
  • Stored Programmes (AREA)

Abstract

La présente invention concerne un tissu dans un cadre d'un SoC, ainsi qu'un système et un procédé, dans lesquels les ressources peuvent être composées comme structures de calcul qui concordent le mieux avec les besoins de l'application. Le tissu de l'invention contient des ressources de calcul, de stockage et de communication pouvant être agrégées à l'exécution pour réaliser des tâches d'application spécifiques. Le système comprend un planificateur, une banque de configuration de grappe, un tissu d'exécution contenant une pluralité de ressources de calcul, un agent de liaison de ressources, une unité de stockage de charge et une logique de décision de la destination du stockage (SDDL). Le procédé de l'invention passe par les étapes de développement de descriptions HLL (langage de haut niveau) des modules d'application, la conversion de la description HLL des modules de l'application à une représentation intermédiaire, - la compilation en grappes à l'aide du graphique du flux de données de l'application, - la réalisation des opérations de liaison, - la réalisation de l'exécution de flux de données commandée, un ensemble de grappes étant planifié et exécuté sur le tissu.
PCT/IN2007/000262 2007-06-28 2007-06-28 Procédé et tissu de système sur puce WO2009001368A2 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IN2007/000262 WO2009001368A2 (fr) 2007-06-28 2007-06-28 Procédé et tissu de système sur puce

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IN2007/000262 WO2009001368A2 (fr) 2007-06-28 2007-06-28 Procédé et tissu de système sur puce

Publications (2)

Publication Number Publication Date
WO2009001368A2 true WO2009001368A2 (fr) 2008-12-31
WO2009001368A3 WO2009001368A3 (fr) 2009-09-24

Family

ID=40186127

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IN2007/000262 WO2009001368A2 (fr) 2007-06-28 2007-06-28 Procédé et tissu de système sur puce

Country Status (1)

Country Link
WO (1) WO2009001368A2 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10089761B2 (en) 2016-04-29 2018-10-02 Hewlett Packard Enterprise Development Lp Graph processing using a shared memory
CN116501594A (zh) * 2023-06-27 2023-07-28 上海燧原科技有限公司 系统建模评估方法、装置、电子设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6175948B1 (en) * 1998-02-05 2001-01-16 Motorola, Inc. Method and apparatus for a waveform compiler
AU2003237005A1 (en) * 2002-06-28 2004-01-19 Koninklijke Philips Electronics N.V. Integrated circuit having building blocks
US7353362B2 (en) * 2003-07-25 2008-04-01 International Business Machines Corporation Multiprocessor subsystem in SoC with bridge between processor clusters interconnetion and SoC system bus
JP4275013B2 (ja) * 2004-06-21 2009-06-10 三洋電機株式会社 データフローグラフ処理装置、処理装置、リコンフィギュラブル回路。

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10089761B2 (en) 2016-04-29 2018-10-02 Hewlett Packard Enterprise Development Lp Graph processing using a shared memory
CN116501594A (zh) * 2023-06-27 2023-07-28 上海燧原科技有限公司 系统建模评估方法、装置、电子设备及存储介质
CN116501594B (zh) * 2023-06-27 2023-09-08 上海燧原科技有限公司 系统建模评估方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
WO2009001368A3 (fr) 2009-09-24

Similar Documents

Publication Publication Date Title
Liu et al. A survey of coarse-grained reconfigurable architecture and design: Taxonomy, challenges, and applications
Weng et al. A hybrid systolic-dataflow architecture for inductive matrix algorithms
JP6059413B2 (ja) 再構成可能命令セル・アレイ
Chen et al. Graph minor approach for application mapping on cgras
CN102782672B (zh) 用于高效嵌入式同类多核平台的基于瓦片的处理器架构模型
Parashar et al. Efficient spatial processing element control via triggered instructions
Liu et al. OverGen: Improving FPGA usability through domain-specific overlay generation
Alle et al. Redefine: Runtime reconfigurable polymorphic asic
Silva et al. Ready: A fine-grained multithreading overlay framework for modern cpu-fpga dataflow applications
Lottarini et al. Master of none acceleration: A comparison of accelerator architectures for analytical query processing
Hatanaka et al. A modulo scheduling algorithm for a coarse-grain reconfigurable array template
Bates et al. Exploiting tightly-coupled cores
Sarbazi-Azad et al. Large Scale Network-Centric Distributed Systems
WO2009001368A2 (fr) Procédé et tissu de système sur puce
Vadivel et al. Towards efficient code generation for exposed datapath architectures
Cardoso Dynamic loop pipelining in data-driven architectures
Paul Programmers' views of SoCs
Alle et al. Synthesis of application accelerators on runtime reconfigurable hardware
Chattopadhyay et al. Language-driven exploration and implementation of partially re-configurable ASIPs
Satrawala et al. Redefine: Architecture of a soc fabric for runtime composition of computation structures
Wijtvliet et al. CGRA background and related work
Petersen et al. Reducing control overhead in dataflow architectures
Cathey et al. A reconfigurable distributed computing fabric exploiting multilevel parallelism
Saha et al. A methodology for automating co-scheduling for reconfigurable computing systems
Khan et al. Exploring the Approaches to Data Flow Computing.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07805632

Country of ref document: EP

Kind code of ref document: A2

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 7659/CHENP/2009

Country of ref document: IN

122 Ep: pct application non-entry in european phase

Ref document number: 07805632

Country of ref document: EP

Kind code of ref document: A2

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载