HK40008643A - Processor with reconfigurable algorithmic pipelined core and algorithmic matching pipelined compiler - Google Patents
Processor with reconfigurable algorithmic pipelined core and algorithmic matching pipelined compiler Download PDFInfo
- Publication number
- HK40008643A HK40008643A HK19123964.9A HK19123964A HK40008643A HK 40008643 A HK40008643 A HK 40008643A HK 19123964 A HK19123964 A HK 19123964A HK 40008643 A HK40008643 A HK 40008643A
- Authority
- HK
- Hong Kong
- Prior art keywords
- processor
- core
- reconfigurable
- compiler
- icat
- Prior art date
Links
Description
Cross Reference to Related Applications
The present application claims priority from U.S. provisional application 62/287,265 entitled "Processor With Reconfigurable alkaline modified Core And alkaline modified pinned Compiler", filed 2016, 1, 26, the entire contents of which are incorporated herein by reference.
Technical Field
The field relates to computer programming and microprocessor design and programming, and in particular to reconfigurable, pipelined, and parallel processing of general purpose software instructions.
Background
FIG. 1A illustrates a compiler of a conventional processor. Conventional processors, such as Intel microprocessors and ARM microprocessors, are well known. For example, a conceptual illustration of a conventional processor is shown in FIG. 1B. These processors are the core of central processing units for modern computers and devices, and are used to process algorithms. One problem with conventional processors is that these types of processors are generic and cannot be reconfigured in any practical manner to allow their performance to be enhanced for a particular application. Another problem is that program execution control adds a large amount of overhead (overhead) to algorithmic functional processes, such as mathematical operations and logical decisions that change the process flow. A conventional processor may be programmed using a higher level programming language, and a compiler converts instructions in the higher level programming language into machine code for a particular processor architecture. The machine code is provided to a memory location accessible to the processor and, along with any BIOS or other calls provided by the system architecture, provides instructions for the operation of the processor hardware. In most cases, the mathematical and logical processing directions point to an Arithmetic Logic Unit (ALU) that returns a solution to the program execution control portion of the processor, which manages overhead, such as directing the processor by solving the correct order of mathematical algorithms, logical decisions, data processing, and so forth. Machine code instructions are continuously fetched from the program memory to control the processing of data. This overhead severely limits machine performance.
For example, the following shows the steps of a conventional compiler compiling mathematical operations in a "C" programming language, which is an example of a higher level programming language that can be compiled to create machine code for a particular conventional processor. Simple mathematical operation assignment "var i 1; "," var i 2; "and" var s; "to define the data storage locations for variables i1, i2, and result s. Then, the instruction "s ═ i1+ i 2; "can be used to sum the variables assigned in data locations i1 and i 2. The compiler (a) first allocates storage locations (e.g., i1, i2, and s) for the data, and (b) generates the source code as machine code. A conventional processor will retrieve all or a portion of the machine code from a storage location where the code is stored. The legacy processor will then execute the machine code. For this example, a Central Processing Unit (CPU) loads the i1 data into a memory location and sends it to the ALU, loads the i2 data into a memory location and sends it to the ALU, and instructs the ALU to add the data located in i1 and i 2. Only then will the ALU perform the addition of the values located in the data positions of i1 and i 2. This is a useful work step and the setup of the CPU is overhead. The CPU may then obtain an ALU result from the data location for "s" and may send the ALU result to the input and output controllers. If the result is not an intermediate step in the calculation, this is a necessary step to present the result. Traditional processor evolution stems from the desire to save computer program development time, which allows higher level programming languages to be compiled for various architectures of central processing units and peripheral devices. In addition, all processes executed by the CPU may share the common ALU when it is shared by various programs operating in the system environment.
Application Specific Integrated Circuits (ASICs) built as hardware electronic circuits capable of quickly executing calculations for specific functions are known. These application specific integrated circuits reduce overhead by hardwiring certain functions into the hardware.
Some Field Programmable Gate Arrays (FPGAs) having a large number of logic gates and Random Access Memory (RAM) blocks are known. These FGPAs are used to implement complex numerical calculations. Such FPGA designs can employ very fast input/output buses and bidirectional data buses, but it is difficult to verify the correct timing of valid data during setup and hold times. The floor plan enables resource allocation within the FPGA to meet these time constraints. An FPGA may be used to implement any logic function that an ASIC chip may perform. With regard to ASIC design, the ability to update functionality after delivery, partial reconfiguration of a portion of the design, and low non-repetitive engineering costs provide advantages for certain applications (even when generally higher unit costs are considered).
However, penetration of FPGA architectures has been limited to narrow niche products. An FPGA virtual computer for executing a set of FPGAs (by sequentially reconfiguring those instructions in response to a series of program instructions) is granted in us patent No.5,684,980. Fig. 2 shows the structure of the FGPA architecture. The issued patent includes an FPGA array that continuously changes configuration during execution of successive algorithms or instructions. The configuration of the FPGA array allows the entire algorithm or instruction set to be executed without waiting for individual instructions to be downloaded in the performance of individual computational steps.
The development and integration of FGPAs with processors has given the promise of providing the ability to reprogram at "run time", but in practice, reconfigurable computing or reconfigurable systems that accommodate the task at hand have far from being implemented in practical applications due to difficulties in programming and configuring these architectures for this purpose.
Fig. 2 shows a block diagram of a virtual computer including a field programmable gate array and a field programmable interconnect device (FPIN) or crossbar that frees internal resources of the field programmable gate array from any external connection tasks, as disclosed in U.S. patent No.5,684,980, the disclosure and drawings of which are incorporated herein in their entirety for the purpose of disclosing the knowledge of those skilled in the art of FPGAs.
FIG. 2 illustrates an array of field programmable gate arrays and field programmable interconnect devices arranged and used as coprocessors to enhance performance within a host computer or virtual computer processor to perform a continuous algorithm. The sequential algorithm must be programmed to correspond to a conventional series of instructions that would normally be executed in a conventional microprocessor. The FPGA/FPIN array then performs certain computational tasks of the sequential algorithm at a much slower rate than the corresponding instructions executed by conventional microprocessors. The virtual computer of fig. 2 must contain a reconfigurable control section that controls the reconfiguration of the FPGA/FPIN array. A configuration bit file must be generated for the reconfigurable control section using a software package designed for this purpose. The configuration bit file must then be sent to the corresponding FPGA/FPIN array in the virtual computer. Fig. 2 shows how the array and dual port Random Access Memory (RAM) are connected to the reconfigurable control section, the bus interface and the computer main memory by pins. The bus interface is connected to a system bus.
U.S. patent No.5,684,980 shows how pins provide clock pins and pins that connect the reconfigurable control section to the FPGA/FPIN array, and shows an example of the reconfigurable control section.
Us patent No.4,291,372 discloses a microprocessor system with dedicated instruction formatting that works with external applications that are slaved to logic blocks that handle specific requirements for sending data to and receiving data from peripheral devices. The microprocessor provides a program memory having a dedicated instruction format. The instruction word format provides a single bit field for selecting a program counter or memory reference register as a source of a memory address, a function field defining a path for data transfer to occur, and source and destination fields for addressing source and target locations. Previously, peripheral controller units loaded the processor and control circuitry in the base module in the system to handle specific requirements.
A Digital Signal Processing (DSP) unit or array of DSP processors may be hardwired as a parallel array that optimizes performance in certain graphics-intensive tasks, such as pixel Processing for generating images on output screens such as monitors and televisions. These are custom made and include a BIOS dedicated to the graphics acceleration environment created for the digital signal processor to do its work.
Matrix Bus Switching (MBS) is known. For example, user guide "AXI4TM、AXI4-LiteTMAnd AXI4-StreamTMProtocol Assertion (Protocol Assertion), revision: r0p1, User Guide, copyright owner 2010, 2012 (referenced as ARM DUI 0534B, ID072312), teaches a system for matrix bus switching that is high speed and can be implemented by one of ordinary skill in the art. The user guide is written for system designers, system integrators, and verification engineers who wish to confirm that the design conforms to the relevant AMBA 4 protocol. For example, the AMBA 4 protocol may be AXI4, AXI4-Lite, or AXI 4-Stream. All trademarks are registered trademarks of ARM in the European Union and elsewhere. In addition, this reference is incorporated by reference herein in its entirety. MBS is a high speed bus for data input and output, and this reference teaches system engineers exemplary methods and hardware for integrating MBS in processor system architectures.
All of this is known in the art, but none of the examples in the prior art eliminate almost all of the overhead generated by conventional processing systems while maintaining the flexibility to handle various algorithms and software development of processing systems using standard higher level programming languages (such as "C").
Disclosure of Invention
An on-chip pipelined parallel processor includes a processing unit and an array of reconfigurable field programmable gates programmed by an algorithm-matched pipeline compiler, which may be a pre-compiler, such that the algorithm-matched pipeline compiler pre-compiles source code for processing by the processing unit designed to run on a standard processor without parallel processing, the processing unit and the algorithm-matched pipeline compiler (referred to as AMPC or ASML) configuring the field programmable gates to run as a pipelined parallel processor. For example, a processor may be referred to as a Reusable Algorithmic Pipelined Core (RAPC). The parallel processor is configured to complete the task without any further overhead from the processing unit, such as overhead for controlling the arithmetic processing unit.
In one example, the reusable algorithm pipeline processor includes a computer pool configured to parallel process algorithms using standard higher level software languages (such as "C", "C + +", etc.). For example, a computer pool is reprogrammed to run different algorithms based on the output of the AMPC created using the available RAPC resources, according to the needs of a particular computation.
For example, a Reusable Algorithmic Pipeline Core (RAPC) may include three modules: an intelligent bus controller or Logic Decision Processor (LDP), a Digital Signal Processor (DSP), and a matrix bus switch. The Logic Decision Processor (LDP) includes reconfigurable logic functions, is reprogrammable as needed, and is used to control a Master Bus Switch (MBS). The DSP includes a reconfigurable math processor for performing math operations. In one example, all mathematical operations processed by the RAPC are processed by the DSP. In one example, all logical functions handled by the RAPC are handled by LDP. A matrix bus router or matrix bus switch (MBR (matrix bus router) or MBS) is defined as a reconfigurable programmable circuit that routes data and results from one RAPC to another RAPC and from/to an input/output controller and/or interrupt generator as needed to complete an algorithm without any further intervention from a central or peripheral processor during the algorithm processing. Thus, overhead is greatly reduced by pipelining as compared to static, non-reconfigurable hardware that requires intervention by a central or peripheral processor to direct data and results into and out of the arithmetic processing unit. In one example, LDP processes logical decisions and iterative loops, and results storage is provided by LDP for learning algorithms.
In one example, all mathematical operations processed by the RAPC are processed by the DSP and all logic functions are processed by the LDP. In one example, a plurality of RAPCs are configured as a core pool, and each RAPC in the core pool can be individually reconfigured by programming without any change to hardware. For example, all RAPCs may be configured as parallel processing algorithms. In one example, LDP uses memory blocks as Look-Up tables (LUTs) and registers for constant or learned values. An n-bit (n-bit) LUT may be used to encode any n-input (n-input) Boolean (Boolean) logic function into a truth table using the LUT established by LDP.
In one example, an Algorithm Matching Pipeline Compiler (AMPC) generates machine code in a higher level, compilable software language such as "C", "C + +", Pascal, Basic, etc. Standard source code written for a conventional non-reconfigurable and non-pipelined general-purpose computer processor may be processed by the AMPC to generate machine code for configuring one or more RAPCs. For example, the AMPC generates machine code from standard, pre-existing code for a conventional ARM processor or a conventional Intel processor, and the machine code generated by the AMPC pre-compiler configures the RAPC using the ARM processor or the Intel processor. Thus, the new computer system includes a legacy processor (such as an existing ARM processor, Intel processor, AMD processor, etc.) and a plurality of RAPCs, each RAPC including, for example, a DSP, LDM, and MBS. Unlike existing coprocessors or accelerators, however, RAPCs are not just peripheral coprocessors. Rather, after the pre-compiler or AMPC configures the RAPC to complete its work, the RAPC is reconfigured to independently address complex mathematical and logical algorithms without further intervention by a conventional processor. The values are input into the configured RAPCs and the solutions are output to the MBS. In one example, multiple RAPCs are arranged on a single chip (such as a reconfigurable ASIC). Reconfigurable ASIC refers to a chip designed to contain RAPCs such that each RAPC can be reprogrammed for specific operations by the AMPC and existing general purpose processor architectures (such as ARM processors, AMD processors, and Intel processors). In this way, such a reconfigurable ASIC may contain 2000 RAPCs and may operate 360 trillion instructions per second at a clock speed of 500 MHz. Thus, a single reconfigurable ASIC containing 2000 RAPCs can operate 100 times faster than any conventional general purpose processor today. All RAPCs can run in parallel in a pipelined configuration while data is available. A single RAPC can execute instructions 100 times faster than a standard processor. A reconfigurable ASIC containing 20 RAPCs, running at a clock speed of 500MHz, can execute 300 million instructions per second. A single chip may contain up to 2000 RAPCs in a custom sized ASIC. Thus, a conventionally-sized ASIC containing 2000 RAPCs can execute instructions 200,000 times faster than a conventional processing system without having to resort to a dedicated programming language. Instead, existing programs can be ported to operate with a reconfigurable ASIC that includes multiple RAPCs and benefit from pipelined (parallel) execution of instructions without substantially requiring rewriting of existing high-level programming. In one example, the AMPC precompiles existing code for an ARM general purpose processor architecture embedded on a reconfigurable ASIC containing multiple RAPCs. This new processor architecture (ICAT) achieves surprising and unexpected performance by integrating an ARM processor architecture and multiple RAPCs on one chip. The embedded ARM processor on the ICAT chip executes machine code instructions generated by the AMPC from a pre-existing program written in a high level programming language such as "C", the AMPC configuring multiple RAPCs on the ICAT chip to execute instructions quickly at a surprising rate per second. The ARM processor also controls intelligent monitoring, diagnostics, and communication with peripheral devices external to the ICAT chip. Therefore, the ICAT chip appears to the outside world as a very fast ARM processor that does not require a math co-processor.
In an alternative example, the ICAT chip may be embedded in the AMD processor and may appear to the outside world as being an AMD processor.
In yet another example, the ICAT chip may be embedded in an Intel processor and may appear to the outside world as an Intel processor.
Surprisingly, while the ICAT chip appears to the outside world as a standard non-reconfigurable non-pipelined processor capable of executing instructions at a rate comparable to only standard processors, the ICAT chip executes instructions at a rate that is surprisingly and unexpectedly 100 times to 200,000 times faster than the standard processors facing the world, without having to rewrite programs written for standard processors. This simplifies the cumbersome task of rewriting code to run on the FPGA, ultimately making it available to the average programmer. In one example, the AMPC does not generate the runtime code of the ICAT chip. Instead, it precompiles the program and separates out the instructions that are best suited for RAPC. The AMPC then generates code for setting individual RAPCs of the plurality of RAPCs on an ICAT chip (or elsewhere, in one example, using multiple ICAT chips operating in parallel), and the RAPCs then operate pipelined and in parallel. Alternatively, the RAPC may be reconfigured in real time based on an instruction received by the ICAT or a history instruction previously received by the ICAT chip. Therefore, if the ICAT chip is used for a similar purpose all the time, the ICAT chip can learn to operate at an increasingly faster speed over time. This occurs naturally if the RAPC is configured in real time to solve new logic and mathematical equations while properly maintaining the old logic and mathematical configuration based on, for example, first-in-first-out or last-in-use last-in-last reconfiguration. If used frequently, the set of RAPCs reconfigured for a particular purpose will not be reconfigured for another purpose until there are no other RAPCs available for the new algorithm requiring a RAPC, e.g., if the most recently used reconfigured algorithm is employed. In this way, the most common algorithms will not need to be configured, but are pre-configured in previous uses.
The RAPC, when configured by the central processing unit, operates without overhead, executing instructions until mathematical, logical and iterative instructions (which the RAPC has been configured) have been completed.
In one example, the ICAT chip includes a setting register, and the AMPC generates an instruction for setting the setting register of the ICAT chip, the AMPC configuring the RAPC to complete the specific instruction. RAPC runs continuously at initialization without further supervision by the central processing unit. In one example, the AMPC receives RAPC hardware data from a hardware compiler (such as a Verilog or Vivado hardware compiler). The hardware file may be generated by a hardware compiler and may be used by the AMPC to generate code of configuration data written to the setting registers of the ICAT chip (or the setting registers of multiple ICAT chips in one example).
In one example, the AMPC extracts configuration data for the setting registers of the ICAT chip from a program written in a high-level programming language (such as "C") for a standard processing architecture. For example, the AMPC ignores the overhead instruction and generates the code for the setting register of the ICAT chip from the following program: 1) arithmetic instructions and data; 2) logical decisions and data; 3) branch or call/return instructions and targets; 4) iterative loops, decisions and data; 5) the DSP sets routines and data; and 6) code entry point tags for loops and branches. For example, AMPC uses these instructions to configure the ICAT's setup registers to configure the DSP to perform mathematical algorithms, to configure LDP for logic decisions and values for LDP look-up tables, and to configure the MBS for branching, calling, and returning target tags that map to entry points and allocated addresses in various processing algorithms in the ICAT hardware. For example, the RAPC hardware table is constructed for each RAPC and contains DSP, LDP and MBS configuration tables. For example, DSP, LDP and MBS are configured for common use in RAPC, but when DSP or LDP is not needed, then RAPC can be reconfigured, and even such common structures of AMPC and ICAT architectures can be omitted. Thus, while DSPs, LDPs and MBS exist in some RAPCs, other RAPCs may have different structures specific to code to be run on an ICAT processor.
In one example, the ICAT architecture and AMPC are dependent on each other on hardware that can be reconfigured by the AMPC. For example, if the target RAPC is nearby, or data is being routed from the DSP to the LDP, or vice versa, the AMPC may implement a branch or call of the target within the ICAT architecture by directly connecting the result or data to the target, e.g., making the result or data directly available for execution of instructions without any overhead. Alternatively, the AMPC may use the MBS to implement a branch or call of the target and the result and data are transmitted over the high-speed streaming interface to the target, which may be another RAPC or other target, so that the data are available for further execution of instructions to the target over the high-speed streaming interface.
In one example, the AMPC is aware of RAPC resources that are allocated by the AMPC when pre-compiling code written in a high-level programming language. Thus, the ICAT architecture may be configured by the AMPC to optimize the use of RAPC resources, such as by minimizing the length of interconnections between instructions executed by multiple RAPCs. This optimization can be done by an iterative method or a trial and error method. In one example, the AMPC includes a learning algorithm that improves optimization based on historical usage patterns of certain instructions (such as by minimizing the use of MBS's branches or calls for targets of a common instruction set), whether mathematical, logical, or a combination of mathematical and logical algorithms. For an example of MBS implementation, see the example of ARM MBS in the background art.
In one example, the RAPC is integrated into a chip having a conventional processor for configuring the RAPC and an AMPC for compiling conventional high-level source code into instructions for the conventional processor to build the RAPC. RAPC includes DSP, LDP and MBS. In this example, each DSP has a setup interface for programming any of a plurality of operations, e.g., integer and floating point mathematical operations such as multiply, divide, add, subtract, and other mathematical functions. The DSP may have inputs for operand data that may be concatenated with or operated on with various combinations of mathematical functions determined by the setting data. In this example, each DSP has a 48-bit accumulator that is the output as result data and status data. The status data includes, for example, execute, equal, greater than, and less than. In this example, each LDP has a setup interface for programming look-up tables, loop counters, and constant registers. Each LDP has a "Loop Counter" that is used to detect when the iterative algorithm is complete. Each LDP has a register that can hold constant data for input to a lookup table. Each LDP has a memory block that can be used to perform a function. The look-up table function may include: a look-up table that can be implemented and accessed sequentially using a loop counter; a lookup table for control purposes that can be implemented and accessed by DSP state, constant registers, or DSP result data; and a logic look-up table that can implement and output various logic signals, e.g., for control purposes. LDP may pass result data from its input to its output. For example, LDP may have a pipeline register at its output for result data. Alternatively, the LDP may have two pipeline registers that enable synchronous clearing of the result data at its output. For example, the chip may be an ICAT chip containing a plurality of RAPCs each including a DSP, LDP and MBS, and each RAPC is provided by the AMPC to code settings of a legacy processor.
In one example, the AMPC includes a compiler with an input architecture for defining a number of the plurality of RAPCs and a location of the plurality of RAPCs. The AMPC filters high-level source code and identifies mathematical and logical algorithms that can be optimized by configuring one or more of the RAPCs. For example, if a video processing mathematical algorithm or logic algorithm is identified, the AMPC builds one or more DSPs, LDPs, and MBS of the RAPC to perform the video processing mathematical algorithm and/or logic algorithm. For example, the AMPC creates machine code from "C" language source code for operating a legacy processor (such as an ARM processor), and the ARM processor sets the portions of the respective DSPs, LDPs, and MBS of the respective RAPCs to be used for processing data input to the processor and outputting data from the processor.
For systems external to the processor, the ICAT processor will appear as an exceptionally fast conventional processor. Within the processor, the DSP, LDP and MBS of the RAPC will process data tens, hundreds or even thousands of times faster than a traditional single core processor. For each RPAC, the DSP will perform its operations on the first clock, LDP will test the results and output control decisions and result data on the second clock, and MBS will route the result data to one of the two targets based on the control data on the third clock. Thus, each RAPC will have a 3-clock delay from the DSP to the MBS. For streaming data, once started, the MBS may output data on each subsequent clock after the delay period.
In one example, a system for configuring a reconfigurable processor includes: a non-reconfigurable processor, a plurality of reconfigurable cores, and an algorithm matching pipeline compiler capable of accepting code written in a high-level programming language for the non-reconfigurable processor; wherein the compiler identifies code written in a high-level programming language (the code available from a pipeline available on one or more of the plurality of reconfigurable cores) and outputs code for the non-reconfigurable processor to set one or more of the plurality of non-reconfigurable processors.
In one example, a processor includes a non-reconfigurable processor core and a plurality of reusable algorithm pipeline cores coupled to the non-reconfigurable processor core, such that the non-reconfigurable processor core is capable of configuring and reconfiguring each of the plurality of reusable algorithm pipeline cores as a result of instructions received from an algorithm matching pipeline compiler. For example, the processor is contained in a single chip. In one example, each reusable algorithm pipeline core includes a DSP, LDP, and MBS, and the DSP is pipelined to the LDP and the LDP is pipelined to the MBS, such that the non-reconfigurable processor does not control any processing occurring in each reusable algorithm pipeline core.
And (4) defining. An algorithm Matching Pipelined Compiler (Algorithmic Matching Pipelined Compiler) or AMPC is a Compiler that can accept code written in a high-level programming language for a conventional non-reconfigurable processor, where the AMPC identifies code written in a high-level programming language that is available from pipelining technology available on a reconfigurable core or processor, such as a RAPC or field programmable gate array, and outputs code for the non-reconfigurable processor that instructs the non-reconfigurable processor to configure the reconfigurable core or processor before providing instructions for using the reconfigurable core or processor. A reusable (or reconfigurable) algorithmic pipeline core (or computer) or RAPC is defined as a reconfigurable processing core having a pipeline structure comprising: a DSP including a setup interface for programming any of a plurality of operations (such as integer and floating point mathematical operations), the DSP having four inputs for operand data that may be in series with or operated on with various combinations of mathematical functions determined by the setup data, and the DSP including a 48-bit accumulator that outputs as result data and status data; LDP with a setup interface for programming look-up tables, loop counters and constant registers, and memory blocks (which may be used to perform functions); and MBS. MBS is defined as a reconfigurable programmable circuit that routes data and results from one RAPC to another RAPC and from/to an input/output controller and/or interrupt generator as needed to complete an algorithm without any further intervention from a central or peripheral processor during algorithm processing.
Drawings
The following figures are illustrative examples and do not further limit any claims that may be finally issued.
FIG. 1A shows a flow diagram for a conventional compiler in the prior art.
FIG. 1B illustrates a prior art processor for a conventional computer.
Fig. 2 shows a block diagram of U.S. patent No.5,684,980.
Fig. 3 is a flowchart showing an example of an AMPC compiler for comparison with the flowchart in fig. 1A.
Fig. 4 is an example of the ICAT architecture.
Fig. 5 shows a flow chart of an example of how a programmer may use AMPC.
FIG. 6 is a schematic example of a reusable algorithmic pipeline computer.
Fig. 7 is a schematic diagram showing a diagram of a hardware configuration resulting from compilation of a Code Example (Code Example)1 with an AMPC compiler.
FIG. 8 illustrates the significant benefit of raw processing power (due to real-time lossless data compression in consumer electronics devices) from the example of FIG. 7.
When the same reference numbers are used, they refer to similar parts in the examples shown in the figures.
Detailed Description
For example, the ICAT architecture mimics any standard microprocessor unit architecture. Its architecture takes advantage of pipelining and the richer density of gates designed in integrated circuits for configuration by customers or designers after manufacture, such as one or more Field Programmable Gate Arrays (FPGAs) can achieve a 100:1 advantage in MIPS when using a single standard microprocessor architecture with the same clock speed for 1:1 comparisons. An FPGA contains an array of programmable logic blocks, and a reconfigurable interconnect hierarchy that allows the blocks to be "wired together," just like many logic gates that can be wired to each other in different configurations. The logic blocks may be configured to perform complex combinatorial functions or simply simple logic gates (e.g., AND AND XOR). In most FPGAs, the logic block also includes memory elements, which can be simple flip-flops or more complete memory blocks.
The tremendous increase in performance allows processors to be used in data intensive applications such as machine vision applications, video processing applications, audio processing applications, robotic control system applications, multi-axis control system applications, mobile communication applications, virtual reality applications, artificial intelligence applications, live broadcast applications, biometric monitoring applications, internet of things applications, supercomputing applications, quantum computing applications, aerospace control system applications, simulation and modeling applications for complex systems, and signal processing applications.
In one example, less power is used for the computationally intensive processing of the algorithm. For example, the ICAT architecture provides the advantage of 100 to 1 reduction in energy usage, more preferably 1000:1, for the same computation implemented on a standard microprocessing unit, thereby reducing heat and power consumption.
In one example, ICAT can run in a configuration of as many parallel processors as needed by the application, improving performance even further compared to standard microprocessors. For example, multiple processor architectures may be run simultaneously. For example, legacy code may run on a virtual machine compatible with the legacy code, while a new virtual machine runs code written specifically for the new architecture. In one example, this reduces the need for extensive regression testing such as is required to adapt legacy code to new system architectures.
In one application, the speed and scalability of the ICAT architecture is applicable to legacy systems that are unable to handle the amount of data required for the original speed and scalability for customers for which code and/or hardware have encountered limitations.
In one example, compiling the reconfiguration at or before power-up greatly simplifies planning with little impact on final product performance. For example, an FPGA is the host hardware for the architecture. Millions of Instructions (MIPS) can be easily added per second without significant rewriting of existing code. The existing code runs with little modification other than recompiling the existing code. For example, algorithms that require processing of a large number of common inputs in parallel are ideal candidates for the ICAT architecture.
In one example, the new processor and the old processor run in parallel. Existing code can be recompiled and run almost unaffected while performing minimal regression testing to ensure no changes occur. Exceptions are timing effects on the operation of the architecture and hardware peripherals being changed. For example, the ICAT architecture can be used to increase raw computing speed, and acceleration of code can be achieved by converting hardware when needed.
In one example, the ICAT architecture includes a front-end pre-compiler that captures any potential code incompatibility issues. This front-end pre-compiler automatically resolves these potential code incompatibility issues. For example, the ICAT architecture may emulate various processor architectures familiar to different developers. For example, the ICAT architecture can emulate more than one processor, allowing items to be coded for multiple developers' preferred processors and code to run on multiple different virtual processors simultaneously. In one example, multiple different processors will run different sets of code in a multiprocessing environment, and a program developer compiles the code for one of multiple domains that are compatible with the code.
In one example, the pre-compiler is an algorithm matching pipeline compiler that generates hardware configuration code required for various processing algorithms. Firmware for configuring the ICAT architecture can be generated from logic and mathematical equations for multiple processing tasks. For example, the plurality of processors may be configured as a matrix array for running mixed low-performance and high-performance tasks.
The ICAT architecture includes process code developed using a higher level language because the ICAT architecture provides a raw speed advantage that goes beyond any speed advantage gained with machine language programming that is only suitable for one particular multiprocessing environment, thereby greatly reducing the time to complete the development project.
The ICAT architecture includes a compiler or pre-compiler that checks legacy code for hardware specific commands, and is optimized using a high-level programming language such as C or C + +. For example, a comparison of fig. 1 and 3 shows additional steps included in an Algorithm Matching Pipeline Compiler (AMPC).
In one example, the ICAT architecture provides a collection of standard multiprocessing/multitasking peripherals with built-in coordination. A Real Time Operating System (RTOS) may be employed. For example, a multitasking, real-time operating system is incorporated into the ICAT architecture. For example, a Micro-Controller operating system (Micro-c/OS) is a real-time operating system designed by embedded software developer Jean j. It is a priority-based pre-emissive real-time operating system for microprocessors, written primarily in the C programming language (higher level programming language). For example, the original speed of the ICAT architecture allows the use of such RTOS. MicroC/OS allows multiple functions to be defined in the C language, each of which can run as a separate thread or task. Each task runs with a different priority and each task considers it to have a virtual processor of the ICAT architecture. Lower priority tasks may be preempted by higher priority tasks at any time. Higher priority tasks may use operating system services, such as delays or events, to allow lower priority tasks to be executed. Operating system services for task management, inter-task communication, memory management, and for timing the MicroC/OS are provided. MicroC/OS is open source and is adaptable to a variety of different processor architectures.
PCBA layout software and engineering tools are provided for the ICAT architecture to allow conversion of existing designs into the ICAT architecture.
In one example, the pipeline architecture is implemented using standard Verilog or VHDL code. For example, a 1024 word (word) instruction cache, a data cache, and a multi-level memory cache architecture may be provided in the ICAT architecture. The pipeline of the ICAT architecture may include a learning algorithm that detects which path a branch on the decision making process (branching) tends to occur on, making that path the default path for future passes through the learning algorithm. In another example, the interrupt code is isolated and the interrupt handler is dedicated to a particular input having a private code location. In one example, the ICAT architecture includes a multiprocessor debugger. For example, existing code may be processed by a pre-processing debugger to ensure that existing code is well partitioned so that functions are separated. A single debugger may then be run on each independent thread of operation.
For example, a Reconfigurable Algorithmic Pipeline Core (RAPC) may be provided in a 2-inch chip package that provides MIPS and Mega FLOPS equivalent to over 1000 Intel i7 microprocessors, and more preferably over 10,000 Intel i7 microprocessors.
In one example, the ICAT architecture is compatible with existing debugging tools. In another example, the ICAT architecture is implemented to run existing legacy code that does not contain inter-processor communication. ICAT specific hardware is consolidated into a single, well-debugged block that is common to all legacy code. For example, peripherals that fully mimic the main functions of a common multi-processing unit are cloned for the ICAT architecture. For example, the superset peripherals allow the customer to easily arrange the hardware arrangement.
In one example, an ICAT architecture compiler or pre-compiler detects low-level code timing loops that count clock cycles, delays that allow instruction fetching, and other incompatible timing code, and manually or automatically marks these items for repair or replacement using compatible higher-level programming provided in the ICAT architecture.
In one example, the ICAT architecture provides an advantage of 4:1MIPS over the legacy architecture. In another example, the advantage is at least 100: 1.
In one example, the ICAT architecture includes an Algorithm Matching Pipeline Compiler (AMPC), which is a compiler that accepts processing algorithms in a standard source code format. The AMPC generates firmware for a conventional processing system operable with the ICAT architecture. The compiler generates instructions to configure the ICAT hardware so that the architecture processes the algorithm with improved performance compared to a conventional microprocessor that cannot be reconfigured by the AMPC. In particular, AMPC uses pipelining to optimize processor performance for applications that require algorithm-intensive computational processing. For example, the firmware may run on a conventional processing system to configure the ICAT hardware architecture(s) of the processing algorithm with optimal performance.
In one example, the AMPC provides a compiler that compiles traditional compiler source code that is able to generate code for operating ICAT hardware that configures the processor resources of the ICAT architecture to directly process algorithms. For example, AMPC uses source code compatible with a conventional compiler (such as C, C #, C + +, Matlab, or other conventional compiler).
In one example, firmware generated by the AMPC runs on a main processing system of the ICAT architecture. For example, the main processing system is a legacy processor on the same chip as the rest of the ICAT architecture and operates seamlessly with the ICAT architecture. In this example, the AMPC accepts code written in a high-level programming language for the source code (such as C, C #, C + +), and the AMPC outputs firmware for the ICAT architecture running on the host processing system. This simplifies the encoding for operating the ICAT architecture by allowing the firmware for the ICAT architecture to be programmed in higher level programming languages familiar to developers. The original speed of the ICAT architecture eliminates the penalty and reduces any need to program machine-level code to optimize speed. Rather, higher level programming languages optimize firmware to optimize performance based on the algorithm to be solved for a particular application. For example, the ICAT architecture is reconfigurable to allow optimal performance for the robotic vision system on at least one virtual machine defined in firmware, as an example.
Unlike conventional microprocessors, in one example, the AMPC of the ICAT architecture may compile a software syntax (e.g., if-then-else process) into firmware that reconfigures (e.g., using pipelining) the hardware of the ICAT architecture to optimally perform the process in fewer clock cycles. The ICAT architecture is configured by running firmware. In contrast, a traditional compiler builds firmware for use by all traditional processors, but the traditional processors are not reconfigured by the firmware. For example, the AMPC builds firmware for the ICAT architecture, configuring the ICAT architecture to achieve optimal operation in a particular application. In one example, the AMPC uses algorithms that are input structures for the processor hardware of the ICAT architecture to select and construct the configuration of the ICAT hardware.
For example, when configured by firmware generated by the AMPC, the AMPC optimizes the hardware architecture of the ICAT architecture to handle the speed performance of a particular application. The AMPC can reconfigure the hardware of the ICAT architecture, whereas a conventional compiler cannot reconfigure the hardware of the ICAT or the hardware of any microprocessor.
A standard system compiler cannot change the hardware architecture in a traditional processor system. However, in one example, the AMPC generates firmware that configures the ICAT architecture processor to directly perform pipeline processing and data routing based on previous results in hardware. For example, if-then-else logic statements input to AMPC will construct the hardware of the ICAT architecture to route the data results to the next ICAT. In this example, the AMPC generates a hardware configuration that eliminates the overhead of a traditional processing system, such as code acquisition, data loading, data storage, branching, and subroutines for the same if-then-else logic.
Fig. 4 shows an example of the ICAT architecture. In one example, a conventional compiler (such as a Visual Studio) may be used to generate the ICAT configuration program running on the host processing system 101. This provides a method for configuring and reconfiguring a reprogrammable hardware pool that can be reconfigured to run and process various types of processing algorithms in a chip. Legacy processing systems (e.g., Intel, ARM, IBM, AMD microprocessors) cannot be reconfigured to run the various algorithms because only software, not hardware, can be changed in the legacy processing system. By using the ICAT architecture, all of the overhead of fetching and executing code instructions of conventional processing systems is eliminated. The ICAT architecture of FIG. 4 provides reconfigurable hardware that can be configured for efficient processing of data using a pool of parallel processor resources implemented in a System On Chip (SOC) device 100.
For example, math processor pool 107 (followed by logical processor 108 and configurable matrix routing 109) implements parallel processing resource pool 102. The architecture enables pipelined processing resources to optimize processing performance for a particular application. In one example, the processor pool 102 performs multiple processing tasks independently of the main processor 101 without receiving further instructions from the main processor. Each ICAT may be configured to process the entire algorithm as an independent processor system. Thus, ICAT can be considered a system within itself, once configured to execute an algorithm, requiring no overhead to complete the processing of the algorithm. For example, the ICAT may be configured to execute an if-then-else instruction set, and may later be reconfigured to execute an entirely different instruction set (such as a fast Fourier transform or other mathematical algorithmic solution).
The ICAT architecture reduces power consumption, generates less heat, and increases the speed at which data is processed by reducing unnecessary active cycles as compared to conventional processors. When data is ready to be processed at the input of the ICAT resource 102, the ICAT resource 102 is idle until the ICAT resource 102 is configured. All processors remain idle when not needed, thereby reducing the amount of heat generated due to any unnecessary overhead. Each processor in the ICAT resource pool has less overhead than a traditional processor because ICAT does not acquire and execute code. Rather, the hardware is configured to perform specific operations and is only active when providing data that needs to be processed using the configuration algorithm provided by the ICAT architecture. In one example, a single ICAT processor uses a pool of math processors 107, logical processors 108, and outputs manipulated by configurable matrix routing 109.
This same ICAT processor can be used for simple processing tasks (such as if-then-else) or for very advanced complex algorithms (such as those used in face recognition). The ICAT architecture can be used to handle tasks that require multiple computations in a pipeline architecture (such as movement, shape, or identity detection, for example) by using a math processor 107, a logic processor 108, and a pool of outputs manipulated by configurable matrix routing 109, multiple ICAT resource groups or resource pools 102.
In one example, the algorithm controls the interconnect bus structure of the ICAT processor, and the ICAT architecture processes the input data stream (such as video data, sensor data, or data from previous processing steps) from the output device 112. For example, the previous results may be streamed from a data storage buffer, real-time input data, or any data from other processing steps 110, 111. The processing results may be output directly to device 113 (e.g., such as a control output or a video output).
A programmer may configure multiple RAPCs using AMPC as shown in the example of fig. 5. Alternatively, for example, the use of AMPC may be automated and controlled on-board by the system-on-chip. Fig. 5 shows a flow chart of 6 steps for a programmer who initially inserts the original high-level programming language source code into a first compiler (AMPC is called ASML). In step 2, the ASML pre-compiler extracts the code from the original source, which occurs automatically. The pre-compiler then outputs the new source code to the second compiler. This step may be done automatically or as a separate step by the programmer after the programmer is confident that the new source is being debugged and optimized. The second compiler compiles firmware constructed for the ICAT architecture. Then, the firmware is loaded into the ICAT architecture, and configures the RAPC of the ICAT architecture. For example, after a programmer is confident that the firmware is debugged and optimized, the programmer can upload the firmware into the ICAT architecture.
Alternatively, each step can be automatic and can occur without human intervention, except for loading the original source code into the ICAT architecture. By combining a legacy processor with multiple RAPCs and AMPCs, the entire process can be automated based on instructions contained in the original source code, such that the legacy processor runs the AMPC to recompile the original source code to generate firmware used by the legacy processor to build the RAPCs.
The ICAT resource pool may contain three types of processor modules, such as, for example, a math module, a logic module, and a result routing module. The math block executes a math function. The logic module performs a logic function. The result routing module executes branching and data routing functions. For example, in fig. 6, a Reusable Algorithmic Pipelined Computer (RAPC) is schematically shown. The setup bus 109 is established by configuring the setup registers of the ICAT architecture by the AMPC. The operands point to memory locations A, B, C and D on the Digital Signal Processor (DSP) 110. The DSP is configured to perform mathematical algorithms. The results of the algorithm are directed to a Logical Decision Processor (LDP) 111. LDP executes logic instructions. The result of the logic instruction is transmitted to the next RAPC directly or through a Matrix Bus Switch (MBS). The MBS directs the results to the next RAPC or controls input and output and interrupts to transfer the results on the high-speed streaming interface.
Hardware resources can be configured into the ICAT coprocessor systems that are interconnected in a pipeline structure for optimal performance. In one example, a method for designing a pool of reprogrammable hardware resources that can reconfigure, run, and process multiple processing algorithms in a chip. Hardware resources for configuring the ICAT processor may be designed into a chip, and the hardware resources in the chip may be reconfigured by the AMPC. For example, the architecture of an ICAT processing system is configured by source code for a processing algorithm. Thus, code generated for a conventional processor can run more efficiently on the ICAT architecture because the hardware of the ICAT processor is configured by the source code to execute algorithms independent of the processor, e.g., using AMPC. Thus, the ICAT architecture is capable of configuring the ICAT hardware architecture with source code created for a conventional microprocessor, which is not known in the art. In one example, a hardware resource pool is created that can be configured and reconfigured by the processor into an algorithm matrix structure, and then the hardware resource pool actually processes multiple processing algorithms in the chip. In one example, the hardware resource processes data independently of multiple commands of other processors by using pipelining.
In one example, the ICAT architecture and algorithms match a pipelined compiler combination to achieve results not known in the art to achieve computational speed and efficiency. For example, AMPC configures hardware resources for running multiple processing algorithms. The AMPC generates configuration setting firmware for configuring a processing algorithm from an ICAT resource pool in the ICAT chip. This provides a tool for programmers to accept existing application source code designed for legacy processors, and new source code for matching and allocating ICAT hardware resources to create separate hardware processing algorithms within the ICAT architecture. During operation of the SOC for a particular purpose, the AMPC generates firmware that runs the main processor to configure the ICAT hardware to execute a plurality of algorithms independent of the main processor.
Conventional processors use a similar architecture, including program memory, fetch and execution hardware for executing program instructions step by step; data storage required to store bulk (heap) data and program stack structures; as well as instruction fetch and execution cycles, management of program stacks, and management of data heap storage, all of which create significant overhead in conventional processor architectures.
In contrast, in one example, the ICAT architecture eliminates almost all of the overhead of a conventional processor system. The ICAT hardware pool is configured by AMPC and is used to process algorithms using the ICAT coprocessor architecture and pipeline flow data structures. Thus, in one example, a method of using the ICAT architecture includes: AMPC accesses an ICAT hardware compiler table defining the resources available in the chip; a hardware design language (such as Verilog) for compiling the ICAT hardware pool 102 for a given processor; the hardware compiles and outputs a table which defines the structure of an ICAT resource pool in the chip; AMPC uses a table of these data generated by a hardware compiler to determine the location and amount of ICAT resources in the chip; AMPC allocates hardware resources, configures mathematical and logical operations, and creates interconnections for various algorithms, wherein the source input syntax for AMPC may include C # syntax or standard mathematical syntax, such as Matlab; AMPC configures the pipeline structure for each algorithm from the pool of available ICAT hardware resources 103 … 111; and these pipeline structures form, for example, an ICAT coprocessor for each algorithm. For example, AMPC outputs code running on a main processing system 101, which main processing system 101 configures control registers 103, 104, 105, 106 that run the resources of the algorithm on parallel ICAT(s) coprocessor 102.
For example, the coprocessor system architecture may be configured with an ICAT resource pool 102, the ICAT resource pool 102 being responsive to input from the host processor 101. Alternatively, if the main processor architecture includes an input/output device separate from the main processor, the ICAT resource pool 102 may generate an interrupt and output data to the main processor 101 or the input/output device of the main processor 101. In one example, the ICAT resource pool 102 can be configured by the legacy processor 101, and then the ICAT resource pool 102 runs on its own until reconfigured.
Once the ICAT processor is configured by firmware, the processors of the ICAT architecture will continue to process the data streams in parallel on their own. In contrast, conventional systems require process flow that endlessly enters memory and fetches instructions to determine the various process steps. For example, the AMPC may assign sets of hardware resources (such as mathematical logic and routing) to a particular ICAT processor architecture of the ICAT architecture, for example, to perform processing steps for processing a particular algorithm. There is no hardware architecture for a conventional compiler to select and configure a microprocessor. For example, when the AMPC builds the hardware structure of the ICAT architecture, it can configure the hardware resources for the ICAT architecture into a pipeline architecture to speed up processing performance. A conventional compiler is not able to do this.
In the example of fig. 4, the ICAT control register 104 is a set of registers for controlling processing functions. For example, a Digital Signal Processor (DSP) Input Mode Register (Input Mode Register) may include Split Input Words (Split Input Words), Pre-Adder Control (Pre-Adder Control), Input Register set selection (Input Register Bank Select), and other DSP Input functions, a DSP ALU Mode Register (Mode Register) may Control add, subtract, multiply, divide, right Shift, left Shift, rotate, sum (and), or, xor (xor), nor (nor), nand (nand), and other logic processes, and a DSP multiplexer Select (multiplexer Select) may Control Shift (Shift) and Input Select (Input Select). The DSP may use one DSP48E1 for each ICAT. For example, the DSP48E1 device may be provided in a Xilinx 7 series field programmable gate array. For example, the ICAT memory and logic operations 105 may be used to control memory and memory logic operations.
In one example, the motion detection algorithm is written in the C language for use on a general purpose computer.
Code example 1: motion detection algorithm written in C language (high level programming language)
Fig. 7 shows a schematic diagram of a hardware configuration resulting from compilation of code example 1 with an AMPC compiler. Video device 111 has two outputs: a real-time video pixel stream 113 and a frame delay buffer stream 112. For an RGB output, each pixel comprises red, green and blue. DSP 115 performs a comparison of the real-time feed and the delayed feed and the result is pipelined 117 to LDP 116, which LDP 116 determines whether movement is detected. The result is output by the MBS of RAPC 114. A single RPAC is configured to implement 3 processing blocks that execute in parallel per clock cycle. In contrast, conventional processing systems require 37 instructions to be executed to process each pixel of the video to detect movement. Most of these instructions take over 3 clock cycles when executed on a conventional, non-reconfigurable, non-pipelined processor. Even if the average instruction executes within 3 clock cycles, this is substantial for a non-optimized general purpose processor, but nevertheless it will take 111 CPU clock cycles to process each pixel. With the increasing number of pixels on modern cameras, it is clear that the cycle time available with modern single-core and multi-core processors is not sufficient to complete the work.
In contrast, the exemplary configuration of a single RAPC processor configured by the AMPC compiler of code example 1 processes a continuous stream of pixels using the pixel clock of the video. The three processing blocks (DSP, LDP and MBS) are implemented as a pipelined streaming configuration of an FPGA with three delayed clock cycles, but each clock cycle processes the output of a pixel (one pixel per clock cycle compared to one pixel per 111 clock cycles) after the pipeline is full (after the first three clock cycles of the video pixel clock). Thus, a single RAPC executes at least 111 times faster than a single core of a conventional processing system, i.e., pixels are processed per clock cycle on ICAT, as compared to 37 instructions per instruction x 3 clock cycles or 111 clock cycles per pixel for a conventional processor. Since two thousand (or more) RAPC processors can be implemented on a single ICAT chip, the combined processing power can be at least 222,000 times faster than a single-core legacy processor. Current legacy processors are limited to four cores, etc., but adding cores to legacy processors is not without additional overhead. More RAPCs than legacy processing cores may be added and each RAPC may be reconfigured into a pipeline, either alone or with other RAPCs.
The point of code example 1 and fig. 7 is that adding RAPCs is only a matter of density and size of the chip, and thousands of RAPCs can be added to an ASIC without increasing overhead. Each RAPC is a pipelined parallel processor. Therefore, the addition of cores, the addition of caches, and the over-clocking of legacy processors never enable them to approach the performance of a single ICAT chip with tens of RAPCs. Furthermore, all efforts to push a conventional processor result in overheating, excessive cost, and excessive size of conventional, non-reconfigurable, and non-pipelined processors. Needless to say, these same methods can also be used to improve the performance of RAPC of the ICAT architecture. In any event, adding RAPCs to the ICAT architecture will always significantly improve performance over traditional processor architectures without requiring a programmer to program specifically for the ICAT architecture. This is a surprising and unexpected result. All attention has been focused on getting more from legacy processors, and little attention has been given to adding programmable reconfigurable architectures to legacy processors to enhance the performance of general purpose processors.
Furthermore, implementing the same solution of code example 1 on a standard FPGA would require more than just recompiling the standard high-level programming language, as shown in this example. For example, to successfully develop a matrix multiplier, PID, or any complex algorithm in a Xilinx FPGA, the following skills are required: designing working knowledge of the circuit by using RTL and Verilog languages; advanced architectural skills (parallel processing, pipelining, dataflow, resource/performance tradeoffs, etc.); design experience using a wide variety of hardware building blocks (such as arithmetic, logic decisions, memory devices, controller devices, peripheral interfaces, etc.); designing software; working knowledge using various versions of higher level programming languages; working knowledge using mathematical algorithms for monitoring and controlling applications; and knowledge of how to use the Xilinx software tools, such as compiling the "C" code into Xilinx hardware; verifying the hardware design and making architectural modifications as needed to meet performance goals; constructing a C code test bench; verifying a hardware simulation result according to a test bench result; and the design is implemented in hardware and tested. All of this makes typical FPGA projects time consuming and expensive, far beyond the capabilities of people with ordinary high-level language programming skills. The prior art reserves FPGAs for niche processing where performance is paramount and the latency and cost of custom design and programming is acceptable.
In contrast, any excellent high-level language programmer can program ICAT technology because front-end microprocessor architectures are familiar with general-purpose architectures. The RAPCs are configured by a general purpose processor and an AMPC that uses the standard architecture of each RAPC to reconfigure one or more RAPCs based on standard code for the front-end processor, for example as shown in figure 7. Thus, the ICAT technology, which includes a plurality of RAPCs and AMPCs for configuring and reconfiguring the RAPCs using a world-oriented standard processor architecture, is a surprising and unexpected improvement over conventional processors and any known FPGA processors.
FIG. 8 illustrates an application of a microprocessor combining a reusable algorithm pipeline core with an algorithm matching pipeline compiler. The lack of video processing speed of conventional microprocessors requires a dedicated and expensive set of chips or post-processing. As shown, a general purpose processor with RAPC and AMPC provides a solution to process millions of pixels in real time, providing motion sensing, video compression, and faster upload and download speeds, for example, for video from a general purpose ICAT chip on a consumer electronics board.
Each RAPC may include a DSP, LDP and MBS. The DSP may have a set-up interface (i.e., integer and floating point, multiply, divide, add, subtract, etc.) for programming the desired type of operation. The DSP may have four inputs for operand data that may be in series with or operate with various combinations of mathematical functions determined by the setup data, such as shown in fig. 8. The DSP may have a 48-bit accumulator that outputs result data and status data. For example, status data includes execute, equal, greater than, and less than.
For example, LDP may have a setup interface for programming look-up tables, loop counters, and constant registers. LDP may have a loop counter that is used to detect when the iterative algorithm is completed. LDP may have a register that may hold constant data for input to a lookup table. LDP may have a block of memory that may be used to perform a function. The LUT functions may include: a look-up table that can be implemented and accessed sequentially using a loop counter; a lookup table for control purposes that can be implemented and accessed by DSP state, constant registers, or DSP result data; and a logic look-up table that can be implemented and outputs various logic signals for control purposes. LDP may pass result data from its input to its output. For example, LDP may have a pipeline register at its output for result data. Alternatively, the LDP may have two pipeline registers that enable synchronous clearing of the result data at its output.
This detailed description provides examples including the features and elements of the claims for the purpose of enabling a person of ordinary skill in the art to make and use the invention as described in the claims. However, these examples are not intended to directly limit the scope of the claims. Rather, these examples provide the features and elements of the claims that have been disclosed in the description, claims, and drawings, and may be varied and combined in ways known in the art.
For example, without any limitation, 3325 RAPCs may be configured in a single operating at a moderate clock rate of 100MHzFPGA、On an FPGA chip, whereinAndis a trademark of Xilinx corporation (Inc.). Each RAPC can handle 1 or 2 logical and mathematical operations on each clock. Thus, this configuration produces 332giga floating point operations per second (GigaFLOPS). For example, the configuration uses a look-up table (LUT) for each of four mathematical operations (e.g., addition, subtraction, multiplication, division) and four logical operations (e.g., greater than, less than, equal to, not equal to). The standard LUT memory size is 512 bytes. In addition, a "greater than configurable constant value" LUT may be provided in addition to other logical operation LUTs. In one example, the output signals of the LUT are used to control bus multiplexer switches to manipulate results between RAPCs. The AMPC compiler is pre-compiled into source code of a higher level programming language written for the von Neumann architecture, and selects the LUT for each operation performed by the RAPC, generating a non-von Neumann processor from the source code written for the von Neumann architecture.
Compared to any conventional von neumann processor, 332GigaFLOPS are considerable, especially when it is understood that this is achieved without any special cooling requirements on the chip, where GigaFLOPS is defined as 10 hundred million floating point operations per second. In contrast, conventional von neumann processing systems require separate fetch and execute cycles for each mathematical operation, logical operation, and branch operation, whereas RAPCs do not require separate fetch and execute cycles for each mathematical operation, logical operation, and branch operation.
In one example, the calculation indicates a clock speed of 741MHzThe Virtex ZU13 chip (where Xilinx and Virtex are trademarks of Xilinx corporation) can be configured with 236,250 RAPCs, giving the chip an implementation capability of greater than 175,000gigaFLOPS, which is a surprising and surprising result to those skilled in the art. This result is possible because the RAPC does not require separate fetch and execute cycles for each mathematical operation, logical operation, and branch operation performed. This and other problems caused by the von neumann architecture of general-purpose computer processors are solved using the RAPCs and architecture described herein. It is a very surprising and unexpected result for a person skilled in the art, and even for an expert in the field, that a program written for a processor with a von neumann architecture (i.e. all known modern general-purpose processors) runs on the described architecture without being rewritten.
Claims (24)
1. A reusable algorithmic pipeline core comprising:
a processing unit;
an array of reconfigurable field programmable gates, wherein the field programmable gates are programmed by an algorithm matching pipeline compiler such that the algorithm matching pipeline compiler precompiles source code for processing by the processing unit that is designed to run on a standard processor without parallel processing, and the processing unit and the algorithm matching pipeline compiler configure the field programmable gates to run as a pipelined parallel processor.
2. The core of claim 1 wherein the algorithm matching pipeline compiler is a pre-compiler.
3. The core of claim 2, wherein the pre-compiler is configured to pre-compile a standard higher-level software language written for a processor of a legacy non-reconfigurable type other than the core, and the pre-compiler generates machine code for the core by utilizing the processor of the legacy non-reconfigurable type, the standard higher-level software language written for the processor of the legacy non-reconfigurable type generating machine code to configure the array of reconfigurable field programmable gates.
4. The core of claim 3, wherein the standard higher level software language is C or C + +.
5. The core of claim 2, wherein the core comprises a computer pool configured to process algorithms required for a particular computation based on output from the pre-compiler.
6. The core of claim 1, wherein the field programmable gate is configured to complete a task without any further overhead from the processing unit.
7. The core of claim 5, further comprising an intelligent bus controller or logical processor, wherein the intelligent bus controller or logical processor performs all logical functions handled by the core.
8. The core of claim 5, further comprising a logical processor and a main bus switch, and the logical processor comprises a reconfigurable logic function to control the main bus switch.
9. The core of claim 7, further comprising a digital signal processor, wherein the digital signal processor comprises a reconfigurable math processor to perform math calculations.
10. The core of claim 8, wherein the master bus switch is a matrix bus router or matrix bus switch comprising circuitry reconfigurable programmable by the pre-compiler and the processing unit such that during processing of an algorithm, data and results are routed from the core to another core to complete the algorithm without any further intervention from a central or peripheral processor, which reduces overhead by pipelining compared to static, non-reconfigurable hardware that requires central or peripheral processor intervention to direct data and results to and from an arithmetic processing unit.
11. The core of claim 9, wherein the logical processor processes logical decisions and iterative loops, and a result store is provided by the logical processor for a learning algorithm.
12. A system comprising a plurality of cores according to claim 1, comprising the steps of:
processing all mathematical operations using a digital signal processor of one or more of the plurality of cores; and
processing all logical functions using one or more logical processors of one or more of the plurality of cores.
13. The system of claim 11, further comprising the step of configuring the plurality of cores as a pool of cores, and each core in the pool of cores is individually reconfigurable by programming without requiring any changes to hardware.
14. The system of claim 12, wherein the step of configuring configures all of the plurality of cores to process the algorithm in parallel without any further intervention from a central or peripheral processor for importing and exporting data and results to and from the arithmetic processing unit.
15. The system of claim 13, wherein the algorithm matching pipeline compiler is a pre-compiler and the logical processor of each of the plurality of cores uses a memory block configured by the pre-compiler as a lookup table and a register for a constant value or a learned value.
16. The system of claim 14, further comprising setting, by the logic processor, the lookup table, wherein the lookup table is an n-bit lookup table and the n-bit lookup table is to encode an n-bit boolean logic function as a truth table.
17. The system of claim 11, further comprising: generating machine code in a standard higher-level software language written for a conventional non-reconfigurable and non-pipelined general-purpose computer processor using the algorithm-matched pipeline compiler for one or more of the plurality of cores.
18. The system of claim 16, wherein the standard higher level software language is written for a conventional non-reconfigurable type of processor, and the step of generating machine code comprises: the algorithm matching pipeline compiler as a pre-compiler generates machine code for configuring an array of reconfigurable field programmable gates of each core of the plurality of cores using the conventional non-reconfigurable type processor.
19. The system of claim 17, wherein the system comprises: at least one of the processors of the legacy non-reconfigurable type for which the standard, higher-level software language is written, and the at least one of the processors of the legacy non-reconfigurable type for which the standard, higher-level software language is written, generates machine code for configuring an array of reconfigurable field programmable gates of each of the plurality of cores.
20. The system of claim 18, wherein each core of the plurality of cores is configured to independently address complex mathematical and logical algorithms without further intervention of the at least one processor of the legacy non-reconfigurable type of processor for which the standard, higher-level software language is written.
21. The system of claim 19, further comprising inputting a value into the system, and the system outputting a solution to a main bus switch of the system without further intervention.
22. The system of claim 20, wherein the plurality of cores comprises 2000 cores.
23. The system of claim 21, wherein the system operates 360 trillion instructions per second at a clock speed of 500 MHz.
24. The system of claim 22, wherein the system has a delay period for input data, but pipelining reduces overhead such that the system is configured to perform computations and output results on each clock of each core after the delay period begins.
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US62/287,265 | 2016-01-26 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| HK40008643A true HK40008643A (en) | 2020-06-19 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10970245B2 (en) | Processor with reconfigurable pipelined core and algorithmic compiler | |
| JP7183197B2 (en) | high throughput processor | |
| CN109597459B (en) | Processor and method for privilege configuration in a spatial array | |
| CN109213523B (en) | Processor, method and system for configurable spatial accelerator with memory system performance, power reduction and atomic support features | |
| Stitt | Are field-programmable gate arrays ready for the mainstream? | |
| Jain et al. | Coarse grained FPGA overlay for rapid just-in-time accelerator compilation | |
| Jesshope et al. | Design of SIMD microprocessor array | |
| Giefers et al. | An FPGA-based reconfigurable mesh many-core | |
| Li et al. | FPGA overlays: hardware-based computing for the masses | |
| Zhu et al. | A hybrid reconfigurable architecture and design methods aiming at control-intensive kernels | |
| HK40008643A (en) | Processor with reconfigurable algorithmic pipelined core and algorithmic matching pipelined compiler | |
| Yang et al. | Application software beyond exascale: challenges and possible trends | |
| Pfenning et al. | Transparent FPGA acceleration with tensorflow | |
| HK40027650A (en) | High throughput processors | |
| Elshimy et al. | A Near-Memory Dynamically Programmable Many-Core Overlay | |
| Guo et al. | Evaluation and tradeoffs for out-of-order execution on reconfigurable heterogeneous MPSoC | |
| Muñoz Hernandez | Enhancing single-instruction multiple-threads FPGA-based processors with high-bandwidth-memory | |
| Adário et al. | Reconfigurable computing: Viable applications and trends | |
| Guccione | Software for Reconfigurable Computing | |
| Schwiegelshohn et al. | Reconfigurable Processors and Multicore Architectures | |
| Chickerur et al. | Reconfigurable Computing: A Review | |
| YUE | FPGA OVERLAY ARCHITECTURES ON THE XILINX ZYNQ AS PROGRAMMABLE ACCELERATORS | |
| Miyajima | A Toolchain for Application Acceleration on Heterogeneous Platforms |