US20130185540A1 - Processor with multi-level looping vector coprocessor - Google Patents
Processor with multi-level looping vector coprocessor Download PDFInfo
- Publication number
- US20130185540A1 US20130185540A1 US13/548,924 US201213548924A US2013185540A1 US 20130185540 A1 US20130185540 A1 US 20130185540A1 US 201213548924 A US201213548924 A US 201213548924A US 2013185540 A1 US2013185540 A1 US 2013185540A1
- Authority
- US
- United States
- Prior art keywords
- vector
- loop
- processor
- instructions
- core
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 230000015654 memory Effects 0.000 claims abstract description 160
- 239000000872 buffer Substances 0.000 claims abstract description 47
- 238000012545 processing Methods 0.000 claims description 62
- 230000008859 change Effects 0.000 claims description 4
- 230000000977 initiatory effect Effects 0.000 claims description 2
- 238000001514 detection method Methods 0.000 claims 1
- 230000005055 memory storage Effects 0.000 claims 1
- 101100534231 Xenopus laevis src-b gene Proteins 0.000 description 69
- 101100332287 Dictyostelium discoideum dst2 gene Proteins 0.000 description 57
- 238000009826 distribution Methods 0.000 description 32
- 238000010586 diagram Methods 0.000 description 10
- 238000000034 method Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 239000012536 storage buffer Substances 0.000 description 4
- 238000012546 transfer Methods 0.000 description 4
- 230000003993 interaction Effects 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 230000017105 transposition Effects 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012856 packing Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 206010048669 Terminal state Diseases 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000003116 impacting effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
- G06F15/8053—Vector processors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
- G06F15/8007—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30021—Compare instructions, e.g. Greater-Than, Equal-To, MINMAX
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30032—Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
- G06F9/30043—LOAD or STORE instructions; Clear instruction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3005—Arrangements for executing specific machine instructions to perform operations for flow control
- G06F9/30065—Loop control instructions; iterative instructions, e.g. LOOP, REPEAT
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30076—Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
- G06F9/30087—Synchronisation or serialisation instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
- G06F9/3013—Organisation of register space, e.g. banked or distributed register file according to data content, e.g. floating-point registers, address registers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/34—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
- G06F9/345—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3887—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
Definitions
- processor designs include coprocessors that are intended to accelerate execution of a given set of processing tasks. Some such coprocessors achieve good performance/area in typical processing tasks, such as scaling, filtering, transformation, sum of absolute differences, etc., executed by a digital signal processor (DSP).
- DSP digital signal processor
- processing tasks often require numerous passes of processing through a coprocessor, compromising power efficiency.
- access patterns required by DSP algorithms are becoming less regular, thereby negatively impacting the overall processing efficiency of coprocessors designed to accommodate more regular access patterns. Consequently, processor and coprocessor architectures that provide improved processing, power, and/or area efficiency are desirable.
- a processor that includes a control processor core and a vector processor core is disclosed herein.
- a processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core.
- the scalar processor core includes a program memory interface through which the scalar processor retrieves instructions from a program memory.
- the instructions include scalar instructions executable by the scalar processor and vector instructions executable by the vector coprocessor core.
- the vector coprocessor core includes a plurality of execution units and a vector command buffer.
- the vector command buffer is configured to decode vector instructions passed by the scalar processor core, to determine whether vector instructions defining an instruction loop have been decoded, and to initiate execution of the instruction loop by one or more of the execution units based on a determination that all of the vector instructions of the instruction loop have been decoded.
- a vector coprocessor in another embodiment, includes a plurality of execution units and a vector command buffer.
- the execution units are configured to simultaneously apply an instruction specified operation to different data values.
- the vector command buffer is configured to decode instructions to be executed by the execution units, to identify an instruction loop in the instructions, and to provide the instructions to the execution units for execution based on a determination that all of the instructions of an identified instruction loop have been decoded.
- a processor in a further embodiment, includes a control processor core and a single-instruction multiple data (SIMD) coprocessor core coupled to the control processor core.
- the control processor core includes a program memory interface through which the control processor core retrieves instructions from a program memory.
- the instructions comprise instructions executable by the control processor core and SIMD instructions executable by the SIMD coprocessor core.
- the SIMD coprocessor core includes a plurality of execution units, a vector command buffer, and loop control logic coupled to the vector command buffer.
- the vector command buffer is configured to group SIMD instructions received from the control processer core into an instruction loop, and to initiate execution of the instruction loop based on a complete set of SIMD instructions of an instruction loop being received from the control processor core.
- the loop control logic is configured to manage execution of the instruction loop as a plurality of nested loops without loop overhead.
- FIG. 1 shows a block diagram of a processor in accordance with various embodiments
- FIG. 2 shows a block diagram of a processor in accordance with various embodiments
- FIG. 3 shows a block diagram of a vector coprocessor core in accordance with various embodiments
- FIG. 4 show a block diagram of an vector command buffer of the vector coprocessor core in accordance with various embodiments
- FIG. 5 shows a diagram of scalar processor core and vector coprocessor core execution interaction in accordance with various embodiments
- FIGS. 6 A- 6 FH show load data distributions provided by a load unit of a vector coprocessor core in accordance with various embodiments
- FIG. 7 shows a table of load unit data distributions in accordance with various embodiments.
- FIG. 8 shows a table of store unit data distributions in accordance with various embodiments.
- the term “software” includes any executable code capable of running on a processor, regardless of the media used to store the software.
- code stored in memory e.g., non-volatile memory
- embedded firmware is included within the definition of software.
- the recitation “based on” is intended to mean “based at least in part on.” Therefore, if X is based on Y, X may be based on Y and any number of other factors.
- the terms “alternate,” “alternating” and the like are used to designate every other one of a series.
- FIG. 1 shows a block diagram of a processor 100 in accordance with various embodiments.
- the processor 100 includes a scalar processor core 102 , a vector coprocessor core 104 , a program memory 106 , a data memory 108 , a working buffer memory 110 , an A buffer memory 112 , and a B buffer memory 114 .
- the A and B buffer memories 112 , 114 are partitioned into a low and high A buffer memory ( 112 A, 112 B) and a low and high B buffer memory ( 114 A, 114 B) to allow simultaneous direct memory access (DMA) and access by the cores 102 , 104 .
- DMA direct memory access
- each of the working buffer memory 110 , A buffer memory 112 , and B buffer memory 114 may comprise N simultaneously accessible banks.
- the vector coprocessor core 104 is an 8-way single-instruction multiple-data (SIMD) core
- each of the working, A, and B buffers 110 , 112 , 114 may comprise 8 banks each of suitable word width (e.g., 32 bits or more wide) that are simultaneously accessible by the vector coprocessor core 104 .
- Switching network 118 provide signal routing between the memories 108 , 110 , 112 , 114 and the various systems that share access to memory (e.g., DMA and the processor cores 102 , 104 ).
- FIG. 2 shows a block diagram of the processor 100 including various peripherals, including DMA controller 202 , memory management units 204 , clock generator 206 , interrupt controller 208 , counter/time module 210 , trace port 214 , memory mapped registers 212 and various interconnect structures that link the components of the processor 100 .
- the scalar processor core 102 may be a reduced instruction set processor core, and include various components, such as execution units, registers, instruction decoders, peripherals, input/output systems and various other components and sub-systems.
- Embodiments of the scalar processor core 102 may include a plurality of execution units that perform data manipulation operations.
- an embodiment of the scalar processor core 102 may include five execution units, a first execution unit performs the logical, shift, rotation, extraction, reverse, clear, set, and equal operations, a second execution unit performs data movement operations, a third execution unit performs arithmetic operations, a fourth execution unit performs multiplication, and a fifth execution unit performs division.
- the scalar processor core 102 serves as a control processor for the processor 100 , and executes control operations, services interrupts, etc.
- the vector coprocessor core 104 serves as a signal processor for processing signal data (e.g., image signals) provided to the vector coprocessor core 104 via the memories 110 , 112 , 114 .
- the program memory 106 stores instructions to be executed by the scalar core 102 interspersed with instructions to be executed by the vector coprocessor core 104 .
- the scalar processor core 102 accesses the program memory 106 and retrieves therefrom an instruction stream comprising instructions to be executed by the scalar processor core 102 and instructions to be executed by the vector coprocessor core 104 .
- the scalar processor core 102 identifies instructions to be executed by the vector coprocessor core 104 and provides the instructions to the vector coprocessor core 104 via a coprocessor interface 116 .
- the scalar processor 102 provides vector instructions, control data, and/or loop instruction program memory addresses to the vector coprocessor core 104 via the coprocessor interface 116 .
- the loop instruction program memory addresses may be provided concurrently with a loop instruction, and the control data may be provided concurrently with a control register load instruction.
- the program memory 106 may be a cache memory that fetches instructions from a memory external to the processor 100 and provides the instructions to the scalar processor core 102 .
- FIG. 3 shows a block diagram of the vector coprocessor core 104 in accordance with various embodiments.
- the vector coprocessor core 104 may be an SIMD processor that executes instructions arranged as a loop. More specifically, the vector coprocessor core 104 executes vector instructions within a plurality of nested loops. In some embodiments, the vector coprocessor core 104 includes built-in looping control that executes instructions in four or more nested loops with zero looping overhead.
- the vector coprocessor core 104 includes a command decoder/buffer 302 , loop control logic 304 , a vector register file 306 , processing elements 308 , a table look-up unit 310 , a histogram unit 312 , load units 314 , store units 316 , and address generators 318 .
- the load units 314 and store units 316 access the working buffer memory 110 , an A buffer memory 112 , and a B buffer memory 114 through a memory interface 320 .
- the address generators 318 compute the addresses applied by the load and store units 314 , 316 for accessing memory.
- the memory interface 320 connects the vector coprocessor core 104 via a lane of interconnect corresponding to each bank of each of memories 110 , 112 , 114 .
- a memory 110 , 112 , 114 having eight parallel banks connects to the vector coprocessor core 104 via eight parallel memory lanes, where each memory lane connects to a port of the memory interface 320 .
- Memory lanes that connect to adjacent ports of the memory interface 320 are termed adjacent memory lanes.
- the coprocessor core 104 includes N processing lanes, where each lane includes a processing element 308 and a set of registers of the vector register file 306 that provide operands to and store results generated by the processing element 308 .
- Each processing element 308 may include a plurality of function units that operate on (e.g., multiply, add, compare, etc.) the operands provided by the register file 306 .
- the register file 306 is N-way and includes storage of a plurality of entries.
- the register file 306 may be N ⁇ 16 where the register file includes sixteen registers for each of the N ways of the vector coprocessor core 104 . Corresponding registers of adjacent ways are termed adjacent registers. Thus, a register RO of SIMD way 0 is adjacent to register RO of SIMD way 1 . Similarly, register RO of SIMD way 0 and register 0 of SIMD way 2 are alternate registers.
- the processing elements 308 and the registers of the register file 306 are sized to process data values of various sizes. In some embodiments, the processing elements 308 and the registers of the register file 306 are sized to process 40 bit and smaller data values (e.g., 32 bit, 16 bit, 8, bit). Other embodiments may be sized to process different data value sizes.
- the vector coprocessor core 104 repeatedly executes a vector instruction sequence (referred to as a vector command) within a nested loop.
- the nested looping is controlled by the loop control logic 304 .
- the scalar core 102 continues to decode and execute the instruction stream retrieved from program memory 106 , until execution of a coprocessor synchronization instruction (by the scalar core 102 ) forces the scalar core 102 to stall for vector coprocessor core 104 vector command completion. While the scalar core 102 is stalled, the scalar core 102 may service interrupts unless interrupt processing is disabled.
- the scalar core 102 executes instructions and services interrupts in parallel with vector coprocessor core 104 instruction execution.
- Instruction execution by the scalar core 102 may be synchronized with instruction execution by the vector coprocessor core 104 based on the scalar core 102 executing a synchronization instruction that causes the scalar core 102 to stall until the vector coprocessor core 104 asserts a synchronization signal indicating that vector processing is complete.
- the synchronization signal may be triggered by execution of a synchronization instruction by the vector coprocessor core 104 .
- the command decode/buffer 302 of the vector coprocessor core 104 includes an instruction buffer that provides temporary storage for vector instructions.
- FIG. 4 shows a block diagram of the command decode/buffer 302 of the vector coprocessor core 104 in accordance with various embodiments.
- the command decode/buffer 302 includes a pre-decode first-in first-out (FIFO) buffer 402 , a vector instruction decoder 404 , and vector command storage buffers 406 .
- Each vector command storage buffer 406 includes capacity to store a complete vector command of maximum size.
- Vector instructions flow from the scalar processor core 102 through the pre-decode FIFO 402 and are decoded by the vector instruction decoder 404 .
- the decoded vector instructions corresponding to a given vector command are stored in one of the vector command storage buffers 406 , and each stored vector command is provided for execution in sequence. Execution of a decoded vector command is initiated (e.g., the vector command is read out of the vector command storage buffer 406 ) only after the complete vector command is decoded and stored in a vector command storage buffer 406 .
- the command decode/buffer 302 loads a vector command into each of the vector command storage buffers 406 , and when the vector command storage buffers 406 are occupied additional vector instructions received by the command decode/buffer 302 are stored in the pre-decode buffer 402 until execution of a vector command is complete, at which time the FIFO buffered vector command may be decoded and loaded into the emptied vector command storage buffer 406 previously occupied by the executed vector command.
- FIG. 5 shows a diagram of scalar processor core 102 and vector coprocessor core 104 interaction in accordance with various embodiments.
- vector instructions i 0 -i 3 form a first exemplary vector command
- vector instructions i 4 -i 7 form a second exemplary vector command
- vector instructions i 8 -i 11 form a third exemplary vector command.
- the scalar processor core 102 recognizes vector instructions in the instruction stream fetched from program memory 106 .
- the scalar processor core 102 asserts the vector valid signal (vec_valid) and passes the identified vector instructions to the vector coprocessor core 104 .
- vec_valid vector valid signal
- the first vector command has been transferred to the vector coprocessor core 104 , and the vector coprocessor core 104 initiates execution of the first vector command while the scalar processor core 102 continues to transfer the vector instructions of the second vector command to the vector coprocessor core 104 .
- transfer of the second vector command to the vector coprocessor core 104 is complete, and the execution of the first vector command is ongoing. Consequently, the vector coprocessor core 104 negates the ready signal (vec_rdy) which causes the scalar processor core 102 to discontinue vector instruction transfer.
- execution of the first vector command is complete, and execution of the second vector command begins.
- vector coprocessor core 104 With completion of the first vector command, vector coprocessor core 104 asserts the ready signal, and the command decode/buffer 302 receives the vector instructions of the third vector command. At time T 5 , the vector coprocessor core 104 completes execution of the second vector command. At time T 6 , transfer of the third vector command is complete, and the vector coprocessor core 104 initiates execution of the third vector command.
- a VWDONE instruction follows the last instruction of the third vector command. The VWDONE instruction causes the scalar processor core 102 to stall pending completion of the third vector command by the vector coprocessor core 104 .
- the vector coprocessor core 104 executes the VWDONE command which causes the vector coprocessor core 104 to assert the vector done signal (vec_done). Assertion of the vector done signal allows the scalar processor core 102 to resume execution, thus providing core synchronization.
- the example of FIG. 5 excludes the pre-decode FIFO 402 .
- the vector coprocessor core 104 delays negation of vec_rdy until the pre-decode FIFO 402 is full.
- operations of vector command execution can be represented as sequential load, arithmetic operation, store, and pointer update stages, where a number of operations may be executed in each stage.
- the following listing shows a skeleton of the nested loop model for a four loop embodiment of the vector coprocessor core 104 .
- Each loop variable is incremented from 0 to Ipend 1 . . . 4.
- Each iteration of the innermost loop (i 4 ) executes in a number of cycles equal to the maximal number of cycles spent in execution of loads, arithmetic operations, and stores within the loop.
- Cycle count for the arthmetic operations is constant for each interation, but cycle count for load and store operations can change depending on pointer update, loop level, and read/write memory contention.
- Embodiments define a vector command with a loop initiation instruction, VLOOP.
- the vector instructions following VLOOP initialize the registers and address generators of the vector coprocessor core 104 , and specify the load operations, arithmetic and data manipulation operations, and store operations to be performed with the nested loops.
- the parameters applicable to execution of a vector command e.g., loop counts, address pointers to arrays, constants used in the computation, round/truncate shift count, saturation bounds, etc.
- embodiments of the vector coprocessor core 104 may always execute a fixed number of nested loops (e.g., 4 as shown in the model above), with loop terminal counts of zero or greater, some embodiments include an optional outermost loop (e.g., an optional fifth loop).
- the optional outermost loop encompasses the fixed number of nested loops associated with the VLOOP instruction, and may be instantiated separately from the fixed number of nested loops.
- execution of the optional outermost loop requires no looping overhead.
- Each iteration of the optional outermost loop may advance a parameter pointer associated with the nested loops.
- the parameter pointer may be advanced by param_len provided in the VLOOP instruction.
- the parameter pointer references the parameter file that contains the parameters applicable to execution of the vector command as explained above (loop counts, etc.).
- loop counts parameters applicable to execution of the vector command as explained above (loop counts, etc.).
- embodiments of the vector coprocessor core 104 can apply the vector command to objects/structures/arrays of varying dimension or having varying inter-object spacing. For example, changing loop counts for the nested loops allows the vector coprocessor core 104 to processes objects of varying dimensions with a single vector command, and without the overhead of a software loop.
- the loop count of the optional outer loop and the parameter pointer may be set by execution of an instruction by the vector coprocessor core 104 .
- the instruction may load a parameter into a control register of the core 104 as:
- Execution of a vector command may be complete when a total number of iterations specified in the parameter file for each loop of the vector command are complete. Because it is advantageous in some situations to terminate the vector command prior to execution of all specified loop iterations, the vector coprocessor core 104 provides early termination of a vector command. Early termination is useful when, for example, the vector command has identified a condition in the data being processed that makes additional processing of the data superfluous. Early termination of a vector command is provided for by execution, in the vector command, of a loop early exit instruction defined as:
- Execution of the VEXITNZ instruction causes the vector coprocessor core 104 to examine the value contained in the register src 1 (e.g., associated with a given SIMD lane), and to schedule loop termination if the value is non-zero. Other embodiments may schedule loop termination based on other conditions of the value (e.g., zero, particular bit set, etc.). If the level parameter indicates that the vector command is to be exited, then the vector coprocessor core 104 schedules the nested loops associated with the vector command to terminate after completion of the current iteration of the innermost of the nest loops. Thus, if the level parameter indicates that the vector command is to be exited, any optional outmost loop encompassing the vector command is not exited, and a next iteration of the vector command may be executed.
- the register src 1 e.g., associated with a given SIMD lane
- Other embodiments may schedule loop termination based on other conditions of the value (e.g., zero, particular bit
- the vector coprocessor core 104 schedules the optional outermost loop to terminate after completion of all remaining iterations of the nested loops associated with the vector command encompassed by the optional outermost loop.
- the load units 314 move data from the memories 110 , 112 , 114 to the registers of the vector register file 306 , and include routing circuitry that distributes data values retrieved from the memories 110 , 112 , 114 to the registers in various patterns that facilitate efficient processing.
- Load instructions executed by the vector coprocessor core 104 specify how the data is to be distributed to the registers.
- FIGS. 6 A- 6 FH show load data distributions provided by the load unit 314 of the vector coprocessor core 104 in accordance with various embodiments. While the illustrative distributions of FIGS.
- 6A-6F are directed loading data values of a given size (e.g., 16 bits), embodiments of the load units 314 may apply similar distributions to data values of other sizes (e.g., 8 bits, 32 bits, etc.).
- the load units 314 may move data from memory 110 , 112 , 114 to the vector registers 306 with instruction specified distribution in a single instruction cycle.
- FIG. 6A shows a load unit 314 retrieving a data value from each of eight locations of a memory 110 , 112 , 114 , (e.g., a value from each of eight banks) via eight adjacent lanes and distributing the retrieved data values to eight adjacent registers of the vector register file 306 (e.g., a register corresponding to each SIMD lane). More generally, the load unit 314 moves a value from memory via each of a plurality adjacent lanes, and distributes the data values to a plurality of adjacent registers of the vector register file 306 in a single instruction cycle.
- FIG. 6B shows a load unit 314 retrieving a data value from a single location of a memory 110 , 112 , 114 , and distributing the retrieved data value to each of eight adjacent registers of the vector register file 306 . More generally, the load unit 314 moves a value from a single location of a memory 110 , 112 , 114 , and distributes the data value to a plurality of adjacent registers of the vector register file 306 in a single instruction cycle. Thus, the load unit 314 may distribute a single value from memory 110 , 112 , 114 to each of N ways of the vector coprocessor core 104 .
- FIG. 6C shows a load unit 314 retrieving a data value from each of two locations of a memory 110 , 112 , 114 via adjacent lanes, and distributing the retrieved data values to each of four adjacent pairs of registers of the vector register file 306 . More generally, the load unit 314 moves a value from each of two locations of a memory 110 , 112 , 114 via adjacent lanes, and distributes the data value to a plurality of adjacent pairs of registers of the vector register file 306 in a single instruction cycle. That is, each value of the pair of values is written to alternate registers of the register file 306 (e.g., one value to odd indexed registers and the other value to even indexed registers). Thus, the load unit 314 may distribute a pair of values from memory 110 , 112 , 114 to each of N/2 way pairs of the vector coprocessor core 104 .
- FIG. 6D shows a load unit 314 retrieving a data value from each of eight locations of a memory 110 , 112 , 114 via alternate lanes (e.g., from odd indexed locations or even indexed locations), and distributing the retrieved data values to eight adjacent registers of the vector register file 306 . More generally, the load unit 314 moves a value from each of a plurality of locations of a memory 110 , 112 , 114 via alternate lanes, and distributes the data values to a plurality of adjacent registers of the vector register file 306 in a single instruction cycle. Thus, the load unit 314 provides down-sampling of the data stored in memory by a factor of two.
- FIG. 6E shows a load unit 314 retrieving a data value from each of four locations of a memory 110 , 112 , 114 via adjacent lanes, and distributing each of the retrieved data values to two adjacent registers of the vector register file 306 . More generally, the load unit 314 moves a value from each of a plurality locations of a memory 110 , 112 , 114 via adjacent lanes, and distributes each of the data values to two adjacent registers of the vector register file 306 in a single instruction cycle. Thus, the load unit 314 provides up-sampling of the data stored in memory by a factor of two.
- FIG. 6F shows a load unit 314 retrieving a data value from each of sixteen locations of a memory 110 , 112 , 114 via adjacent lanes, and distributing each of the retrieved data values to registers of the vector register file 306 such that data values retrieved via even numbered lanes are distributed to adjacent registers and data values retrieved via odd numbered lanes are distributed to adjacent registers. More generally, the load unit 314 moves a value from each of a plurality locations of a memory 110 , 112 , 114 via adjacent lanes, and distributes the data values in deinterleaved fashion to two sets of adjacent registers of the vector register file 306 . Thus, the load unit 314 provides deinterleaving of data values across registers M and M+1 where register M encompasses a given register of each way of the N-way vector coprocessor core 104 in a single instruction cycle.
- Some embodiments of the load unit 314 also provide custom distribution. With custom distribution, the load unit 314 distributes one or more data values retrieved from a memory 110 , 112 , 114 to registers of the vector register file 306 in accordance with a distribution pattern specified by an instruction loaded distribution control register or a distribution control structure retrieved from memory. Load with custom distribution can move data from memory to the vector register file 306 in a single instruction cycle.
- the custom distribution may be arbitrary. Custom distribution allows the number of values read from memory, the number of registers of the register file 306 loaded, and the distribution of data to the registers to be specified. In some embodiments of the load unit 314 , custom distribution allows loading of data across multiple rows of the vector register file 306 with instruction defined distribution.
- execution of a single custom load instruction may cause a load unit 314 to move values from memory locations 0 - 7 to registers V[ 0 ][ 0 - 7 ] and move values from memory locations 3 - 10 to registers V[ 1 ][ 0 - 7 ].
- Such data loading may be applied to facilitate motion estimation searching in a video system.
- Some embodiments of the load unit 314 further provide for loading with expansion.
- the load unit 314 retrieves a compacted (collated) array from a memory 110 , 112 , 114 and expands the array such the elements of the array are repositioned (e.g., to precompacted locations) in registers of the vector register file 306 .
- the positioning of each element of the array is determined by expansion information loaded into an expansion control register via instruction. For example, given array ⁇ A,B,C ⁇ retrieved from memory and expansion control information ⁇ 0,0,1,0,1,1,0,0 ⁇ , the retrieved array may be expanded to ⁇ 0,0,A,0,B,C,0,0 ⁇ and written to registers of the register file 306 . Load with expansion moves data from memory to the vector register file 306 with expansion in a single instruction cycle.
- FIG. 7 shows a table of data distributions that may be implemented by the load unit 314 in accordance with various embodiments. Operation of the load units 314 may be invoked by execution of a vector load instruction by the vector coprocessor core 104 .
- the vector load instruction may take the form of:
- the timing of vector load instruction execution may be determined by the load units 314 (i.e., by hardware) based, for example, on when the data retrieved by the load is needed by the processing elements 308 , and memory interface availability. In contrast, the timing of the computations performed by the processing elements 308 may be determined by the sequence of vector instructions provided by the scalar processor core 102 .
- the store units 316 include routing circuitry that distributes data values retrieved from the registers of the vector register file 306 to locations in the memories 110 , 112 , 114 in various patterns that facilitate efficient processing.
- Store instructions executed by the vector coprocessor core 104 specify how the data is to be distributed to memory. At least some of the data distributions provide by the store unit 316 reverse the data distributions provided by the load units 314 .
- the store units 316 may provide the data distributions described herein for data values of various lengths (e.g., 32, 16, 8 bit values).
- the store units 316 move data from the vector registers 306 to memory 110 , 112 , 114 with instruction specified distribution in a single instruction cycle.
- a store unit 316 may move data from a plurality of adjacent registers of the register file 306 to locations in memory 110 , 112 , 114 via adjacent memory lanes in a single instruction cycle. For example, data values corresponding to a given register of each of N-ways of the vector coprocessor core 104 may be moved to memory via adjacent memory lanes in a single instruction cycle.
- the store unit 316 may also move a value from a single given register of the register file 306 to a given location in memory 110 , 112 , 114 in a single instruction cycle.
- the store unit 316 may provide downsampling by a factor of two by storing data retrieved from alternate registers of the vector register file 306 (i.e., data from each of alternate ways of the vector coprocessor core 104 ) to locations of memory 110 , 112 , 114 via adjacent memory lanes. Thus, the store unit 316 may provide an operation that reverses the upsampling by two shown in FIG. 6E .
- the store unit 316 provides the movement of data from registers to memory with down sampling in a single instruction cycle.
- Embodiments of the store unit 316 may provide interleaving of data values retrieved from registers of the vector register file 306 while moving the data values to memory.
- the interleaving reverses the distribution shown in FIG. 6F such that data values retrieved from a first set of adjacent registers are written to memory locations via even indexed memory lanes and data values retrieved from a second set of adjacent registers are interleaved therewith by writing the data values to memory locations via odd indexed memory lanes.
- the store unit 316 provides the movement of data from registers to memory with interleaving in a single instruction cycle.
- Embodiments of the store unit 316 may provide for transposition of data values retrieved from registers of the vector register file 306 while moving the data values to memory, where, for example, the data values form a row or column of an array.
- Data values corresponding to each way of the vector coprocessor core 104 may be written to memory at an index corresponding to the index of the register providing the data value times the number of ways plus one.
- reg[ 0 ] is written to mem[ 0 ]
- reg[ 1 ] is written to mem[ 9 ]
- reg[ 2 ] is written to mem[ 18 ], etc.
- the store unit 316 provides movement of N data values from registers to memory with transposition in a single instruction cycle.
- Embodiments of the store unit 316 may provide collation of data values retrieved from registers of the vector register file 306 while moving the data values to memory.
- the collating reverses the expansion distribution provided by the load units 314 .
- the collation compacts the data retrieved from adjacent registers of the vector register file 306 , by writing to locations of memory via adjacent memory lanes those data values identified in collation control information stored in a register. For example, given registers containing an array ⁇ 0,0,A,0,B,C,0,0 ⁇ and collation control information ⁇ 0,0,1,0,1,1,0,0 ⁇ , the store unit 316 stores ⁇ A,B,C ⁇ in memory.
- the store unit 316 provides the movement of data from registers to memory with collation in a single instruction cycle.
- Embodiments of the store unit 316 may provide data-driven addressing (DDA) of data values retrieved from registers of the vector register file 306 while moving the data values to memory.
- the data-driven addressing generates a memory address for each of a plurality of adjacent registers of the vector register file 306 using offset values provided from a DDA control register.
- the DDA control register may be a register of the vector register file corresponding the way of the register containing the value to written to memory.
- Register data values corresponding to each of the N ways of the vector coprocessor core may be stored to memory in a single instruction cycle if the DDA control register specified offsets provide for the data values to be written to different memory banks.
- the store unit 316 may write the data values in a plurality of cycles selected to minimize the number of memory cycles used to write the register values to memory.
- Embodiments of the store unit 316 may provide for moving data values retrieved from a plurality of adjacent registers of the vector register file 306 to locations of the memory via alternate memory lanes, thus skipping every other memory location.
- the store units 316 may write the plurality of data values to alternate locations in memory 110 , 112 , 114 in a single instruction cycle.
- FIG. 8 shows a table of data distributions that may be implemented by the store unit 316 in accordance with various embodiments. Operation of the store units 316 may be invoked by execution of a vector store instruction by the vector coprocessor core 104 .
- the vector store instruction may take the form of:
- the store units 316 provide selectable rounding and/or saturation of data values as the values are moved from the vector registers 306 to memory 110 , 112 , 114 .
- Application of rounding/saturation adds no additional cycles to the store operation.
- Embodiments may selectably enable or disable rounding. With regard to saturation, embodiments may selectably perform saturation according to following options:
- the timing of vector store instruction execution is determined by the store units 316 (i.e., by hardware) based, for example, on availability of the memories 110 , 112 , 114 .
- the timing of the computations performed by the processing elements 308 may be determined by the sequence of vector instructions provided by the scalar processor core 102 .
- the processing elements 308 of the vector coprocessor core 104 include logic that accelerates SIMD processing of signal data.
- each of the N processing lanes e.g., the processing element of the lane
- Embodiments of the vector coprocessor core 104 improve SIMD processing efficiency by providing communication between the processing elements 308 of the SIMD lanes.
- Some embodiments of the vector coprocessor core 104 include logic that compares values stored in two registers of the vector register file 306 associated with each SIMD processing lane. That is values of two registers associated with a first lane are compared, values of two registers associated with a second lane are compared, etc.
- the vector coprocessor core 104 packs the result of the comparison in each lane into a data value, and broadcasts (i.e., writes) the data value to a destination register associated with each SIMD lane.
- broadcasts i.e., writes
- the processing element 308 of each SIMD lane is provided access to the results of the comparison for all SIMD lanes.
- the vector coprocessor core 104 performs the comparison, packing, and broadcasting as execution of a vector bit packing instruction, which may be defined as:
- Some embodiments of the vector coprocessor core 104 include logic that copies a value of one register to another within each SIMD lane based on a packed array of flags, where each flag corresponds to an SIMD lane.
- each SIMD lane identifies the flag value corresponding to the lane (e.g., bit 0 of the register for lane 0 , bit 1 of the register for lane 1 , etc.). If the flag value is “1” then a specified source register of the lane is copied to a specified destination register of the lane. If the flag value is “0” then zero is written to the specified destination register of the lane.
- the vector coprocessor core 104 performs the unpacking of the flag value and the register copying as execution of a vector bit unpacking instruction, which may be defined as:
- Some embodiments of the vector coprocessor core 104 include logic that transposes values of a given register across SIMD lanes. For example, as shown below, a given register in each of a 4-way vector coprocessor core 104 contains the values 8, 4, 0 ⁇ C, and 2. The vector coprocessor core 104 transposes the bit values such that bit 0 values of each lane are written to the specified destination register of lane 0 , bit 1 values of each lane are written to the specified destination register of lane 1 , etc.
- the vector coprocessor core 104 transposes the bits of the source register across SIMD lanes.
- the vector coprocessor core 104 performs the transposition as execution of a vector bit transpose instruction, which may be defined as:
- Some embodiments of the processing element 308 include logic that provides bit level interleaving and deinterleaving of values stored in registers of the vector register file 306 corresponding to the processing element 308 .
- the processing element 308 may provide bit interleaving as shown below. In bit interleaving the bit values of two specified source registers are interleaved in a destination register, such that successive bits of each source register are written to alternate bit locations of the destination register.
- the processing element 308 performs the interleaving as execution of a vector bit interleave instruction, which may be defined as:
- the processing element 308 executes deinterleaving to reverse the interleaving operation described above. In deinterleaving, the processing element 308 writes even indexed bits of a specified source register to a first destination register and writes odd indexed bits to a second destination register. For example:
- the processing element 308 performs the deinterleaving as execution of a vector bit deinterleave instruction, which may be defined as:
- Embodiments of the vector coprocessor core 104 may also interleave register values across SIMD lanes. For example, for 8-way SIMD, the vector coprocessor core 104 may provide single element interleaving of two specified source registers as:
- the vector coprocessor core 104 performs the interleaving as execution of a vector interleave instruction, which may be defined as
- the vector coprocessor core 104 may also interleave register values across SIMD lanes with 2-element frequency. For example, for 8-way SIMD, the vector coprocessor core 104 may provide 2-element interleaving of two specified source registers as:
- the vector coprocessor core 104 performs the 2-element interleaving as execution of a vector interleave instruction, which may be defined as:
- the vector coprocessor core 104 may also interleave register values across SIMD lanes with 4-element frequency. For example, for 8-way SIMD, the vector coprocessor core 104 may provide 4-element interleaving of two specified source registers as:
- the vector coprocessor core 104 performs the 4-element interleaving as execution of a vector interleave instruction, which may be defined as:
- Embodiments of the vector coprocessor core 104 provide deinterleaving of register values across SIMD lanes. Corresponding to the single element interleaving described above, the vector coprocessor core 104 provides single element deinterleaving. For example, for 8-way SIMD, the vector coprocessor core 104 may provide single element deinterleaving of two specified source registers as:
- the vector coprocessor core 104 performs the deinterleaving as execution of a vector interleave instruction, which may be defined as:
- the vector coprocessor core 104 provides 2-element deinterleaving.
- the vector coprocessor core 104 may provide 2-element deinterleaving of two specified source registers as:
- the vector coprocessor core 104 performs the 2-element deinterleaving as execution of a vector interleave instruction, which may be defined as:
- the processing elements 308 are configured to conditionally move data from a first register to second register based on an iteration condition of the nested loops being true.
- the conditional move is performed in a single instruction cycle.
- the processing elements 308 perform the conditional move as execution of a conditional move instruction, which may defined as:
- the loop iteration condition (cond) may specify performing the move:
- the processing elements 308 are configured to conditionally swap data values between two registers in a single instruction cycle based on a value contained in a specified condition register. Each processing element 308 executes the swap based on the condition register associated with the SIMD lane corresponding to the processing element 308 .
- the processing elements 308 perform the value swap as execution of a conditional swap instruction, which may defined as:
- the processing elements 308 are configured to sort two values contained in specified registers in a single instruction cycle.
- the processing element 308 compares the two values. The smaller of the values is written to a first register, and the larger of the two values is written to a second register.
- the processing elements 308 perform the value sort as execution of a sort instruction, which may defined as:
- the processing elements 308 include logic that generates a result value from values contained in three specified registers.
- a processing element 308 may, in a single instruction cycle, add three register values, logically “and” three register values, logically “or” three register values, or add two register values and subtract a third register value.
- the processing elements 308 perform these operations as execution of instructions, which may defined as:
- the table lookup unit 310 is a processing unit separate from the processing elements 308 and the histogram unit 312 .
- the table lookup unit 310 accelerates lookup of data values stored in tables in the memories 110 , 112 , 114 .
- the table lookup unit 310 can perform N lookups (where N is the number of SIMD lanes of the vector coprocessor core 104 ) per cycle.
- the table lookup unit 310 executes the table lookups in a nested loop.
- the table lookup loop is defined by a VLOOP instruction that specifies table lookup operation.
- the vector command specified by VLOOP and the associated vector instructions cause the table lookup unit 310 to retrieve a specified set of values from one or more tables stored in the memories 110 , 112 , 114 , and store the retrieved values in the memories 110 , 112 , 114 at a different specified location.
- a table lookup vector command initializes address generators used to access information defining which values are to be retrieved from a lookup table, used to lookup table location in memory 110 , 112 , 114 , and used to define where the retrieved lookup table values are to be stored.
- the table lookup unit 310 retrieves information identifying the data to be fetched from the lookup table, applies the information in conjunction with the lookup table location to fetch the data, and stores the fetched data to memory 110 , 112 , 114 for subsequent access by a compute loop executing on the vector coprocessor core 104 .
- the table lookup unit 310 may fetch table data from memories 110 , 112 , 114 based on a vector load instruction as disclosed herein, and store the fetched data to memories 110 , 112 , 114 using a vector store instruction as disclosed herein. Embodiments of the table lookup unit 310 may also fetch data from memories 110 , 112 , 114 using a vector table load instruction, which may be defined as:
- the table lookup unit 310 may fetch one or more data values from one or more tables simultaneously, where each of the multiple tables is located in a different bank of memories 110 , 112 , 114 . Fetching multiple values from a table for a given index is advantageous when interpolation is to be applied to the values (e.g., bilinear or bicubic interpolation).
- Some embodiments of the table lookup unit 310 constrain the number of tables accessed and/or data values accessed in parallel. For example, the product of the number of tables accessed and the number of data values retrieved per table may be restricted to be less than the number of SIMD lanes of the vector coprocessor core 104 . In some embodiments, the number of data values retrieved per table access may be restricted to be 1, 2, or 4. Table 1 below shows allowable table and value number combinations for some embodiments of an 8-way SIMD vector coprocessor core 104 .
- the histogram unit 312 is a processing unit separate from the processing elements 308 and the table lookup unit 310 .
- the histogram unit 312 accelerates construction of histograms in the memories 110 , 112 , 114 .
- the histogram unit 312 provides construction of normal histograms, in which an addressed histogram bin entry is incremented by 1, and weighted histograms, in which an addressed histogram bin entry is incremented by a value provided as an element in a weight array input.
- the histogram unit 312 can perform N histogram bin updates (where N is the number of SIMD lanes of the vector coprocessor core 104 ) simultaneously.
- the histogram unit 312 executes the histogram bin updates in a nested loop.
- the histogram loop is defined by a VLOOP instruction that specifies histogram operation.
- the vector command specified by VLOOP and the associated vector instructions cause the histogram unit 312 to retrieve histogram bin values from one or more histograms stored in the memories 110 , 112 , 114 , increment the retrieved values in accordance with a predetermined weight, and store the updated values in the memories 110 , 112 , 114 at the locations from which the values were retrieved.
- a histogram vector command initializes the increment value by which the retrieved histogram bin values are to be increased, loads an index to a histogram bin, fetches the value from the histogram bin from memory 110 , 112 , 114 , adds the increment value to the histogram bin, and stores the updated histogram bin value to memory 110 , 112 , 114 .
- Bin value and weights may be signed or unsigned. Saturation may be applied to the updated histogram bin value in accordance with the type (e.g., signed/unsigned, data size, etc.) in conjunction with the store operation.
- Vector load instructions as disclosed herein, may be used to initialize the increment value and load the bin index.
- Embodiments of the histogram unit 312 may fetch histogram bin values from memories 110 , 112 , 114 in accordance with a histogram load instruction, which may be defined as:
- Embodiments of the histogram unit 312 may store updated histogram bin values to memories 110 , 112 , 114 in accordance with a histogram store instruction, which may be defined as:
- Embodiments of the processor 100 may be applied to advantage in any number of devices and/or systems that employ real-time data processing. Embodiments may be particularly well suited for use in devices that employ image and/or vision processing, such as consumer devices that that include imaging systems. Such devices may include an image sensor for acquiring image data and/or a display device for displaying acquired and/or processed image data. For example, embodiments of the processor 100 may be included in mobile telephones, tablet computers, and other mobile devices to provide image processing while reducing overall power consumption.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Advance Control (AREA)
- Complex Calculations (AREA)
Abstract
A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core includes a program memory interface through which the scalar processor retrieves instructions from a program memory. The instructions include scalar instructions executable by the scalar processor and vector instructions executable by the vector coprocessor core. The vector coprocessor core includes a plurality of execution units and a vector command buffer. The vector command buffer is configured to decode vector instructions passed by the scalar processor core, to determine whether vector instructions defining an instruction loop have been decoded, and to initiate execution of the instruction loop by one or more of the execution units based on a determination that all of the vector instructions of the instruction loop have been decoded.
Description
- The present application claims priority to U.S. Provisional Patent Application No. 61/507,652, filed on Jul. 14, 2011 (Attorney Docket No. TI-70051 PS); which is hereby incorporated herein by reference in its entirety.
- Various processor designs include coprocessors that are intended to accelerate execution of a given set of processing tasks. Some such coprocessors achieve good performance/area in typical processing tasks, such as scaling, filtering, transformation, sum of absolute differences, etc., executed by a digital signal processor (DSP). However, as the complexity of digital signal processing algorithms increases, processing tasks often require numerous passes of processing through a coprocessor, compromising power efficiency. Furthermore, access patterns required by DSP algorithms are becoming less regular, thereby negatively impacting the overall processing efficiency of coprocessors designed to accommodate more regular access patterns. Consequently, processor and coprocessor architectures that provide improved processing, power, and/or area efficiency are desirable.
- A processor that includes a control processor core and a vector processor core is disclosed herein. In one embodiment, a processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core includes a program memory interface through which the scalar processor retrieves instructions from a program memory. The instructions include scalar instructions executable by the scalar processor and vector instructions executable by the vector coprocessor core. The vector coprocessor core includes a plurality of execution units and a vector command buffer. The vector command buffer is configured to decode vector instructions passed by the scalar processor core, to determine whether vector instructions defining an instruction loop have been decoded, and to initiate execution of the instruction loop by one or more of the execution units based on a determination that all of the vector instructions of the instruction loop have been decoded.
- In another embodiment, a vector coprocessor includes a plurality of execution units and a vector command buffer. The execution units are configured to simultaneously apply an instruction specified operation to different data values. The vector command buffer is configured to decode instructions to be executed by the execution units, to identify an instruction loop in the instructions, and to provide the instructions to the execution units for execution based on a determination that all of the instructions of an identified instruction loop have been decoded.
- In a further embodiment, a processor includes a control processor core and a single-instruction multiple data (SIMD) coprocessor core coupled to the control processor core. The control processor core includes a program memory interface through which the control processor core retrieves instructions from a program memory. The instructions comprise instructions executable by the control processor core and SIMD instructions executable by the SIMD coprocessor core. The SIMD coprocessor core includes a plurality of execution units, a vector command buffer, and loop control logic coupled to the vector command buffer. The vector command buffer is configured to group SIMD instructions received from the control processer core into an instruction loop, and to initiate execution of the instruction loop based on a complete set of SIMD instructions of an instruction loop being received from the control processor core. The loop control logic is configured to manage execution of the instruction loop as a plurality of nested loops without loop overhead.
- For a detailed description of exemplary embodiments of the invention, reference will now be made to the accompanying drawings in which:
-
FIG. 1 shows a block diagram of a processor in accordance with various embodiments; -
FIG. 2 shows a block diagram of a processor in accordance with various embodiments; -
FIG. 3 shows a block diagram of a vector coprocessor core in accordance with various embodiments; -
FIG. 4 show a block diagram of an vector command buffer of the vector coprocessor core in accordance with various embodiments; -
FIG. 5 shows a diagram of scalar processor core and vector coprocessor core execution interaction in accordance with various embodiments; - FIGS. 6A-6FH show load data distributions provided by a load unit of a vector coprocessor core in accordance with various embodiments;
-
FIG. 7 shows a table of load unit data distributions in accordance with various embodiments; and -
FIG. 8 shows a table of store unit data distributions in accordance with various embodiments. - Certain terms are used throughout the following description and claims to refer to particular system components. As one skilled in the art will appreciate, companies may refer to a component by different names. This document does not intend to distinguish between components that differ in name but not function. In the following discussion and in the claims, the terms “including” and “comprising” are used in an open-ended fashion, and thus should be interpreted to mean “including, but not limited to . . . ” Also, the term “couple” or “couples” is intended to mean either an indirect or direct electrical connection. Thus, if a first device couples to a second device, that connection may be through a direct electrical connection, or through an indirect electrical connection via other devices and connections. Further, the term “software” includes any executable code capable of running on a processor, regardless of the media used to store the software. Thus, code stored in memory (e.g., non-volatile memory), and sometimes referred to as “embedded firmware,” is included within the definition of software. The recitation “based on” is intended to mean “based at least in part on.” Therefore, if X is based on Y, X may be based on Y and any number of other factors. The terms “alternate,” “alternating” and the like are used to designate every other one of a series.
- The following discussion is directed to various embodiments of the invention. Although one or more of these embodiments may be preferred, the embodiments disclosed should not be interpreted, or otherwise used, as limiting the scope of the disclosure, including the claims. In addition, one skilled in the art will understand that the following description has broad application, and the discussion of any embodiment is meant only to be exemplary of that embodiment, and not intended to intimate that the scope of the disclosure, including the claims, is limited to that embodiment.
- Embodiments of the processor disclosed herein provide improved performance without sacrificing area or power efficiency.
FIG. 1 shows a block diagram of aprocessor 100 in accordance with various embodiments. Theprocessor 100 includes ascalar processor core 102, avector coprocessor core 104, aprogram memory 106, adata memory 108, a workingbuffer memory 110, anA buffer memory 112, and aB buffer memory 114. The A andB buffer memories cores vector coprocessor core 102, each of the workingbuffer memory 110, Abuffer memory 112, andB buffer memory 114 may comprise N simultaneously accessible banks. For example, if thevector coprocessor core 104 is an 8-way single-instruction multiple-data (SIMD) core, then each of the working, A, andB buffers vector coprocessor core 104.Switching network 118 provide signal routing between thememories processor cores 102, 104). -
FIG. 2 shows a block diagram of theprocessor 100 including various peripherals, includingDMA controller 202,memory management units 204,clock generator 206,interrupt controller 208, counter/time module 210,trace port 214, memory mappedregisters 212 and various interconnect structures that link the components of theprocessor 100. - The
scalar processor core 102 may be a reduced instruction set processor core, and include various components, such as execution units, registers, instruction decoders, peripherals, input/output systems and various other components and sub-systems. Embodiments of thescalar processor core 102 may include a plurality of execution units that perform data manipulation operations. For example, an embodiment of thescalar processor core 102 may include five execution units, a first execution unit performs the logical, shift, rotation, extraction, reverse, clear, set, and equal operations, a second execution unit performs data movement operations, a third execution unit performs arithmetic operations, a fourth execution unit performs multiplication, and a fifth execution unit performs division. In some embodiments, thescalar processor core 102 serves as a control processor for theprocessor 100, and executes control operations, services interrupts, etc., while thevector coprocessor core 104 serves as a signal processor for processing signal data (e.g., image signals) provided to thevector coprocessor core 104 via thememories - The
program memory 106 stores instructions to be executed by thescalar core 102 interspersed with instructions to be executed by thevector coprocessor core 104. Thescalar processor core 102 accesses theprogram memory 106 and retrieves therefrom an instruction stream comprising instructions to be executed by thescalar processor core 102 and instructions to be executed by thevector coprocessor core 104. Thescalar processor core 102 identifies instructions to be executed by thevector coprocessor core 104 and provides the instructions to thevector coprocessor core 104 via acoprocessor interface 116. In some embodiments, thescalar processor 102 provides vector instructions, control data, and/or loop instruction program memory addresses to thevector coprocessor core 104 via thecoprocessor interface 116. The loop instruction program memory addresses may be provided concurrently with a loop instruction, and the control data may be provided concurrently with a control register load instruction. In some embodiments, theprogram memory 106 may be a cache memory that fetches instructions from a memory external to theprocessor 100 and provides the instructions to thescalar processor core 102. -
FIG. 3 shows a block diagram of thevector coprocessor core 104 in accordance with various embodiments. Thevector coprocessor core 104 may be an SIMD processor that executes instructions arranged as a loop. More specifically, thevector coprocessor core 104 executes vector instructions within a plurality of nested loops. In some embodiments, thevector coprocessor core 104 includes built-in looping control that executes instructions in four or more nested loops with zero looping overhead. Thevector coprocessor core 104 includes a command decoder/buffer 302,loop control logic 304, avector register file 306, processingelements 308, a table look-upunit 310, ahistogram unit 312,load units 314,store units 316, andaddress generators 318. Theload units 314 andstore units 316 access the workingbuffer memory 110, anA buffer memory 112, and aB buffer memory 114 through a memory interface 320. Theaddress generators 318 compute the addresses applied by the load andstore units address generator 318 is capable of multi-dimensional addressing that computes an address based on the indices of the nested loops and corresponding constants (e.g., address=base+i1*const1+i2*const2+i3*const3+i4*const4 for 4-dimensional addressing where in is a loop index for one of four nested loops). - The memory interface 320 connects the
vector coprocessor core 104 via a lane of interconnect corresponding to each bank of each ofmemories memory vector coprocessor core 104 via eight parallel memory lanes, where each memory lane connects to a port of the memory interface 320. Memory lanes that connect to adjacent ports of the memory interface 320 are termed adjacent memory lanes. - The
vector coprocessor core 104 is N-way SIMD, where in the embodiment ofFIG. 3 , N=8. N may be different in other embodiments. Thus, thecoprocessor core 104 includes N processing lanes, where each lane includes aprocessing element 308 and a set of registers of thevector register file 306 that provide operands to and store results generated by theprocessing element 308. Eachprocessing element 308 may include a plurality of function units that operate on (e.g., multiply, add, compare, etc.) the operands provided by theregister file 306. Accordingly, theregister file 306 is N-way and includes storage of a plurality of entries. For example, theregister file 306 may be N×16 where the register file includes sixteen registers for each of the N ways of thevector coprocessor core 104. Corresponding registers of adjacent ways are termed adjacent registers. Thus, a register RO ofSIMD way 0 is adjacent to register RO ofSIMD way 1. Similarly, register RO ofSIMD way 0 and register 0 ofSIMD way 2 are alternate registers. Theprocessing elements 308 and the registers of theregister file 306 are sized to process data values of various sizes. In some embodiments, theprocessing elements 308 and the registers of theregister file 306 are sized to process 40 bit and smaller data values (e.g., 32 bit, 16 bit, 8, bit). Other embodiments may be sized to process different data value sizes. - As noted above, the
vector coprocessor core 104 repeatedly executes a vector instruction sequence (referred to as a vector command) within a nested loop. The nested looping is controlled by theloop control logic 304. While thevector coprocessor core 104 is executing vector commands, thescalar core 102 continues to decode and execute the instruction stream retrieved fromprogram memory 106, until execution of a coprocessor synchronization instruction (by the scalar core 102) forces thescalar core 102 to stall forvector coprocessor core 104 vector command completion. While thescalar core 102 is stalled, thescalar core 102 may service interrupts unless interrupt processing is disabled. Thus, thescalar core 102 executes instructions and services interrupts in parallel withvector coprocessor core 104 instruction execution. Instruction execution by thescalar core 102 may be synchronized with instruction execution by thevector coprocessor core 104 based on thescalar core 102 executing a synchronization instruction that causes thescalar core 102 to stall until thevector coprocessor core 104 asserts a synchronization signal indicating that vector processing is complete. Assertion the synchronization signal may be triggered by execution of a synchronization instruction by thevector coprocessor core 104. - The command decode/
buffer 302 of thevector coprocessor core 104 includes an instruction buffer that provides temporary storage for vector instructions.FIG. 4 shows a block diagram of the command decode/buffer 302 of thevector coprocessor core 104 in accordance with various embodiments. The command decode/buffer 302 includes a pre-decode first-in first-out (FIFO)buffer 402, avector instruction decoder 404, and vector command storage buffers 406. Each vectorcommand storage buffer 406 includes capacity to store a complete vector command of maximum size. Vector instructions flow from thescalar processor core 102 through thepre-decode FIFO 402 and are decoded by thevector instruction decoder 404. The decoded vector instructions corresponding to a given vector command are stored in one of the vector command storage buffers 406, and each stored vector command is provided for execution in sequence. Execution of a decoded vector command is initiated (e.g., the vector command is read out of the vector command storage buffer 406) only after the complete vector command is decoded and stored in a vectorcommand storage buffer 406. Thus, the command decode/buffer 302 loads a vector command into each of the vector command storage buffers 406, and when the vector command storage buffers 406 are occupied additional vector instructions received by the command decode/buffer 302 are stored in thepre-decode buffer 402 until execution of a vector command is complete, at which time the FIFO buffered vector command may be decoded and loaded into the emptied vectorcommand storage buffer 406 previously occupied by the executed vector command. -
FIG. 5 shows a diagram ofscalar processor core 102 andvector coprocessor core 104 interaction in accordance with various embodiments. InFIG. 5 , vector instructions i0-i3 form a first exemplary vector command, vector instructions i4-i7 form a second exemplary vector command, and vector instructions i8-i11 form a third exemplary vector command. At time T1, thescalar processor core 102 recognizes vector instructions in the instruction stream fetched fromprogram memory 106. In response, thescalar processor core 102 asserts the vector valid signal (vec_valid) and passes the identified vector instructions to thevector coprocessor core 104. At time T2, the first vector command has been transferred to thevector coprocessor core 104, and thevector coprocessor core 104 initiates execution of the first vector command while thescalar processor core 102 continues to transfer the vector instructions of the second vector command to thevector coprocessor core 104. At time T3, transfer of the second vector command to thevector coprocessor core 104 is complete, and the execution of the first vector command is ongoing. Consequently, thevector coprocessor core 104 negates the ready signal (vec_rdy) which causes thescalar processor core 102 to discontinue vector instruction transfer. At time T4, execution of the first vector command is complete, and execution of the second vector command begins. With completion of the first vector command,vector coprocessor core 104 asserts the ready signal, and the command decode/buffer 302 receives the vector instructions of the third vector command. At time T5, thevector coprocessor core 104 completes execution of the second vector command. At time T6, transfer of the third vector command is complete, and thevector coprocessor core 104 initiates execution of the third vector command. A VWDONE instruction follows the last instruction of the third vector command. The VWDONE instruction causes thescalar processor core 102 to stall pending completion of the third vector command by thevector coprocessor core 104. When thevector coprocessor core 104 completes execution of the third vector command, thevector coprocessor core 104 executes the VWDONE command which causes thevector coprocessor core 104 to assert the vector done signal (vec_done). Assertion of the vector done signal allows thescalar processor core 102 to resume execution, thus providing core synchronization. - In order to highlight certain aspects of
scalar processor core 102 andvector coprocessor core 104 interaction, the example ofFIG. 5 excludes thepre-decode FIFO 402. With inclusion of thepre-decode FIFO 402, thevector coprocessor core 104 delays negation of vec_rdy until thepre-decode FIFO 402 is full. - Within the multi-level nested loop executed by the
vector coprocessor core 104, operations of vector command execution can be represented as sequential load, arithmetic operation, store, and pointer update stages, where a number of operations may be executed in each stage. The following listing shows a skeleton of the nested loop model for a four loop embodiment of thevector coprocessor core 104. There are 4 loop variables, i1, i2, i3, and i4. Each loop variable is incremented from 0 to Ipend1 . . . 4. -
EVE_compute(...) { for (i1=0; i1<=lpend1; i1++) { for (i2=0; i2<=lpend2; i2++) { for (i3=0; i3<=lpend3; i3++) { for (i4=0; i4<=lpend4; i4++) { for (k=0; k<num_inits; k++) initialize_vreg_from_parameters(...); for (k=0; k<num_loads; k++) load_vreg_from_local_memory(...); for (k=0; k<num_ops; k++) op(...); // 2 functional units, executing 2 ops per cycle for (k=0; k<num_stores; k++) store_vreg_to_local_memory(...); for (k=0; k<num_agens; k++) update_agen(...); } } } } - Each iteration of the innermost loop (i4) executes in a number of cycles equal to the maximal number of cycles spent in execution of loads, arithmetic operations, and stores within the loop. Cycle count for the arthmetic operations is constant for each interation, but cycle count for load and store operations can change depending on pointer update, loop level, and read/write memory contention.
- Embodiments define a vector command with a loop initiation instruction, VLOOP.
- VLOOP cmd_type, CL#: cmd_len, PL#: param_len
where:- cmd_type specifies the loop type: compute (executed by the processing elements), table lookup (executed by the table lookup unit), or histogram (executed by the histogram unit);
- cmd_len specifies the length of the vector command; and
- param_len specifies the length of the memory stored parameter file associated with the vector command.
- The vector instructions following VLOOP initialize the registers and address generators of the
vector coprocessor core 104, and specify the load operations, arithmetic and data manipulation operations, and store operations to be performed with the nested loops. The parameters applicable to execution of a vector command (e.g., loop counts, address pointers to arrays, constants used in the computation, round/truncate shift count, saturation bounds, etc.) may stored in memory (e.g., 110, 112, 114) by thescalar processor core 102 as a parameter file and retrieved by thevector coprocessor core 102 as part of loop initialization. - While embodiments of the
vector coprocessor core 104 may always execute a fixed number of nested loops (e.g., 4 as shown in the model above), with loop terminal counts of zero or greater, some embodiments include an optional outermost loop (e.g., an optional fifth loop). The optional outermost loop encompasses the fixed number of nested loops associated with the VLOOP instruction, and may be instantiated separately from the fixed number of nested loops. As with the nested loops associated with the VLOOP instruction, execution of the optional outermost loop requires no looping overhead. Each iteration of the optional outermost loop may advance a parameter pointer associated with the nested loops. For example, the parameter pointer may be advanced by param_len provided in the VLOOP instruction. The parameter pointer references the parameter file that contains the parameters applicable to execution of the vector command as explained above (loop counts, etc.). By changing the parameters of the vector command with each iteration of the outermost loop, embodiments of thevector coprocessor core 104 can apply the vector command to objects/structures/arrays of varying dimension or having varying inter-object spacing. For example, changing loop counts for the nested loops allows thevector coprocessor core 104 to processes objects of varying dimensions with a single vector command, and without the overhead of a software loop. - The loop count of the optional outer loop and the parameter pointer may be set by execution of an instruction by the
vector coprocessor core 104. The instruction may load a parameter into a control register of the core 104 as: - VCTRL <scalar_register>, <control_register>
where:- scalar_register specifies a register containg a value to loaded as an outermost loop count or parameter pointer; and
- control_register specifies a destination register, where the destination register may be the outermost loop end count register or the vector command parameter pointer register.
- Execution of a vector command may be complete when a total number of iterations specified in the parameter file for each loop of the vector command are complete. Because it is advantageous in some situations to terminate the vector command prior to execution of all specified loop iterations, the
vector coprocessor core 104 provides early termination of a vector command. Early termination is useful when, for example, the vector command has identified a condition in the data being processed that makes additional processing of the data superfluous. Early termination of a vector command is provided for by execution, in the vector command, of a loop early exit instruction defined as: - VEXITNZ level, src1
where:- level specifies whether a vector command (i.e., loops associated with a VLOOP instruction) or an optional outermost loop is to be exited; and
- src1 specifies a register containing a value that determines whether to perform the early exit.
- Execution of the VEXITNZ instruction causes the
vector coprocessor core 104 to examine the value contained in the register src1 (e.g., associated with a given SIMD lane), and to schedule loop termination if the value is non-zero. Other embodiments may schedule loop termination based on other conditions of the value (e.g., zero, particular bit set, etc.). If the level parameter indicates that the vector command is to be exited, then thevector coprocessor core 104 schedules the nested loops associated with the vector command to terminate after completion of the current iteration of the innermost of the nest loops. Thus, if the level parameter indicates that the vector command is to be exited, any optional outmost loop encompassing the vector command is not exited, and a next iteration of the vector command may be executed. - If the level parameter indicates that the optional outermost loop is to be exited, then, on identification of the terminal state of src1, the
vector coprocessor core 104 schedules the optional outermost loop to terminate after completion of all remaining iterations of the nested loops associated with the vector command encompassed by the optional outermost loop. - The
load units 314 move data from thememories vector register file 306, and include routing circuitry that distributes data values retrieved from thememories vector coprocessor core 104 specify how the data is to be distributed to the registers. FIGS. 6A-6FH show load data distributions provided by theload unit 314 of thevector coprocessor core 104 in accordance with various embodiments. While the illustrative distributions ofFIGS. 6A-6F are directed loading data values of a given size (e.g., 16 bits), embodiments of theload units 314 may apply similar distributions to data values of other sizes (e.g., 8 bits, 32 bits, etc.). Theload units 314 may move data frommemory -
FIG. 6A shows aload unit 314 retrieving a data value from each of eight locations of amemory load unit 314 moves a value from memory via each of a plurality adjacent lanes, and distributes the data values to a plurality of adjacent registers of thevector register file 306 in a single instruction cycle. -
FIG. 6B shows aload unit 314 retrieving a data value from a single location of amemory vector register file 306. More generally, theload unit 314 moves a value from a single location of amemory vector register file 306 in a single instruction cycle. Thus, theload unit 314 may distribute a single value frommemory vector coprocessor core 104. -
FIG. 6C shows aload unit 314 retrieving a data value from each of two locations of amemory vector register file 306. More generally, theload unit 314 moves a value from each of two locations of amemory vector register file 306 in a single instruction cycle. That is, each value of the pair of values is written to alternate registers of the register file 306 (e.g., one value to odd indexed registers and the other value to even indexed registers). Thus, theload unit 314 may distribute a pair of values frommemory vector coprocessor core 104. -
FIG. 6D shows aload unit 314 retrieving a data value from each of eight locations of amemory vector register file 306. More generally, theload unit 314 moves a value from each of a plurality of locations of amemory vector register file 306 in a single instruction cycle. Thus, theload unit 314 provides down-sampling of the data stored in memory by a factor of two. -
FIG. 6E shows aload unit 314 retrieving a data value from each of four locations of amemory vector register file 306. More generally, theload unit 314 moves a value from each of a plurality locations of amemory vector register file 306 in a single instruction cycle. Thus, theload unit 314 provides up-sampling of the data stored in memory by a factor of two. -
FIG. 6F shows aload unit 314 retrieving a data value from each of sixteen locations of amemory vector register file 306 such that data values retrieved via even numbered lanes are distributed to adjacent registers and data values retrieved via odd numbered lanes are distributed to adjacent registers. More generally, theload unit 314 moves a value from each of a plurality locations of amemory vector register file 306. Thus, theload unit 314 provides deinterleaving of data values across registers M and M+1 where register M encompasses a given register of each way of the N-wayvector coprocessor core 104 in a single instruction cycle. - Some embodiments of the
load unit 314 also provide custom distribution. With custom distribution, theload unit 314 distributes one or more data values retrieved from amemory vector register file 306 in accordance with a distribution pattern specified by an instruction loaded distribution control register or a distribution control structure retrieved from memory. Load with custom distribution can move data from memory to thevector register file 306 in a single instruction cycle. The custom distribution may be arbitrary. Custom distribution allows the number of values read from memory, the number of registers of theregister file 306 loaded, and the distribution of data to the registers to be specified. In some embodiments of theload unit 314, custom distribution allows loading of data across multiple rows of thevector register file 306 with instruction defined distribution. For example, execution of a single custom load instruction may cause aload unit 314 to move values from memory locations 0-7 to registers V[0][0-7] and move values from memory locations 3-10 to registers V[1][0-7]. Such data loading may be applied to facilitate motion estimation searching in a video system. - Some embodiments of the
load unit 314 further provide for loading with expansion. In loading with expansion, theload unit 314 retrieves a compacted (collated) array from amemory vector register file 306. The positioning of each element of the array is determined by expansion information loaded into an expansion control register via instruction. For example, given array {A,B,C} retrieved from memory and expansion control information {0,0,1,0,1,1,0,0}, the retrieved array may be expanded to {0,0,A,0,B,C,0,0} and written to registers of theregister file 306. Load with expansion moves data from memory to thevector register file 306 with expansion in a single instruction cycle. -
FIG. 7 shows a table of data distributions that may be implemented by theload unit 314 in accordance with various embodiments. Operation of theload units 314 may be invoked by execution of a vector load instruction by thevector coprocessor core 104. The vector load instruction may take the form of: - VLD<type>_<distribution> base[agen], vreg
where:- type specifies the data size (e.g., byte, half-word, word, etc.);
- distribution specifies the data distribution option (described above) to be applied;
- base specifies a register containing an address;
- agen specifies an address generator for indexing; and
- vreg specifies a vector register to be loaded.
- The timing of vector load instruction execution may be determined by the load units 314 (i.e., by hardware) based, for example, on when the data retrieved by the load is needed by the
processing elements 308, and memory interface availability. In contrast, the timing of the computations performed by theprocessing elements 308 may be determined by the sequence of vector instructions provided by thescalar processor core 102. - The
store units 316 include routing circuitry that distributes data values retrieved from the registers of thevector register file 306 to locations in thememories vector coprocessor core 104 specify how the data is to be distributed to memory. At least some of the data distributions provide by thestore unit 316 reverse the data distributions provided by theload units 314. Thestore units 316 may provide the data distributions described herein for data values of various lengths (e.g., 32, 16, 8 bit values). Thestore units 316 move data from the vector registers 306 tomemory - A
store unit 316 may move data from a plurality of adjacent registers of theregister file 306 to locations inmemory vector coprocessor core 104 may be moved to memory via adjacent memory lanes in a single instruction cycle. Thestore unit 316 may also move a value from a single given register of theregister file 306 to a given location inmemory - The
store unit 316 may provide downsampling by a factor of two by storing data retrieved from alternate registers of the vector register file 306 (i.e., data from each of alternate ways of the vector coprocessor core 104) to locations ofmemory store unit 316 may provide an operation that reverses the upsampling by two shown inFIG. 6E . Thestore unit 316 provides the movement of data from registers to memory with down sampling in a single instruction cycle. - Embodiments of the
store unit 316 may provide interleaving of data values retrieved from registers of thevector register file 306 while moving the data values to memory. The interleaving reverses the distribution shown inFIG. 6F such that data values retrieved from a first set of adjacent registers are written to memory locations via even indexed memory lanes and data values retrieved from a second set of adjacent registers are interleaved therewith by writing the data values to memory locations via odd indexed memory lanes. Thestore unit 316 provides the movement of data from registers to memory with interleaving in a single instruction cycle. - Embodiments of the
store unit 316 may provide for transposition of data values retrieved from registers of thevector register file 306 while moving the data values to memory, where, for example, the data values form a row or column of an array. Data values corresponding to each way of thevector coprocessor core 104 may be written to memory at an index corresponding to the index of the register providing the data value times the number of ways plus one. Thus, for 8-way SIMD, reg[0] is written to mem[0], reg[1] is written to mem[9], reg[2] is written to mem[18], etc. Where, the transposed register values are written to different banks of memory, thestore unit 316 provides movement of N data values from registers to memory with transposition in a single instruction cycle. - Embodiments of the
store unit 316 may provide collation of data values retrieved from registers of thevector register file 306 while moving the data values to memory. The collating reverses the expansion distribution provided by theload units 314. The collation compacts the data retrieved from adjacent registers of thevector register file 306, by writing to locations of memory via adjacent memory lanes those data values identified in collation control information stored in a register. For example, given registers containing an array {0,0,A,0,B,C,0,0} and collation control information {0,0,1,0,1,1,0,0}, thestore unit 316 stores {A,B,C} in memory. Thestore unit 316 provides the movement of data from registers to memory with collation in a single instruction cycle. - Embodiments of the
store unit 316 may provide data-driven addressing (DDA) of data values retrieved from registers of thevector register file 306 while moving the data values to memory. The data-driven addressing generates a memory address for each of a plurality of adjacent registers of thevector register file 306 using offset values provided from a DDA control register. The DDA control register may be a register of the vector register file corresponding the way of the register containing the value to written to memory. Register data values corresponding to each of the N ways of the vector coprocessor core may be stored to memory in a single instruction cycle if the DDA control register specified offsets provide for the data values to be written to different memory banks. If the DDA control register specified offsets provide for the data values to be written to memory banks that preclude simultaneously writing all data values, then thestore unit 316 may write the data values in a plurality of cycles selected to minimize the number of memory cycles used to write the register values to memory. - Embodiments of the
store unit 316 may provide for moving data values retrieved from a plurality of adjacent registers of thevector register file 306 to locations of the memory via alternate memory lanes, thus skipping every other memory location. Thestore units 316 may write the plurality of data values to alternate locations inmemory -
FIG. 8 shows a table of data distributions that may be implemented by thestore unit 316 in accordance with various embodiments. Operation of thestore units 316 may be invoked by execution of a vector store instruction by thevector coprocessor core 104. The vector store instruction may take the form of: - [pred] VST<type>_<distribution>_<wr_loop> vreg, base[agen], RND_SAT: rnd_sat_param
where:- pred specifies a register containing a condition value that determines whether the store is performed;
- type specifies the data size (e.g., byte, half-word, word, etc.);
- distribution specifies the data distribution option to be applied;
- wr_loop specifies the nested loop level where the store is to be performed;
- vreg specifies a vector register to be stored;
- base specifies a register containing an address;
- agen specifies an address generator for indexing; and
- RND_SAT: rnd_sat_param specifies the rounding/saturation to be applied to the stored data.
- The
store units 316 provide selectable rounding and/or saturation of data values as the values are moved from the vector registers 306 tomemory -
- NO_SAT: no saturation performed;
- SYMM: signed symmetrical saturation [−bound, bound] (for unsigned store, [0, bound]);
- ASYMM: signed asymmetrical saturation [−bound−1, bound] (for unsigned store, [0, bound]), useful for fixed bit width. For example, when bound=1023, saturate to [−1024, 1023];
- 4PARAM: use 4 parameter registers to specify sat_high_cmp, sat_high_set, sat—low —cmp, sat —low —set;
- SYMM32: use 2 parameter registers to specify a 32-bit bound, then follow SYMM above; and
- ASYMM32: use 2 parameter registers to specify a 32-bit bound, then follow ASYMM above.
- The timing of vector store instruction execution is determined by the store units 316 (i.e., by hardware) based, for example, on availability of the
memories processing elements 308 may be determined by the sequence of vector instructions provided by thescalar processor core 102. - The
processing elements 308 of thevector coprocessor core 104 include logic that accelerates SIMD processing of signal data. In SIMD processing, each of the N processing lanes (e.g., the processing element of the lane) is generally isolated from each of the other processing lanes. Embodiments of thevector coprocessor core 104 improve SIMD processing efficiency by providing communication between theprocessing elements 308 of the SIMD lanes. - Some embodiments of the
vector coprocessor core 104 include logic that compares values stored in two registers of thevector register file 306 associated with each SIMD processing lane. That is values of two registers associated with a first lane are compared, values of two registers associated with a second lane are compared, etc. Thevector coprocessor core 104 packs the result of the comparison in each lane into a data value, and broadcasts (i.e., writes) the data value to a destination register associated with each SIMD lane. Thus, theprocessing element 308 of each SIMD lane is provided access to the results of the comparison for all SIMD lanes. Thevector coprocessor core 104 performs the comparison, packing, and broadcasting as execution of a vector bit packing instruction, which may be defined as: - VBITPK src1, src2, dst
where:- src1 and src2 specify the registers to be compared; and
- dst specifies the register to which the packed comparison results are to be written.
- Some embodiments of the
vector coprocessor core 104 include logic that copies a value of one register to another within each SIMD lane based on a packed array of flags, where each flag corresponds to an SIMD lane. Thus, given the packed flag value in a register, each SIMD lane identifies the flag value corresponding to the lane (e.g.,bit 0 of the register forlane 0,bit 1 of the register forlane 1, etc.). If the flag value is “1” then a specified source register of the lane is copied to a specified destination register of the lane. If the flag value is “0” then zero is written to the specified destination register of the lane. Thevector coprocessor core 104 performs the unpacking of the flag value and the register copying as execution of a vector bit unpacking instruction, which may be defined as: - VBITUNPK src1, src2, dst
where:- src1 specifies the register containing the packed per lane flag values;
- src2 specifies the register to be copied based on the flag value for the lane; and
- dst specifies the destination register to written.
- Some embodiments of the
vector coprocessor core 104 include logic that transposes values of a given register across SIMD lanes. For example, as shown below, a given register in each of a 4-wayvector coprocessor core 104 contains thevalues vector coprocessor core 104 transposes the bit values such thatbit 0 values of each lane are written to the specified destination register oflane 0,bit 1 values of each lane are written to the specified destination register oflane 1, etc. -
Source: bit position lane value 0 1 2 3 0 1 1 0 0 0 1 2 0 1 0 0 2 3 1 1 0 0 3 4 0 0 1 0 -
Destination: bit position lane value 0 1 2 3 0 5 1 0 1 0 1 6 0 1 1 0 2 8 0 0 0 1 3 0 0 0 0 0
Thus, thevector coprocessor core 104 transposes the bits of the source register across SIMD lanes. Thevector coprocessor core 104 performs the transposition as execution of a vector bit transpose instruction, which may be defined as: - VBITTR src1, dst
where:- src1 specifies the register containing the bits to be transposed; and
- dst specifies the register to which the transposed bits are written.
- Some embodiments of the
processing element 308 include logic that provides bit level interleaving and deinterleaving of values stored in registers of thevector register file 306 corresponding to theprocessing element 308. For example, theprocessing element 308 may provide bit interleaving as shown below. In bit interleaving the bit values of two specified source registers are interleaved in a destination register, such that successive bits of each source register are written to alternate bit locations of the destination register. -
src1=0×25 (0000—0000—0010—0101), -
src2=0×11 (0000—0000—0001—0001), -
dst=0×523 (0000—0000—0000—0000—0000—1001—0010—0011) - The
processing element 308 performs the interleaving as execution of a vector bit interleave instruction, which may be defined as: - VBITI src1, src2, dst
where:- src1 and src2 specify the registers containing the bits to be interleaved; and
- dst specifies the register to which the interleaved bits are written.
- The
processing element 308 executes deinterleaving to reverse the interleaving operation described above. In deinterleaving, theprocessing element 308 writes even indexed bits of a specified source register to a first destination register and writes odd indexed bits to a second destination register. For example: -
src=0×523 (0000—0000—0000—0000—0000—1001—0010—0011) -
dst1=0×25 (0000—0000—0010—0101), -
dst2=0×11 (0000—0000—0001—0001), - The
processing element 308 performs the deinterleaving as execution of a vector bit deinterleave instruction, which may be defined as: - VBITDI src,
dst 1,dst 2,
where:- src specifies the register containing the bits to be deinterleaved; and
- dst1 and dst2 specify the registers to which the deinterleaved bits are written.
- Embodiments of the
vector coprocessor core 104 may also interleave register values across SIMD lanes. For example, for 8-way SIMD, thevector coprocessor core 104 may provide single element interleaving of two specified source registers as: -
dst1[0]=src1[0]; -
dst1[1]=src2[0]; -
dst1[2]=src1[1]; -
dst1[3]=src2[1]; -
dst1[4]=src1[2]; -
dst1[5]=src2[2]; -
dst1[6]=src1[3]; -
dst1[7]=src2[3]; -
dst2[0]=src1[4]; -
dst2[1]=src2[4]; -
dst2[2]=src1[5]; -
dst2[3]=src2[5]; -
dst2[4]=src1[6]; -
dst2[5]=src2[6]; -
dst2[6]=src1[7]; -
dst2[7]=src2[7]; - where the bracketed index value refers the SIMD lane. The
vector coprocessor core 104 performs the interleaving as execution of a vector interleave instruction, which may be defined as - VINTRLV src1/dst1, src2/dst2,
where src1/dst1 and src2/dst2 specify source registers to be interleaved and the registers to be written. - The
vector coprocessor core 104 may also interleave register values across SIMD lanes with 2-element frequency. For example, for 8-way SIMD, thevector coprocessor core 104 may provide 2-element interleaving of two specified source registers as: -
dst1[0]=src1[0]; -
dst1[1]=src1[1]; -
dst1[2]=src2[0]; -
dst1[3]=src2[1]; -
dst1[4]=src1[2]; -
dst1[5]=src1[3]; -
dst1[6]=src2[2]; -
dst1[7]=src2[3]; -
dst2[0]=src1[4]; -
dst2[1]=src1[5]; -
dst2[2]=src2[4]; -
dst2[3]=src2[5]; -
dst2[4]=src1[6]; -
dst2[5]=src1[7]; -
dst2[6]=src2[6]; -
dst2[7]=src2[7]; - where the bracketed index value refers the SIMD lane. The
vector coprocessor core 104 performs the 2-element interleaving as execution of a vector interleave instruction, which may be defined as: - VINTRLV2 src1/dst1, src2/dst2,
where src1/dst1 and src2/dst2 specify source registers to be interleaved and the registers to be written. - The
vector coprocessor core 104 may also interleave register values across SIMD lanes with 4-element frequency. For example, for 8-way SIMD, thevector coprocessor core 104 may provide 4-element interleaving of two specified source registers as: -
dst1[0]=src1[0]; -
dst1[1]=src1[1]; -
dst1[2]=src1[2]; -
dst1[3]=src1[3]; -
dst1[4]=src2[0]; -
dst1[5]=src2[1]; -
dst1[6]=src2[2]; -
dst1[7]=src2[3]; -
dst2[0]=src1[4]; -
dst2[1]=src1[5]; -
dst2[2]=src1[6]; -
dst2[3]=src1[7]; -
dst2[4]=src2[4]; -
dst2[5]=src2[5]; -
dst2[6]=src2[6]; -
dst2[7]=src2[7]; - where the bracketed index value refers the SIMD lane. The
vector coprocessor core 104 performs the 4-element interleaving as execution of a vector interleave instruction, which may be defined as: - VINTRLV4 src1/dst1, src2/dst2,
where src1/dst1 and src2/dst2 specify source registers to be interleaved and the registers to be written. - Embodiments of the
vector coprocessor core 104 provide deinterleaving of register values across SIMD lanes. Corresponding to the single element interleaving described above, thevector coprocessor core 104 provides single element deinterleaving. For example, for 8-way SIMD, thevector coprocessor core 104 may provide single element deinterleaving of two specified source registers as: -
dst1[0]=src1[0]; -
dst2[0]=src1[1]; -
dst1[1]=src1[2]; -
dst2[1]=src1[3]; -
dst1[2]=src1[4]; -
dst2[2]=src1[5]; -
dst1[3]=src1[6]; -
dst2[3]=src1[7]; -
dst1[4]=src2[0]; -
dst2[4]=src2[1]; -
dst1[5]=src2[2]; -
dst2[5]=src2[3]; -
dst1[6]=src2[4]; -
dst2[6]=src2[5]; -
dst1[7]=src2[6]; -
dst2[7]=src2[7]; - The
vector coprocessor core 104 performs the deinterleaving as execution of a vector interleave instruction, which may be defined as: - VDINTRLV src1/dst1, src2/dst2,
where src1/dst1 and src2/dst2 specify source registers to be deinterleaved and the registers to be written. - Corresponding to the 2-element interleaving described above, the
vector coprocessor core 104 provides 2-element deinterleaving. For example, for 8-way SIMD, thevector coprocessor core 104 may provide 2-element deinterleaving of two specified source registers as: -
dst1[0]=src1[0]; -
dst1[1]=src1[1]; -
dst2[0]=src1[2]; -
dst2[1]=src1[3]; -
dst1[2]=src1[4]; -
dst1[3]=src1[5]; -
dst2[2]=src1[6]; -
dst2[3]=src1[7]; -
dst1[4]=src2[0]; -
dst1[5]=src2[1]; -
dst2[4]=src2[2]; -
dst2[5]=src2[3]; -
dst1[6]=src2[4]; -
dst1[7]=src2[5]; -
dst2[6]=src2[6]; -
dst2[7]=src2[7]; - The
vector coprocessor core 104 performs the 2-element deinterleaving as execution of a vector interleave instruction, which may be defined as: - VDINTRLV2 src1/dst1, src2/dst2,
where src1/dst1 and src2/dst2 specify source registers to be deinterleaved and the registers to be written. - The
processing elements 308 are configured to conditionally move data from a first register to second register based on an iteration condition of the nested loops being true. The conditional move is performed in a single instruction cycle. Theprocessing elements 308 perform the conditional move as execution of a conditional move instruction, which may defined as: - VCMOV cond, src, dst
where:- src and dst specify the register from which and to which data is to be moved; and
- cond specifies the iteration condition of the nested loops under which the move is to be performed.
- The loop iteration condition (cond) may specify performing the move:
-
- on every iteration of the inner-most loop (loop M);
- on the final iteration of the inner-most loop;
- in loop M−1, prior to entering loop M;
- in loop M−2, prior to entering loop M−1;
- in loop M−3, prior to entering loop M−2;
- on the final iteration of loops M and M−1; or
- on the final iteration of loops M, M−1, and M−2.
- The
processing elements 308 are configured to conditionally swap data values between two registers in a single instruction cycle based on a value contained in a specified condition register. Eachprocessing element 308 executes the swap based on the condition register associated with the SIMD lane corresponding to theprocessing element 308. Theprocessing elements 308 perform the value swap as execution of a conditional swap instruction, which may defined as: - VSWAP cond, src1/dst1, src2/dst2
where:- src1/dst1 and src2/dst2 specify the registers having values to be swapped; and
- cond specifies the condition register that controls whether the swap is to be performed.
In some embodiments, the swap is performed if the least significant bit of the condition register is set.
- The
processing elements 308 are configured to sort two values contained in specified registers in a single instruction cycle. Theprocessing element 308 compares the two values. The smaller of the values is written to a first register, and the larger of the two values is written to a second register. Theprocessing elements 308 perform the value sort as execution of a sort instruction, which may defined as: - VSORT2 src1/dst1, src2/dst2
where src1/dst1 and src2/dst2 specify the registers having values to be sorted. The smaller of the two values is written to dst1, and the larger of the two values is written to dst2. - The
processing elements 308 include logic that generates a result value from values contained in three specified registers. Aprocessing element 308 may, in a single instruction cycle, add three register values, logically “and” three register values, logically “or” three register values, or add two register values and subtract a third register value. Theprocessing elements 308 perform these operations as execution of instructions, which may defined as: - VADD3 src1, src2, src3, dst
where:- src1, src2, and src3 specify the registers containing values to be summed; and
- dst specifies the register to which the summation result is to be written.
- VAND3 src1, src2, src3, dst
where:- src1, src2, and src3 specify the registers containing values to be logically “and'd”; and
- dst specifies the register to which the “and” result is to be written.
- VOR3 src1, src2, src3, dst
where:- src1, src2, and src3 specify the registers containing values to be logically “or'd”; and
- dst specifies the register to which the “or” result is to be written.
- VADIF3 src1, src2, src3, dst
where:- src1 and src3 specify the registers containing values to be summed;
- src2 specifies the register containing a value to subtracted from the sum of src1 and src3; and
- dst specifies the register to which the final result is to be written.
- The
table lookup unit 310 is a processing unit separate from theprocessing elements 308 and thehistogram unit 312. Thetable lookup unit 310 accelerates lookup of data values stored in tables in thememories table lookup unit 310 can perform N lookups (where N is the number of SIMD lanes of the vector coprocessor core 104) per cycle. Thetable lookup unit 310 executes the table lookups in a nested loop. The table lookup loop is defined by a VLOOP instruction that specifies table lookup operation. The vector command specified by VLOOP and the associated vector instructions cause thetable lookup unit 310 to retrieve a specified set of values from one or more tables stored in thememories memories - A table lookup vector command initializes address generators used to access information defining which values are to be retrieved from a lookup table, used to lookup table location in
memory table lookup unit 310 retrieves information identifying the data to be fetched from the lookup table, applies the information in conjunction with the lookup table location to fetch the data, and stores the fetched data tomemory vector coprocessor core 104. Thetable lookup unit 310 may fetch table data frommemories memories table lookup unit 310 may also fetch data frommemories - VTLD<type>_<m>TBL_<n>PT tbl_base[tbl_agen] [V2], V0, RND_SAT: rnd_sat
where:- type specifies the data size (e.g., byte, half-word, word, etc.);
- _<m>TBL specifies the number of lookup tables to be accessed in parallel;
- _<n>PT specifies the number of data items per lookup table to be loaded;
- tbl_base specifies a lookup table base address;
- tbl_agen specifies an address generator containing offset to a given table;
- V2 specifies a vector register containing a data item specific offset into the given table;
- V0 specifies a vector register to which the retrieved table data is to be written; and
- RND_SAT: rnd_sat specifies a rounding/saturation mode to be applied to the table lookup indices.
- As shown by the vector table lookup instruction, the
table lookup unit 310 may fetch one or more data values from one or more tables simultaneously, where each of the multiple tables is located in a different bank ofmemories table lookup unit 310 constrain the number of tables accessed and/or data values accessed in parallel. For example, the product of the number of tables accessed and the number of data values retrieved per table may be restricted to be less than the number of SIMD lanes of thevector coprocessor core 104. In some embodiments, the number of data values retrieved per table access may be restricted to be 1, 2, or 4. Table 1 below shows allowable table and value number combinations for some embodiments of an 8-way SIMDvector coprocessor core 104. -
TABLE 1 Table Lookup Constraints Number of parallel tables, Num items per lookup, num_par_tbl Table type num_data_per_lu 1 2 4 8 Byte 1 ✓ ✓ ✓ ✓ 2 ✓ ✓ ✓ 4 ✓ ✓ 8 ✓ Half word 1 ✓ ✓ ✓ ✓ 2 ✓ ✓ ✓ 4 ✓ ✓ 8 ✓ Word 1 ✓ ✓ ✓ ✓ 2 ✓ ✓ ✓ 4 ✓ ✓ 8 ✓ - The
histogram unit 312 is a processing unit separate from theprocessing elements 308 and thetable lookup unit 310. Thehistogram unit 312 accelerates construction of histograms in thememories histogram unit 312 provides construction of normal histograms, in which an addressed histogram bin entry is incremented by 1, and weighted histograms, in which an addressed histogram bin entry is incremented by a value provided as an element in a weight array input. Thehistogram unit 312 can perform N histogram bin updates (where N is the number of SIMD lanes of the vector coprocessor core 104) simultaneously. Thehistogram unit 312 executes the histogram bin updates in a nested loop. The histogram loop is defined by a VLOOP instruction that specifies histogram operation. The vector command specified by VLOOP and the associated vector instructions cause thehistogram unit 312 to retrieve histogram bin values from one or more histograms stored in thememories memories - A histogram vector command initializes the increment value by which the retrieved histogram bin values are to be increased, loads an index to a histogram bin, fetches the value from the histogram bin from
memory memory histogram unit 312 may fetch histogram bin values frommemories - VHLD<type>_<m>HIST hist_base[hist_agen] [V2], V0, RND_SAT: rnd_sat
where:- type specifies the data size (e.g., byte, half-word, word, etc.);
- _<m> HIST specifies the number of histograms to be accessed in parallel;
- hist_base specifies a histogram base address;
- hist_agen specifies an address generator containing offset to a given histogram;
- V2 specifies a vector register containing a histogram bin specific offset into the given histogram;
- V0 specifies a vector register to which the histogram bin value is to be written; and
- RND_SAT: rnd_sat specifies a rounding/saturation mode to be applied to the histogram indices.
- Embodiments of the
histogram unit 312 may store updated histogram bin values tomemories - VHST<type>_<m>HIST V0, hist_base[hist_agen][V2]
where:- type specifies the data size (e.g., byte, half-word, word, etc.);
- _<m> HIST specifies the number of histograms to be accessed in parallel;
- V0 specifies a vector register containing the histogram bin value to be written to memory;
- hist_base specifies a histogram base address;
- hist_agen specifies an address generator containing offset to a given histogram; and
- V2 specifies a vector register containing a histogram bin specific offset into the given histogram.
- Embodiments of the
processor 100 may be applied to advantage in any number of devices and/or systems that employ real-time data processing. Embodiments may be particularly well suited for use in devices that employ image and/or vision processing, such as consumer devices that that include imaging systems. Such devices may include an image sensor for acquiring image data and/or a display device for displaying acquired and/or processed image data. For example, embodiments of theprocessor 100 may be included in mobile telephones, tablet computers, and other mobile devices to provide image processing while reducing overall power consumption. - The above discussion is meant to be illustrative of the principles and various embodiments of the present invention. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.
Claims (29)
1. A processor, comprising:
a scalar processor core; and
a vector coprocessor core coupled to the scalar processor core;
the scalar processor core comprising:
a program memory interface through which the scalar processor retrieves instructions from a program memory, the instructions comprising scalar instructions executable by the scalar processor and vector instructions executable by the vector coprocessor core;
a coprocessor interface through which the scalar processor passes the vector instructions to the vector coprocessor;
the vector coprocessor core, comprising:
a plurality of execution units; and
a vector command buffer configured to:
decode the vector instructions passed by the scalar processor core;
determine whether vector instructions defining an instruction loop have been decoded; and
initiate execution of the instruction loop by one or more of the execution units based on a determination that all of the vector instructions of the instruction loop have been decoded.
2. The processor of claim 1 , wherein the instruction loop comprises a plurality of nested loops and the vector coprocessor core is configured to execute the plurality of nested loops without looping overhead.
3. The processor of claim 1 , wherein the vector coprocessor core is configured to execute vector instructions only by execution of the vector instructions within an instruction loop comprising a predetermined plurality of nested loops.
4. The processor of claim 1 , wherein the scalar processor core is configured to execute the scalar instructions while the vector coprocessor core executes the vector instructions.
5. The processor of claim 1 , wherein the vector command buffer comprises storage for a plurality of vector commands, each of the vector commands comprising nested loops of vector instructions; and the vector command buffer is configured to decode the vector instructions of a second vector command passed by the scalar processor core while the execution units execute a first vector command.
6. The processor of claim 1 , wherein the scalar processor core is configured to service interrupts while stalled awaiting completion of execution of vector instructions by the vector coprocessor.
7. The processor of claim 1 , wherein the vector processor core further comprises an operand memory interface configured to simultaneously access a plurality of banks of memory, each of the banks organized as a plurality of sub-banks of memory.
8. The processor of claim 7 , wherein the operand memory interface is configured to simultaneously access the plurality of sub-banks of memory.
9. The processor of claim 7 , wherein vector processor core further comprises a plurality of address generators, each of the address generators configured to provide an address for accessing one of the sub-banks of memory.
10. The processor of claim 1 , wherein the instruction loop comprises an outermost loop about a plurality of nested loops, wherein the vector coprocessor core is configured to change, with no overhead, at least one of a location in memory of data to processed by the nested loops and a dimension of an array of data to be processed by the nested loops.
11. The processor of claim 1 , wherein the vector coprocessor core is configured to exit the instruction loop prior to execution of a predetermined number of iterations of the instruction loop; wherein the vector coprocessor core is configured to schedule the exit from a predetermined number of nested loops of the instruction loop to occur at completion of a current iteration of an innermost loop of the nested loops based on a value stored in a register of the vector coprocessor core.
12. The processor of claim 1 , wherein the vector coprocessor core is configured to exit an outermost loop of the instruction loop prior to execution of a predetermined number of iterations of the outermost loop; wherein the vector coprocessor core is configured to schedule the exit from the outermost loop to occur at completion of all iterations of loops within the outermost loop based on a value stored in a register of the vector coprocessor core.
13. A vector coprocessor, comprising:
a plurality of execution units configured to simultaneously apply an instruction specified operation to different data; and
a vector command buffer configured to:
decode instructions to be executed by the execution units;
identify an instruction loop in the instructions; and
provide the instructions to the execution units for execution based on a determination that all of the instructions of an identified instruction loop have been decoded.
14. The vector coprocessor of claim 13 , further comprising loop control logic configured to control execution of a plurality of nested loop of the instruction loop without looping overhead.
15. The vector coprocessor of claim 14 , wherein the loop control logic is configured to exit the instruction loop prior to execution of a predetermined number of iterations of the instruction loop; wherein the loop control logic is configured to schedule the exit from the predetermined number of nested loops of the instruction loop to occur at completion of a current iteration of an innermost loop of the nested loops based on a value read from a register of the vector coprocessor core during the current iteration of the innermost loop.
16. The vector coprocessor of claim 14 , wherein the loop control logic is configured to exit an outermost loop of the instruction loop prior to execution of a predetermined number of iterations of the outermost loop; wherein the loop control logic is configured to schedule the exit from the outermost loop to occur at completion of all iterations of nested loops within the outermost loop based on a value read from a register of the vector coprocessor core during a current iteration of an innermost loop of the nested loops.
17. The vector coprocessor of claim 14 , wherein the instruction loop comprises an outermost loop about a plurality of nested loops, wherein the loop control logic is configured to change, with no overhead, at least one of a distance to a next object in memory to be processed by the nested loops and a dimension of the next object to be processed by the nested loops.
18. The vector coprocessor of claim 13 , wherein the vector command buffer comprises storage for a plurality of vector commands, each of the vector commands comprising nested loops of instructions; and the vector command buffer is configured to decode the instructions of a second vector command while the execution units execute a first vector command.
19. The vector coprocessor of claim 13 , further comprising an operand memory interface configured to simultaneously access a plurality of banks of memory, each of the banks organized as a plurality of sub-banks of memory.
20. The vector coprocessor of claim 16 , wherein the operand memory interface is configured to simultaneously access the plurality of sub-banks of memory.
21. The vector coprocessor of claim 16 , further comprising a plurality of address generators, each of the address generators configured to provide an address for accessing one or more of the sub-banks of memory.
22. The vector coprocessor of claim 13 , vector command buffer is configured to identify the instruction loop based on detection of a looping instruction in the instructions, wherein the looping instruction is indicative of a start of the instruction loop and specifies a length of the instruction loop.
23. The vector coprocessor of claim 13 , further comprising:
a processor interface through which the vector processor receives from a different processor:
instructions to be executed by the vector processor;
memory storage addresses of the instructions to executed by the vector processor, wherein each address is provided concurrently with one of the instructions; and
data values to be written to a register of the vector processor, wherein each data value is provided concurrently with one of the instructions.
24. A processor, comprising:
a control processor core; and
a single-instruction multiple data (SIMD) coprocessor core coupled to the control processor core;
the control processor core comprising:
a program memory interface through which the control processor retrieves instructions from a program memory, the instructions comprising instructions executable by the scalar processor and SIMD instructions executable by the SIMD coprocessor core;
the SIMD coprocessor core, comprising:
a plurality of execution units; and
a vector command buffer configured to:
group SIMD instructions received from the control processer core into an instruction loop; and
initiate execution of the instruction loop based on a complete set of SIMD instructions of an instruction loop being received from the control processor core; and
loop control logic coupled to the vector command buffer, the loop control logic configured to manage execution of the instruction loop as a plurality of nested loops without loop overhead.
25. The processor of claim 24 , wherein the control processor core is configured to:
execute instructions while the SIMD coprocessor core executes the instruction loop; and
service interrupts while stalled awaiting completion of execution of the instruction loop.
26. The processor of claim 24 , wherein the vector command buffer comprises:
an SIMD instruction decoder; and
storage for a plurality of vector commands, each vector command comprising an instruction loop;
wherein the SIMD instruction decoder is configured to decode a vector command while a previously decoded vector command is being executed.
27. The processor of claim 24 further comprising:
a plurality of banks of memory coupled to the control processor core and to the SIMD coprocessor core, wherein each bank of memory comprises a number of sub-banks equal to a number of SIMD processing lanes of the SIMD coprocessor core;
wherein the SIMD coprocessor core comprises a memory interface configured to simultaneously access all the sub-banks of a given one of the banks of memory.
28. The processor of claim 24 , wherein the loop control logic is configured to:
schedule exit from a predetermined number of nested loops of the instruction loop to occur at completion of a current iteration of an innermost loop of the nested loops based on a value read from a register of the vector coprocessor core during the current iteration of the innermost loop; and
schedule exit from an outermost loop of the instruction loop to occur at completion of all iterations of the nested loops within the outermost loop based on the value read from the register of the vector coprocessor core during the current iteration of the innermost loop;
wherein the current iteration is not the last iteration of the instruction set at initiation of instruction loop execution.
29. The processor claim 24 , wherein the loop control logic is configured to change, with no overhead, on execution of an outermost loop of the instruction loop, at least one of a distance to a next object in memory to be processed by the instruction loop and a dimension of the next object to be processed by the instruction loop.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/548,924 US20130185540A1 (en) | 2011-07-14 | 2012-07-13 | Processor with multi-level looping vector coprocessor |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201161507652P | 2011-07-14 | 2011-07-14 | |
US13/548,924 US20130185540A1 (en) | 2011-07-14 | 2012-07-13 | Processor with multi-level looping vector coprocessor |
Publications (1)
Publication Number | Publication Date |
---|---|
US20130185540A1 true US20130185540A1 (en) | 2013-07-18 |
Family
ID=48780833
Family Applications (6)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/548,955 Active 2037-01-27 US10803009B2 (en) | 2011-07-14 | 2012-07-13 | Processor with table lookup processing unit |
US13/548,933 Active 2035-07-07 US9519617B2 (en) | 2011-07-14 | 2012-07-13 | Processor with instruction variable data distribution |
US13/548,924 Abandoned US20130185540A1 (en) | 2011-07-14 | 2012-07-13 | Processor with multi-level looping vector coprocessor |
US13/548,945 Abandoned US20130185538A1 (en) | 2011-07-14 | 2012-07-13 | Processor with inter-processing path communication |
US17/026,412 Active US11468003B2 (en) | 2011-07-14 | 2020-09-21 | Vector table load instruction with address generation field to access table offset value |
US17/962,585 Pending US20230049454A1 (en) | 2011-07-14 | 2022-10-10 | Processor with table lookup unit |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/548,955 Active 2037-01-27 US10803009B2 (en) | 2011-07-14 | 2012-07-13 | Processor with table lookup processing unit |
US13/548,933 Active 2035-07-07 US9519617B2 (en) | 2011-07-14 | 2012-07-13 | Processor with instruction variable data distribution |
Family Applications After (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/548,945 Abandoned US20130185538A1 (en) | 2011-07-14 | 2012-07-13 | Processor with inter-processing path communication |
US17/026,412 Active US11468003B2 (en) | 2011-07-14 | 2020-09-21 | Vector table load instruction with address generation field to access table offset value |
US17/962,585 Pending US20230049454A1 (en) | 2011-07-14 | 2022-10-10 | Processor with table lookup unit |
Country Status (1)
Country | Link |
---|---|
US (6) | US10803009B2 (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140188961A1 (en) * | 2012-12-27 | 2014-07-03 | Mikhail Plotnikov | Vectorization Of Collapsed Multi-Nested Loops |
US20140244970A1 (en) * | 2011-10-18 | 2014-08-28 | Mediatek Sweden Ab | Digital signal processor and baseband communication device |
US20140281373A1 (en) * | 2011-10-18 | 2014-09-18 | Mediatek Sweden Ab | Digital signal processor and baseband communication device |
US20170337060A1 (en) * | 2016-05-23 | 2017-11-23 | Fujitsu Limited | Information processing apparatus and conversion method |
EP3226121A3 (en) * | 2016-02-03 | 2018-10-31 | Google LLC | Accessing data in multi-dimensional tensors |
US10504022B2 (en) | 2017-08-11 | 2019-12-10 | Google Llc | Neural network accelerator with parameters resident on chip |
US20220229663A1 (en) * | 2021-01-15 | 2022-07-21 | Cornell University | Content-addressable processing engine |
US11455169B2 (en) * | 2019-05-27 | 2022-09-27 | Texas Instruments Incorporated | Look-up table read |
US11481327B2 (en) * | 2016-12-20 | 2022-10-25 | Texas Instruments Incorporated | Streaming engine with flexible streaming engine template supporting differing number of nested loops with corresponding loop counts and loop offsets |
US20220414051A1 (en) * | 2021-06-28 | 2022-12-29 | Silicon Laboratories Inc. | Apparatus for Array Processor with Program Packets and Associated Methods |
US12153921B2 (en) | 2021-06-28 | 2024-11-26 | Silicon Laboratories Inc. | Processor with macro-instruction achieving zero-latency data movement |
Families Citing this family (45)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8577950B2 (en) * | 2009-08-17 | 2013-11-05 | International Business Machines Corporation | Matrix multiplication operations with data pre-conditioning in a high performance computing architecture |
US8650240B2 (en) * | 2009-08-17 | 2014-02-11 | International Business Machines Corporation | Complex matrix multiplication operations with data pre-conditioning in a high performance computing architecture |
US9600281B2 (en) * | 2010-07-12 | 2017-03-21 | International Business Machines Corporation | Matrix multiplication operations using pair-wise load and splat operations |
SE537423C2 (en) * | 2011-12-20 | 2015-04-21 | Mediatek Sweden Ab | Digital signal processor and method for addressing a memory in a digital signal processor |
US9268563B2 (en) * | 2012-11-12 | 2016-02-23 | International Business Machines Corporation | Verification of a vector execution unit design |
US9804839B2 (en) * | 2012-12-28 | 2017-10-31 | Intel Corporation | Instruction for determining histograms |
KR20150063745A (en) * | 2013-12-02 | 2015-06-10 | 삼성전자주식회사 | Method and apparatus for simd computation using register pairing |
US20160026607A1 (en) * | 2014-07-25 | 2016-01-28 | Qualcomm Incorporated | Parallelization of scalar operations by vector processors using data-indexed accumulators in vector register files, and related circuits, methods, and computer-readable media |
CN104298639B (en) * | 2014-09-23 | 2017-03-15 | 天津国芯科技有限公司 | Embedded method of attachment and the connecting interface of primary processor and some coprocessors |
EP3001306A1 (en) * | 2014-09-25 | 2016-03-30 | Intel Corporation | Bit group interleave processors, methods, systems, and instructions |
US20160124651A1 (en) | 2014-11-03 | 2016-05-05 | Texas Instruments Incorporated | Method for performing random read access to a block of data using parallel lut read instruction in vector processors |
KR102357863B1 (en) * | 2014-12-15 | 2022-02-04 | 삼성전자주식회사 | A memory access method and apparatus |
US9547881B2 (en) * | 2015-01-29 | 2017-01-17 | Qualcomm Incorporated | Systems and methods for calculating a feature descriptor |
GB2536069B (en) * | 2015-03-25 | 2017-08-30 | Imagination Tech Ltd | SIMD processing module |
US9875213B2 (en) * | 2015-06-26 | 2018-01-23 | Intel Corporation | Methods, apparatus, instructions and logic to provide vector packed histogram functionality |
US20170003966A1 (en) * | 2015-06-30 | 2017-01-05 | Microsoft Technology Licensing, Llc | Processor with instruction for interpolating table lookup values |
US9965275B2 (en) * | 2015-07-31 | 2018-05-08 | Arm Limited | Element size increasing instruction |
US20170046156A1 (en) * | 2015-08-14 | 2017-02-16 | Qualcomm Incorporated | Table lookup using simd instructions |
US20170177362A1 (en) * | 2015-12-22 | 2017-06-22 | Intel Corporation | Adjoining data element pairwise swap processors, methods, systems, and instructions |
US10564964B2 (en) | 2016-08-23 | 2020-02-18 | International Business Machines Corporation | Vector cross-compare count and sequence instructions |
GB2558220B (en) * | 2016-12-22 | 2019-05-15 | Advanced Risc Mach Ltd | Vector generating instruction |
US10908901B2 (en) * | 2017-06-29 | 2021-02-02 | Texas Instruments Incorporated | Streaming engine with early exit from loop levels supporting early exit loops and irregular loops |
US10108538B1 (en) * | 2017-07-31 | 2018-10-23 | Google Llc | Accessing prologue and epilogue data |
US10628295B2 (en) * | 2017-12-26 | 2020-04-21 | Samsung Electronics Co., Ltd. | Computing mechanisms using lookup tables stored on memory |
US10963256B2 (en) * | 2018-09-28 | 2021-03-30 | Intel Corporation | Systems and methods for performing instructions to transform matrices into row-interleaved format |
US11042468B2 (en) | 2018-11-06 | 2021-06-22 | Texas Instruments Incorporated | Tracking debug events from an autonomous module through a data pipeline |
US11403256B2 (en) | 2019-05-20 | 2022-08-02 | Micron Technology, Inc. | Conditional operations in a vector processor having true and false vector index registers |
US11340904B2 (en) | 2019-05-20 | 2022-05-24 | Micron Technology, Inc. | Vector index registers |
US11507374B2 (en) | 2019-05-20 | 2022-11-22 | Micron Technology, Inc. | True/false vector index registers and methods of populating thereof |
US11327862B2 (en) | 2019-05-20 | 2022-05-10 | Micron Technology, Inc. | Multi-lane solutions for addressing vector elements using vector index registers |
KR20210112949A (en) | 2020-03-06 | 2021-09-15 | 삼성전자주식회사 | Data bus, data processing method thereof and data processing apparatus |
GB202003257D0 (en) | 2020-03-06 | 2020-04-22 | Myrtle Software Ltd | Memory processing optimisation |
US12164926B2 (en) * | 2020-10-13 | 2024-12-10 | Carnegie Mellon University | Vector dataflow architecture for embedded systems |
US11573921B1 (en) | 2021-08-02 | 2023-02-07 | Nvidia Corporation | Built-in self-test for a programmable vision accelerator of a system on a chip |
US11593290B1 (en) | 2021-08-02 | 2023-02-28 | Nvidia Corporation | Using a hardware sequencer in a direct memory access system of a system on a chip |
US12118353B2 (en) | 2021-08-02 | 2024-10-15 | Nvidia Corporation | Performing load and permute with a single instruction in a system on a chip |
US11836527B2 (en) | 2021-08-02 | 2023-12-05 | Nvidia Corporation | Accelerating table lookups using a decoupled lookup table accelerator in a system on a chip |
US11636063B2 (en) | 2021-08-02 | 2023-04-25 | Nvidia Corporation | Hardware accelerated anomaly detection using a min/max collector in a system on a chip |
US11704067B2 (en) | 2021-08-02 | 2023-07-18 | Nvidia Corporation | Performing multiple point table lookups in a single cycle in a system on chip |
US11954496B2 (en) | 2021-08-02 | 2024-04-09 | Nvidia Corporation | Reduced memory write requirements in a system on a chip using automatic store predication |
US11593001B1 (en) | 2021-08-02 | 2023-02-28 | Nvidia Corporation | Using per memory bank load caches for reducing power use in a system on a chip |
US12099439B2 (en) | 2021-08-02 | 2024-09-24 | Nvidia Corporation | Performing load and store operations of 2D arrays in a single cycle in a system on a chip |
US11573795B1 (en) | 2021-08-02 | 2023-02-07 | Nvidia Corporation | Using a vector processor to configure a direct memory access system for feature tracking operations in a system on a chip |
US11741044B2 (en) | 2021-12-30 | 2023-08-29 | Microsoft Technology Licensing, Llc | Issuing instructions on a vector processor |
US12131157B2 (en) * | 2022-11-10 | 2024-10-29 | Azurengine Technologies Zhuhai Inc. | Mixed scalar and vector operations in multi-threaded computing |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5524223A (en) * | 1994-01-31 | 1996-06-04 | Motorola, Inc. | Instruction accelerator for processing loop instructions with address generator using multiple stored increment values |
US6269440B1 (en) * | 1999-02-05 | 2001-07-31 | Agere Systems Guardian Corp. | Accelerating vector processing using plural sequencers to process multiple loop iterations simultaneously |
US6598155B1 (en) * | 2000-01-31 | 2003-07-22 | Intel Corporation | Method and apparatus for loop buffering digital signal processing instructions |
US20030200423A1 (en) * | 2002-04-22 | 2003-10-23 | Ehlig Peter N. | Repeat block with zero cycle overhead nesting |
US20040078554A1 (en) * | 2000-02-29 | 2004-04-22 | International Business Machines Corporation | Digital signal processor with cascaded SIMD organization |
US6820194B1 (en) * | 2001-04-10 | 2004-11-16 | Mindspeed Technologies, Inc. | Method for reducing power when fetching instructions in a processor and related apparatus |
US20050102659A1 (en) * | 2003-11-06 | 2005-05-12 | Singh Ravi P. | Methods and apparatus for setting up hardware loops in a deeply pipelined processor |
US20050240644A1 (en) * | 2002-05-24 | 2005-10-27 | Van Berkel Cornelis H | Scalar/vector processor |
US20060107028A1 (en) * | 2002-11-28 | 2006-05-18 | Koninklijke Philips Electronics N.V. | Loop control circuit for a data processor |
US20070113058A1 (en) * | 2005-11-14 | 2007-05-17 | Texas Instruments Incorporated | Microprocessor with indepedent SIMD loop buffer |
US7272704B1 (en) * | 2004-05-13 | 2007-09-18 | Verisilicon Holdings (Cayman Islands) Co. Ltd. | Hardware looping mechanism and method for efficient execution of discontinuity instructions |
US20100169612A1 (en) * | 2007-06-26 | 2010-07-01 | Telefonaktiebolaget L M Ericsson (Publ) | Data-Processing Unit for Nested-Loop Instructions |
US8359462B1 (en) * | 2007-11-21 | 2013-01-22 | Marvell International Ltd. | Method and apparatus for programmable coupling between CPU and co-processor |
Family Cites Families (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4594682A (en) * | 1982-12-22 | 1986-06-10 | Ibm Corporation | Vector processing |
US5887183A (en) * | 1995-01-04 | 1999-03-23 | International Business Machines Corporation | Method and system in a data processing system for loading and storing vectors in a plurality of modes |
US5893159A (en) * | 1997-10-22 | 1999-04-06 | International Business Machines Corporation | Methods and apparatus for managing scratchpad memory in a multiprocessor data processing system |
US6192515B1 (en) * | 1998-07-17 | 2001-02-20 | Intel Corporation | Method for software pipelining nested loops |
US20020002666A1 (en) * | 1998-10-12 | 2002-01-03 | Carole Dulong | Conditional operand selection using mask operations |
US6745319B1 (en) * | 2000-02-18 | 2004-06-01 | Texas Instruments Incorporated | Microprocessor with instructions for shuffling and dealing data |
US6922716B2 (en) * | 2001-07-13 | 2005-07-26 | Motorola, Inc. | Method and apparatus for vector processing |
US6931511B1 (en) * | 2001-12-31 | 2005-08-16 | Apple Computer, Inc. | Parallel vector table look-up with replicated index element vector |
US7467287B1 (en) * | 2001-12-31 | 2008-12-16 | Apple Inc. | Method and apparatus for vector table look-up |
US20130212353A1 (en) * | 2002-02-04 | 2013-08-15 | Tibet MIMAR | System for implementing vector look-up table operations in a SIMD processor |
US20110087859A1 (en) * | 2002-02-04 | 2011-04-14 | Mimar Tibet | System cycle loading and storing of misaligned vector elements in a simd processor |
US7107512B2 (en) * | 2002-05-31 | 2006-09-12 | Broadcom Corporation | TTCM decoder design |
US7506135B1 (en) * | 2002-06-03 | 2009-03-17 | Mimar Tibet | Histogram generation with vector operations in SIMD and VLIW processor by consolidating LUTs storing parallel update incremented count values for vector data elements |
US7275148B2 (en) * | 2003-09-08 | 2007-09-25 | Freescale Semiconductor, Inc. | Data processing system using multiple addressing modes for SIMD operations and method thereof |
GB2409062C (en) * | 2003-12-09 | 2007-12-11 | Advanced Risc Mach Ltd | Aliasing data processing registers |
US9557994B2 (en) * | 2004-07-13 | 2017-01-31 | Arm Limited | Data processing apparatus and method for performing N-way interleaving and de-interleaving operations where N is an odd plural number |
US7627735B2 (en) * | 2005-10-21 | 2009-12-01 | Intel Corporation | Implementing vector memory operations |
US20070156685A1 (en) * | 2005-12-28 | 2007-07-05 | Hiroshi Inoue | Method for sorting data using SIMD instructions |
US7480787B1 (en) * | 2006-01-27 | 2009-01-20 | Sun Microsystems, Inc. | Method and structure for pipelining of SIMD conditional moves |
US7536532B2 (en) * | 2006-09-27 | 2009-05-19 | International Business Machines Corporation | Merge operations of data arrays based on SIMD instructions |
US7962718B2 (en) * | 2007-10-12 | 2011-06-14 | Freescale Semiconductor, Inc. | Methods for performing extended table lookups using SIMD vector permutation instructions that support out-of-range index values |
US8667250B2 (en) * | 2007-12-26 | 2014-03-04 | Intel Corporation | Methods, apparatus, and instructions for converting vector data |
US7945764B2 (en) * | 2008-01-11 | 2011-05-17 | International Business Machines Corporation | Processing unit incorporating multirate execution unit |
JP2011221605A (en) * | 2010-04-05 | 2011-11-04 | Sony Corp | Information processing apparatus, information processing method and program |
US20120254591A1 (en) * | 2011-04-01 | 2012-10-04 | Hughes Christopher J | Systems, apparatuses, and methods for stride pattern gathering of data elements and stride pattern scattering of data elements |
-
2012
- 2012-07-13 US US13/548,955 patent/US10803009B2/en active Active
- 2012-07-13 US US13/548,933 patent/US9519617B2/en active Active
- 2012-07-13 US US13/548,924 patent/US20130185540A1/en not_active Abandoned
- 2012-07-13 US US13/548,945 patent/US20130185538A1/en not_active Abandoned
-
2020
- 2020-09-21 US US17/026,412 patent/US11468003B2/en active Active
-
2022
- 2022-10-10 US US17/962,585 patent/US20230049454A1/en active Pending
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5524223A (en) * | 1994-01-31 | 1996-06-04 | Motorola, Inc. | Instruction accelerator for processing loop instructions with address generator using multiple stored increment values |
US6269440B1 (en) * | 1999-02-05 | 2001-07-31 | Agere Systems Guardian Corp. | Accelerating vector processing using plural sequencers to process multiple loop iterations simultaneously |
US6598155B1 (en) * | 2000-01-31 | 2003-07-22 | Intel Corporation | Method and apparatus for loop buffering digital signal processing instructions |
US20040078554A1 (en) * | 2000-02-29 | 2004-04-22 | International Business Machines Corporation | Digital signal processor with cascaded SIMD organization |
US6820194B1 (en) * | 2001-04-10 | 2004-11-16 | Mindspeed Technologies, Inc. | Method for reducing power when fetching instructions in a processor and related apparatus |
US20030200423A1 (en) * | 2002-04-22 | 2003-10-23 | Ehlig Peter N. | Repeat block with zero cycle overhead nesting |
US20050240644A1 (en) * | 2002-05-24 | 2005-10-27 | Van Berkel Cornelis H | Scalar/vector processor |
US20060107028A1 (en) * | 2002-11-28 | 2006-05-18 | Koninklijke Philips Electronics N.V. | Loop control circuit for a data processor |
US20050102659A1 (en) * | 2003-11-06 | 2005-05-12 | Singh Ravi P. | Methods and apparatus for setting up hardware loops in a deeply pipelined processor |
US7272704B1 (en) * | 2004-05-13 | 2007-09-18 | Verisilicon Holdings (Cayman Islands) Co. Ltd. | Hardware looping mechanism and method for efficient execution of discontinuity instructions |
US20070113058A1 (en) * | 2005-11-14 | 2007-05-17 | Texas Instruments Incorporated | Microprocessor with indepedent SIMD loop buffer |
US20100169612A1 (en) * | 2007-06-26 | 2010-07-01 | Telefonaktiebolaget L M Ericsson (Publ) | Data-Processing Unit for Nested-Loop Instructions |
US8359462B1 (en) * | 2007-11-21 | 2013-01-22 | Marvell International Ltd. | Method and apparatus for programmable coupling between CPU and co-processor |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140244970A1 (en) * | 2011-10-18 | 2014-08-28 | Mediatek Sweden Ab | Digital signal processor and baseband communication device |
US20140281373A1 (en) * | 2011-10-18 | 2014-09-18 | Mediatek Sweden Ab | Digital signal processor and baseband communication device |
US20140188961A1 (en) * | 2012-12-27 | 2014-07-03 | Mikhail Plotnikov | Vectorization Of Collapsed Multi-Nested Loops |
US10838724B2 (en) | 2016-02-03 | 2020-11-17 | Google Llc | Accessing data in multi-dimensional tensors |
EP3226121A3 (en) * | 2016-02-03 | 2018-10-31 | Google LLC | Accessing data in multi-dimensional tensors |
US10228947B2 (en) | 2016-02-03 | 2019-03-12 | Google Llc | Accessing data in multi-dimensional tensors |
US20170337060A1 (en) * | 2016-05-23 | 2017-11-23 | Fujitsu Limited | Information processing apparatus and conversion method |
US10496408B2 (en) * | 2016-05-23 | 2019-12-03 | Fujitsu Limited | Information processing apparatus and conversion method |
US11481327B2 (en) * | 2016-12-20 | 2022-10-25 | Texas Instruments Incorporated | Streaming engine with flexible streaming engine template supporting differing number of nested loops with corresponding loop counts and loop offsets |
US11921636B2 (en) | 2016-12-20 | 2024-03-05 | Texas Instruments Incorporated | Streaming engine with flexible streaming engine template supporting differing number of nested loops with corresponding loop counts and loop offsets |
US11501144B2 (en) | 2017-08-11 | 2022-11-15 | Google Llc | Neural network accelerator with parameters resident on chip |
US11727259B2 (en) | 2017-08-11 | 2023-08-15 | Google Llc | Neural network accelerator with parameters resident on chip |
US10504022B2 (en) | 2017-08-11 | 2019-12-10 | Google Llc | Neural network accelerator with parameters resident on chip |
US11960891B2 (en) | 2019-05-27 | 2024-04-16 | Texas Instruments Incorporated | Look-up table write |
US11455169B2 (en) * | 2019-05-27 | 2022-09-27 | Texas Instruments Incorporated | Look-up table read |
US12093690B2 (en) | 2019-05-27 | 2024-09-17 | Texas Instruments Incorporated | Look-up table read |
US12242852B2 (en) | 2019-05-27 | 2025-03-04 | Texas Instruments Incorporated | Look-up table initialize |
US20230032335A1 (en) * | 2021-01-15 | 2023-02-02 | Cornell University | Content-addressable processing engine |
US11461097B2 (en) * | 2021-01-15 | 2022-10-04 | Cornell University | Content-addressable processing engine |
US20220229663A1 (en) * | 2021-01-15 | 2022-07-21 | Cornell University | Content-addressable processing engine |
US12001841B2 (en) * | 2021-01-15 | 2024-06-04 | Cornell University | Content-addressable processing engine |
US20220414051A1 (en) * | 2021-06-28 | 2022-12-29 | Silicon Laboratories Inc. | Apparatus for Array Processor with Program Packets and Associated Methods |
US12153921B2 (en) | 2021-06-28 | 2024-11-26 | Silicon Laboratories Inc. | Processor with macro-instruction achieving zero-latency data movement |
US12153542B2 (en) * | 2021-06-28 | 2024-11-26 | Silicon Laboratories Inc. | Apparatus for array processor with program packets and associated methods |
Also Published As
Publication number | Publication date |
---|---|
US20130185538A1 (en) | 2013-07-18 |
US10803009B2 (en) | 2020-10-13 |
US20130185539A1 (en) | 2013-07-18 |
US20210004349A1 (en) | 2021-01-07 |
US20230049454A1 (en) | 2023-02-16 |
US9519617B2 (en) | 2016-12-13 |
US20130185544A1 (en) | 2013-07-18 |
US11468003B2 (en) | 2022-10-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11468003B2 (en) | Vector table load instruction with address generation field to access table offset value | |
US8412917B2 (en) | Data exchange and communication between execution units in a parallel processor | |
US20240378056A1 (en) | Streaming engine with cache-like stream data storage and lifetime tracking | |
US7937559B1 (en) | System and method for generating a configurable processor supporting a user-defined plurality of instruction sizes | |
CN111381880B (en) | Processor, medium, and operation method of processor | |
JP7616757B2 (en) | Apparatus, method and system for matrix operation accelerator instructions - Patents.com | |
US6275920B1 (en) | Mesh connected computed | |
US6289434B1 (en) | Apparatus and method of implementing systems on silicon using dynamic-adaptive run-time reconfigurable circuits for processing multiple, independent data and control streams of varying rates | |
CN111095242A (en) | Vector calculation unit | |
CN108205448B (en) | Stream engine with multi-dimensional circular addressing selectable in each dimension | |
US12079470B2 (en) | Streaming engine with fetch ahead hysteresis | |
CN115562729A (en) | Data processing apparatus having a stream engine with read and read/forward operand encoding | |
WO1999053411A2 (en) | Mesh connected computer | |
WO2005026974A1 (en) | Data processing system for implementing simd operations and method thereof | |
JP2008003708A (en) | Video processing engine and video processing system including the same | |
US20180181347A1 (en) | Data processing apparatus and method for controlling vector memory accesses | |
CN113924550A (en) | Histogram operation | |
US7111155B1 (en) | Digital signal processor computation core with input operand selection from operand bus for dual operations | |
EP2267596B1 (en) | Processor core for processing instructions of different formats | |
US6785743B1 (en) | Template data transfer coprocessor | |
US11714641B2 (en) | Vector generating instruction for generating a vector comprising a sequence of elements that wraps as required | |
CN111984316A (en) | Method and apparatus for comparing source data in a processor | |
US6820189B1 (en) | Computation core executing multiple operation DSP instructions and micro-controller instructions of shorter length without performing switch operation | |
EP3757822B1 (en) | Apparatuses, methods, and systems for enhanced matrix multiplier architecture | |
US20200371793A1 (en) | Vector store using bit-reversed order |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: TEXAS INSTRUMENTS INCORPORATED, TEXAS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HUNG, CHING-YU;INAMORI, SHINRI;SANKARAN, JAGADEESH;AND OTHERS;SIGNING DATES FROM 20120713 TO 20120809;REEL/FRAME:028779/0467 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION |