WO2008124061A1 - Système pour calcul de convolution équipé de plusieurs processeurs informatiques - Google Patents
Système pour calcul de convolution équipé de plusieurs processeurs informatiques Download PDFInfo
- Publication number
- WO2008124061A1 WO2008124061A1 PCT/US2008/004402 US2008004402W WO2008124061A1 WO 2008124061 A1 WO2008124061 A1 WO 2008124061A1 US 2008004402 W US2008004402 W US 2008004402W WO 2008124061 A1 WO2008124061 A1 WO 2008124061A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- value
- data
- processors
- processor
- values
- Prior art date
Links
- 238000004364 calculation method Methods 0.000 title claims description 49
- 238000009795 derivation Methods 0.000 claims abstract description 48
- 238000000034 method Methods 0.000 claims description 24
- 230000008569 process Effects 0.000 claims description 19
- 238000012545 processing Methods 0.000 claims description 11
- 230000006872 improvement Effects 0.000 claims description 6
- 239000004065 semiconductor Substances 0.000 claims description 3
- 230000006870 function Effects 0.000 description 45
- 238000004422 calculation algorithm Methods 0.000 description 24
- 238000013459 approach Methods 0.000 description 15
- 230000008901 benefit Effects 0.000 description 6
- 238000001914 filtration Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 230000009466 transformation Effects 0.000 description 3
- 241000761456 Nops Species 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000013500 data storage Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000001627 detrimental effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000010230 functional analysis Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000007620 mathematical function Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
Definitions
- the present invention relates generally to electrical computers for arithmetic processing and calculating, and more particularly to such where a convolution integral is evaluated in a digital fashion.
- Transform is used to refer to a class of equation analysis techniques. The concept of the transform traces back to the functional analysis branch of mathematics, which primarily deals with the study of spaces of functions where a particular function has as its argument another function. Transforms thus can be used with an individual equation or with entire sets of equations, wherein the process of transformation is a one-to-one mapping of the original equation or equations represented in one domain into another equation or equations represented in a separate domain.
- the motivation for performing transformation is often straightforward. There are many equations that are difficult to solve in their original representations, yet which may be more easily solvable in one or more other representations. Thus, a transform may be performed, a solution found, and then an inverse transform performed to map the solution back into the original domain.
- the general form of an integral transform is defined as: where K(a,t) is often referred to as the "integral kernel" of the transform.
- the Laplace transform is a subset of the class of transforms defined by equation (1) and it is often particularly useful. Given a simple mathematical or functional description of an input to or an output from a system, the Laplace transform can provide an alternative functional description that may simplify analyzing the behavior of the system.
- the general form of the Laplace transform is defined as:
- F(s) is not the transform of a single known function but can be represented as the product of two functions that are each the result of the transform of a known function/ ⁇ or g(t), respectively. That is,
- the FIR filter is usually considered advantageous to use because it does not require internal feedback, which can, for example, cause an HR filter to respond indefinitely to an impulse.
- the word "finite" in its name also implies another advantage of the FIR filter.
- the impulse from such a filter ultimately settles to zero, and errors in the iterative summing calculations used do not propagate. That is, the error term stays constant throughout the entire calculation process. This is a distinct advantage over an HR filter, for example, where error can potentially grow for each additional iterative output sum.
- Unfortunately for many applications a major limitation of a digital filter is that its speed is restricted by the speed of the processor or processors used for numerical calculations.
- one preferred embodiment of the present invention is a system for calculating a convolution of a data function with a filter function.
- An array of processors is provided that includes first and last processors and that each include a logic to multiply a coefficient value that is based on a derivation of the filter function and a data value that is representative of the data function to produce a current intermediate value.
- a logic is provided to receive a prior intermediate value that is representative of a previously performed calculation in another of the processors and to add that prior intermediate value to the current intermediate value.
- a logic is provided to send the data value and the current intermediate value to another processor.
- a logic is further provided to hold a prior intermediate value, if any, from the last processor as a prior partial value and to add this prior partial value to the current intermediate value from the last processor to produce a result value.
- the array of processors thus receives a series of the data values and produces a series of the result values that collectively are representative of the convolution of the data function with the filter function.
- another preferred embodiment of the present invention is a process for calculating a result value in a convolution of a data function with a filter function.
- a sequence of coefficient values are obtained that are based on a derivation of the filter function.
- the coefficient values are then used in a pipeline of computerized processors including a first and last processors to multiply one of the coefficient values and the data value to produce current intermediate values.
- a prior intermediate value that is representative of a previously performed calculation in another of the processors is added to the current intermediate value.
- the data value and current intermediate value are sent to a subsequent processor.
- a prior partial value, if any, is added to the current intermediate value from the last processor to produce a result value, wherein this prior partial value is a previous intermediate value from the last processor.
- the result value is output to a digital signal processor employing the process.
- another preferred embodiment of the present invention is a process for calculating a convolution of a data function with a filter function.
- a sequence of coefficient values are obtained that are based on a derivation of the filter function and a sequence of data values are obtained that are representative of the data function.
- the coefficient value and the data value are multiplied to produce a current intermediate value.
- a prior intermediate value that is representative of a previously performed calculation in another of the processors is then added to the current intermediate value.
- the data value and the current intermediate value are sent to a subsequent processor.
- a prior partial value, if any, is added to the current intermediate value from the last processor to produce a result value, wherein the prior partial value is a previous intermediate value from the last processor.
- These result values Are accumulated as the convolution, and the convolution is output to a digital signal processor employing the process.
- another preferred embodiment of the present invention is an improved system for calculating a convolution of the type in which at least one processor multiplies a coefficient value that is representative of a filter function with a data value that is representative of a data function.
- the improvement comprises the coefficient value being based on a derivation of the filter function.
- another preferred embodiment of the present invention is an improved process for calculating a convolution in a computerized processor of the type in which coefficient values are representative of a filter function, data values are representative of a data function, and there is multiplying of the coefficient values with the data values to produce result values that are collectively representative of the convolution.
- the improvement comprises employing coefficient values that are based on a derivation of the filter function.
- FIG. 1 is a diagrammatic view depicting the inventive convolution system being employed in an array of computer processors
- FIG. 2 background art
- FIGS. 3a-c are partial views of FIG. 1 depicting inbound, outbound, and internal communications using the processors in FIG. 1, wherein FIG. 3a shows how data is passed between the input device and a first processor, and between the first processor and a second processor, FIG. 3b shows how data is passed between a second to last processor and a last processor, and between the last processor and the output device, and FIG. 3c shows how data is passed between two exemplary processors being used centrally in the array;
- FIGS. 4a-f are block diagrams schematically depicting convolution calculation stages in an array of processors, such as that of FIG. 1 ;
- FIGS. 5a-f are block diagrams schematically depicting convolution calculation stages based on a new algorithm, again presented in an array of processors such as that of FIG. 1.
- FIGS. 6a-c are graphs depicting convolution performed using both of the approaches presented in FIGS. 4a-f and FIGS. 5a-f, wherein FIG. 6a shows a first trace representing the use of conventional convolution coefficients and a second trace representing the use of derivation convolution coefficients, FIG. 6b shows a first trace representing input data upon which convolution is performed and a second trace representing the use of derivation signal data, and FIG. 6c shows a single trace representing the results of the approaches discussed.
- FIG. 7 is a listing of code suitable for use in a direct filter.
- FIGS. 8a-b are a listing of code suitable for use in a derivation filter, wherein
- FIG. 8a shows the code performing the functions that are conceptually similar to those in FIG. 7 and FIG. 8b shows additional code used by the derivation-based algorithm.
- a preferred embodiment of the present invention is a system for convolution calculation performed with multiple computer processors. As illustrated in the various drawings herein, and particularly in the view of FIG. 1, preferred embodiments of the invention are depicted by the general reference character 10.
- the invention is an improved convolution system 10 for numerically approximating the solution to a convolution of a data function with a filtering function. Performing convolution calculations using numerical techniques inherently tends to involve large numbers of multiplication and addition operations.
- the present invention permits substantially reducing the overall time needed to perform such calculations in two particular manners. First, the invention permits completing large portions of the necessary calculations in parallel, rather than serially. Second, the invention permits employing a new class of algorithms which uses filter values and data values that can be expressed with fewer data bits, and thus which can be performed faster in view of the inherent limitations in processors.
- FIG. 1 is a diagrammatic view depicting the inventive convolution system 10 being employed in an array 12 of computer processors 14.
- external elements that support the array 12 have been omitted or represented generically. Those skilled in the art will appreciate, however, that such elements will be present in actual operating embodiments and that they typically can be conventional in nature.
- FIG. 1 omits all details related to powering the array 12 and includes generic forms of an external input device 16, an input bus 18, an output bus 20, and an external output device 22.
- general computational initialization and termination matters are also not initially discussed, and program instructions and convolution coefficient values are treated as being already loaded into the processors 14.
- the input device 16 here is considered only with respect to providing the input data values upon which convolution will be performed and the output device 22 here is considered only with respect to receiving the output data values upon which convolution will have been performed.
- FIG. 1 also stylistically shows a flow path 24. It should be appreciated, however, that other arrangements than this are easily possible. For instance, other starting and stopping positions are possible, different paths than the depicted flow path 24 are also possible
- the inventor's presently preferred hardware platform for the convolution system 10 is to have the array 12 of processors 14 in a single semiconductor die 26, such as the SEAforth-24A or the SEAforth-40A devices, by IntellaSys Corporation of Cupertino, California.
- the SEAforth-24A is used herein in most examples (and the processors 14 in these examples can properly be termed "cores" or "nodes").
- processors 14a-x the members of the collective set of processors 14 are individually referenced as processors 14a-x, as shown, and each has busses 28 that permit intercommunication with other processors 14 that are present.
- each processor 14 shown in FIG. 1 has busses 28 interconnecting it to all of its adjacent processors 14, from the route of the flow path 24 it can be seen that not all of the busses 28 are necessarily used.
- the embodiment of the convolution system 10 shown here could alternately be embodied in a single dimensional array of serially communicating processors (which we term a "pipeline" of processors).
- FIG. 2 is a diagrammatic view of the major internal features of one of the processors 14 in FIG. 1, i.e., of a SEAforth-24A processor core.
- Each of the processors 14a-x is generally an independently functioning computer, including an arithmetic logic unit (ALU 30), a quantity of read only memory (ROM 32), a quantity of random access memory (RAM 34), an instruction decode logic section 36, an instruction area 38, a data stack 40, and a return stack 42.
- ALU 30 arithmetic logic unit
- ROM 32 read only memory
- RAM 34 random access memory
- instruction decode logic section 36 an instruction area 38
- data stack 40 a data stack 40
- return stack 42 return stack
- an 18- bit "A" register (A-register 44), a 9-bit "B” register (B-register 46), a 9-bit program counter register (P -register 48), and an 18-bit I/O control and status register (IOCS- register 50).
- four communications ports (collectively ports 52, individually ports 52a-d). Except for edge and corner node cases, the ports 52 each connect to a respective bus 28 (and have 18 data lines, a read line, and a write line - not shown individually).
- the nodes in the SEAforth-24A device handle communications and processing in particularly elegant and efficient manners, both asynchronously, making this device highly suitable for use with embodiments of the inventive convolution system 10.
- FIGS. 3a-c are partial views of FIG. 1 depicting inbound, outbound, and internal communications using the processors 14.
- FIG. 3a shows how data is passed between the input device 16 and processor 14a, and between processor 14a and processor 14b.
- FIG. 3b shows how data is passed between processor 14w and processor 14x, and between processor 14x and the output device 22.
- FIG. 3a shows how data is passed between the input device 16 and processor 14a, and between processor 14a and processor 14b.
- FIG. 3b shows how data is passed between processor 14w and processor 14x, and between processor 14x and the output device 22.
- FIGS. 3a-c shows how data is passed between processor 14i and processor 14j.
- Each processor 14 in FIGS. 3a-c is represented as having generic key information holding elements.
- the SEAforth-24A device features RAM, ROM, registers, and ports which can all be used in programmatically performing calculations.
- the generic information holding elements which are about to be discussed, can be any of RAM, ROM, registers, and ports.
- a signal data element 60 is the important information holding element.
- processors 14b-w one each of the signal data elements 60, the integral kernel filter elements 62, and the calculated elements 64 are the respective important information holding elements.
- a result element 66 is the important information holding element.
- FIG. 3a shows how data can be entered into the array 12.
- processor 14a is dedicated to receiving data from the input device 16 and providing it to processor 14b. As such, processor 14a can receive and store data words from the input device
- Both communications and processing in the SEAforth-24A device are asynchronous so, once the processor 14a makes data available to processor 14b, the processing of the task at hand can, conceptually, "flow" through the rest of the array 12.
- FIG. 3b shows how data can be extracted from the array 12.
- the processor 14x here is dedicated to receiving data from processor 14w and providing it to the output device 22.
- processor 14x receives and stores data words from processor 14w, and then uses its result element 66 to provide data words to the output device 22, again all subject only to limitations by the capacity of its RAM 34 and whether it has been suitably programmed.
- FIG. 3 c shows how the contents of the signal data elements 60 and the calculated elements 64 generally flow between the processors 14b-w, and also how the sum can be stored in each processor 14 as an accumulation and then passed all at once in the course of a convolution calculation. As described in detail presently, each of the processors 14b-w here can be performing an operation contributing to the overall calculation.
- processor 14b this operation uses a new input data value (in its signal data element 60) and a pre-stored convolution coefficient value (in its integral kernel filter element 62).
- processor 14b does not need a calculated element 64, since there is nothing "partial" yet from an earlier calculation stage. For program simplicity, however, processor 14b can have a calculated element 64 loaded with zero. Also, for applications where multiple convolution coefficients per node are processed (discussed presently), processor 14b may then have and use a calculated element 64.
- processor 14c-w each performs an operation contributing to the overall convolution calculation by using a pre-stored convolution coefficient value, an input data value which has come from its respective preceding processor 14 along the flow path 24, and an intermediate value which is also from the preceding processor 14.
- the convolution coefficient value is held in a respective integral kernel filter element 62, the input data value is held temporarily in a respective signal data element 60, and the intermediate value is held temporarily in a respective calculated element 64.
- FIG. 2 and FIGS. 3a-c can be used to see more generally how the ports and registers of the processors 14a-x in a SEAforth-24A device can be used as just described.
- processor 14a uses its port 52d to pass an input data value rightward to processor 14b, which can put it into it's data stack 40.
- processor 14b will read the input data value that has arrived at it's port 52c and puts it onto its data stack 40. Then processor 14b performs an operation contributing to the convolution, using the input data value now in it's data stack 40 and the convolution coefficient value already in it's data stack 40, and then processor
- processor 14b puts the intermediate data value from this at its port 52d.
- Similar operations can then occur along the flow path 24 in the processors 14b-w.
- the nodes in the SEAforth-24A device operate asynchronously, the operations in the processors 14b-w here can all conceptually be viewed as taking place in parallel. Accordingly, essentially contemporaneously with what has just been described for processor 14b, similar operations can be taking place in processors 14i and 14j, for example, only these will be using respective convolution coefficient values, processing intermediate data values, and using their ports 52 along the flow path 24. Also essentially contemporaneously, processor 14w will make available at its port 52c an output data value for processor 14x to handle as described above.
- FIGS. 4a-f are block diagrams schematically depicting convolution calculation stages in an array 12 of processors 14 such as that of FIG. 1. Generally, the stages here entail: (1) multiplying data sample values and convolution coefficient values in parallel;
- FIG. 4a shows a stage at which formal calculation is ready to commence.
- c-bins 72 collectively and c-bins 72 (0 .. n) specifically
- zeros have been loaded into other bins
- other bins r- bins 76 collectively and r-bins 76(o..2n-i) specifically
- FIG. 4b shows a next stage at which a first data sample value (do) has been received into d-bin 74 ( o>. Calculation proceeds as shown, essentially contemporaneously and in parallel throughout the length of the pipeline, with a first result value (ro) being stored in r-bin 76( 0 ).
- FIG. 4c shows a next stage at which the prior data sample value (do) has been moved to d-bin 74 ⁇ i ) and a second data sample value (di) has been received into d-bin
- FIG. 4c and FIG. 4d are n-2 stages that are conceptually much like the stage just described.
- FIG. 4d shows a stage at which the last data sample value (d n ) has been received into d-bin 74 (0) . Again calculation proceeds as shown, with a result value (r n ) now being stored in the r-bin 76( n ).
- FIG. 4e shows a next stage. Having now partially processed all of the n+1 data sample values (do ... d n ), the last data sample value (d n ) has been moved into d-bin
- FIG. 4f shows a stage at which the last data sample value (d n ) is finally finishing being processed.
- an (n+n-l)' h result value is stored in r-bin 76(2 n- i) and processing is complete.
- the r-bins 76(o..2n-i) now hold the full result of the convolution calculation performed here based on the n+1 data sample values (do ... d n ) and the n+1 convolution coefficient values (co ... C n ).
- FIGS. 5a-f are block diagrams schematically depicting convolution calculation stages based on a new algorithm, again presented in an array 12 of processors 14 such as that of FIG. 1.
- the new algorithm employs a derivation of the filtering function.
- FIG. 5a shows a stage at which formal calculation is ready to commence.
- c' m have been loaded into bins (c'- bins 82 collectively and c'-bins 82(o.. m > specifically), zeros have been loaded into other bins (d-bins 84 collectively and d-bins 84(o.. m ) specifically), and a single p-bin 86 and a set of result bins (r-bins 88 collectively and r-bins 88(o..2m-i) specifically) have contents which are initially unimportant.
- FIG. 5b shows a next stage at which a first data sample value (do) has been received into d-bin 84( 0) . Calculation proceeds as shown, essentially contemporaneously and in parallel throughout the length of the pipeline, with a first part value (p 0 ) being provided to the p-bin 86.
- a first data sample value (do) is again received into d-bin 84( 0) .
- FIG. 5c shows a next stage at which the prior data sample value (do) has been moved to d-bin 84 ( i ) and a second data sample value (di) has been received into d-bin 84 (0) .
- FIG. 5d shows a stage at which the last data sample value (d m ) has been received into d-bin 84( ⁇ j). Yet again, calculation proceeds as shown, with an m' h result value (r m ) now being stored in r-bin 88 (m) .
- FIG. 5e shows a next stage. Having now partially processed all of the m+1 data sample values (do ... d m ), the last data sample value (d m ) has been moved to d-bin 84(i) and a zero value is put into d-bin 84 (0) . Calculation proceeds and a result value (r m+1 ) is stored in r-bin 88 (m+ i ) . [0065] Between FIG. 5e and FIG. 5f are another m-2 stages that are conceptually much like that just described. [0066] FIG. 5f shows a stage at which the last data sample value (d m ) is finally finishing being processed.
- FIGS. 6a-c are graphs depicting convolution performed using both of the approaches presented above.
- FIG. 6a shows a first trace 92 representing the use of conventional convolution coefficients and a second trace 92' representing the use of derivation convolution coefficients, that is, ones usable by the new class of algorithms and which may be used in the present invention.
- FIG. 6b shows a single trace 94 representing input data upon which convolution is performed using both of the approaches so far presented (the other trace 94' here is discussed presently).
- FIG. 6c shows a single trace 96 representing the results of both of the so far discussed approaches.
- the trace 92 is represented by the equation:
- the trace 92' is represented by u '(t).
- the trace 94 is represented by the equation:
- FIGS. 6a-c show how the very same result can be achieved using either of the approaches so far discussed, that result being shown as trace 96 here.
- trace 92 and trace 94 have large amplitude ranges while trace 92' and trace 94' have a markedly lesser amplitude ranges.
- the SEAforth 24A device is actually quite outstanding over many other suitable candidates, we can and do continue using it here to illustrate points about how the inventive convolution system 10 can help overcome some of the inherent limitations in modern digital processors, generally.
- the convolution filter values in trace 92 might have to be expressed using 18-bit values, whereas the values used for the inventive approach in trace 92' might be expressed using 9-bit or even fewer-bit values.
- the data sample values used in trace 94 might have to be expressed using 18-bit values, whereas the values used for the inventive approach in trace 94' might be expressed using 9-bit or even fewer-bit values. It has been the present inventor's observation that using all 9-bit values can provide roughly a four fold (4X) speed increase in the inventive convolution system 10.
- the SEAforth-24A device is no exception to these general principles of digital processors. It employs the Forth language and relies on numeric values being represented with 18 bits for unsigned values (or 17 bits for signed values), or as being represented with 9 bits for unsigned values (or 8 bits for signed values). If a value requires 10 bits, for example, it effectively therefore must be treated the same as if it requires 18 bits. With reference briefly back to FIG.2 it can be seen again that each processor 14 in the SEAforth-24A device has one 18-bit A-register 44, one 9-bit B- register 46, 18-bit wide words in ROM 32 and RAM 34, and 18-bit wide ports 52.
- the derivation approximation is an appropriate method. Otherwise, the difference between successive direct coefficient values must be less than 256 units.
- FIG. 7 is a code listing of an exemplary direct filter 700.
- the programming language used here is Forth and the target hardware processor 14 is a SEAforth 24A device.
- An item 702 is a compiler instruction equating "IO" with the IOCS-register
- An item 704 is a compiler instruction equating "H" with a coefficient value.
- An item 708 is a location label in the Forth language.
- An item 710 is a sequence of Forth instructions that initializes the processor 14 for the convolution calculation to follow. Specifically, IO is loaded into the top of the data stack; then popped from there into the B-register 46 so that it points to the IOCS-register 50; then H is loaded into the top of the data stack; and a nop pads the 18-bit instruction word used here to contain this instruction sequence.
- An item 712 specifies the beginning of a loop within which three cases are dealt with by conditional compilation. This programs the processor 14 depending upon whether is to be the first (processor 14b), a middle (any of processors 14c-v), or the last (processor 14w) in the pipeline. See also, FIGS. 3a-c.
- An item 714 here specifies the start of compilation of instructions for the most
- a data sample value (s) is read from where the B-register 46 points to and is pushed onto the 20 data stack (h ⁇ s h); then an accumulation value (a) is also read and pushed onto the data stack (s h — a s h); then the top element on the data stack is popped and pushed onto the top of the return stack (D: a s h — s h R: ⁇ a); then the top element on the data stack is replicated and pushed onto the data stack (D: s h — s s h R: a ⁇ a). [0100] An item 718 follows, where the top element on the data stack is popped and pushed
- a large multiply (“MULT,” a definition provided in the BIOS of the SEAforth 24A device) is performed.
- the top two elements of the data stack are used here as a multiplier and a multiplicand, and the top element is replaced with the result (a') while the multiplicand is left as is as the second element in the data stack (D: s h — a' h R: s a — s a).
- An item 722 follows, where the top two elements of the data stack are added together, the top is replaced with the new accumulated sum (a") and the second is replaced with the 5 next lower element (D: a a' h ⁇ a" h R: — ).
- An item 724 follows, ending the conditional compilation of the code for the case here where the subject processor 14 is one of processors 14c-v.
- the case for processor 14b is simpler, because there is no "prior” 10 accumulate value to be read and added.
- the case for processor 14v is also somewhat simpler, because the current data sample value does not need to be written to a "subsequent" processor.
- an item 726 is an instruction sequence that is compiled for all of processors
- FIGS. 8a-b are a code listing of an exemplary derivation filter 800, wherein FIG. 8a shows the code performing the functions that are conceptually similar (other than 9-bit versus
- FIG. 8a As can be seen in FIG. 8a, much of the code here is essentially the same as in the direct filter 700, discussed above. One exception, however, is an item 802, where nine plus star ("+*") operations are used to perform a small multiply (instead of a large multiply performed using the MULT definition).
- An item 804 is a sequence of Forth instructions that initializes a portion value (p) in
- the current portion value (p) is popped of the return stack and pushed onto the data stack
- An item 808 then retains the accumulated sum (a") in the return stack as a next portion value (p').
- the accumulated sum (a") is replicated (D: a" h — p' a" h R: — ); the next portion value (p') is popped off of the data stack and pushed onto the return stack (D: a" h — a" h R: ⁇ p'); and two nops pad out the 18-bit instruction word.
- FIG. 8b this shows additional code used for the "integrator" step
- An item 810 is a comment in the Forth language and an item 812 is a location label in the Forth language.
- This code could be conditionally compiled, by adding appropriate complier directives to the code in FIG. 8a, or it can be separately compiled.
- An item 814 is a sequence of Forth instructions that, first, fetches the value of IO into the B-register 46 and, second, fetches the value $3F (a port address) into the A-register 44.
- An item 816 is another sequence of Forth instructions. Specifically, one that zeros the data stack. The top element on the data stack is replicated and pushed onto the data stack, and then this is done again (it is irrelevant what that top element is). Then the two topmost 20 elements are popped off of the data stack, XOR' ed, and the result (zero) pushed back onto the data stack.
- An item 818 specifies the beginning of a loop.
- An item 820 is another sequence of Forth instructions. Specifically, a value is read from where the B-register 46 points to and is pushed onto the data stack; then the top two 25 elements on the data stack are added and replace the top element (and the second element is replaced with the next lower element); then the top element on the data stack is replicated and pushed onto the data stack; and then the top element is popped off of the data stack and written to where the A-register 44 points. The net result of this is that the sum is output while a copy is also kept (accumulated) for the next execution of the loop.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computational Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Complex Calculations (AREA)
- Image Processing (AREA)
Abstract
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2010502137A JP2010524080A (ja) | 2007-04-06 | 2008-04-04 | 複数のコンピュータ・プロセッサを用いる畳み込み計算システム |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US91062607P | 2007-04-06 | 2007-04-06 | |
US60/910,626 | 2007-04-06 | ||
US11/854,215 | 2007-09-12 | ||
US11/854,215 US20080250092A1 (en) | 2007-04-06 | 2007-09-12 | System for convolution calculation with multiple computer processors |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2008124061A1 true WO2008124061A1 (fr) | 2008-10-16 |
Family
ID=39827918
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2008/004402 WO2008124061A1 (fr) | 2007-04-06 | 2008-04-04 | Système pour calcul de convolution équipé de plusieurs processeurs informatiques |
Country Status (4)
Country | Link |
---|---|
JP (1) | JP2009010925A (fr) |
CN (1) | CN101652770A (fr) |
TW (1) | TW200842699A (fr) |
WO (1) | WO2008124061A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2564285A (en) * | 2016-07-01 | 2019-01-09 | Google Llc | Convolutional neural network on programmable two dimensional image processor |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10122346B2 (en) * | 2017-03-03 | 2018-11-06 | Synaptics Incorporated | Coefficient generation for digital filters |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030113031A1 (en) * | 1997-04-15 | 2003-06-19 | Wal Gooitzen Siemen Van Der | Parallel pipeline image processing system |
US20070041438A1 (en) * | 2004-01-30 | 2007-02-22 | Sony Corporation | Sampling rate conversion device and method, and audio device |
US20070070079A1 (en) * | 2002-01-17 | 2007-03-29 | University Of Washington | Programmable 3d graphics pipeline for multimedia applications |
-
2008
- 2008-03-27 TW TW97110966A patent/TW200842699A/zh unknown
- 2008-04-03 JP JP2008097204A patent/JP2009010925A/ja active Pending
- 2008-04-04 CN CN200880011407A patent/CN101652770A/zh active Pending
- 2008-04-04 WO PCT/US2008/004402 patent/WO2008124061A1/fr active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030113031A1 (en) * | 1997-04-15 | 2003-06-19 | Wal Gooitzen Siemen Van Der | Parallel pipeline image processing system |
US20070070079A1 (en) * | 2002-01-17 | 2007-03-29 | University Of Washington | Programmable 3d graphics pipeline for multimedia applications |
US20070041438A1 (en) * | 2004-01-30 | 2007-02-22 | Sony Corporation | Sampling rate conversion device and method, and audio device |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2564285A (en) * | 2016-07-01 | 2019-01-09 | Google Llc | Convolutional neural network on programmable two dimensional image processor |
GB2564285B (en) * | 2016-07-01 | 2019-07-17 | Google Llc | Processor and computer program product configured to cause performance of a two-dimensional convolution |
US10546211B2 (en) | 2016-07-01 | 2020-01-28 | Google Llc | Convolutional neural network on programmable two dimensional image processor |
US10789505B2 (en) | 2016-07-01 | 2020-09-29 | Google Llc | Convolutional neural network on programmable two dimensional image processor |
US12020027B2 (en) | 2016-07-01 | 2024-06-25 | Google Llc | Convolutional neural network on programmable two dimensional image processor |
Also Published As
Publication number | Publication date |
---|---|
TW200842699A (en) | 2008-11-01 |
CN101652770A (zh) | 2010-02-17 |
JP2009010925A (ja) | 2009-01-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
USRE46712E1 (en) | Data processing device and method of computing the cosine transform of a matrix | |
JP3750820B2 (ja) | パック・データの乗加算演算を実行する装置 | |
US7814297B2 (en) | Algebraic single instruction multiple data processing | |
CN103092562B (zh) | 控制移位分组数据的位校正的装置 | |
CN111465924A (zh) | 用于将矩阵输入转换为矩阵处理器的向量化输入的系统和方法 | |
US20040122887A1 (en) | Efficient multiplication of small matrices using SIMD registers | |
WO2015114305A1 (fr) | Appareil de traitement de données et procédé d'exécution d'instruction de balayage de vecteur | |
CN100447777C (zh) | 处理器 | |
JPS6125188B2 (fr) | ||
JPH11203272A (ja) | 高速フーリエ変換処理装置、高速フーリエ変換処理システムおよび高速フーリエ変換処理方法 | |
US6675286B1 (en) | Multimedia instruction set for wide data paths | |
EP3655851A1 (fr) | Traitement de nombre complexe à base de registres | |
Bowman et al. | Efficient dealiased convolutions without padding | |
EP1978450A2 (fr) | Traitement de signal | |
JP3092534B2 (ja) | ブロックiirプロセッサ | |
WO2008124061A1 (fr) | Système pour calcul de convolution équipé de plusieurs processeurs informatiques | |
US7774399B2 (en) | Shift-add based parallel multiplication | |
JP3709291B2 (ja) | 高速複素フーリエ変換方法及び装置 | |
WO2024195694A1 (fr) | Dispositif de processeur et procédé de calcul | |
Kim et al. | AMD's 3DNow!/sup TM/vectorization for signal processing applications | |
GR20170200090U (el) | Λειτουργιες στοιχειου κατα διανυσμα σε μια διαταξη επεξεργασιας δεδομενων | |
CA2339919A1 (fr) | Methode et systeme de traitement de matrices a nombres complexes | |
GR20170200089U (el) | Πολλαπλασιασμος-συσσωρευση σε μια διαταξη επεξεργασιας δεδομενων | |
KR20080106754A (ko) | 이미지 필터링을 위한 마이크로프로세서의 연산방법 및연산장치 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 200880011407.7 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 08727290 Country of ref document: EP Kind code of ref document: A1 |
|
DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) | ||
ENP | Entry into the national phase |
Ref document number: 2010502137 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 08727290 Country of ref document: EP Kind code of ref document: A1 |