CN110162742B - Floating point operation circuit implementation method for real matrix inversion - Google Patents
Floating point operation circuit implementation method for real matrix inversion Download PDFInfo
- Publication number
- CN110162742B CN110162742B CN201910254674.7A CN201910254674A CN110162742B CN 110162742 B CN110162742 B CN 110162742B CN 201910254674 A CN201910254674 A CN 201910254674A CN 110162742 B CN110162742 B CN 110162742B
- Authority
- CN
- China
- Prior art keywords
- row
- matrix
- input
- data
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/57—Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Mathematical Physics (AREA)
- Computational Mathematics (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a method for realizing a floating point arithmetic circuit for inverting a real matrix, and aims to provide a square matrix inversion method which is beneficial to HDL realization and has shorter time delay. The invention is realized by the following technical scheme: three check devices are input according to the line or column beat dividing sequence, N multiplied by N square matrixes input according to the line beat dividing sequence are sequentially stored in an internal RAM buffer area, then N cyclic transformation operations are carried out on the N square matrixes row by using an in-situ operation algorithm of Gaussian-approximate-time elimination algorithm, in the process of each cyclic transformation operation, the three check devices read matrix line data of a kth line from an RAM buffer unit, all line data of the matrix are traversed once to serve as one element elimination cycle, and N times of cycles are executed altogether; and finally outputting the result inverse matrix according to the line beat dividing sequence. The time division multiplexing technology is adopted to reasonably allocate the operation time of the multiplier at different stages of operation, hardware resources are saved, the resource utilization efficiency is improved, and the data throughput rate and the operation speed are improved by the local pipeline technology.
Description
Technical Field
The invention relates to an operation circuit implementation method suitable for real matrix inversion based on hardware description language (Hardware Description Language, HDL for short), in particular to a floating point operation circuit structure implementation technology of real matrix inversion.
Background
Matrix operation is a fundamental problem in scientific and engineering computation, can be widely applied to various circuit computation, and can be widely and deeply applied to adaptive array antenna technologies such as DCT, LDPC coding and decoding and emerging in recent years, and is used for solving key problems. For example, in algorithms such as array signal processing and radar imaging, a large number of parts for calculating weight coefficients need to involve matrix inversion, but due to the great complexity of the algorithm, approximate weight coefficients can be obtained only by using a least mean square algorithm and a least square algorithm in many cases, so that the operation precision and the convergence speed are lost. Matrix inversion is one of the most important and difficult operations in matrix operations, and conventional matrix inversion algorithms are mostly implemented by performing serial computation on a processor using software. Some computing software, such as Matlab, can realize inversion operation of any order real matrix or any order complex matrix, and can achieve higher precision, but its actual performance is not ideal. The defect that matrix inversion cannot be calculated in an electronic system or real-time requirements cannot be met due to incapability of embedding computing software into an embedded electronic system with high real-time requirements is also more and more obvious. With the increase of integrated circuit manufacturing process, the adoption of a large number of very large scale integrated units and microprocessors to form a multiprocessor parallel system has become an effective means for increasing the calculation speed. In order to accelerate the operation speed of matrix inversion, a method of hardware implementation is generally adopted. The hardware implementation of matrix operation can fully exert the speed and parallelism of the hardware. In this respect, many have conducted intensive studies, particularly using cardiac array-based methods to construct circuit structures. For example, the matrix inversion is realized by using a QR decomposition method, the proposed circuit structure is very compact, the multiplexing rate is very high, but the operation which needs to be completed by the basic processing unit is very complex, not only the complex multiplication and division operation is involved, but also the square opening operation is needed, thus not only increasing the hardware cost, but also the hardware realization difficulty is greatly increased. The complexity of the operation in the basic processing unit inevitably has a certain influence on the clock frequency, so that the actual operation time of the whole inversion operation is prolonged, the complexity and the calculated amount of the operation realization are long, the period required by the inversion process is relatively long, the real-time requirement is difficult to meet when the matrix order is large, and the improvement of the calculation speed is severely restricted. The matrix inversion operation realized by programming HDL language Verilog or VHDL is usually performed mainly by fixed-point operation, the operation precision is lower than that of the traditional floating-point matrix inversion operation, the shift of the fixed-point number and the number of truncated bits are carefully considered, the design difficulty is higher, and the application range is not as wide as that of the floating-point matrix inversion operation. The latest versions of these languages, while employing floating point definitions and not requiring synthesis, are not well suited to support floating point designs. Whereas the conventional method has very low efficiency in integrating floating point data paths into FPGAs.
Floating point numbers are used in computers to approximate any real number, which is defined by an integer or fixed pointThe number (i.e., mantissa) is multiplied by the integer power of some radix (typically 2 in a computer). A floating point number a is represented by two numbers m and e: a=m×b≡e (e power of b). Floating point computing refers to operations in which floating point numbers participate in floating point computing, which are typically accompanied by approximations or rounding due to imprecise representation. Mathematically, the operation of matrix inversion is defined as: for an n×n order nonsingular matrix a= [ a ] ij ]The inverse matrix thereof means that A is satisfied -1 A=AA -1 N×n order matrix a of =i -1 Wherein I is a unit array. There are many methods for matrix inversion in mathematics, such as a concomitant matrix method, an elementary conversion method, a block matrix method, a Gaussian elimination method and the like, and most of the methods have the problems of complex calculation process, large storage resource requirement and the like, and are not suitable for being realized on hardware. Among them, both the concomitant array method and the elementary conversion method are not suitable for programming. The accompanying matrix method has large calculation amount, and the elementary conversion method is difficult to realize by programming. The temporal complexity of the concomitant matrix method is mainly derived from the computational determinant, which is 0 (n 2 ) Whereas the whole algorithm needs to calculate algebraic remainder of each element, the time complexity is directly enlarged by n 2 Multiple of 0 (n) 4 ). LU decomposition method: the method is mainly time-consuming in the decomposition process, and the time complexity of solving the triangular matrix is 0 (n 2 ) The decomposition process is 0 (n 3 ) In general, the method is similar to the Gaussian elimination method, but the process that the main element of the Gaussian elimination method is 0 is avoided. In order to save space, the elements of the a=lu decomposition are stored in a matrix of a. In addition, it can be seen that when the matrix dimension exceeds a certain value, the memory space is not used enough. It can be seen that the commonly used matrix inversion method is not easy to implement engineering, and the calculation software is difficult to be equipped into an electronic system.
The data to be processed in the existing scientific and engineering calculation are mostly floating point numbers, and matrix elements are required to be represented by real numbers, so that a circuit structure for floating point operation is required to be designed. At present, a matrix decomposition-based method is mainly used in a hardware platform, and the matrix decomposition-based method mainly comprises three types of LU decomposition, QR decomposition and Cholesky decomposition.
LU decompositionThe matrix is decomposed into an upper triangular matrix and a lower triangular matrix. The existing inversion implementation method based on LU decomposition needs to decompose an input square matrix into L, U two triangular matrixes, and calculate L of L, U matrixes according to a triangular matrix inversion algorithm respectively -1 And U -1 Finally, re-calculate U -1 ×L -1 To obtain the inverse matrix of the input square matrix, LU decomposition, inversion of two triangular matrices and matrix multiplication are needed to be carried out in the period, and three operations are needed. In implementing the algorithm using HDL, LU decomposition, L-matrix and U-matrix inversion, U, are required to be performed -1 ×L -1 Three steps, if the multiplier and the adder are independently allocated for multiplication and addition operation to be executed in each step, the utilization efficiency of hardware resources is greatly reduced, and the design difficulty of the circuit is greatly improved due to the plurality of multiplexing ports by adopting a time division multiplexing technology, so that the problem of time sequence shortage of the circuit is more difficult to solve; in addition, when inverting the two triangular arrays of the L array and the U array, the existing processing method generally directly transfers a software triangular array inversion algorithm, wherein the input of the next beat of calculation in the algorithm depends on the result of the previous beat of calculation, the pipeline design can be ensured when the operation time delay of a basic multiplication, division and addition/subtraction arithmetic unit is only one clock, and if the operation time delay exceeds one clock (which is common in a floating point arithmetic unit), the design of a triangular array inversion circuit is degraded into a non-pipeline mode, thereby influencing the data throughput rate and the operation speed of inversion; even if extra hardware resources are added to forcedly realize pipelining of the triangular matrix inversion, idle running of some operators on some clocks can be caused due to different numbers of elements of each row of the triangular matrix, so that the utilization efficiency of the hardware resources is reduced. Based on the analysis, the LU decomposition inversion method is only suitable for fixed point number operation with operation time delay not more than 1 clock, and the application of the fixed point number operation leads to careful design of the bit number of shift and truncate in each operation in design, thereby further increasing the design difficulty; finally, because the L array and the U array are required to be respectively generated in the operation process, and are respectively inverted and then multiplied, the operation requires that a memory which is several times of the size of the input matrix is additionally allocated to store the intermediate state matrices, and the number of the intermediate state matrices is greatly increased Consumption of memory resources. The singular value decomposition method is to decompose the input matrix into a product a=usv of the orthogonal matrix U, V and the diagonal matrix S T Then inverse matrix A -1 =VS -1 U T . The primary equal-row transformation simultaneously performs primary equal-row transformation on the input square matrix and the accompanying unit matrix, and the accompanying matrix is the inverse matrix of the input square matrix after the input square matrix is transformed into the unit matrix.
The QR decomposition method gradually normalizes the input matrix into the product of a quadrature matrix Q and an upper triangular matrix or a quasi-upper triangular matrix R by using quadrature similarity transformation, because the quadrature matrix Q satisfies Q -1 =Q T Then after inverting the upper triangular matrix R, calculate R -1 Q T The inverse of the input matrix is obtained. The QR decomposition algorithm is a matrix inversion based on AD SP TS201SDSP assembly language. Although the method overcomes the defects of difficult inversion and low efficiency of a high-order matrix, and can realize the rapid inversion operation of any order complex matrix, the algorithm involves complex transformation processing such as Householder transformation or Givens transformation, and the line transformation of the matrix, and the calculation process is too complex and is not suitable for being realized on hardware.
An ideal algorithm for high performance computing is Cholesky decomposition. The algorithm is often used for linear algebra, a plurality of equations can be efficiently solved, and a matrix inversion function can be realized. This algorithm is very complex and always requires a floating point representation to obtain a reasonable result. The computational requirements are proportional to the matrix dimension N and therefore generally very demanding processing. The actual GFLOP/s depends on the matrix size and the required matrix processing throughput. Cholesky decomposition is simpler, but is only applicable to a real symmetric positive definite matrix, so that the application range is too small, and the evolution operation consumes a lot of hardware resources; the application condition of LU decomposition is easy to meet, the computational complexity is moderate, and the method is suitable for hardware implementation. And three decomposition methods will produce a triangular matrix.
The algorithms are only proposed from the mathematical theory angle, the inversion of the low-order matrix is easy to realize, and the order is difficult to manually calculate when the order is too high. The performance and accuracy of the engineering are difficult to achieve for practical engineering.
The Gaussian elimination method adopted in the prior art is also called Gaussian elimination method, and is actually commonly called addition-subtraction elimination method. Mathematically, gaussian elimination is also known as Gaussian-jojo elimination. It is an algorithm in linear algebra for solving the solution of the system of linear equations, the rank of the matrix and the inverse of the reversible square matrix. Gaussian elimination when used to invert a matrix yields a "row-eliminated trapezoidal form". The calculation process of the Gaussian-approximately-time cancellation matrix inversion algorithm is as follows: for an n x n order nonsingular matrix
Order A (0) =a, by a (0) Starting, matrix primary equal row transformation or column transformation is used to obtain a square matrix sequence { A } (k) }, whereinEach time by A (k-1) To A (k) The transformation of (a) performs the following two-step calculation:
1. and (3) line normalization: for elements at the kth row, kth column:for the other elements of the k row, not the k column:
2. column zeroing: for other elements than the kth row and the kth column:
for other elements on the kth column than the kth row:
a obtained by the above calculation (n) Namely A -1 。
In view of the above-mentioned several inverse matrix methods, except for the elementary line transformation method, other methods all involve multiple steps and algorithms with different formulas, and the difficulty of writing and implementing with HDL is relatively high; the Gaussian-about-time elimination method only involves basic division, multiplication and addition operation, the number of cyclic operation steps is only two, the realization difficulty is low by writing HDL, and compared with other algorithms, the memory and multiplier resources can be greatly saved because in-situ operation can be carried out.
The addition, multiplication addition and division operation based on the floating point number is the basis of hardware realization, the precision of the matrix operation realization algorithm realized by the hardware is an important factor influencing the success or failure of the algorithm, and the fixed point number is a false result caused by the condition that overflow tail-cut and the like are very easy to occur in the complex operation process. The floating point number system needs special hardware, and the operation is complex. Many CPUs have hardware support of floating point operation and can be directly called in software, but the disadvantage is that the high-speed operation requirement cannot be met. Modular matrix operations for floating point operations of different systems in different applications are the basis of numerous algorithms, especially matrix inversion operations, is a big difficulty for hardware implementation. In summary, the current inversion method based on LU decomposition has the problems of complex circuit design, low hardware resource utilization efficiency, overlarge calculation time delay and the like when HDL is realized. The calculation efficiency after engineering implementation also has a large difference, so that the common calculation inverse matrix method is not feasible in engineering implementation. How to shorten the calculation time of the inverse matrix, and directly influence the performance of the whole system.
Disclosure of Invention
Aiming at the technical problems of complex circuit design, low hardware resource utilization efficiency and overlarge calculation time delay existing in the existing method, the invention provides the realization method of the real matrix inversion floating point arithmetic circuit, which has the advantages of high operation precision, strong reusability, low realization difficulty, simpler circuit design and higher hardware resource utilization rate.
In order to solve the technical problems, the invention provides a method for realizing a floating point arithmetic circuit for inverting a real matrix, which has the following technical characteristics: according to a matrix inversion circuit comprising an inversion operation state control unit, a check 3MUX1, a check 1MUX3, a line return unit, a column return unit, a RAM buffer unit and a shared multiplier array unit, inputting an N-level square matrix waiting for inversion from an input port of the matrix inversion circuit into the check 3MUX1 according to a line or column beat dividing sequence, wherein the check 3MUX1 gates signals between an input matrix line, a line return output result and a column return output result, and outputs a buffer area of the RAM buffer unit for storage; a three-check-out device 1MUX3 reads the matrix row data of the kth row from the RAM cache unit, takes all row or column data traversing the matrix once as a vanishing cycle, divides the read matrix row data into two paths, and executes n times of cycles; one path is sent to a line normalization unit, the line normalization unit reads the output result of the floating-point multiplier array in the shared multiplier array unit, distributes the operation time of the multiplier at different stages of operation according to the input data corresponding to the floating-point multiplier, and stores and restores the line data of the line normalization result at the position in the line; the other path is sent to a column zeroing unit, and matrix row data in the column zeroing unit are subjected to N times of cyclic transformation operation line by line according to a Gaussian elimination method; the column zeroing unit reads an output result from the floating point subtracter array output end after the output interface of the shared multiplier array unit, and stores and recovers 'column zeroing' result row data according to the position of input data corresponding to the floating point subtracter in the row; the line return unit and the column return unit feed back respective processing results to a three-selection one-check device 3MUX1, and the three-selection one-check device 3MUX1 replaces input line data corresponding to the line return result line data and the column return result line data in the RAM to obtain an N-level inverse matrix stored in a buffer area of the RAM buffer unit; the one-to-three check 1MUX3 sequentially reads and outputs the data of the result inverse matrix from the RAM buffer unit, outputs and maintains the matrix singular mark during the data of the output inverse matrix, enters the next cycle, and finally outputs the result inverse matrix according to the line beat sequence.
Compared with the prior art, the invention has the following beneficial effects:
the calculated amount is small, and the operation time delay is small. The invention adopts the control unit comprising inversion operation state and three optionsThe matrix inversion circuit comprises a check 3MUX1, a three-check 1MUX3, a row normalization unit, a column zeroing unit, a RAM cache unit and a shared multiplier array unit, wherein the circuit design is simpler, the utilization rate of hardware resources is higher, and the calculation time delay is more reasonable. The N-order square matrix waiting for inversion is input into a check 3MUX1 according to the line or column beat dividing sequence from an input port of the matrix inversion circuit, the N multiplied by N square matrix input according to the line beat dividing sequence is firstly stored into an internal RAM buffer area in sequence, then N times of cyclic conversion operation is carried out on the N multiplied square matrix according to the line beat dividing sequence by using an in-situ operation algorithm of Gaussian-approximate elimination algorithm, and finally the result inversion matrix is output according to the line beat dividing sequence. The scheme has high execution efficiency, high speed and good integration level, except that in the operation of the row return-to-zero step in the Gaussian-approximate method, the row return-to-zero step needs to be waited, because the operation output is not used as the dependence of the operation input of the other row between the rows, the local pipeline technology can be used for accelerating the throughput rate of data, thereby obviously reducing the operation delay, overcoming the defects that in the triangular matrix inversion algorithm in the prior art, the multiplication operation of the next beat depends on the multiplication operation result of the previous beat, the local pipeline technology cannot be used when the method is applied to floating point multiplication which needs a plurality of clock cycles to finish the operation, and each operation can not be started until the completion of the previous operation, and a large amount of delay is caused. Compared with the existing inversion implementation method based on LU decomposition, the method needs to decompose an input square matrix into L, U two triangular matrixes, and calculate L of L, U matrixes according to a triangular matrix inversion algorithm respectively -1 And U -1 Finally, re-calculate U -1 ×L -1 To obtain the inverse matrix of the input square matrix, LU decomposition, inversion of two triangular matrixes and matrix multiplication are needed to be carried out during the period, and three operations are altogether carried out; the invention has the advantage that the calculation amount is greatly reduced for an in-situ operation algorithm which applies a Gaussian-approximately-time elimination algorithm to an input square matrix.
The operation precision is high. The invention adopts a time division multiplexing technology, applies the same group of m multipliers to the row normalization step and the column zeroing step according to time slicing allocation, adopts a local pipeline technology to improve the data throughput rate and the operation speed in the process of each cyclic conversion operation, and adopts the time division multiplexing technology to reasonably allocate the operation time of the multipliers at different stages of the operation so as to save hardware resources and improve the resource utilization efficiency. And (3) processing the square matrix in the floating-point real number format, wherein the floating-point number has the advantage of natural operation precision relative to the fixed-point number, and the multiplier resource can be saved. Compared with a square matrix inversion method taking fixed point numbers as processing objects, the method has the advantages of natural precision and resource utilization rate. The simulation proves that the improved algorithm has better adaptability and has better performance under the conditions of lower snapshot number and higher signal to noise ratio.
The reusability is strong and the realization difficulty is low. The matrix inversion circuit designed by the invention takes the floating point real number as the basic processing object, and because the basic operation core of the floating point number is highly mature, compared with the square matrix inversion method taking the fixed point number as the processing object, a developer does not need to care about the bit interception and bit truncation problems in the operation process, the HDL realization difficulty of inversion operation can be greatly reduced, and the reusability of an inversion module can be obviously improved. The realization difficulty is reduced by directly carrying out the real number operation of the algorithm on the shared multiplier array unit. More flexible, better interference suppression capability, higher gain of output signals and higher spatial resolution capability.
The utilization rate of hardware resources is high. Aiming at the prior LU decomposition matrix inversion technology, matrix row data output from a RAM cache unit is divided into two paths and respectively sent into a row normalization unit and a column zeroing unit, the two units cooperate to realize the cyclic variation operation of a Gaussian one-approximation method, wherein only two operation steps of row normalization and column zeroing are used, and the two operation steps are not overlapped in execution time, so that multiplication operation can be carried out time-division multiplexing between the two operation steps of row normalization and column zeroing except for division and division in basic operation, the utilization rate of hardware resources is high, and the time-division multiplexing data synchronization is simple; the matrix inversion method based on LU decomposition needs to allocate resources to realize one LU decomposition, two triangular matrix inversion and one matrix multiplication operation, and the three operations cannot realize multiplexing, and even if the basic division, multiplication, addition and subtraction operations are subjected to time division multiplexing, additional logic resources have to be allocated to realize data synchronization in the time division multiplexing, so that the hardware resource utilization rate is low and the data synchronization is more complex.
The application is flexible. The invention uses the time division multiplexing technology to divide matrix row data into a plurality of beats to sequentially input 'row normalization' and 'column zeroing' operations, and can reduce the sizes of the parallel multiplier array and the parallel subtracter array under the condition of properly increasing the operation time delay. The sizes of the parallel multiplier array and the parallel subtracter array can be controlled by configuring different time division multiplexing beats, so that the contradiction between the operation speed and the hardware resources can be balanced flexibly.
The implementation difficulty is low. The invention can divide the multiplication operation of one row completed by one beat into m beats to complete by the time division multiplexing technology in the n shared multiplier arrays multiplexed according to time division in the steps of row normalization and column zeroing, and further reduces the requirement of the n shared multipliers to n/m, thereby flexibly balancing hardware resources and operation speed. Through the description and analysis of the Gaussian-approximate elimination algorithm and the circuit implementation method thereof, compared with the traditional matrix inversion HDL implementation technology based on LU decomposition, the method has the advantages of small calculation amount and calculation delay, high calculation precision, strong reusability, low implementation difficulty, flexible balance of the contradiction between the calculation speed and limited hardware resources, and is more suitable for high-speed real-time signal processing. The invention can be used for realizing the real number square matrix inversion HDL of floating point number operation.
The invention can be used for, but not limited to, inversion operation of a square matrix composed of floating-point real numbers, and can be flexibly converted into inversion operation of square matrices of real numbers in other formats under the condition of having basic division, multiplication and subtraction operation cores supporting the pipeline function; even after the basic division, multiplication and subtraction operation cores perform reasonable truncating and truncating treatment, the method can be used for the square matrix inversion operation of fixed-point real numbers; theoretically, any array of components that can be used in the Gaussian-approximately-time cancellation method can be used to perform the inversion operation by the method, as long as the user provides basic division, multiplication, subtraction kernels that support the pipeline function that meet the user's own precision requirements.
Drawings
Fig. 1 is a logic block diagram of a floating-point real matrix inversion circuit according to the present invention.
Fig. 2 is a state transition diagram of the state control unit state machine of fig. 1.
FIG. 3 is a state change flow chart of the normalization row counter of FIG. 2.
Fig. 4 is a state change flow diagram of the return-to-zero line counter of fig. 2.
Fig. 5 is a state change flowchart of the one-out-of-three check 3MUX1 in fig. 1.
Fig. 6 is a flow chart showing the state change of the one-to-three check 1MUX3 in the floating-point real matrix inversion circuit according to the present invention.
Fig. 7 is a functional logic diagram of the row normalization unit in fig. 1.
Fig. 8 is a functional logic diagram of the column zeroing unit in fig. 1.
Fig. 9 is a functional logic diagram of the RAM cache unit of fig. 1.
Fig. 10 is a functional logic diagram of the shared multiplier array unit in fig. 1.
In order to better understand the above technical solutions, the following detailed description will refer to the accompanying drawings and specific embodiments.
Detailed Description
See fig. 1. According to the invention, according to a matrix inversion circuit comprising an inversion operation state control unit, a three-selection one-check 3MUX1, a three-selection one-check 1MUX3, a row return unit, a column return unit, a RAM buffer unit and a shared multiplier array unit, an N-level square matrix waiting for inversion is input into the three-selection one-check 3MUX1 according to a row or column beat dividing sequence from an input port of the matrix inversion circuit, and the three-selection one-check 3MUX1 gates signals among 'input matrix row', 'row return one' output results and 'column return zero' output results and then is sent into a buffer area of the RAM buffer unit for storage through the state control unit; a three-check-out device 1MUX3 reads the matrix row data of the kth row from the RAM cache unit, takes all row or column data traversing the matrix once as a vanishing cycle, divides the read matrix row data into two paths, and executes n times of cycles; one path is sent to a line normalization unit, the line normalization unit reads the output result of the floating-point multiplier array in the shared multiplier array unit, distributes the operation time of the multiplier at different stages of operation according to the input data corresponding to the floating-point multiplier, and stores and restores the line data of the line normalization result at the position in the line; the other path is sent to a column zeroing unit, and matrix row data in the column zeroing unit are subjected to N times of cyclic transformation operation line by line according to a Gaussian elimination method; the column zeroing unit reads an output result from the floating point subtracter array output end in the shared multiplier array unit, and stores and recovers 'column zeroing' result row data according to the position of input data corresponding to the floating point subtracter in the row; the line return unit and the column return unit feed back respective processing results to a three-selection one-check device 3MUX1, and the three-selection one-check device 3MUX1 replaces input line data corresponding to the line return result line data and the column return result line data in the RAM to obtain an N-level inverse matrix stored in a buffer area of the RAM buffer unit; the one-to-three check 1MUX3 sequentially reads and outputs the data of the result inverse matrix from the RAM buffer unit, outputs and maintains the matrix singular mark during the data of the output inverse matrix, enters the next cycle, and finally outputs the result inverse matrix according to the line beat sequence.
In view of the fact that all the inversion operations of the Gaussian-reduction elimination method in the pipeline mode require n dividers and n multiplied subtractors, and the consumption of hardware resources is huge, the method provided by the invention requires that new matrix data to be inverted is refused to be received after all matrix data to be inverted are input and before matrix inversion results are output, and a multiplier array is shared between a row return unit and a column return unit by using a time division multiplexing technology in the operation process so as to reduce the requirement of multiplier resources, and a local pipeline technology is used for accelerating the operation speed of the column return unit.
The inversion circuit includes: the system comprises a RAM buffer unit, an inversion operation state control unit, a row return unit, a column return unit and a shared multiplier array unit, wherein the RAM buffer unit is connected between two three-selection one-check devices 3MUX1 in series, the inversion operation state control unit is connected with the two three-selection one-check devices 1MUX3 in parallel, the row return unit, the column return unit and the shared multiplier array unit are connected end to end, the return unit is connected with the shared multiplier array unit through an interface A, an interface B and an interface R, and the shared multiplier array unit is connected with the column return unit through an interface C, an interface D and an interface O.
In an alternative data manipulation and transfer flow embodiment:
The matrix to be inverted is input from an input port of the matrix inversion circuit according to the frequency of each clock line and stored in the internal RAM; after the input data is received, the following n cycles are performed: at the kth (k=0, 1,2,., n) cycle, it is assumed that one line of data must be split because the number m of multipliers in the multiplier array is insufficient to compute all data for one line in parallel The beat can perform the multiplication: RAM outputs the data of the kth row of the matrix +.>(subscript k, ×represents all data of row k); in the row normalization unit, n selects 1 check from +.>Is selected from the k-th element->Divider calculation into row normalization unit>If->If the number is 0, setting a matrix singular flag bit; />M parallel t-check-up registers before the other signal of (a) is sent to the multiplier array input interface of the row normalization unitThe array performs t selection gating on the kth line data (when multiplication is performed on the matrix line data, the time division multiplexing beat dividing technology is used for performing beat dividing calculation on one line of data to reduce the hardware resource requirement; the input interface of the multiplier array blocks the data input flow until the input end of the multiplier array is idle, and then starts to output +. >And->The output interface of the multiplier array reads the output result of the multiplier array and stores the output result in the position of the line to recover the result line data of 'line normalization'>And storing the result in a register array of row normalization result; result row data outputted by row normalization unit +.>Inputting data +.>Column zeroing unit sequentially reads out data of other non-kth row from RAM> Output via n check-up>Result row of "row normalization" in a joint register array->Selecting data for inputting the multiplier array by selecting a check array together through t of the multiplier array input interface; the input interface of the multiplier array blocks the data input flow until the input end of the multiplier array is idle, and then starts to output +.>And->The result of the multiplier array output->Selecting data (when subtracting matrix line data, time division multiplexing beat-dividing technique is completed for one line data beat-dividing calculation to reduce hardware resource requirement) from input line by selecting a check array>After the same beat number as the processing delay of the multiplier, the beat number is sent to the input end of the subtracter array. In particular, for the data of column k, the data is supplied to the subtractor input >Should be replaced with 0; the result output by the output end of the subtracter array>And storing and retrieving the resulting row data of "column zeroing" according to their corresponding positions of the input data in the row ∈>Column zeroing unit outputting column zeroed result row data>Input RAM, replace matrix row data in RAM +.>Data of RAM sequential output result inverse matrix +.>And outputs and maintains the matrix singular flag during the data of the inverse matrix is output.
Circuit design description:
see fig. 2, 3 and 4. The inversion operation state control unit (abbreviated as state control unit, hereinafter the same) comprises a state machine with four states of input, row normalization, column zeroing and output for performing state inversion conversion, the state machine is reset to be a data input state when a circuit is initialized, data input is completed, a normalization matrix row index counter (abbreviated as normalization row counter zlcntr, hereinafter the same) arranged in the row normalization state points to the last row, and the normalization row counter olcntr is-! The row index counter for return-to-zero in the column return-to-zero state (abbreviated as "return-to-zero row counter zlcntr",0,1, … N-1, the same below) is N-1,0,1, … N-1, where N is the total number of rows of the matrix, and the count value of the return-to-zero row counter is sent to the RAM buffer unit through the read address port of the RAM for addressing matrix data input to the column return-to-zero unit. The zeroing row counter zlcntr points to the last row, and the zeroing row counter olcntr points to the last row, outputs the inverse matrix, and returns to the input state after the inverse matrix is output.
The state transition of the normalization line counter olcntr executes a state change flow as shown in fig. 3, the normalization line counter olcntr is reset to 0 during initialization, and is used for pointing to the line data of the first line matrix, judging whether the state machine is 'input', if yes, the normalization line counter olcntr=0, otherwise judging whether the state machine is 'output', if yes, the normalization line counter olcntr=0, if no, judging whether the state machine is 'line normalization', if yes, olcntr=remains unchanged, if no, judging whether the state machine is 'column zeroing', if yes, judging whether the zeroing line counter zlcntr points to the last line of the matrix, otherwise ending; then judging whether the return-to-zero row counter zlcntr points to the last row of the matrix, if yes, judging whether the return-to-one row counter olcntr points to the last row of the matrix, if not, keeping olcntr unchanged, if the return-to-one row counter olcntr points to the last row of the matrix, setting olcntr=0, ending the judgment flow, if not, setting olcntr=olcntr+1, and ending the judgment flow.
The state transition of the return-to-zero line counter performs a state change flow as shown in fig. 4. The zeroing row counter zlcntr is reset to 0, the value of the zeroing row counter olcntr is initialized when the state machine is in a 'row zeroing' state, the zeroing row counter olcntr is reset to an index of a first row to be zeroed after a 'row zeroing' unit outputs a result, the zeroing row counter zlcntr is triggered to increment after Zhong Lei is added to the last row when the state machine is in a 'column zeroing' state, and a matrix which is normalized is skipped during accumulation of the zeroing row counter zlcntr. After the state machine is switched to the "output" state, the zeroing row counter zlcntr is first reset to the first row index of the matrix, then is added to the last row index of the matrix at time Zhong Lei, and finally triggers the state machine to switch to the "input" state.
See fig. 5. The one-out-of-three check 3MUX1 inputs the matrix A to be inverted according to the state of the state control unit (0) Line normalization of resultsColumn return to zero result row corresponding to return to one row +.>The state inversion flow of the three paths of data, which are selected by the data connection of one path, is executed as shown in fig. 5. The state control unit is in the 'input' state to gate the matrix A to be inverted (0) The result of the gating and return is +.>Gating column zeroing result row +.>In the "output" state, a default all-zero signal is then asserted, and one-three check 3M is selectedUX1 is reset to gating all zeros, whether the state machine is in an 'input' state is judged, and the state machine is in a three-one check box 3MUX1 gating input to-be-solved inverse matrix A # 0) If not, judging whether the state machine is in a 'line return-to-one' state, if so, the one-to-three check device 3MUX1 is used for checking the line return-to-one result +.>Otherwise, judging whether the state machine is in a column zeroing state, and inputting column zeroing result matrix row by a check 3MUX1>Otherwise, judging whether the state machine is in an output state, if so, inputting an all-zero signal by the one-three check device 3MUX1, otherwise, ending the judging flow.
See fig. 6. The three-check-out 1MUX3 outputs the matrix row data output by the RAM to the row return unit, the column return unit and the external output interface according to the state of the state control unit, and the state turning flow chart is executed as shown in figure 6. The one-to-three check device 1MUX3 is gated out to be suspended when the state control unit is in an 'input' state, is gated out to be in a 'line return-to-one' state, is gated out to be in a 'column return-to-zero' state, and is output to an external output interface when in an 'output' state. The three-selection-check-device 1MUX3 is reset to be in a gating all-zero state, whether the state machine is in an 'input' state is judged, the three-selection-device 1MUX3 is in gating suspension, if not, whether the state machine is in a 'line return-to-one' state is judged, if yes, the three-selection-device 1MUX3 is in gating output to a line return-to-one unit, if not, whether the state machine is in a 'column return-to-zero' state is judged, if not, the three-selection-device 1MUX3 is in gating output to an inversion output port, and if not, the judgment flow is ended.
See fig. 7. The row normalization unit executes row normalization operation, and when the state control unit is in row normalization state, the matrix row data output by RAM is implementedThe input row is normalized by a check 3MUX1 after being gated by a check three, and is divided into two paths of signals: one way is sent to n check-up device together with the return-to-normal counter value k given by the state control unit to gate out +.>The n check-up device is connected with divisor input end of divider, the divisor input end of divider is connected with constant signal representing real number "1", and the +.>Dividing the input signal into m paths and connecting m shared multiplier array input interfaces A; the other way is->The k-th element of (a) is replaced by a real number of "1" to obtain +.>Then, a check is sent to T to gate m matrix row elements, and the matrix row elements pass through T dr Beat (T) dr =t Nmux1 +t div Wherein t is Nmux1 N is a check box processing time delay, t div Is the divider processing delay, the following is the same), and m shared multiplier array input interfaces B are respectively connected after the delay synchronization. M multiplication results output by the shared multiplier array output interface R are subjected to t-beat delay splicing to form a row normalization result of a kth row +.>Is connected to the output of the row normalization unit and outputs a splice completion signal VO. The abnormal mark given by the divider is also used as matrix singular mark, and passes through T df Beat (T) dr =t Nmux1 +t div +t mult +t comb Wherein t is mult Is the processing delay of the multiplier array, t comb Is the time delay of t-beat delay splicing treatment, and the following is the sameAfter the synchronization delay, the output terminal of the row normalization unit is connected.
See fig. 8. The column zeroing unit executes a column zeroing operation step, and when the state control unit is in a column zeroing state, the row zeroing unit outputs a result rowThe matrix row data output by the RAM through the registered input column zeroing unitThe input column is zeroed after being gated by a check 3MUX1, and is divided into two paths of signals: one path is gated out by n check boxes>Then, gating +.>The m matrix row elements are connected to the input interfaces C and D of the shared multiplier array respectively; the other way is->The k-th element of (a) is replaced by a real number "0" to obtain +.>Then, a check is sent to T to gate m matrix row elements, and T is passed through dm Beat (T) dm =t tdm +t mult ,t tdm T selects a check box to process the time delay, and the following is the same), and is connected to the subtractor array input interface A. M multiplication results output by the shared multiplier array output interface O are respectively connected to the subtracter array input interface B. M subtracting results output by the subtracter array output interface R are subjected to t-beat delay splicing to form a column zeroing result +. >Output terminal connected to column zeroing unitAnd simultaneously outputs a splice completion signal VZ.
See fig. 9. Matrix A to be calculated input by RAM buffer unit buffer (0) Intermediate matrix A generated during the calculation cycle (k+1) Data and final result inverse matrix A -1 ,. The signal connected with the RAM address writing port of the RAM buffer unit is input into the matrix A to be calculated by a check 3MUX1 (0) The row index of (2), the return-to-normal counter and the return-to-zero counter after delay synchronization are selected. When the state control unit is in the 'input' state, the matrix A to be calculated is gated in (0) When in a 'line return' state, the gating return-to-one line counter is in a 'column return-to-zero' state, and the gating is delayed by T zl The zeroing line counter value (T zl =t Nmux1 +t mult +t sub +t comb ,t sub Is the subtractor array processing delay). The RAM data writing port of the RAM cache unit is connected with matrix row data output by the check 3MUX 1. The RAM write signal port of the RAM cache unit is gated between '1', '0', and splice completion signals VO and VZ by a check out of four: the state control unit gates "1" in the "input" state, "0" in the "output" state, VO signal in the "row return" state, and VZ signal in the "column return" state. The RAM reading data port of the RAM cache unit is connected with the input matrix row data of the three-selection check device 1MUX 3. The RAM read address port of the RAM cache unit is connected with the output value of the zeroing line counter. The RAM buffer unit is gated by a check two check one, the state control unit is gated by "0" when in "input" and by "1" when not in "input".
See fig. 10. The shared multiplier array unit processes the multiplication task of the time-sharing processing row return unit and the column return unit, and the functional logic diagram is shown in fig. 10. The input ports of the shared multiplier array unit include: the shared multiplier array input interface A, B (abbreviated as interface a, interface B in this paragraph) for the row normalization unit output, the shared multiplier array input interface C, D (abbreviated as interface C, interface D in this paragraph) for the column zeroing unit output, and the state of the state control unit. The interface A and the interface C, and the interface B and the interface D are respectively connected to two input ports of the shared multiplier array after being gated by two check boxes. When the state of the state control unit is 'row return to one', the two check boxes gate the interface A and the interface B respectively, and when the state of the state control unit is 'column return to zero', the two check boxes gate the interface C and the interface D respectively. The output ports of the shared multiplier array unit include: the shared multiplier array output interface R (abbreviated as interface R in this paragraph) to the row normalization unit, and the shared multiplier array output interface O (abbreviated as interface O in this paragraph) to the column zeroing unit. The output interface of the shared multiplier array is connected to interface R and interface O via a one-to-two check. When the state of the state control unit is 'row return to one', the one-two check device is output to the interface R in a gating mode, and when the state of the state control unit is 'column return to zero', the one-two check device is output to the interface O in a gating mode.
In an alternative embodiment, the to-be-inverted matrix input into the inversion circuit is a 40×40 double-precision floating-point real number square matrix, and a RAM buffer unit in the inversion circuit is configured with a RAM block with a size of 40×40 for storing input matrix data updated during calculation; configuring division, multiplication and subtraction operation cores of double-precision floating point numbers supporting pipeline modes, wherein the operation time delays of the operation cores are respectively 10, 5 and 7 clocks; for parallel processing of each row of matrix row data by 2 beats, 1 division operation core, 20 multiplication operation cores and 20 subtraction operation cores need to be exemplified.
In the above embodiment, the inversion operation of the 40×40 double-precision floating point number matrix is realized based on the gaussian-approximate-cancellation algorithm, the speed of the inversion operation is greatly increased by using the local pipeline technology (the inversion operation delay time of the embodiment is 4572 beats of clock through calculation), and meanwhile, the resource requirements of the floating point multiplier core and the floating point subtractor core are obviously reduced by using the time division multiplexing technology (80 floating point multipliers are needed in the original algorithm, 40 floating point subtractors are needed in the original algorithm, the number of the floating point multipliers and the floating point subtractors are 20 in the inversion circuit designed according to the invention, the floating point multiplier core is reduced to 1/4 of the original total requirement, and the floating point subtracter core is reduced to 1/2 of the original requirement. Therefore, the method provided by the invention has the advantages of small operation delay, flexible balance of contradiction between hardware resources and operation time, and suitability for high-speed real-time signal processing and satisfaction of engineering application requirements.
The embodiments of the present invention described above are combinations of elements and features of the present invention in a predetermined form. Elements or features may be considered optional unless specified otherwise. Each element or feature may be implemented without being combined with other elements or features. Moreover, embodiments of the invention may be constructed by combining portions of the elements and/or features. The order of operations described in embodiments of the present invention may be rearranged. Some configurations of any one embodiment may be included in another embodiment and may be replaced with corresponding configurations of another embodiment.
While the invention has been described in detail in connection with the drawings, it should be understood that the foregoing is only illustrative of the preferred embodiment of the invention and is not intended to limit the invention thereto, but rather that various modifications, equivalents, improvements and substitutions can be made therein by those skilled in the art without departing from the spirit and principles of the invention, and are intended to be included within the scope of the appended claims.
Claims (24)
1. A method for realizing a floating point operation circuit for inverting a real matrix is characterized by comprising the following steps: according to a matrix inversion circuit comprising an inversion operation state control unit, a check 3MUX1, a check 1MUX3, a line return unit, a column return unit, a RAM buffer unit and a shared multiplier array unit, inputting an N-level square matrix waiting for inversion from an input port of the matrix inversion circuit into the check 3MUX1 according to a line or column beat dividing sequence, wherein the check 3MUX1 gates signals between an input matrix line, a line return output result and a column return output result, and outputs a buffer area of the RAM buffer unit for storage; a three-check-out device 1MUX3 reads the matrix row data of the kth row from the RAM cache unit, takes all row or column data traversing the matrix once as a vanishing cycle, divides the read matrix row data into two paths, and executes n times of cycles; one path is sent to a line normalization unit, the line normalization unit reads the output result of the floating-point multiplier array in the shared multiplier array unit, distributes the operation time of the multiplier at different stages of operation according to the input data corresponding to the floating-point multiplier, and stores and restores the line data of the line normalization result at the position in the line; the other path is sent to a column zeroing unit, and matrix row data in the column zeroing unit are subjected to N times of cyclic transformation operation line by line according to a Gaussian elimination method; the column zeroing unit reads an output result from the output end of the subtracter array after the output interface of the shared multiplier array unit, and stores and recovers the result row data of column zeroing according to the position of input data corresponding to the subtracter in the row; the line return unit and the column return unit feed back respective processing results to a three-selection one-check device 3MUX1, and the three-selection one-check device 3MUX1 replaces input line data corresponding to the line return result line data and the column return result line data in the RAM to obtain an N-level inverse matrix stored in a buffer area of the RAM buffer unit; the one-to-three check 1MUX3 sequentially reads and outputs the data of the result inverse matrix from the RAM buffer unit, outputs and maintains the matrix singular mark during the data of the output inverse matrix, enters the next cycle, and finally outputs the result inverse matrix according to the line beat sequence.
2. The method for implementing a floating point arithmetic circuit for real matrix inversion of claim 1, wherein: the matrix inversion circuit includes: the system comprises a RAM buffer unit connected in series between two check-out-of-three checkboxes 3MUX1, an inversion operation state control unit connected in parallel with the two check-out-of-three checkboxes 1MUX3, a row normalization unit, a column zeroing unit and a shared multiplier array unit, wherein the row normalization unit is connected with the shared multiplier array unit through an interface A, an interface B and an interface R, the shared multiplier array unit is connected with the column zeroing unit through an interface C, an interface D and an interface O, and a matrix to be inverted is input from an input port of a matrix inversion circuit according to the frequency of each clock row and is stored in an internal RAM.
3. The method for implementing the floating point arithmetic circuit for real matrix inversion according to claim 2, wherein: after the inversion matrix input data is received, executing the following n times of loops: in the kth cycle, k=0, 1,2, …, n, the number of multipliers m in the multiplier array is not enough to calculate all data for a row in parallel, and one row of data is dividedAfter multiplication, the RAM outputs the k row data of matrix +. >Where k represents all data of the kth line.
4. The method for implementing a floating point arithmetic circuit for real matrix inversion of claim 3, wherein: in the row normalization unit, the 1-check-n check-up is fromIs selected from the k-th element->Divider computation into row normalization unitIf->If the number is 0, setting a matrix singular flag bit; />The other path of signals of the row normalization unit is sent to m parallel t check arrays in front of the input interface of the multiplier array of the row normalization unit, t check gating is carried out on the kth row data, when multiplication is carried out on the matrix row data, the time division multiplexing beat dividing technology is used for dividing the row data into t beats to finish calculation so as to reduce the requirement of hardware resources, and t is a positive integer and is used for inputting the data of the multiplier array and sent to the input end of the multiplier array.
5. The method for implementing a floating point arithmetic circuit for real matrix inversion of claim 4, wherein: the multiplier array input interface blocks the data input flow until the multiplier array input end is idle, and starts to output to the multiplier arrayAnd->The method comprises the steps of carrying out a first treatment on the surface of the The output interface of the multiplier array reads the output result of the multiplier array and stores the output result in the position of the line to recover the result line data of 'line normalization' >And stored in a row normalization result register array.
6. The method for implementing the floating point arithmetic circuit for real matrix inversion according to claim 5, wherein: the result data output by the line normalization unitInputting data +.>The method comprises the steps of carrying out a first treatment on the surface of the Column zeroing unit sequentially reads out data of other non-kth row from RAM>,/>Output +.>Result row of "row normalization" in a joint register array->The data for inputting the multiplier array is selected by selecting a check array together through t of the multiplier array input interface.
7. The method for implementing the floating point arithmetic circuit for real matrix inversion according to claim 6, wherein: the multiplier array input interface blocks the data input flow until the multiplier array input end is idle, and starts to output to the multiplier arrayAnd->The method comprises the steps of carrying out a first treatment on the surface of the The result of the multiplier array output->Data selected by selecting a check matrix with the input path t +.>After the same beat number is delayed to the processing delay of the multiplier, the beat number is sent to the input end of the subtracter array.
8. The method for implementing the floating point arithmetic circuit for real matrix inversion according to claim 7, wherein: the multiplier array sends the k column data to the subtracter input terminal Should be replaced with 0; output result of subtracter array output endAnd storing and retrieving the resulting row data of "column zeroing" according to the position of their corresponding input data in the row>The method comprises the steps of carrying out a first treatment on the surface of the Column zeroing unit output "Column zero result row data +.>Input RAM, replace matrix row data in RAM +.>The method comprises the steps of carrying out a first treatment on the surface of the Data of RAM sequential output result inverse matrix +.>And outputs and maintains the matrix singular flag during the data of the inverse matrix is output.
9. The method for implementing a floating point arithmetic circuit for real matrix inversion of claim 1, wherein: the inversion operation state control unit, which is called a state control unit for short, comprises a state machine which has four states of input, row normalization, column zeroing and output and performs state inversion conversion, the state machine is reset to be a data input state when a circuit is initialized, a normalization matrix row index counter which is arranged in the row normalization state is called a normalization row counter olcntr for short after the data input is completed, a zeroing row index counter in the column zeroing state is called a zeroing row counter zlcntr for short, zlcntr=0, 1, … N-1, N is a matrix total row, the zeroing row counter zlcntr points to the last row, and the normalization row counter olcntr is-! After the completion of the return-to-row calculation, the count value of the return-to-zero row counter is sent to the RAM buffer unit through the read address port of the RAM for matrix data addressing of the input column return-to-zero unit.
10. The method for implementing the floating point arithmetic circuit for real matrix inversion according to claim 9, wherein: the zeroing row counter zlcntr points to the last row, and the zeroing row counter olcntr points to the last row, outputs the inverse matrix, and returns to the input state after the inverse matrix is output.
11. The method for implementing the floating point arithmetic circuit for real matrix inversion according to claim 10, wherein: the state transition of the normalization line counter olcntr executes a state change flow, the normalization line counter olcntr is reset to 0 during initialization and is used for pointing to the line data of the first line matrix, judging whether the state machine is 'input', if yes, the normalization line counter olcntr=0, otherwise judging whether the state machine is 'output', if yes, the normalization line counter olcntr=0, if no, judging whether the state machine is 'line normalization', if yes, olcntr=remains unchanged, if no, judging whether the state machine is 'column zeroing', if yes, judging whether the zeroing line counter zlcntr points to the last line of the matrix, otherwise ending; then judging whether the return-to-zero row counter zlcntr points to the last row of the matrix, if yes, judging whether the return-to-one row counter olcntr points to the last row of the matrix, if not, keeping olcntr unchanged, if the return-to-one row counter olcntr points to the last row of the matrix, setting olcntr=0, ending the judgment flow, if not, setting olcntr=olcntr+1, and ending the judgment flow.
12. The method for implementing a floating point arithmetic circuit for real matrix inversion of claim 11, wherein: the zeroing row counter zlcntr is reset to 0, the value of the zeroing row counter olcntr is initialized when the state machine is in a 'row zeroing' state, the zeroing row counter olcntr is reset to an index of a first row to be zeroed after a 'row zeroing' unit outputs a result, the zeroing row counter zlcntr is triggered to be added to one after Zhong Lei is added to the last row when the state machine is in a 'column zeroing' state, and a matrix which is normalized is skipped during the accumulating period of the zeroing row counter zlcntr; after the state machine is switched to the "output" state, the zeroing row counter zlcntr is first reset to the first row index of the matrix, then is added to the last row index of the matrix at time Zhong Lei, and finally triggers the state machine to switch to the "input" state.
13. The method for implementing a floating point arithmetic circuit for real matrix inversion of claim 12, wherein: the one-out-of-three check 3MUX1 inputs the matrix A to be inverted according to the state of the state control unit (0) Line normalization of resultsColumn return to zero result row corresponding to return to one row +.>One data is selected from the three data to be connected to a write data port of the RAM.
14. The method for implementing a floating point arithmetic circuit for real matrix inversion of claim 13, wherein: the state control unit is in the 'input' state to gate the matrix A to be inverted (0) Gating row normalization results when in the "row normalization" stateGating column zeroing result row when in column zeroing state>When in the 'output' state, a default all-zero signal is gated, the one-out-of-three check 3MUX1 is reset to gate all-zero, whether the state machine is in the 'input' state is judged, and if so, the one-out-of-three check 3MUX1 gates the input matrix A to be inverted (0) If not, judging whether the state machine is in a 'line return-to-one' state, if so, the one-to-three check device 3MUX1 is used for checking the line return-to-one result +.>Otherwise, judging whether the state machine is in a column zeroing state, if so, inputting column zeroing result matrix row by the one-out-of-three check 3MUX1>Otherwise, judging whether the state machine is in an output state, if so, inputting an all-zero signal by the one-three check device 3MUX1, and if not, ending the judging flow.
15. The method for implementing a floating point arithmetic circuit for real matrix inversion of claim 14, wherein: the line normalization unit performs a line normalization operation when the state control unit is in a line normalization stateMatrix line data output by RAMThe input row is normalized by a check 3MUX1 after being gated by a check three, and is divided into two paths of signals: one way is sent to n check-up device together with the return-to-normal counter value k given by the state control unit to gate out +. >The n check-up device is connected with the divisor input end of the divider, the divisor input end of the divider is connected with a constant signal representing real number '1', and the +.>Dividing the input signal into m paths and connecting m shared multiplier array input interfaces A; the other way is->The k-th element of (a) is replaced by a real number of "1" to obtain +.>Then, a check is sent to T to gate m matrix row elements, and the matrix row elements pass through T dr After the time delay of the beat is synchronous, respectively connecting m shared multiplier array input interfaces B; m multiplication results output by the shared multiplier array output interface R are subjected to t-beat delay splicing to form a row normalization result of a kth row +.>The output end of the row normalization unit is connected with the output end of the row normalization unit, and meanwhile a splicing completion signal VO is output; the "zero-removal" anomaly flag given by the divider also serves as a matrix singular flag.
16. The method for implementing a floating point arithmetic circuit for real matrix inversion of claim 15, wherein: when the state control unit is in the column zero state, the output result row of the row normalization unitThe matrix row data outputted by the RAM through the column zeroing unit after the register is inputted>The input column is zeroed after being gated by a check 3MUX1, and is divided into two paths of signals: one path is gated out by n check boxes >Then, select a check box through tThe m matrix row elements are connected to the input interfaces C and D of the shared multiplier array respectively; the other way is->The k-th element of (a) is replaced by a real number "0" to obtain +.>Then, a check is sent to T to gate m matrix row elements, and T is passed through dm After the time delay synchronization, the pulse is connected to the subtracter array input interface A.
17. The method for implementing a floating point arithmetic circuit for real matrix inversion of claim 16, wherein: m multiplication results output by the shared multiplier array output interface O are respectively connected to the subtracter array input interface B; m subtraction results output by the subtracter array output interface R are subjected to t-beat delay splicing to form a column zeroing resultIs connected to the output of the column zeroing unit and outputs a splice completion signal VZ.
18. The method for implementing a floating point arithmetic circuit for real matrix inversion of claim 17, wherein:matrix A to be calculated input by RAM buffer unit buffer (0) Intermediate matrix A generated during the calculation cycle (k+1) The data and the final result inverse matrix A-1, the signals connected with the RAM address writing port of the RAM buffer unit are input into the matrix A to be calculated by a check-out-of-three 3MUX1 (0) The gate is selected among the row index, the return-to-normal counter and the return-to-zero counter after time delay synchronization; when the state control unit is in the 'input' state, the matrix A to be calculated is gated in (0) When in a 'line return' state, the gating return-to-one line counter is in a 'column return-to-zero' state, and the gating is delayed by T zl The number T of the zeroing line counter after photographing zl =t Nmux1 +t mult +t sub +t comb ,t Nmux1 N is a check box processing time delay, t mult Is the processing delay of the multiplier array, t sub Is the processing delay of the subtracter array, t comb Is the time delay of t-beat delay splicing processing.
19. The method for implementing a floating point arithmetic circuit for real matrix inversion of claim 18, wherein: the RAM data writing port of the RAM cache unit is connected with matrix row data output by the check 3MUX 1; the RAM write signal port of the RAM cache unit is gated between '1', '0', and splice completion signals VO and VZ by a check out of four: the state control unit gates "1" in the "input" state, "0" in the "output" state, VO signal in the "row return" state, and VZ signal in the "column return" state.
20. The method for implementing a floating point arithmetic circuit for real matrix inversion of claim 19, wherein: the RAM reading data port of the RAM cache unit is connected with the input matrix row data of the three-selection check device 1MUX 3; the RAM read address port of the RAM cache unit is connected with the output value of the zeroing line counter; the RAM buffer unit is gated by a check two check one, the state control unit is gated by "0" when in "input" and by "1" when not in "input".
21. The method for implementing a floating point arithmetic circuit for real matrix inversion of claim 20, wherein: the input ports of the shared multiplier array unit include: the shared multiplier array input interface A, B for row normalization unit outputs, the shared multiplier array input interface C for column zeroing unit outputs, interface D, and the state of the state control unit.
22. The method for implementing a floating point arithmetic circuit for real matrix inversion of claim 21, wherein: the interface A and the interface C, the interface B and the interface D are respectively connected to two input ports of the shared multiplier array after being gated by two check boxes, when the state of the state control unit is row return to one, the two check boxes gate the interface A and the interface B respectively, and when the state of the state control unit is column return to zero, the two check boxes gate the interface C and the interface D respectively.
23. The method of claim 22, wherein the real matrix inversion is implemented by a floating point arithmetic circuit: the output ports of the shared multiplier array unit include: a shared multiplier array output interface R output to the row return unit and a shared multiplier array output interface O output to the column return unit; the output interface of the shared multiplier array is connected to the interface R and the interface O through a check-two-one check; when the state of the state control unit is 'row return to one', the one-two check device is output to the interface R in a gating mode, and when the state of the state control unit is 'column return to zero', the one-two check device is output to the interface O in a gating mode.
24. The method for implementing a floating point arithmetic circuit for real matrix inversion of claim 1, wherein: the matrix inversion circuit is input with a double-precision floating-point real number square matrix of 40 multiplied by 40, and a RAM buffer unit in the matrix inversion circuit is configured with a RAM block of 40 multiplied by 40 for storing input matrix data updated during calculation; configuring division, multiplication and subtraction operation cores of double-precision floating point numbers supporting pipeline modes, wherein the operation time delays of the operation cores are respectively 10, 5 and 7 clocks; for parallel processing of each row of matrix row data by 2 beats, 1 division operation core, 20 multiplication operation cores and 20 subtraction operation cores need to be exemplified.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910254674.7A CN110162742B (en) | 2019-03-31 | 2019-03-31 | Floating point operation circuit implementation method for real matrix inversion |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910254674.7A CN110162742B (en) | 2019-03-31 | 2019-03-31 | Floating point operation circuit implementation method for real matrix inversion |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN110162742A CN110162742A (en) | 2019-08-23 |
| CN110162742B true CN110162742B (en) | 2023-09-15 |
Family
ID=67638375
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201910254674.7A Active CN110162742B (en) | 2019-03-31 | 2019-03-31 | Floating point operation circuit implementation method for real matrix inversion |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN110162742B (en) |
Families Citing this family (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110989970B (en) * | 2019-11-27 | 2023-04-11 | 广州海格通信集团股份有限公司 | Double-precision floating-point matrix operation processor and method |
| CN111199017B (en) * | 2020-01-06 | 2023-02-28 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Method for realizing multi-functional inverse IP core of hermitian symmetric matrix |
| CN111291320B (en) * | 2020-01-16 | 2023-12-15 | 西安电子科技大学 | Double-precision floating-point complex matrix operation optimization method based on HXDSP chip |
| CN111858169B (en) * | 2020-07-10 | 2023-07-25 | 山东云海国创云计算装备产业创新中心有限公司 | A data recovery method, system and related components |
| CN115169271A (en) * | 2022-06-23 | 2022-10-11 | 中国科学院电工研究所 | Verilog-based reusable real part non-singular complex matrix inversion module design method |
| CN115310035B (en) * | 2022-08-08 | 2025-07-25 | 昆仑芯(北京)科技有限公司 | Data processing method, device, electronic equipment, medium and chip |
| CN116662730B (en) * | 2023-08-02 | 2023-10-20 | 之江实验室 | Cholesky decomposition calculation acceleration system based on FPGA |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103678257A (en) * | 2013-12-20 | 2014-03-26 | 上海交通大学 | Positive definite matrix floating point inversion device based on FPGA and inversion method thereof |
| WO2014133864A2 (en) * | 2013-02-28 | 2014-09-04 | Caterpillar Inc. | A machine communication system and a communication unit |
| CN105426345A (en) * | 2015-12-25 | 2016-03-23 | 南京大学 | Matrix inverse operation method |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8543633B2 (en) * | 2010-09-24 | 2013-09-24 | Lockheed Martin Corporation | Modified Gram-Schmidt core implemented in a single field programmable gate array architecture |
-
2019
- 2019-03-31 CN CN201910254674.7A patent/CN110162742B/en active Active
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2014133864A2 (en) * | 2013-02-28 | 2014-09-04 | Caterpillar Inc. | A machine communication system and a communication unit |
| CN103678257A (en) * | 2013-12-20 | 2014-03-26 | 上海交通大学 | Positive definite matrix floating point inversion device based on FPGA and inversion method thereof |
| CN105426345A (en) * | 2015-12-25 | 2016-03-23 | 南京大学 | Matrix inverse operation method |
Non-Patent Citations (2)
| Title |
|---|
| An approach to determining nearshore bathymetry using remotely sensed ocean surface dynamics;Shubhra K. Misra et al.;《Coastal Engineering》;20031231(第47期);全文 * |
| 三对角矩阵的逆;冉瑞生等;《哈尔滨工业大学学报》;20060531;第38卷(第5期);全文 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN110162742A (en) | 2019-08-23 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN110162742B (en) | Floating point operation circuit implementation method for real matrix inversion | |
| CN110050256A (en) | Block floating point for neural network implementation | |
| Lin et al. | K-means implementation on FPGA for high-dimensional data using triangle inequality | |
| Li et al. | VBSF: a new storage format for SIMD sparse matrix–vector multiplication on modern processors | |
| EP4318275A1 (en) | Matrix multiplier and method for controlling matrix multiplier | |
| Liu et al. | Application-specific instruction set processor for SoC implementation of modern signal processing algorithms | |
| Sun et al. | An I/O bandwidth-sensitive sparse matrix-vector multiplication engine on FPGAs | |
| CN118760651A (en) | A sparse on-chip training hardware accelerator architecture and implementation method thereof | |
| CN114047903B (en) | A mixed precision arithmetic unit for data stream driven reconfigurable array | |
| Im et al. | Lutein: Dense-sparse bit-slice architecture with radix-4 lut-based slice-tensor processing units | |
| Ebeling et al. | RaPiD-a configurable computing architecture for compute-intensive applications | |
| Zhang et al. | Mixed-precision block incomplete sparse approximate preconditioner on Tensor core | |
| Park et al. | FIGLUT: An Energy-Efficient Accelerator Design for FP-INT GEMM Using Look-Up Tables | |
| CN118132006A (en) | Base-4 number theory transformation acceleration structure capable of being configured during operation | |
| Chen et al. | Edge FPGA-based onsite neural network training | |
| Tai et al. | Scalable matrix decompositions with multiple cores on FPGAs | |
| Pang et al. | A self-timed ICT chip for image coding | |
| Huang et al. | An efficient parallel processor for dense tensor computation | |
| Zhao | Matrix inversion on a many-core platform | |
| Park et al. | Rare Computing: Removing Redundant Multiplications From Sparse and Repetitive Data in Deep Neural Networks | |
| Wang et al. | Design and implementation of high performance matrix inversion based on reconfigurable processor | |
| Chen et al. | Implementation of vector floating-point processing unit on FPGAs for high performance computing | |
| Shrawankar et al. | Vedic Mathematics Approach to Speedup Parallel Computations | |
| Zhou et al. | Fixed-point simulation technology for SAR real-time imaging system | |
| Wanhammar et al. | Implementation of Digital Filters |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |