CN114328289B

CN114328289B - Chip structure and its size transformation accelerator

Info

Publication number: CN114328289B
Application number: CN202111616752.7A
Authority: CN
Inventors: 谭黎敏; 宋捷; 桑迟
Original assignee: Shanghai Xijing Technology Co ltd
Current assignee: Shanghai Xijing Technology Co ltd
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2025-04-15
Anticipated expiration: 2041-12-27
Also published as: CN114328289A

Abstract

The present invention provides a chip structure and a size transformation accelerator thereof. The size transformation accelerator comprises: a control module, comprising an interface register list and an output feature map coordinate control module, the interface register list comprising a first register and a second register, the output feature map coordinate control module being used to traverse the coordinates of the output feature map, and generating a size transformation calculation parameter based on the interface register list according to the coordinates of the output feature map; a feature map input control module being used to connect to an on-chip bus; a feature map multiplexing control module being used to cache the image data of the input feature map, and controlling the writing and reading of the cached image data of the input feature map; a size transformation multiplication-addition array, which transforms the image data of the input feature map into the image data of the output feature map according to the size transformation calculation parameters in a pipeline manner; and a feature map output control module being used to connect to the on-chip bus. The present invention realizes size transformation acceleration.

Description

Chip structure and size conversion accelerator thereof

Technical Field

The invention relates to the field of image processing, in particular to a chip structure and a size conversion accelerator thereof.

Background

In the field of image processing, especially for deep learning network models of images, it is often necessary to transform and scale the size of the input image. Thus, the size transformation operator is widely used in various deep learning algorithms.

A common algorithm for size conversion is a bilinear interpolation algorithm. There are some acceleration calculation methods for bilinear interpolation algorithms. Such as CN104869284a, entitled "Field-Programmable gate array" for a bilinear interpolation amplification algorithm, a method and apparatus for implementing the same, which mainly provide a set of computation weights through two consecutive output pixels, and express the relationship between the two sets of computation weights by a mathematical formula. The patent aims at realizing the FPGA of the relation formula, reducing the use of a multiplier, and achieving that the FPGA of small-scale resources can also support bilinear interpolation. For another example, the bulletin number CN111429346a, entitled a real-time video image method based on FPGA, describes using a 2x2 register array, updating two numbers in the row direction and then updating two numbers in the column direction each time of computation, which is completely coincident with the original algorithm.

Therefore, the acceleration improvement of the bilinear interpolation algorithm is mainly the improvement of the algorithm and the hardware realization of the size transformation, and the calculation acceleration of the size transformation is not considered from the directions of the hardware reading, writing and buffering of the bilinear interpolation algorithm.

Therefore, how to improve the data read-write and cache of the chip to achieve the calculation acceleration of the size transformation is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a chip structure and a size conversion accelerator thereof, so as to improve the data read-write and cache of the chip and realize the calculation acceleration of size conversion.

According to one aspect of the present invention, there is provided a size conversion accelerator comprising:

The control module comprises an interface register list and an output feature map coordinate control module, the interface register list comprises a first register used for configuring the scaling of a feature map in a first direction and a second register used for configuring the scaling of the feature map in a second direction, the first direction is perpendicular to the second direction, and the output feature map coordinate control module is used for traversing the coordinates of the output feature map and generating calculation parameters of size transformation based on the interface register list according to the coordinates of the output feature map;

The feature map input control module is used for being connected with the on-chip bus to acquire image data of the input feature map;

The feature map multiplexing control module is used for caching the image data of the input feature map and controlling the writing and reading of the cached image data of the input feature map;

A size conversion multiply-add array for multiply-add computing the image data of the input feature map read by the feature map multiplexing control module and the computation parameters of the size conversion in a pipeline manner to convert the image data of the input feature map into the image data of the output feature map according to the computation parameters of the size conversion, and

And the feature map output control module is used for being connected with the on-chip bus so as to send the image data of the output feature map to the on-chip bus.

In some embodiments of the application, the output feature map coordinate control module includes a first output feature map coordinate control module and a second output feature map coordinate control module to traverse coordinates of the first direction and the second direction of the output feature map, respectively.

In some embodiments of the present application, the feature map multiplexing control module includes:

The input feature map caching module comprises a plurality of pairs of caching units, wherein each pair of caching units comprises a first caching unit and a second caching unit, so that two lines of data of the input feature map, which are used for calculating one line of data of the current output feature map, are cached respectively.

In some embodiments of the application, the profile input control module comprises a direct memory access module configured to:

Judging whether lines of the input feature images to be cached by first cache units/second cache units of the pairs of cache units of the input feature image cache module are continuous and same along a first traversal direction;

If yes, the direct memory access module is started to read the data of the input feature diagram once and buffer the data into a plurality of first buffer units/second buffer units which are the same in succession.

Judging whether the lines of the input feature graphs to be cached by the first caching units of the pairs of caching units of the input feature graph caching module exist or not, and judging whether the lines of the input features to be cached by the second caching units are identical to the lines of the input feature graphs;

If yes, the direct storage access module is started to read the data of the input feature map once, and the data are cached in the corresponding first cache unit and second cache unit.

In some embodiments of the present application, the input feature map caching module includes a first input feature map caching module and a second input feature map caching module, each of the first input feature map caching module and the second input feature map caching module includes a plurality of pairs of the caching units, and the first input feature map caching module and/or the second input feature map caching module are configured to:

Judging whether the line of the input feature map to be cached in the multiple pairs of cache units of the input feature map cache module repeatedly appears in the line of the cached input feature map in the multiple pairs of cache units of the other input feature map cache module;

if yes, directly reading the line of the cached input characteristic diagram in the multiple pairs of cache units of the other input characteristic diagram cache module.

In some embodiments of the present application, the first input feature map buffer module and the second input feature map buffer module respectively include a first data reading interface and a second data reading interface, where the first data reading interface respectively performs data transmission with the feature map input control module and the feature map output control module, and the second data reading interface is used to perform data transmission between the first input feature map buffer module and the second input feature map buffer module.

In some embodiments of the present application, the first cache unit and the second cache unit comprise an odd column cache unit and an even column cache unit,

In a first quantization mode, each column of data of the input feature map has a bit width of N bits, interface bit widths of the odd-numbered column buffer unit and the even-numbered column buffer unit are 2N bits, the odd-numbered column buffer unit buffers splicing data of two adjacent odd-numbered columns of the input image, and the even-numbered column buffer unit buffers splicing data of two adjacent even-numbered columns of the input image;

In the second quantization mode, each column of the input feature map has a data bit width of 2N bits, the interface bit widths of the odd column buffer unit and the even column buffer unit are 2N bits, the odd column buffer unit buffers data of an odd column of the input image, the even column buffer unit buffers data of an even column of the input image,

Wherein N is an integer greater than 0.

In some embodiments of the present application, the control module is further configured to slice the input feature map and the output feature map, and a slice direction is a row direction of the feature map.

According to still another aspect of the present application, there is also provided a chip structure including:

a size conversion accelerator as described above;

on-chip bus, and

An on-chip processor for configuring a list of interface registers of the size conversion accelerator.

Compared with the prior art, the invention has the advantages that:

1) Traversing the coordinates of an output feature map through a control module of a size transformation accelerator, and generating calculation parameters of size transformation based on the interface register list according to the coordinates of the output feature map, so that the step of data reading and writing is reduced without providing the calculation parameters of size transformation externally, and meanwhile, the interface register provides scaling parameter settings in two directions, thereby adapting to different scaling sizes;

2) The input control module and the multiplexing control module of the feature images are used for caching the image data of the input feature images and controlling the writing and reading of the cached image data of the input feature images, so that the data multiplexing of multiple dimensions of the feature images is conveniently executed, and the time consumption required by the data reading is further reduced;

therefore, the application realizes the calculation acceleration of the size transformation by improving the data reading, writing and buffering of the chip.

Drawings

The above and other features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.

FIG. 1 shows a block diagram of a chip architecture according to an embodiment of the invention;

FIG. 2 illustrates a block diagram of a size conversion accelerator according to an embodiment of the invention;

FIG. 3 shows a schematic diagram of an input/output signature in accordance with an embodiment of the invention;

FIG. 4 is a schematic diagram of an input feature map caching module according to an embodiment of the invention;

FIG. 5 is a schematic diagram of a first acceleration dimension of an input feature map caching module according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a second acceleration dimension of an input feature map caching module according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a third acceleration dimension of an input feature map caching module according to an embodiment of the present invention;

fig. 8 shows a schematic diagram of a fourth acceleration dimension of the input feature map caching module according to an embodiment of the present invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein, but rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the exemplary embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

The application is mainly applied to the reasoning calculation of a size transformation (size) operator in a neural network, and focuses on the efficient multiplexing of the original data to be scaled cached in the chip, and provides high-performance calculation capability by using a multi-dimensional multiplexing thought and matching with a ping-pong cache and a pipeline mode. And corresponding on-chip parallel computing resources can be matched according to different DRAM (Dynamic Random Access Memory, namely dynamic random access memory) bandwidths, so that the accelerator DRAM interface bandwidth and the accelerator computing power realize optional adaptation in the chip design stage. Compared with a CPU directly using an x86 or ARM architecture, the method has the advantages that after the acceleration is carried out through the cache, the accelerator implemented on the FPGA can improve the calculation performance of the operator by 100 times under the common image size compared with the CPU of the x86 architecture based on actual measurement data.

Referring first to fig. 1, fig. 1 shows a block diagram of a chip architecture according to an embodiment of the application. The chip architecture includes a size conversion accelerator 141, an on-chip bus system 120, and an on-chip processor (general purpose processor) 110. Specifically, the size transformation accelerator 141 may be provided within the deep learning algorithm accelerator 140 as one of a plurality of accelerating IP cores (intellectual property cores) of the deep learning algorithm accelerator 140. The on-chip bus system 120 may be, for example, an AXI (Advanced eXtensible Interface ) bus system, and the present application is not limited in this regard. The on-chip processor 110 may configure the feature parameters of the size transformation accelerator 141 according to the requirements of the deep learning network model. Also shown in FIG. 1 is an off-chip DRAM access interface 130, the off-chip DRAM access interface 130 being used to read from and write to off-chip memory.

The structure of the size conversion accelerator can be seen in fig. 2. The size conversion accelerator includes a control module 210, a feature map input control module 220, a feature map multiplexing control module 230, a size conversion add array 240, and a feature map output control module 250.

The control module 220 includes an interface register list and an output profile coordinate control module. The interface register list includes a first register to configure a feature map first direction (e.g., feature map row direction) scaling and a second register to configure a feature map second direction (e.g., feature map column direction) scaling, the first direction being perpendicular to the second direction. Specifically, the first register and the second register are 32-bit interface registers for storing the size scaling parameters. The present application may also provide registers other than the first and second registers to store other parameters, thereby implementing the increase or decrease of the configuration function, which is not limited by the present application.

The output feature map coordinate control module is used for traversing the coordinates of the output feature map and generating calculation parameters of size transformation based on the interface register list according to the coordinates of the output feature map. Specifically, the output feature map coordinate control module may include a first output feature map coordinate control module and a second output feature map coordinate control module to traverse coordinates of the first direction and the second direction of the output feature map, respectively.

The feature map input control module 220 is configured to connect with the on-chip bus 120 to obtain image data of an input feature map.

The feature map multiplexing control module 230 is configured to buffer the image data of the input feature map, and control writing and reading of the buffered image data of the input feature map.

The size conversion multiply-add array 240 multiplies and adds the image data of the input feature map read by the feature map multiplexing control module and the calculation parameters of the size conversion in a pipeline manner, so as to convert the image data of the input feature map into the image data of the output feature map according to the calculation parameters of the size conversion.

The feature map output control module 250 is configured to connect with the on-chip bus 120 to send the image data of the output feature map to the on-chip bus.

In the chip structure and the size conversion accelerator thereof, the coordinates of an output feature diagram are traversed through the control module of the size conversion accelerator, and the calculation parameters of size conversion are generated based on the interface register list according to the coordinates of the output feature diagram, so that the step of data reading and writing is not required to be provided externally, and meanwhile, the interface register provides scaling parameter settings in two directions so as to adapt to different scaling sizes.

Specifically, the control module 210 may be further configured to slice the input feature map and the output feature map, where the slice direction is a row direction of the feature map. Further, the control module 210 may slice the output feature map along the row direction of the feature map, and the column direction of the feature map is not slice. Then the calculation of n lines of complete output signature data per slice requires the preparation of 2n lines of complete input signature data (n being an integer greater than 0). In the process of alternately calculating each slice, the application can automatically analyze the scaling characteristics and automatically enable the multiplexing of the input feature map data of multiple dimensions.

Specifically, the control module 210 may control such that each slice of the feature map is calculated thereby, and thereby control the direct memory access module (DMA) of the feature map input control module 220 and the feature map multiplexing control module 230 and fetch or input the feature map from the off-chip DRAM to be a corresponding start address.

In particular, the application is applied to a scene in which a bilinear interpolation algorithm is used to implement a size transformation. In the bilinear interpolation algorithm, for each pixel point of the output feature image, a set of parameters required by bilinear interpolation calculation are different, and if the size of the output feature image is larger, the parameter data size is larger. The application respectively controls the row direction and the column direction of the output feature images through the first output feature image coordinate control module and the second output feature image coordinate control module, so that corresponding calculation parameters can be generated in real time based on the coordinate control module while traversing the output coordinates, and the calculation parameters can enter the size conversion and addition array 240 in a pipeline mode to match the input feature images entering the size conversion and addition array 240 in real time. Thus, the size conversion accelerator no longer needs to obtain relevant computing parameters from the on-chip bus interface.

In other words, the coefficients directly multiplied with the image data to be scaled in the size conversion and addition array 240 are automatically generated in-slice by the first output feature map coordinate control module and the second output feature map coordinate control module without being externally provided by the accelerator. Compared with the scheme that the external calculated coefficients are written into the accelerator by the upper computer, on one hand, the method of writing the coefficients in advance is not needed, the problem that the accelerator can only work under the configuration of fixed size scaling ratio, once the scaling ratio is changed, the coefficients need to be rewritten, so that the accelerator cannot adapt to the rapid switching of scene tasks in real time is avoided, and on the other hand, a large-area Static Random-Access Memory (SRAM) is not needed to store related calculated coefficients, and the excessive consumption of on-chip area is avoided.

In addition, by the method, the acceleration calculation can be performed by only providing the original feature map data to be scaled and the scaling ratios of the row direction and the column direction. The scaling of each of the row and column directions may be configured in an interface register list.

Referring now to fig. 3, fig. 3 shows a schematic diagram of an input/output profile according to an embodiment of the present application. Along the horizontal direction of the feature map, it may be defined as the feature map column direction, wherein, according to the 0 th column and the 1 st column, the even column and the odd column are defined, respectively. Even columns and odd columns alternate. Along the vertical direction of the feature map, defined as the feature map row direction. The following embodiments of the application are described and illustrated with this definition of the row and column directions.

Specifically, the open source codes in the characteristic diagram row coordinate calculation process of the bilinear interpolation algorithm are as follows:

the source code shows a row-column coordinate calculation process of the input feature map and the output feature map, and shows the relationship between the row-direction coordinates of the output feature map and the row-direction coordinates of the input feature map.

Where src_h represents the maximum row direction coordinate of the input feature map. dst_h represents the maximum row direction coordinate of the output feature map. scale_y represents the ratio of the input feature map to the output feature map in the row direction. dst_index represents the row coordinates of the output feature map. Assuming image magnification, scale_y is less than 1. The resulting src_y_0, src_y_1 represents the two rows on the input profile that need to be provided for the correlation calculation to complete for a certain dst_index row output profile. Typically, src_y_1 is 1 larger than src_y_0 in value, but may be equal at the image edge.

According to the source code analysis, the calculation of each line in the output feature map needs to provide two lines of data on the input feature map, and the required lines can be obtained according to the source code calculation. Accordingly, the signature reuse control module 230 of the present application may include input signature caching modules 233, 234. The input profile caching modules 233, 234 include a plurality of pairs of caching units, each pair of caching units including a first caching unit 235A and a second caching unit 235B to respectively cache two lines of data of the input profile for calculating one line of data of the current output profile.

Further, referring to fig. 4, fig. 4 shows a schematic diagram of an input feature map caching module according to an embodiment of the present invention. In this embodiment, the input feature map caching module includes a first input feature map caching module 233 and a second input feature map caching module 234.

The first input feature map caching module 233 and the second input feature map caching module 234 may have the same structure. The first input feature map buffer module 233 and the second input feature map buffer module 234 may be SRAM, so that the SRAM resources consumed by the first input feature map buffer module 233 and the second input feature map buffer module 234 are the same, which may also be referred to as a ping-pong buffer architecture.

As in fig. 4, for calculating the 1-line output profile, the two required lines of input profiles, src_y_0, src_y_1, will be stored in the first buffer unit 235A and the second buffer unit 235B, respectively. The first buffer unit 235A and the second buffer unit 235B may be used as SRAM basic units, and may be built based on features of an SRAM IP library provided by a chip manufacturer, or a single-chip SRAM based on a minimum size provided by an FPGA manufacturer. Wherein src_y_0 is stored in the first buffer unit 235A and src_y_1 is stored in the second buffer unit 235B.

The input feature map is input to the first input feature map buffer module 233 or the second input feature map buffer module 234 for buffering, and the first input feature map buffer module 233/the second input feature map buffer module 234 stores at most input feature lines for calculating n lines of output feature maps, so that the first input feature map buffer module 233 and the second input feature map buffer module 234 buffer 2nx2 lines of input feature data in total. In the architecture design, the value of n can be comprehensively considered according to the design performance requirement, the area and the cost of a chip, and n represents the parallel computing acceleration capability of an output characteristic diagram.

The following continues with n=6.

Suppose that one input feature map is non-double-amplified, say 2.6 times in h, i.e. row direction.

For the first 12 rows of output feature graphs, according to the foregoing calculation of the source codes, the src_y_0 and src_y_1 of the 12 rows can be obtained respectively as follows:

dst_index:0,1,2,3,4,5,6,7,8,9,10,11

src_y_0:0,0,0,0,1,1,2,2,2,3,3,3

src_y_1:1,1,1,1,2,2,3,3,3,4,4,4

The application can flexibly support the configuration of any integer size within the maximum range of the size of the input characteristic diagram and the size of the output characteristic diagram, and the irregular change of the src_y_0 and src_y_1 sequences is brought by the proportion of different size changes. The design starting point and the most difficult design challenge of the application is that a detection logic is designed, and the detection logic is only based on the feature diagram size and the scaling of the accelerator interface register configuration, firstly, dst_index, src_y_0 and src_y_1 sequences can be automatically generated, and secondly, the coordinate relation between any src_y_0 and src_y_1 sequences can be automatically analyzed, and meanwhile, the multiplexing of the input feature diagram in a chip is realized.

The various on-chip multiplexing modes of the present application will be described below in connection with fig. 5-8, respectively.

Fig. 5 shows a schematic diagram of a first acceleration dimension of the input feature map caching module according to an embodiment of the present invention. In this embodiment, the profile input control module includes a direct memory access module. The feature map input control module may be configured to determine, along a first traversal direction, whether lines of an input feature map to be cached by a first cache unit/a second cache unit of the pairs of cache units of the input feature map cache module 233 are continuously identical, and if so, when the lines of the input feature map to be cached are different (i.e., when the lines are continuously identical, the lines are ended), start the direct memory access module to read the line of the input feature map once, and cache the line of the input feature map into a plurality of first cache units/second cache units that are continuously identical.

As shown in figure 5 of the drawings, in the aforementioned embodiments (src_y_0:0, 0,1, 2, 3; src_y_1:1,1, 2,3, 4), it is determined that the line 0 data of the input feature map is to be stored in the first buffer unit of the first pair of buffer units of the input feature map buffer module 233, and determining, by the first pair of cache units along a first traversal direction (in a downward direction in fig. 5), that the lines of the input feature map to be cached by the first cache units of the plurality of pairs of cache units of the input feature map cache module 233 are identical to the first cache units of the fourth pair of cache units. The same first cache units may be marked, so that the signature input control module starts the direct memory access module once, and in this start, reads the required line data of the input signature and caches the line data into a plurality of first cache units (marked first cache units) that are the same in succession

In this example, the rows of the input feature map required for the first four first buffer units in the input feature map buffer module 233 are the same. According to the DMA data multiplexing method, DMA of one line of data can realize on-chip caching of four lines of calculation data, and the DMA efficiency is improved by 400%. The same row 0 is repeatedly read from the DDR 4 times by a common similar accelerator, which greatly affects the efficiency of the bus and DMA.

The multiplexing method is not only present between the first buffer units in the input feature map buffer modules 233/234, but also is performed between the second buffer units. The multiplexing method of this embodiment is performed for the input feature map buffer module 233/234 between the first buffer units or between the second buffer units. Similarly, the dashed arrow in FIG. 5 may also be used in this multiplexed manner for a second DMA start.

Referring now to fig. 6, fig. 6 is a schematic diagram illustrating a second acceleration dimension of an input feature map caching module according to an embodiment of the present invention.

In this embodiment, the feature map input control module includes a direct memory access module, where the feature map input control module is configured to determine whether a line of an input feature map to be cached by a first cache unit of a plurality of pairs of cache units of the input feature map cache module is identical to a line of an input feature to be cached by a plurality of second cache units, and if so, start the direct memory access module to read the line of the input feature map once and cache the line of the input feature map in the corresponding first cache unit and second cache unit.

When the first multiplexing mode is combined, if the detection logic finds that the current effective DMA input feature line is located in a certain first cache unit module and the data line needs to exist in a plurality of second cache units at the same time after the first multiplexing mode is combined, the search of the second cache units is continuously completed until the last second cache unit in the minimum coordinate line is detected along the second traversal direction, and then the DMA of the input feature data line is started.

As shown in fig. 6, the input feature line 1 exists in the first buffer unit corresponding to the coordinates of the output feature line 4,5, and simultaneously exists in the second buffer unit corresponding to the coordinates of the output feature line 3,2,1,0, and when the last second buffer unit in the arrow direction is detected to be completed, the DMA of the input feature line 1 is started, and the data are written into the SRAMs of the first buffer unit and the second buffer unit at the same time. Six lines of data are cached in a required mode, and efficiency is improved by 600%.

The multiplexing method is executed between the first buffer unit and the second buffer unit in each of the input feature map buffer modules 233/234.

Referring now to fig. 7, fig. 7 is a schematic diagram illustrating a third acceleration dimension of an input feature map caching module according to an embodiment of the present invention. In this embodiment, the input feature map caching module includes a first input feature map caching module 233 and a second input feature map caching module 234, where the first input feature map caching module 233 and the second input feature map caching module 234 each include a plurality of pairs of the caching units, and the first input feature map caching module 233 and/or the second input feature map caching module 234 are configured to determine whether a line of an input feature map to be cached in a plurality of pairs of caching units of the present input feature map caching module is repeatedly present in a line of an input feature map cached in a plurality of pairs of caching units of the other input feature map caching module, and if so, directly read a line of an input feature map cached in a plurality of pairs of caching units of the other input feature map caching module.

The multiplexing manner considers that the possible and common sequence features of the repeated data exist between the first input feature map buffer module 233 and the second input feature map buffer module 234, so when detecting the first buffer unit of any input feature map buffer module, if the input feature in the first buffer unit is still in the second buffer unit of another input feature map buffer module, the line of input features is directly fetched from the second read data interface in the other input feature map buffer module, the DMA operation of the DDR is not performed, and the resource occupation of the interface bandwidth is reduced.

Further, the present invention is further characterized in that, the SRAM arrays in the two buffer modules of the first input feature map buffer module 233 and the second input feature map buffer module 234 can be implemented in the form of a true dual port SRAM (True Dual Port RAM), so that one piece of SRAM can simultaneously support the two read interfaces to work independently, thereby ensuring that the direct data movement between the first input feature map buffer module 233 and the second input feature map buffer module 234 does not affect the use of the first read data interfaces of the first input feature map buffer module 233 and the second input feature map buffer module 234 of the next-stage multiply-add array module.

As shown in fig. 7, the input feature line 2 exists in the second buffer unit corresponding to the coordinates of the output feature line 5 and also exists in the first buffer units corresponding to the coordinates of the output feature lines 6,7 and 8, when the last first buffer unit in the arrow direction is detected to be completed, on-chip reading and writing of the input feature line 2 are started, and the on-chip reading and writing of the input feature line 2 are read from the first input feature map buffer module 233 and written into the second input feature map buffer module 234. The data is written into the SRAM of the first buffer unit at the same time. One line of data completes three lines of demand buffering, the efficiency is improved by 300 percent, and no DDR interface is accessed.

Thus, the third multiplexing method is implemented between the first input feature map buffer module 233 and the second input feature map buffer module 234.

Referring now to fig. 8, fig. 8 is a schematic diagram illustrating a fourth acceleration dimension of an input feature map caching module according to an embodiment of the present invention. In this embodiment, the first buffer unit and the second buffer unit respectively include an odd column buffer unit (such as an odd column SRAM) and an even column buffer unit (such as an even column SRAM), in a first quantization mode, each column of data of the input feature map is N bits, an interface bit width of the odd column buffer unit and the even column buffer unit is 2N bits, the odd column buffer unit buffers spliced data of adjacent two odd columns of the input image, the even column buffer unit buffers spliced data of adjacent two even columns of the input image, in a second quantization mode, each column of data bit width of the input feature map is 2N bits, an interface bit width of the odd column buffer unit and the even column buffer unit is 2N bits, the odd column buffer unit buffers data of one odd column of the input image, and the even column buffer unit buffers data of one even column of the input image, wherein N is an integer greater than 0.

As shown in fig. 8, for any one of the first input feature map buffer module 233 and the second input feature map buffer module 234, for a group of buffer units corresponding to any one output feature line, the architecture is characterized in that the SRAM buffer module in one first buffer unit or one second buffer unit is composed of two basic SRAM units. And one row of input characteristic data is respectively stored into an odd column SRAM and an even column SRAM according to the serial numbers of the odd columns and the even columns. The present embodiment supports both the first quantization mode (e.g., 8-bit quantization) and the second quantization mode (e.g., 16-bit quantization), so that the bit width of the read-write data interface of the SRAM basic unit in the present embodiment may be 16 bits.

Based on the parity SRAM memory architecture in fig. 8, the following table shows the manner in which data is stored in the address of the parity SRAM in two scenarios where one line of feature map data is 8bit quantized data or 16bit quantized data, respectively:

addr (Address)	Even-numbered column SRAM	Odd-numbered column SRAM
			0	0,2	1,3
1	4,6	5,7
			2	8,10	9,11
3	12,14	13,15
			4	16,18	17,19
5	20,22	21,23
			6	24,26	25,27
7	28,30	29,31

8-Bit quantized data

Addr (Address)	Even-numbered column SRAM	Odd-numbered column SRAM
			0	0	1
1	2	3
			2	4	5
3	6	7
			4	8	9
5	10	11
			6	12	13
7	14	15

16-Bit quantized data

In the above table, in the 8-bit quantization mode, each column of data has a bit width of 8 bits, the 0 th and 2 nd column of data are spliced into 1 16-bit data and stored in the even-numbered column SRAM address 0, and the 1 st and 3 rd column of data are spliced into 1 16-bit data and stored in the odd-numbered column SRAM address 0. According to the rule, every time the address is added with 1 position, four stored column data are added with 4 according to the coordinate of 0,2,1,3. For example, in address 1, the splice data of even columns 4 and 6 is stored, and the splice data of odd columns 5 and 7 is stored.

In the 16-bit quantization mode, each column of data has a bit width of 16 bits. For the same group of odd-even SRAM arrays, each address starts from 0, and each address is added with 1, and the stored data coordinates are even column coordinates added with 2, and odd column coordinates added with 2. For example, in address 0, even column 0 data is stored, and odd column 1 data is stored. In address 1, data of even column 2 is stored, and data of odd column 3 is stored.

Further, the depth of the odd-even column SRAM represents the maximum buffer size of the input feature diagram in the column direction, and the depth can be specifically defined according to the index in the design, the chip performance and the cost, so as to obtain the total area consumption of the SRAM.

Thus, the same set of two SRAM arrays can support two quantized data structures.

In general, the coordinates of the two columns of input features required for a column of output feature computation, typically differ by 1, and it is also possible that the coordinates are equal at the boundaries of the image. The parity cache architecture in the design ensures that any two adjacent parity columns of data can be read out simultaneously in any clock period, so that each column of pixels on each output characteristic diagram is ensured, two columns of input characteristic columns required by calculation can be read out simultaneously in each clock period, and the calculation flow efficiency of a multiplication and addition engine of a next-stage module is improved. By using the odd-even buffer architecture, the problem that any two adjacent columns of data are taken out of the SRAM array at the same time under the condition that no obvious regular column coordinate sequence exists can be solved.

For example, assuming that a row of output feature map is numbered first 4 columns 0,1,2,3 by column, if the scene is a reduction of an image, it is possible that the input feature columns required for the four column output feature calculations are set in order of 0 and 1, 3 and 4, 8 and 9, 13 and 14, and it can be found that the set of 3 and 4 is more specific. Since one SRAM can only access one address per clock cycle, the parity SRAM array of the present application can ensure that the data required by the above feature sequence is provided at the same time per clock cycle. When the 3 rd and 4 th column data are to be fetched in one period, the even column SRAM read address is controlled to be written into 1, and the odd column SRAM read address is controlled to be written into 0, then the 3 rd and 4 th column data will appear on the data output signals of the odd column SRAM and the even column SRAM respectively at the same time in the next period of the period.

Therefore, at least 4 SRAMs are needed for a pair of cache units in this embodiment. Thus, the fourth data multiplexing method of the present application is realized.

The multiplexing modes can be realized independently or in combination, and the application is not limited by the above.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A size conversion accelerator, comprising:

the characteristic map multiplexing control module is used for caching image data of an input characteristic map and controlling writing and reading of the cached image data of the input characteristic map, and comprises at least one input characteristic map caching module, a characteristic map multiplexing control module and a characteristic map multiplexing control module, wherein each input characteristic map caching module comprises a plurality of pairs of caching units, and each pair of caching units comprises a first caching unit and a second caching unit so as to respectively cache two lines of data of the input characteristic map for calculating one line of data of a current output characteristic map;

2. The size conversion accelerator of claim 1, wherein the output profile coordinate control module comprises a first output profile coordinate control module and a second output profile coordinate control module to traverse the coordinates of the first direction and the second direction of the output profile, respectively.

3. The size conversion accelerator of claim 1, wherein the profile input control module comprises a direct memory access module configured to:

4. The size conversion accelerator of claim 1, wherein the profile input control module comprises a direct memory access module configured to:

5. The size transformation accelerator of claim 1, wherein the input signature caching module comprises a first input signature caching module and a second input signature caching module, each comprising a plurality of pairs of the caching units, the first input signature caching module and/or the second input signature caching module configured to:

6. The size conversion accelerator of claim 5, wherein the first input signature buffer module and the second input signature buffer module comprise a first data read interface and a second data read interface, respectively, the first data read interface being in data communication with the signature input control module and the signature output control module, respectively, the second data read interface being configured to perform data communication between the first input signature buffer module and the second input signature buffer module.

7. The size conversion accelerator of claim 1, wherein the first buffer unit and the second buffer unit comprise an odd column buffer unit and an even column buffer unit, respectively,

In a first quantization mode, each column of data of the input feature map has a bit width of N bits, interface bit widths of the odd-numbered column buffer units and the even-numbered column buffer units are 2N bits, the odd-numbered column buffer units buffer splicing data of two adjacent odd-numbered columns of the input image, and the even-numbered column buffer units buffer splicing data of two adjacent even-numbered columns of the input image;

Wherein N is an integer greater than 0.

8. The size conversion accelerator of claim 1, wherein the control module is further configured to slice the input profile and the output profile, the slice direction being a row direction of the profiles.

9. A chip structure, comprising:

The size conversion accelerator according to any one of claims 1 to 8;

on-chip bus, and