CN114489805B

CN114489805B - Treatment method, treatment device and related products

Info

Publication number: CN114489805B
Application number: CN202011272696.5A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Filing date: 2020-11-13
Publication date: 2025-10-10
Anticipated expiration: 2040-11-13

Abstract

The present disclosure discloses a processing method, a processing apparatus, and related products. The processing means may be implemented such that the computing means is comprised in a combined processing means, which combined processing means may also comprise interface means and other processing means. The computing device interacts with other processing devices to collectively perform a user-specified computing operation. The combined processing means may further comprise storage means connected to the computing means and the other processing means, respectively, for storing data of the computing means and the other processing means. The solution of the present disclosure provides an instruction parallelism solution that can improve instruction parallelism, thereby improving the processing efficiency of the machine.

Description

Processing method, processing device and related products

Technical Field

The present disclosure relates to the field of processors, and in particular, to a processing method, a processing apparatus, a chip, and a board.

Background

An instruction system is an interface for computer software and hardware interaction, and is a very important part of the computer system architecture. With the continuous development of artificial intelligence technology, the amount of data and the data dimension to be processed are increasing. Therefore, how to reasonably and scientifically control the execution of instructions, especially to improve the parallel degree of the instructions and the performance of a machine, is an important problem in instruction design.

Disclosure of Invention

To address one or more of the technical problems mentioned above, the present disclosure proposes, in various aspects, a solution to enhance instruction parallelism. By the instruction system of the present disclosure, the degree of instruction parallelism can be increased, thereby increasing the processing efficiency of the machine.

In a first aspect, the present disclosure provides a processing method comprising a first operation to obtain decoded instructions, determining a first coordinate space range of tensor data that is allowed to be used by the first operation, determining a second coordinate space range of the tensor data to be used when performing the first operation, and performing the first operation within a third coordinate space range determined by an intersection of the first coordinate space range and the second coordinate space range, wherein the first coordinate space range and the second coordinate space range are determined based at least in part on a predetermined division of a shape coordinate space of the tensor data.

In a second aspect, the present disclosure provides a processing apparatus comprising an operation acquisition unit configured to acquire a first operation of a decoded instruction, a first determination unit configured to determine a first coordinate space range of tensor data allowed to be used by the first operation, a second determination unit configured to determine a second coordinate space range of the tensor data to be used when the first operation is performed, and an execution unit configured to execute the first operation within a third coordinate space range determined by an intersection of the first coordinate space range and the second coordinate space range, wherein the first coordinate space range and the second coordinate space range are determined based at least in part on a predetermined division of a shape coordinate space of the tensor data.

In a third aspect, the present disclosure provides a chip comprising the processing apparatus of any one of the embodiments of the second aspect described above.

In a fourth aspect, the present disclosure provides a board comprising the chip of any one of the embodiments of the third aspect.

Through the processing device, the processing method, the chip and the board provided by the embodiment of the disclosure, the coordinate space range used by the operation is limited in the operation execution process of the instruction, so that the consistency of the execution sequence can be ensured and the parallelism degree of the operation can be improved when the hardware is executed in parallel, thereby ensuring the accuracy and the efficiency of the processing.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1A shows a schematic diagram of a data storage space according to an embodiment of the present disclosure;

FIG. 1B shows a schematic diagram of a data chunk in a data storage space, according to an embodiment of the present disclosure;

FIG. 2 shows a schematic block diagram of a processing device according to an embodiment of the present disclosure;

FIG. 3A shows a schematic flow chart of a processing method according to an embodiment of the disclosure;

FIG. 3B shows a schematic block diagram of a processing device according to an embodiment of the present disclosure;

4A-4B illustrate schematic diagrams of coordinate space ranges according to embodiments of the present disclosure;

FIG. 5 shows a block diagram of a combined processing apparatus according to an embodiment of the disclosure, and

Fig. 6 shows a schematic structural diagram of a board according to an embodiment of the disclosure.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," and the like, as may be used in the claims, specification, and drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

The computer processes various data by executing instructions. To indicate the source of the data, the direction of the result of the operation, and the operation being performed, an instruction typically contains the following information:

(1) An Operation Code (OP) is used to indicate the Operation (e.g., add, subtract, multiply, divide, transfer of data, etc.) to be performed by the instruction, and specifies the nature and function of the Operation. A computer may have tens to hundreds of instructions, each with a corresponding operation code, and the computer may perform different operations by recognizing the operation code.

(2) An operand describing an object of operation of the instruction. The operands may relate to the data type, memory access, addressing mode, etc. of the object being operated on. The operand may directly give the operated on object or indicate a memory address or register address (i.e. register name) of the operated on object.

Instructions of a conventional processor are designed to be able to perform basic single data scalar operations. Here, a single data scalar operation refers to an instruction where each operand is a scalar data. However, with the development of artificial intelligence technology, in tasks such as image processing and pattern recognition, the operands facing tend to be data types of multidimensional vectors (i.e., tensor data), and the use of scalar operations alone does not allow the hardware to efficiently complete the computational task. Therefore, how to efficiently perform multidimensional tensor data processing is also a problem to be solved in the current computing field.

In an embodiment of the present disclosure, an instruction system is provided in which descriptors are included in operands of an instruction, by which information relating to tensor data can be quickly obtained. In particular, the descriptor may indicate at least one of shape information of the tensor data, spatial information of the tensor data. The shape information of the tensor data may be used to determine a data address of the tensor data corresponding to the operand in the data storage space. Spatial information of tensor data may be used to determine dependencies between instructions, which in turn may determine, for example, the order of execution of the instructions. Spatial information of tensor data may be indicated by a spatial Identification (ID). The spatial ID may also be referred to as a spatial alias, which refers to a spatial region for storing corresponding tensor data, and may be a continuous space or a multi-segment space, and the present disclosure is not limited to a specific composition form of the spatial region. The different spatial IDs indicate that there is no dependency on the spatial region pointed to. For example, it is possible to ensure that there is no dependency relationship by making the spatial areas to which different spatial IDs point non-overlapping with each other.

Various possible implementations of shape information of tensor data are described in detail below in connection with the figures.

Tensors may include multiple forms of data constituent. Tensors may be of different dimensions, e.g. a scalar may be considered as a 0-dimensional tensor, a vector may be considered as a 1-dimensional tensor, and a matrix may be a 2-dimensional or more tensor. The shape of the tensor includes information such as the dimension of the tensor, the dimensions of the respective dimensions of the tensor, and the like. For example, for a three-dimensional tensor:

x₃=[[[1,2,3],[4,5,6]];[[7,8,9],[10,11,12]]]

the shape or dimension of the tensor may be represented as X ₃ = (2, 3), i.e. the tensor is a three-dimensional tensor by three parameters, with the first dimension of the tensor being 2 in size, the second dimension being 2 in size, and the third dimension being 3 in size. When the tensor data is stored in the memory, the shape of the tensor data cannot be determined according to the data address (or the storage area), and further, related information such as correlations among a plurality of tensor data cannot be determined, so that the access efficiency of the processor to the tensor data is low.

In one possible implementation, the shape of the tensor data in N dimensions may be indicated with a descriptor, N being a positive integer, e.g. n=1, 2 or 3, or zero. The three-dimensional tensor descriptor in the above example may be expressed as (2, 3). It is noted that the present disclosure is not limited to the manner in which the descriptors indicate tensor shapes.

In one possible implementation, the value of N may be determined according to the dimension (also called the order) of the tensor data, or may be set according to the usage requirement of the tensor data. For example, when N has a value of 3, the tensor data is three-dimensional tensor data, and the descriptor may be used to indicate the shape (e.g., offset, size, etc.) of the three-dimensional tensor data in three dimensions. It should be understood that the person skilled in the art may set the value of N according to actual needs, which is not limited by the present disclosure.

Although tensor data can be multidimensional, because the layout of the memory is always one-dimensional, there is a correspondence between tensors and storage on the memory. Tensor data is typically allocated in contiguous memory space, i.e., the tensor data is one-dimensionally expanded (e.g., row-first manner) and stored on memory.

This relationship between tensors and underlying storage may be represented by an offset of a dimension (offset), a size of a dimension (size), a step size of a dimension (stride), and so on. The offset of a dimension refers to the offset in that dimension from a reference position. The dimension refers to the size of the dimension, i.e., the number of elements in the dimension. The step size of a dimension refers to the spacing between adjacent elements in that dimension, e.g. the step size of the above three-dimensional tensor is (6,3,1), i.e. the step size of the first dimension is 6, the step size of the second dimension is 3, and the step size of the third dimension is 1.

FIG. 1A shows a schematic diagram of a data storage space according to an embodiment of the present disclosure. As shown in fig. 1A, the data storage space 21 stores a two-dimensional data in a line-first manner, which can be represented by (X, Y) (where the X-axis is horizontally to the right and the Y-axis is vertically downward). The size in the X-axis direction (the size of each row, or the total number of columns) is ori_x (not shown in the figure), the size in the Y-axis direction (the total number of rows) is ori_y (not shown in the figure), and the start address pa_start (reference address) of the data storage space 21 is the physical address of the first data block 22. The data block 23 is part of the data in the data storage space 21, and its offset 25 in the X-axis direction is denoted as offset_x, its offset 24 in the Y-axis direction is denoted as offset_y, its size in the X-axis direction is denoted as size_x, and its size in the Y-axis direction is denoted as size_y.

In one possible implementation, where a descriptor is used to define data block 23, the data reference point of the descriptor may use the first data block of data storage space 21, and the reference address of the descriptor may be agreed upon as the start address pa_start of data storage space 21. The content of the descriptor of the data block 23 can then be determined in combination with the dimension ori_x of the data storage space 21 in the X-axis, the dimension ori_y in the Y-axis, and the offset offset_y of the data block 23 in the Y-axis direction, the offset_x in the X-axis direction, the dimension size_x in the X-axis direction, and the dimension size_y in the Y-axis direction.

In one possible implementation, the following formula (1) may be used to represent the content of the descriptor:

it should be appreciated that while in the above examples the content of the descriptor represents a two-dimensional space, one skilled in the art may set the specific dimensions of the content representation of the descriptor according to the actual circumstances, which is not limited by the present disclosure.

In one possible implementation, a reference address of the data reference point of the descriptor in the data storage space may be agreed, and the content of the descriptor of the tensor data is determined based on the reference address according to the positions of at least two vertices located diagonally in the N-dimensional directions with respect to the data reference point.

For example, a reference address pa_base of a data reference point of a descriptor in a data storage space may be contracted. For example, one data (for example, data in the position (2, 2)) may be selected in the data storage space 21 as the data reference point, and the physical address of the data in the data storage space may be set as the reference address pa_base. The content of the descriptor of data block 23 in fig. 1A may be determined from the positions of the two vertices of the diagonal position relative to the data reference points. First, the positions of at least two vertexes of the diagonal position of the data block 23 with respect to the data reference point are determined, for example, the positions of vertexes of the diagonal position in the upper left-to-lower right direction with respect to the data reference point are used, wherein the relative positions of the vertexes of the upper left corner are (x_min, y_min), the relative positions of the vertexes of the lower right corner are (x_max, y_max), and then the content of the descriptor of the data block 23 can be determined from the reference address pa_base, the relative positions of the vertexes of the upper left corner (x_min, y_min), and the relative positions of the vertexes of the lower right corner (x_max, y_max).

In one possible implementation, the content of the descriptor (base address pa_base) may be represented using the following equation (2):

it should be appreciated that while the above examples use two diagonally positioned vertices, the upper left corner and the lower right corner, to determine the content of the descriptor, those skilled in the art may set the specific vertices of at least two diagonally positioned vertices as actually needed, and this disclosure is not limited in this regard.

In one possible implementation, the content of the descriptor of the tensor data may be determined according to a reference address of the data reference point of the descriptor in the data storage space, and a mapping relationship between the data description location and the data address of the tensor data indicated by the descriptor. The mapping relationship between the data description location and the data address may be set according to actual needs, for example, when tensor data indicated by the descriptor is three-dimensional space data, the mapping relationship between the data description location and the data address may be defined by using a function f (x, y, z).

In one possible implementation, the following formula (3) may be used to represent the content of the descriptor:

In one possible implementation, the descriptor is further used to indicate an address of the tensor data in the N dimension, where the content of the descriptor further includes at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be the following formula (4):

Wherein PA is an address parameter. The address parameter may be a logical address or a physical address. When the descriptor is analyzed, the PA can be taken as any one of the vertex, the middle point or the preset point of the vector shape, and the corresponding data address can be obtained by combining the shape parameters in the X direction and the Y direction.

In one possible implementation, the address parameter of the tensor data includes a reference address of the data reference point of the descriptor in the data storage space of the tensor data, the reference address including a start address of the data storage space.

In one possible implementation, the descriptor may further include at least one address parameter representing an address of the tensor data, e.g., the content of the descriptor may be the following formula (5):

the pa_start is a reference address parameter, and will not be described in detail.

It should be understood that, the mapping relationship between the data description location and the data address may be set by those skilled in the art according to the actual situation, which is not limited by the present disclosure.

In one possible implementation, the agreed-upon base address may be set in a task for which descriptors in the instructions all use, and shape parameters based on the base address may be included in the descriptor contents. This reference address may be determined by setting the environmental parameters of this task. Reference addresses are described and used in a manner as described in the above embodiments. In such an implementation, the content of the descriptor may be mapped to the data address more quickly.

In one possible implementation, the reference address may be included in the content of each descriptor, and then the reference address may be different for each descriptor. Each descriptor in this manner may more flexibly describe data and use a larger data address space than a manner in which a common base address is set with environmental parameters.

In one possible implementation, a data address of data corresponding to an operand of a processing instruction in a data storage space may be determined based on the content of the descriptor. The calculation of the data address is automatically completed by hardware, and when the representation modes of the content of the descriptors are different, the calculation methods of the data address are also different. The specific calculation method of the data address is not limited by the present disclosure.

For example, the content of the descriptor in the operand is expressed by using formula (1), the offset of the tensor data indicated by the descriptor in the data storage space is offset_x and offset_y, and the size is size_x×size_y, and then the starting data address PA1 _(x,y) of the tensor data indicated by the descriptor in the data storage space can be determined by using the following formula (6):

PA1_(x,y)=PA_start+(offset_y-1)*ori_x+offset_x (6)

The storage area of the tensor data indicated by the descriptor in the data storage space can be determined according to the data start address PA1 _(x,y) determined in the above formula (6) in combination with the offsets offset_x and offset_y and the sizes size_x and size_y of the storage area.

In one possible implementation, when the operand further includes a data description location for the descriptor, a data address of data corresponding to the operand in the data storage space may be determined according to the content of the descriptor and the data description location. In this way, part of the data (e.g., one or more data) in the tensor data indicated by the descriptor can be processed.

For example, the content of the descriptor in the operand is expressed using formula (2), the tensor data indicated by the descriptor is offset_x and offset_y in the data storage space, the size is size_x×size_y, and the data description location for the descriptor included in the operand is (x _q,y_q), and then the data address PA2 _(x,y) of the tensor data indicated by the descriptor in the data storage space can be determined using formula (7) below:

PA2_(x,y)＝PA_start+(offset_y+y_q-1)*ori_x+(offset_x+x_q) (7)

In one possible implementation, the descriptor may indicate the data of the chunk. The data blocking can effectively accelerate the operation speed and improve the processing efficiency in many applications. For example, in graphics processing, convolution operations often use blocks of data for fast arithmetic processing.

FIG. 1B shows a schematic diagram of data partitioning in a data storage space according to an embodiment of the present disclosure. As shown in FIG. 1B, the data storage space 26 also stores two-dimensional data in a row-first manner, which can be represented by (X, Y) (where the X-axis is horizontally to the right and the Y-axis is vertically downward). The dimension in the X-axis direction (the dimension of each row, or the total number of columns) is ori_x (not shown in the figure), and the dimension in the Y-axis direction (the total number of rows) is ori_y (not shown in the figure). Unlike the tensor data of fig. 1A, the tensor data stored in fig. 1B includes a plurality of data blocks.

In this case, the descriptor requires more parameters to represent the data blocks. Taking the X-axis (X dimension) as an example, there may be mentioned the parameters ori_x, x.tile.size (dimension 27 in the block), x.tile.stride (step 28 in the block, i.e. the distance of the first point of the first patch from the first point of the second patch), x.tile.num (number of blocks, shown as 3 blocks in fig. 1B), x.stride (step as a whole, i.e. the distance of the first point of the first line to the first point of the second line), etc. Other dimensions may similarly include corresponding parameters.

In one possible implementation, the descriptor may include an identification of the descriptor and/or content of the descriptor. Wherein the identification of the descriptor is used to distinguish the descriptors, e.g. the identification of the descriptor may be a number thereof, and the content of the descriptor may comprise at least one shape parameter representing the shape of the tensor data. For example, the tensor data is 3-dimensional data, in which the shape parameters of two dimensions are fixed among three dimensions of the tensor data, and the content of the descriptor may include the shape parameters representing the other dimension of the tensor data.

In one possible implementation, the data addresses of the data storage space corresponding to each descriptor may be fixed addresses. For example, separate data storage spaces may be partitioned for tensor data, each tensor data having a one-to-one correspondence with a descriptor at a start address of the data storage space. In this case, circuitry or modules responsible for resolving the computing instruction (e.g., entities external to the computing device of the present disclosure) may determine the data address of the data corresponding to the operand in the data storage space from the descriptor.

In one possible implementation, when the data address of the data storage space corresponding to the descriptor is a variable address, the descriptor may further be used to indicate an address of tensor data in the N dimension, wherein the content of the descriptor may further include at least one address parameter representing the address of the tensor data. For example, the tensor data is 3-dimensional data, and when the descriptor points to the address of the tensor data, the content of the descriptor may include one address parameter representing the address of the tensor data, such as a start physical address of the tensor data, and may also include a plurality of address parameters of the address of the tensor data, such as a start address+address offset of the tensor data, or an address parameter of the tensor data based on each dimension. The address parameters may be set by those skilled in the art as desired, and the present disclosure is not limited thereto.

In one possible implementation, the address parameter of the tensor data may include a reference address of a data reference point of the descriptor in a data storage space of the tensor data. Wherein the reference address may be different according to a change of the data reference point. The present disclosure is not limited to the choice of data reference points.

In one possible implementation, the base address may include a starting address of the data storage space. When the data datum point of the descriptor is the first data block of the data storage space, the datum address of the descriptor is the starting address of the data storage space. When the data datum point of the descriptor is other data except the first data block in the data storage space, the datum address of the descriptor is the address of the data block in the data storage space.

In one possible implementation, the shape parameters of the tensor data include at least one of a size of the data storage space in at least one of the N dimension directions, a size of the storage region in at least one of the N dimension directions, an offset of the storage region in at least one of the N dimension directions, a position of at least two vertices at diagonal positions of the N dimension directions relative to the data reference point, a mapping relationship between a data description position of the tensor data indicated by the descriptor and the data address. Where the data description location is a mapping location of a point or an area in the tensor data indicated by the descriptor, for example, when the tensor data is 3-dimensional data, the descriptor may use three-dimensional space coordinates (x, y, z) to represent a shape of the tensor data, and the data description location of the tensor data may be a location of a point or an area in three-dimensional space, which is represented using three-dimensional space coordinates (x, y, z).

It should be appreciated that the shape parameters representing tensor data may be selected by one of ordinary skill in the art based on the actual circumstances, and the disclosure is not limited in this regard. By using descriptors in the data access process, the association between data can be established, thereby reducing the complexity of data access and improving the instruction processing efficiency.

Fig. 2 shows a schematic block diagram of a processing device according to an embodiment of the disclosure. As shown in fig. 2, the processing device 200 includes a control module 210, an operation module 220, and a storage module 230.

The control module 210 may be configured to control the operation of the processing apparatus 200, such as reading a memory or an externally incoming instruction, decoding (decoding) the instruction by the decoder 211, issuing a micro-operation control signal to a corresponding component, and the like. Specifically, the control module 210 may be configured to control the execution unit 220 to execute a corresponding process according to the received instruction. The instructions may include, but are not limited to, data access instructions, operation instructions, descriptor management instructions, synchronization instructions, and the like. The present disclosure is not limited by the specific type of instruction and the specific manner of decoding.

The decoded instruction includes an opcode and an operand. When an instruction involves processing tensor data, at least one operand of the instruction may include at least one descriptor indicating at least one of shape information of the tensor data and spatial information of the tensor data.

The operation module 220 is configured to execute specific instructions or operations under the control of the control module 210. The operation module 220 may include, for example, but not limited to, an Arithmetic Logic Unit (ALU), a memory access unit (memory access unit, MAU), an artificial intelligence operation unit (neural functional unit, NFU), and the like. Two functional units 221 and 222 are schematically shown in fig. 2. The present disclosure is not limited to a particular hardware type of functional unit.

The storage module 230 may be configured to store various information including, but not limited to, instructions, descriptor-associated information, tensor data, and the like. The memory module 230 may include various memory resources including, but not limited to, internal memory and external memory. The internal memory may include, for example, registers, on-chip SRAM, or other media cache. The external memory may include, for example, off-chip memory. The present disclosure is not limited to a particular implementation of the memory module.

Alternatively or additionally, the processing device 200 may further comprise a tensor interface unit (Tensor interface Unit, TIU) 240. The tensor interface unit 240 may be configured to implement operations associated with the descriptors under control of the control module 210. These operations may include, but are not limited to, registration, modification, de-registration, parsing of descriptors, reading and writing of descriptor content, and the like. The present disclosure is not limited to a particular hardware type of tensor interface unit. In this way, operations associated with descriptors can be implemented by dedicated hardware, further improving the access efficiency of tensor data.

In some embodiments of the present disclosure, tensor interface unit 240 may be configured to parse descriptors included in operands of instructions. For example, the tensor interface unit may parse shape information of tensor data included in the descriptor to determine a data address of data corresponding to the operand in the data storage space.

Although control module 210 and tensor interface unit 240 are shown in fig. 2 as two separate modules, one of ordinary skill in the art will appreciate that these two modules/units may also be implemented as one or more modules, as the disclosure is not limited in this respect.

The data processing apparatus 200 may be implemented using a general purpose processor (e.g., a Central Processing Unit (CPU), a Graphics Processor (GPU)) and/or a special purpose processor (e.g., an artificial intelligence processor, a scientific computing processor, a digital signal processor, etc.), and the present disclosure is not limited as to the particular type of data processing apparatus.

When hardware executes instructions in parallel, if there is a dependency relationship between the instructions executed in parallel, an execution result error may be caused. For example, if two instructions executed in parallel access the same memory location or the same data, and at least one of the two instructions is an instruction that writes to the memory location, then a dependency relationship exists between the two instructions, such as a read-after-write dependency, a write-after-write dependency, or a write-after-read dependency. At this time, if the latter instruction is executed before the former instruction, execution errors may be caused. It is necessary to ensure that these instructions are executed in order, for example by forcing sequential execution, i.e. the following instruction must wait for the previous instruction to complete before executing.

From the foregoing description of tensor data, tensor data is typically a multidimensional array, and the amount of data is large, so instruction processing time for tensor data is typically longer than scalar data. At this time, if tensor data is still processed according to the previous sequential execution mode, the processing time is too long, and the efficiency is low. In view of this, in the embodiments of the present disclosure, an instruction parallel scheme is provided in which parallel execution of operations is restricted based on a coordinate space range of tensor data used for the operations of the instructions, so that when parallel execution of hardware is performed, both consistency of execution order can be ensured and the degree of parallelism of operations can be improved, thereby ensuring accuracy and efficiency of processing.

Fig. 3A illustrates an exemplary flowchart of a processing method 300 according to an embodiment of the present disclosure. The processing method 300 may be implemented, for example, by the processing device 200 of fig. 2.

As shown in fig. 3A, the method 300 begins with step S310, a first operation of fetching a decoded instruction. This step may be performed, for example, by the decoder 211 in the control module 210 of fig. 2. In some embodiments, the first operation may involve processing of tensor data.

It should be noted that the operations involved in the present disclosure may be basic operations supported by processor hardware, or may be microinstructions (e.g., request signals, etc.) that parse the basic operations. The present disclosure is not limited to the particular type of operation. The processing apparatus of the present disclosure may perform two operations in parallel, or may perform more than two operations in parallel, and the present disclosure does not limit the number of operations performed in parallel. The two operations performed in parallel may belong to the same instruction or may belong to different instructions, as the disclosure is not limited in this respect.

Next, in step S320, a first coordinate space range of tensor data that is allowed to be used by the first operation is determined. This step may be performed, for example, by the control module 210 of fig. 2, or determined by the control module 210 controlling the tensor interface unit 240.

Next, in step S330, a second coordinate space range of the tensor data to be used when the first operation is performed is determined. In one implementation, this step may be performed, for example, by the control module 210 of fig. 2, or determined by the control module 210 controlling the tensor interface unit 240. In another implementation, this step may be performed, for example, by the operation module 220 of fig. 2, e.g., as determined by the respective functional unit 221 or 222.

Finally, in step S340, the first operation is performed within a third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range. This step may be performed, for example, by the operation module 220 of fig. 2, in particular by the respective functional units 221, 222.

In some embodiments, the first operation may be blocked when the third coordinate space range is empty. By blocking the following operations, a plurality of operations having dependency relationships can be forced to be executed in a predetermined order, thereby ensuring the correctness of the result.

In the presently disclosed embodiments, by limiting the coordinate space ranges of tensor data that can be used when an operation is performed, such as limiting the operation to be performed within a third coordinate space range as described above, it is possible to ensure that accesses of instructions on each coordinate space range are sequential when the instructions are executed in parallel, thereby ensuring accuracy and efficiency of processing. In some embodiments, the first coordinate space range and the second coordinate space range are determined based at least in part on a predetermined division of a shape coordinate space of the tensor data for which the operation is intended.

In some embodiments, the first operation described above involves processing tensor data. Accordingly, the first coordinate space range and the second coordinate space range may be part of a shape coordinate space of a corresponding dimension of the tensor data, respectively. The shape coordinate space maps to a data storage area on the tensor data storage module 230. By dividing the shape coordinate space of the tensor data into several coordinate space ranges and restricting parallel execution of instructions based on the limitation of the coordinate space ranges (e.g., the coordinate space ranges of the front-back operations do not overlap), the parallelism of processing is improved and the processing time is reduced. Further, since programming on the software side typically uses spatial coordinates to reference data points or data blocks in tensor data, constraining parallel execution of operations by the coordinate space range of tensor data can simplify code programming on the software side, facilitating execution of instructions.

The present disclosure also provides an exemplary processing apparatus for implementing the processing method 300 of fig. 3A. Fig. 3B shows a schematic functional block diagram of a processing device according to an embodiment of the present disclosure.

As shown in fig. 3B, the processing apparatus 30 includes an operation acquisition unit 31, a first determination unit 32, a second determination unit 33, and an execution unit 34.

The operation acquisition unit 31 is configured to acquire a first operation of the decoded instruction. The first determination unit 32 is configured to determine a first coordinate space range of tensor data that is allowed to be used by the first operation. The second determination unit 33 is configured to determine a second coordinate space range of tensor data to be used when performing the first operation. The execution unit 34 is configured to execute the first operation within a third coordinate space range determined by an intersection of the first coordinate space range and the second coordinate space range.

Those skilled in the art will appreciate that the various elements shown in FIG. 3B are partitioned by functional implementation. Such partitioning is merely exemplary, and in an actual implementation, two or more functions may be implemented in the same hardware unit, and one function may be implemented in two hardware units as well.

For example, in one implementation, the operation acquisition unit 31 and the first determination unit 32 may be included in the control module 210 of the processing apparatus 200 shown in fig. 2, and the second determination unit 33 and the execution unit 34 may be included in the operation module 220 of the processing apparatus 200.

For example, in another implementation, the operation acquisition unit 31, the first determination unit 32, and the second determination unit 33 may be included in the control module 210 of the processing apparatus 200 shown in fig. 2, and the execution unit 34 is included in the operation module 220 of the processing apparatus 200.

It should also be appreciated that the units contained in the processing device 30 correspond to the various steps in the method 300 described with reference to fig. 3A. Thus, the operations and features described herein with respect to the method are equally applicable to the processing device 30 and the units contained therein, and will not be described in detail.

In some embodiments of the present disclosure, the first coordinate space range and the second coordinate space range may be determined based at least in part on a predetermined division of a shape coordinate space of the tensor data.

More specifically, in some embodiments, the processing method further includes determining a prior operation being performed that accesses the same tensor data and that has a dependency relationship with the first operation, and determining the first coordinate space range based at least in part on an operational state of the prior operation. As mentioned previously, there may be three dependencies of the current operation and the prior operation, such as a read-after-write dependency, a write-after-write dependency, or a write-after-read dependency. The order consistency in the execution of these instructions must be ensured at this time, and therefore the first coordinate space range that is allowed to be used by the current operation (here, the first operation) is determined based on the operation state of the previous operation.

Fig. 4A schematically illustrates the division of coordinate space ranges according to an embodiment of the present disclosure. Fig. 4A is illustrated by way of example with two-dimensional data, however, those skilled in the art will appreciate that the same scheme may similarly be applied to tensor data in three or more dimensions.

As shown in fig. 4A, the shape coordinate space 400A of the two-dimensional tensor data is divided into four space blocks, 41, 42, 43, and 44, respectively. On each space block, the access to the data at each coordinate point in the space block is ensured to be sequential, namely, for each coordinate point, the prior operation with the dependency relationship is executed first, and then the first operation is executed, so that the correctness of the result is ensured.

In some embodiments, step S320 of FIG. 3A may further include determining a block of space in the shape coordinate space of the tensor data for which the previous operation was completed as the first coordinate space range.

In these embodiments, for example, when the previous operation has completed accessing the space blocks 41, 43 and the space block 42 is being used, the space range (i.e., the first coordinate space range) permitted to be used by the first operation at this time may include the space block 41 and the space block 43, as indicated by the hatched areas.

Alternatively or additionally, in some embodiments, step S330 of FIG. 3A may further include determining the block of space determined based on the coordinates of the tensor data to be accessed by the first operation as the second coordinate space range.

In these embodiments, for example, when the first operation is expected to use the spatial blocks 41 and 42 (e.g., estimated from coordinates of tensor data to be accessed), the spatial range (i.e., the second coordinate spatial range) that the first operation is to be performed at this time may be determined as the spatial blocks 41 and 42, as the region shown is filled with points.

Then, according to the embodiment of the present disclosure, the range in which the first operation can be actually performed, that is, the third coordinate space range, is the intersection of the first coordinate space range and the second coordinate space range. As shown in fig. 4A, in the present example, the third coordinate space range is an area where both diagonal hatching and dot filling exist, that is, a space block 41 in fig. 4A.

Alternatively or additionally, in some embodiments, the first operation may be performed within the third coordinate space based on at least one of a predetermined spatial block order, and/or a predetermined spatial coordinate order.

In some implementations, after the shape coordinate space of the tensor data to be operated on is pre-partitioned into blocks, e.g., the four spatial blocks of fig. 4A, a spatial block order, i.e., the order of operation for each spatial block in the coordinate space, e.g., in order of spatial blocks 41, 42, 43, and 44, may be predetermined. In this case, if the operation target or the usage space of both instructions having a dependency relationship is the whole tensor data, the instructions can be caused to operate on the space blocks one by one in this order. For example, assuming that the preceding instruction 1 is to write tensor data and the following instruction 2 is to read the tensor data, then instruction 1 may write to spatial block 41 first and then write to spatial block 42. At this point, instruction 2 may be caused to begin a read operation on spatial block 41. If the space block is divided so that the execution beats of instruction 2 are consistent with that of instruction 1, then in the subsequent time, when instruction 1 starts writing to space block 43, instruction 2 has also completed reading to space block 41 and starts reading to space block 42, and so on. Therefore, the division of the space blocks is beneficial to the parallel execution of the instructions, the convention of the sequence of the space blocks is beneficial to simplifying operation scheduling, shortening the processing time and improving the processing efficiency.

Alternatively or additionally, in some implementations, the first operation may also be performed in a predetermined spatial coordinate order when performed within a single spatial block. When the operation ranges of the instructions executed in parallel are further controlled based on the coordinate points of the current operation within a single spatial block, this manner of execution in the predetermined spatial coordinate order is advantageous in simplifying operation scheduling, the principle of which is similar to the principle of execution in the predetermined spatial block order described above, and will not be described here again.

Although four spatial blocks divided equally are shown in fig. 4A, various numbers of spatial blocks of unequal sizes may be divided, and the present disclosure is not limited in the specific manner of division.

In some embodiments, the shape coordinate space of the tensor data may be pre-partitioned based on at least one of the processing power of the hardware, pre-set parameters, and the size of the shape coordinate space of the tensor data. The processing capabilities of the hardware may include, for example, but are not limited to, the data bit width that the hardware can handle. Based on the data bit width which can be processed by the hardware, the shape coordinate space of tensor data is divided, so that the processing capability of the hardware can be fully exerted, and the parallel processing efficiency is improved. The preset parameters may, for example, directly specify the number of spatial blocks to be divided, the size of the dimensions of the spatial blocks, etc. The shape coordinate space of the tensor data is divided based on the size/dimension of the shape coordinate space of the tensor data. For example, when the tensor data is a two-dimensional matrix, the size of the tensor data is M rows by N columns (M, N are all positive integers), each row can be divided into M parts on average, and each column can be divided into N parts on average, so that M by N space blocks can be summed.

In some embodiments, the first coordinate space range, the second coordinate space range, and the third coordinate space range may be characterized using the identification of the spatial blocks that are each included. For example, in the example shown in FIG. 4A, a first coordinate space range may be characterized using the identifications of space blocks 41 and 43, a second coordinate space range may be characterized using the identifications of space blocks 41 and 42, and a third coordinate space range may be characterized using the identifications of space blocks 41.

Considering that in most cases, access to the tensor data is typically in a dimension, the access coordinates are progressively incremented, traversing the data units at each coordinate point in the tensor data from front to back.

Thus, in other embodiments, the first coordinate space range is characterized using an upper bound of coordinates that allows the spatial block or partial spatial block used by the first operation to be in one or more dimensions of the tensor data, and/or the second coordinate space range is characterized using a lower bound of coordinates that predicts the spatial block or partial spatial block to be used by the first operation to be in one or more dimensions of the tensor data. By utilizing such a feature of the tensor data that is accessed in a dimensional sequence, only the upper or lower coordinate bound can be used to characterize the first or second coordinate space range, whereby the control information and the corresponding control method can be simplified.

Fig. 4B schematically illustrates a representation of a coordinate space range according to an embodiment of the disclosure. Fig. 4B is still exemplarily illustrated with two-dimensional data, however, it will be appreciated by those skilled in the art that the same scheme may similarly be applied to tensor data in three or even more dimensions.

As shown in fig. 4B, the shape coordinate space 400B of the two-dimensional tensor data is divided into 12 space blocks. On each spatial block, access to the data at each coordinate point therein by a plurality of instructions having interdependence is guaranteed to be sequential. Any data element (e.g., data point) on the tensor data can be represented by two-dimensional spatial coordinates (X, Y) (where the X-axis is horizontally to the right and the Y-axis is vertically downward). Obviously, the coordinates of any data element on the tensor data do not exceed the maximum size of the shape coordinate space 400B.

As previously described, the first coordinate space range may be characterized using an upper bound of coordinates that allows the spatial blocks used by the first operation to be in one or more dimensions of the tensor data. For example, when the previous operation has completed access to the upper left two rows and three columns in total of 6 space blocks, and the rightmost column and the bottommost row are used for a total of 6 space blocks, the space range (i.e., the first coordinate space range) permitted to be used at this time for the first operation may include the upper left two rows and three columns in total of 6 space blocks, as indicated by the hatched area. At this time, in fig. 4B, the first coordinate space range may be characterized by an X upper bound 411 on the X axis and a Y upper bound 421 on the Y axis. In this example, the upper bound X and the upper bound Y represent that the data coordinates accessed by the first operation cannot exceed the upper bound X312A in the X dimension and cannot exceed the upper bound Y322A in the Y dimension. It can be seen that these upper bounds correspond to upper bounds in each dimension of the spatial blocks that can be used by the first operation.

Similarly, the second coordinate space range may be characterized using a lower bound of coordinates of the spatial block in one or more dimensions of the tensor data that is expected to be used by the first operation. For example, when determining that the first operation is to use a spatial block other than the left two spatial blocks of the first row based on the coordinates of tensor data to be accessed by the first operation, it may be determined that the second coordinate spatial range includes the remaining 10 spatial blocks, as the dot-filled portion shows the region. At this time, in fig. 4B, the second coordinate space range may be characterized by an X lower bound 412 on the X axis and a Y lower bound 422 on the Y axis. In this example, the X lower bound and the Y lower bound represent that data in the tensor data whose X coordinates are below the X lower bound and Y coordinates are below the Y lower bound is not accessed when the first operation is performed. As can be seen from fig. 4B, the second coordinate space range corresponds to a range within the shape coordinate space 400B where either of the X-coordinate and the Y-coordinate exceeds a corresponding lower bound (X lower bound or Y lower bound) corresponding to a lower bound of the space block in each dimension to be used by the first operation.

The range in which the first operation can be actually performed is a third coordinate space range, which is an intersection of the first coordinate space range and the second coordinate space range. As shown in fig. 4B, in the present example, the third coordinate space range is an area where both diagonal shading and dot filling exist, that is, an "inverted L-shaped" area in fig. 4B.

In some embodiments, the respective coordinate space ranges may be limited to only some of the dimensions thereof. For example, for two-dimensional tensor data, only the boundaries of the X dimension or Y dimension are limited.

The determination of the first coordinate space range and the second coordinate space range based on the predetermined division of the shape coordinate space of the tensor data is described above. Additionally, some other factors may also be considered to determine the first and second coordinate space ranges, respectively. When the division of the space blocks is large, the access time to operate within one space block is also longer. At this time, the spatial range is divided more finely within a single spatial block, and the parallel processing efficiency can be further improved.

In such an embodiment, when there is a space block in which the previous operation has been completed only with respect to the partial coordinates, a partial space block constituted by the partial coordinates is dynamically determined based on the operation state of the previous operation, and the partial space block is included in the first coordinate space range. Thus, the first coordinate space range may contain complete space blocks where the prior operation has been completed in its entirety, or partial space blocks where the prior operation has been completed only in part.

Similarly, when there is a space block in which the first operation is to access a partial coordinate, a partial space block constituted by the partial coordinate is dynamically determined based on the operation state of the first operation, and the partial space block is included in the second coordinate space range. Thus, the second coordinate space range may include a complete space block that the first operation would access, or a partial space block that the first operation would access only a portion of.

Here, the operation state may include at least one of a completion state of the operation with respect to a coordinate point in the space block, an execution range of the operation, and an access mode of the operation. For example, for a first coordinate space range, coordinate points within a spatial block that operate first to complete access may be included within the first coordinate space range. For another example, for a second coordinate space range, coordinate points within the spatial block that are not accessed by the first operation may be excluded from the second coordinate space range. The access mode of operation may include, for example, sequential access, regular access, and so forth. Based on the various access modes of operation, it may be determined accordingly which coordinate points within the spatial block may be included or excluded in the first and/or second coordinate space ranges. For the case of partial spatial blocks, additional information is needed to assist in identifying coordinate points in the partial spatial blocks, in addition to using spatial block identification for characterization. In some embodiments, such a coordinate space range containing partial space blocks may be characterized with reference to the upper and lower boundary representation of coordinates of FIG. 4B. It will be appreciated that for the case of sequential access to the coordinate space in a certain order, typically a partial space block is located at the edge of the coordinate space range to which it belongs, so that the coordinate space range can be characterized using the full space block or the upper and lower boundaries of the partial space block at the edge of the coordinate space range.

For example, in embodiments where the coordinate space range is characterized by an upper and lower boundary of the coordinate space, the lower boundary of the coordinate space of the tensor data used by the previous operation or instruction may be used as the upper boundary of the tensor data used by the current new instruction.

In one example, when the first operation (i.e., the current operation) is a read operation, the upper boundary of the coordinate space is the lower boundary of the coordinate space of the most recent (i.e., the preamble operation or the preceding operation) write operation for the tensor data.

In another example, when the first operation is a write operation, the upper boundary of the coordinate space is the smallest of the lower boundary of the coordinate space of the closest write operation for the tensor data and the lower boundary of the coordinate space of all read operations for the tensor data between the two write operations. By choosing the minimum value it is ensured that the execution of the first operation does not affect the execution of any of the preceding operations.

Alternatively or additionally, the second coordinate space range may be determined based on at least one of an execution range of the operation, an access mode of the operation, and a current execution state of the operation. For example, in embodiments where the coordinate space range is characterized by an upper boundary and a lower boundary of the coordinate space, the above factors may be comprehensively considered to determine the second coordinate space range, so as to ensure that when tensor data is accessed according to a dimension, the coordinate in the corresponding dimension is not less than the lower boundary of the coordinate space. Further, the maximum value of the lower bound of the coordinate space is provided as much as possible, so that the larger the accessible space range is left for the subsequent operation or instruction.

In one example, when the access pattern of the first operation is sequential and continuous access, the coordinate space lower bound may be determined based on the minimum access coordinates of the first operation. For example, the lower bound of the coordinate space may be determined as the lower bound of the block of space where the minimum access coordinates are located. As shown in fig. 4B, when the first operation accesses data in the X-dimension, assuming that the minimum X-coordinate of the accessed data is a, which is located in the 2 nd spatial block from the left, the X lower bound may be determined as the lower bound of the 2 nd spatial block, and when the first operation accesses data in the Y-dimension, assuming that the minimum Y-coordinate of the accessed data is B, which falls in the 3 rd spatial block from the top, the Y lower bound may be determined as the lower bound of the 3 rd spatial block.

In another example, when the access pattern of the first operation is regular access, the coordinate space lower bound may be determined based on the rule. For example, in a convolution operation, it may be necessary to access tensor data in blocks, so the lower bound of the coordinate space may be determined according to the rule of the blocks of the convolution operation.

In yet another example, when the access mode of the first operation cannot be determined, the coordinate space lower bound may be determined based on a predetermined setting. For example, the lower bound of the coordinate space may be a default value, such as 0 or the size of 1 or more spatial blocks.

Although fig. 4B shows the coordinate upper and lower bound representation when the complete spatial block is included for the coordinate spatial range, the coordinate upper and lower bound representation may be similarly applied to the partial spatial block. That is, the upper and/or lower bounds of the coordinates at this time are not the upper and/or lower bounds of the complete spatial block, but are seated inside the spatial block. The specific determination manner is similar to the foregoing, and will not be repeated here.

The above describes a scheme of restricting the spatial range actually used by the operation in order to ensure sequential consistency of data processing while improving the parallel processing efficiency when the hardware performs the operation in parallel. Those skilled in the art will appreciate that the current operation (e.g., the first operation described above) and the previous operation (or the preceding operation) may each be an operation in a different instruction executed in parallel, and that the current operation and the previous operation may also each be a different operation executed in parallel in the same instruction, as the disclosure is not limited in this respect.

The processing method performed by the processing apparatus of the embodiment of the present disclosure has been described above with reference to flowcharts. As will be appreciated by those skilled in the art, since the operations performed in parallel are constrained based on the coordinate space range of the processed data, it is possible to improve the degree of parallelism of the operations while ensuring the sequential consistency of the execution of the operations, thereby improving the processing efficiency. It should be noted that, for simplicity of description, the foregoing method embodiments are all depicted as a series of acts, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of acts described, as some steps may occur in other orders or concurrently in accordance with the disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required by the present disclosure.

It should be further noted that, although the steps in the flowchart of the method are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in a method flowchart may include a plurality of sub-steps or phases that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the sub-steps or phases are performed necessarily occur in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or phases of other steps or other steps.

Fig. 5 is a block diagram illustrating a combination processing apparatus 500 according to an embodiment of the disclosure. As shown in fig. 5, the combined processing device 500 includes a computing processing device 502, an interface device 504, other processing devices 506, and a storage device 508. Depending on the application scenario, one or more computing devices 510 may be included in the computing processing device, which may be configured as the processing device 200 shown in FIG. 2 for performing the operations described herein in connection with FIG. 4.

In various embodiments, the computing processing means of the present disclosure may be configured to perform user-specified operations. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or as a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware architecture of an artificial intelligence processor core. When multiple computing devices are implemented as artificial intelligence processor cores or portions of hardware structures of artificial intelligence processor cores, the computing processing devices of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure.

In an exemplary operation, the computing processing device of the present disclosure may interact with other processing devices through an interface device to collectively accomplish user-specified operations. Depending on the implementation, other processing devices of the present disclosure may include one or more types of processors among a central processing unit (Central Processing Unit, CPU), a graphics processor (Graphics Processing Unit, GPU), an artificial intelligence processor, and/or the like, general purpose and/or special purpose processors. These processors may include, but are not limited to, digital signal processors (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), field-Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing processing device of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing processing devices and other processing devices are considered together, both may be considered to form heterogeneous multi-core structures.

In one or more embodiments, the other processing device may interface with external data and controls as a computing processing device of the present disclosure (which may be embodied as an associated computing device for artificial intelligence, such as neural network operations), performing basic controls including, but not limited to, data handling, starting and/or stopping of the computing device, and the like. In other embodiments, other processing devices may also cooperate with the computing processing device to jointly accomplish the computational tasks.

In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing device may obtain input data from other processing devices via the interface device, and write the input data to a storage device (or memory) on the computing device. Further, the computing processing device may obtain control instructions from other processing devices via the interface device, and write the control instructions into a control cache on the computing processing device chip. Alternatively or in addition, the interface device may also read data in a memory device of the computing processing device and transmit it to the other processing device.

Additionally or alternatively, the combined processing apparatus of the present disclosure may further comprise a storage device. As shown in the figure, the storage means are connected to the computing processing means and the other processing means, respectively. In one or more embodiments, a storage device may be used to store data for the computing processing device and/or the other processing devices. For example, the data may be data that cannot be stored entirely within an internal or on-chip memory device of a computing processing device or other processing device.

In some embodiments, the present disclosure also discloses a chip (e.g., chip 602 shown in fig. 6). In one implementation, the Chip is a System on Chip (SoC) and is integrated with one or more combined processing devices as shown in fig. 5. The chip may be connected to other related components by an external interface device (such as external interface device 606 shown in fig. 6). The relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card, or a wifi interface. In some application scenarios, other processing units (e.g., video codecs) and/or interface modules (e.g., DRAM interfaces) etc. may be integrated on the chip. In some embodiments, the disclosure also discloses a chip packaging structure including the chip. In some embodiments, the disclosure further discloses a board card, which includes the chip package structure described above. The board will be described in detail with reference to fig. 6.

Fig. 6 is a schematic diagram illustrating the structure of a board 600 according to an embodiment of the disclosure. As shown in fig. 6, the board includes a memory device 604 for storing data, which includes one or more memory cells 610. The memory device may be connected and data transferred to the control device 608 and the chip 602 described above by means of, for example, a bus or the like. Further, the board card also includes an external interface device 606 configured for data relay or transfer functions between the chip (or chips in a chip package structure) and an external device 612 (e.g., a server or computer, etc.). For example, the data to be processed may be transferred by the external device to the chip through the external interface means. For another example, the calculation result of the chip may be transmitted back to the external device via the external interface device. The external interface device may have different interface forms according to different application scenarios, for example, it may use a standard PCIE interface or the like.

In one or more embodiments, the control device in the disclosed board card may be configured to regulate the state of the chip. For this reason, in an application scenario, the control device may include a single chip microcomputer (Micro Controller Unit, MCU) to regulate and control the working state of the chip.

From the above description in connection with fig. 5 and 6, those skilled in the art will appreciate that the present disclosure also discloses an electronic device or apparatus that may include one or more of the above-described boards, one or more of the above-described chips, and/or one or more of the above-described combination processing apparatuses.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle, the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas stoves and range hoods, and the medical equipment comprises a nuclear magnetic resonance instrument, a B-ultrasonic instrument and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a computationally intensive electronic device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.

It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the aspects of the present disclosure are not limited by the order of actions described. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure is also focused on, depending on the scenario. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.

In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are divided herein by taking into account the logic function, and there may be other manners of dividing the units when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.

In the present disclosure, elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically separate. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, some or all of the units may be selected to achieve the objectives of the embodiments of the disclosure, as desired. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically exist alone.

In some implementation scenarios, the above-described integrated units may be implemented in the form of software program modules. The integrated unit may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand alone product. In this regard, when the aspects of the present disclosure are embodied in a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described by the embodiments of the present disclosure. The aforementioned Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media that can store program codes.

In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPU, GPU, FPGA, DSP and ASICs, etc. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (RESISTIVE RANDOM ACCESS MEMORY, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (ENHANCED DYNAMIC Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.

The foregoing may be better understood in light of the following clauses:

Clause 1. A method of treatment, the method comprising:

a first operation of obtaining a decoded instruction;

determining a first coordinate space range of tensor data that allows use of the first operation;

Determining a second coordinate space range of the tensor data to be used in performing the first operation, and

Performing the first operation within a third coordinate space range determined by an intersection of the first coordinate space range and the second coordinate space range;

wherein the first coordinate space range and the second coordinate space range are determined based at least in part on a predetermined division of a shape coordinate space of the tensor data.

Clause 2. The processing method of clause 1, wherein the first coordinate space range and the second coordinate space range are each part of a shape coordinate space of the tensor data, the shape coordinate space being mapped to a data storage area on the tensor data storage module.

Clause 3. The method of any of clauses 1-2, further comprising:

determining a prior operation being performed that accesses the same tensor data and has a dependency relationship with the first operation, and

The first coordinate space range is determined based at least in part on an operational state of the prior operation.

Clause 4. The processing method of clause 3, wherein the shape coordinate space of the tensor data is pre-divided into a number of spatial blocks, the method further comprising:

determining a block of space in the shape coordinate space in which the prior operation has been completed as the first coordinate space range, and/or

And determining a space block determined based on coordinates of tensor data to be accessed by the first operation as the second coordinate space range.

Clause 5. The method of processing according to clause 4, the method further comprising:

Determining a partial space block composed of partial coordinates based on an operation state of the prior operation when there is a space block in which the prior operation is completed with respect to only the partial coordinates, and including the partial space block in the first coordinate space range, and/or

When there is a space block in which the first operation is to access a partial coordinate, determining a partial space block constituted by the partial coordinate based on an operation state of the first operation, and including the partial space block in the second coordinate space range.

Clause 6. The method of processing of clause 5, wherein the operating state comprises at least one of the following information:

operating a completion status with respect to coordinate points in the spatial block;

execution scope of operation, and

Access mode of operation.

Clause 7 the processing method of any of clauses 5-6, wherein the first coordinate space range and the second coordinate space range are characterized using at least one of:

The first coordinate space range and the second coordinate space range respectively comprise the identification of the space block;

A spatial block or partial spatial block within the first coordinate space range is bounded by coordinates in one or more dimensions of the tensor data, and/or

A spatial block or partial spatial block within the second coordinate space range is a lower bound of coordinates in one or more dimensions of the tensor data.

Clause 8. The method of any of clauses 4-7, further comprising:

within the third coordinate space, the first operation is performed based on at least one of the following orders:

predetermined spatial block order, and/or

A predetermined sequence of spatial coordinates.

Clause 9. The method of any of clauses 1-8, further comprising:

the first operation is blocked when the third coordinate space range is empty.

Clause 10 the processing method of any of clauses 1-9, wherein the pre-partitioning of the shape coordinate space of the tensor data is performed based on at least one of:

Processing power of the hardware;

Preset parameters, and

The shape coordinate space of the tensor data is of a size.

Clause 11. The method of treatment according to any of clauses 3-10, wherein:

the first operation and the preceding operation are respectively operations in different instructions executed in parallel, or

The first operation and the preceding operation are respectively different operations executed in parallel in the same instruction.

Clause 12. A processing device, comprising:

an operation acquisition unit configured to acquire a first operation of the decoded instruction;

A first determination unit configured to determine a first coordinate space range of tensor data that is allowed to be used by the first operation;

a second determination unit configured to determine a second coordinate space range of the tensor data to be used when the first operation is performed, and

An execution unit configured to execute the first operation within a third coordinate space range determined by an intersection of the first coordinate space range and the second coordinate space range;

Clause 13 the processing device of clause 12, wherein the first coordinate space range and the second coordinate space range are each part of a shape coordinate space of the tensor data, the shape coordinate space being mapped to a data storage area on the tensor data storage unit.

Clause 14 the processing device according to any of clauses 12-13, further comprising:

a third determination unit configured to determine a preceding operation being performed that accesses the same tensor number and has a dependency relationship with the first operation, and

The first determination unit is configured to determine the first coordinate space range based at least in part on an operational state of the prior operation.

Clause 15 the processing device of clause 14, wherein the shape coordinate space of the tensor data is pre-divided into a number of space blocks, and

The first determining unit is further configured to determine a block of space in the shape coordinate space in which the preceding operation has been completed as the first coordinate space range, and/or

The second determining unit is further configured to determine a spatial block determined based on coordinates of tensor data to be accessed by the first operation as the second coordinate space range.

Clause 16 the processing device of clause 15, wherein:

The first determination unit is further configured to determine a partial space block constituted by the partial coordinates based on an operation state of the prior operation when there is a space block in which the prior operation is completed with respect to only the partial coordinates, and to include the partial space block in the first coordinate space range, and/or

The second determination unit is further configured to determine a partial space block constituted by the partial coordinates based on an operation state of the first operation when there is a space block in which the first operation is to access the partial coordinates, and to include the partial space block in the second coordinate space range.

Clause 17 the processing device of clause 16, wherein the operating state comprises at least one of the following information:

execution scope of operation, and

Access mode of operation.

Clause 18 the processing device of any of clauses 16-17, wherein the first coordinate space range and the second coordinate space range are characterized using at least one of:

Clause 19 the processing device of any of clauses 15-18, wherein:

predetermined spatial block order, and/or

A predetermined sequence of spatial coordinates.

Clause 20 the processing means of any of clauses 12-19, wherein the pre-partitioning of the shape coordinate space of the tensor data is performed based on at least one of:

Processing power of the hardware;

Preset parameters, and

The shape coordinate space of the tensor data is of a size.

Clause 21 the processing device of any of clauses 14-20, wherein:

Clause 22 the processing device according to any of clauses 12-21, wherein the operation acquisition unit and the first determination unit are included in a control module of the processing device, and the second determination unit and the execution unit are included in an operation module of the processing device.

Clause 23 the processing device according to any of clauses 12-21, wherein the operation acquisition unit, the first determination unit, and the second determination unit are included in a control module of the processing device, and the execution unit is included in an arithmetic module of the processing device.

Clause 24 a chip, characterized in that the chip comprises a processing device according to any of clauses 12-23.

Clause 25. A board card, wherein the board card comprises the chip of clause 24.

Claims

1. A treatment method, comprising:

Get the first operation of the decoded instruction;

determining a first coordinate space range of the tensor data allowed to be used by the first operation;

determining a second coordinate space range of the tensor data to be used when performing the first operation; and

Performing the first operation within a third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range, so as to improve the degree of parallelism of the operation while ensuring the sequential consistency of the operation execution;

wherein the first coordinate space range and the second coordinate space range are determined based at least in part on a predetermined partitioning of a shape coordinate space of the tensor data;

The shape coordinate space of the tensor data is pre-divided into a number of space blocks;

Determining a first coordinate space range of tensor data allowed to be used by the first operation includes: determining a space block in the shape coordinate space where a previous operation has been completed as the first coordinate space range;

The determining of the second coordinate space range of the tensor data to be used when performing the first operation includes: determining a spatial block determined based on the coordinates of the tensor data to be accessed by the first operation as the second coordinate space range.

2. The processing method according to claim 1, wherein the first coordinate space range and the second coordinate space range are respectively part of the shape coordinate space of the tensor data, and the shape coordinate space is mapped to the data storage area of the tensor data on the storage module.

3. The processing method according to any one of claims 1-2, further comprising:

Determine a previous operation that is being executed and that accesses the same tensor data as the first operation and has a dependency relationship with the first operation.

4. The processing method according to claim 1, further comprising:

When there is a spatial block in which the previous operation has been completed only with respect to part of the coordinates, determining a partial spatial block consisting of the partial coordinates based on the operation state of the previous operation, and including the partial spatial block in the first coordinate space range; and/or

When there is a spatial block in which the first operation will access partial coordinates, a partial spatial block composed of the partial coordinates is determined based on an operation state of the first operation, and the partial spatial block is included in the second coordinate space range.

5. The processing method according to claim 4, wherein the operation status includes at least one of the following information:

The completion status of the operation relative to the coordinate point in the spatial block;

the scope of the operation; and

The access mode of the operation.

6. The processing method according to any one of claims 4 to 5, wherein the first coordinate space range and the second coordinate space range are characterized by at least one of the following:

Identifications of the space blocks respectively included in the first coordinate space range and the second coordinate space range;

The upper bound of the coordinates of the spatial block or part of the spatial block within the first coordinate space in one or more dimensions of the tensor data; and/or

The spatial block or partial spatial block within the second coordinate space range is a lower bound of coordinates in one or more dimensions of the tensor data.

7. The processing method according to claim 1, further comprising:

Within the third coordinate space, the first operation is performed based on at least one of the following sequences:

a predetermined sequence of spatial blocks; and/or

A predetermined order of spatial coordinates.

8. The processing method according to claim 1, further comprising:

When the third coordinate space range is empty, the first operation is blocked.

9. The processing method according to claim 1 , wherein the pre-division of the shape coordinate space of the tensor data is performed based on at least one of the following:

The processing power of the hardware;

Pre-set parameters; and

The size of the shape coordinate space of the tensor data.

10. The processing method according to claim 3, wherein:

The first operation and the previous operation are operations in different instructions executed in parallel; or

The first operation and the previous operation are different operations executed in parallel in the same instruction.

11. A processing device comprising:

a first determining unit configured to determine a first coordinate space range of the tensor data allowed to be used by the first operation;

a second determining unit configured to determine a second coordinate space range of the tensor data to be used when performing the first operation; and

an execution unit configured to execute the first operation within a third coordinate space range determined by the intersection of the first coordinate space range and the second coordinate space range, so as to improve the degree of parallelism of the operation while ensuring the sequential consistency of the operation execution;

The shape coordinate space of the tensor data is pre-divided into several space blocks, and

The first determining unit is further configured to: determine a space block in the shape coordinate space where the previous operation has been completed as the first coordinate space range; and/or

The second determining unit is further configured to: determine a spatial block determined based on the coordinates of the tensor data to be accessed by the first operation as the second coordinate space range.

12. The processing device according to claim 11, wherein the first coordinate space range and the second coordinate space range are respectively part of a shape coordinate space of the tensor data, and the shape coordinate space is mapped to a data storage area of the tensor data on a storage unit.

13. The processing device according to any one of claims 11-12, further comprising:

The third determining unit is configured to determine a previous operation being executed that accesses the same tensor quantity as the first operation and has a dependency relationship with the first operation.

14. The processing device according to claim 11, wherein:

The first determining unit is further configured to: when there is a spatial block in which the previous operation is completed only with respect to part of the coordinates, determine a partial spatial block consisting of the partial coordinates based on the operation state of the previous operation, and include the partial spatial block in the first coordinate space range; and/or

The second determining unit is further configured to: when there is a spatial block in which the first operation will access partial coordinates, determine a partial spatial block composed of the partial coordinates based on the operation state of the first operation, and include the partial spatial block in the second coordinate space range.

15. The processing device according to claim 14, wherein the operating status includes at least one of the following information:

the scope of the operation; and

The access mode of the operation.

16. The processing device according to any one of claims 14 to 15, wherein the first coordinate space range and the second coordinate space range are characterized by at least one of the following:

17. The processing device according to claim 11, wherein:

a predetermined sequence of spatial blocks; and/or

A predetermined order of spatial coordinates.

18. The processing apparatus according to claim 11, wherein the pre-division of the shape coordinate space of the tensor data is performed based on at least one of the following:

The processing power of the hardware;

Pre-set parameters; and

The size of the shape coordinate space of the tensor data.

19. The processing device according to claim 13, wherein:

20 . The processing device according to claim 11 , wherein the operation acquisition unit and the first determination unit are included in a control module of the processing device, and the second determination unit and the execution unit are included in a calculation module of the processing device.

21 . The processing device according to claim 11 , wherein the operation acquisition unit, the first determination unit, and the second determination unit are included in a control module of the processing device, and the execution unit is included in a calculation module of the processing device.

22. A chip, characterized in that the chip comprises the processing device according to any one of claims 11 to 21.

23. A board, characterized in that the board comprises the chip according to claim 22.