CN114359289B

CN114359289B - Image processing method and related device

Info

Publication number: CN114359289B
Application number: CN202011043640.2A
Authority: CN
Inventors: 汪涛; 宋风龙; 任文琦; 操晓春
Original assignee: Huawei Technologies Co Ltd; Institute of Information Engineering of CAS
Current assignee: Huawei Technologies Co Ltd; Institute of Information Engineering of CAS
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2025-09-05
Anticipated expiration: 2040-09-28
Also published as: CN114359289A

Abstract

The embodiment of the application discloses an image processing method which is applied to the field of artificial intelligence and comprises the steps of obtaining an image to be processed, processing the image to be processed through a first network to obtain a first feature, processing the image to be processed through a second network to obtain a second feature, wherein the first network is configured to at least extract features for image enhancement, generating a third feature according to the first feature and the second feature, obtaining a semantic segmentation result of the image to be processed, generating a fourth feature according to the third feature and the semantic segmentation result of the image to be processed, and carrying out image reconstruction on the fourth feature to obtain a target image. By introducing semantic features and semantic segmentation results of the image in the image enhancement processing process, different image enhancement intensities can be adopted for different semantic areas, texture details can be accurately kept, and the authenticity of the texture details after image enhancement is improved.

Description

Image processing method and related device

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to an image processing method and related apparatus.

Background

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The deep learning method is a key driving force for the development of the field of artificial intelligence in recent years, and achieves remarkable effects on various tasks of computer vision. In the field of image enhancement (also referred to as image quality enhancement), methods based on deep learning have exceeded conventional methods.

However, the image enhancement network based on deep learning at present has unnatural enhancement effect on images, and the texture details of the images obtained after the image enhancement network processing are not true.

Disclosure of Invention

The embodiment of the application provides an image processing method and a related device, which are used for improving the image enhancement effect.

The first aspect of the application provides an image processing method, which comprises the steps of obtaining an image to be processed, wherein the image to be processed can be an image needing image enhancement, processing the image to be processed through a first network to obtain a first feature, wherein the first network is configured to at least extract a feature used for image enhancement, the feature used for image enhancement can be an image low-layer feature, processing the image to be processed through a second network to obtain a second feature, the second network is configured to at least extract a semantic segmentation feature, the semantic segmentation feature can be an image high-layer feature, a third feature is generated according to the first feature and the second feature, a semantic segmentation result of the image to be processed is obtained, a fourth feature is generated according to the third feature and the semantic segmentation result of the image to be processed, and image reconstruction is performed on the fourth feature to obtain a target image.

According to the scheme, the semantic features and the semantic segmentation result of the image are introduced in the image enhancement processing process, and the semantic features and the semantic segmentation result are fused with the features for image enhancement, so that different image enhancement intensities can be adopted for different semantic areas, texture details can be accurately kept, and the authenticity of the texture details after image enhancement is improved.

Optionally, in one possible implementation manner, the obtaining the semantic segmentation result of the image to be processed includes processing the third feature through a third network to obtain the semantic segmentation result of the image to be processed.

Since the third feature is a feature after fusing the image enhancement related image low-level feature and the semantic segmentation related image high-level feature. Therefore, by processing the third feature, the image low-level feature can be introduced on the basis of the image high-level feature related to semantic segmentation, namely, the semantic segmentation result of the image to be processed is obtained based on the features of different levels, and the accuracy of the obtained semantic segmentation result is improved.

Optionally, in one possible implementation manner, the generating a third feature according to the first feature and the second feature includes performing feature fusion processing on the first feature and the second feature to obtain the third feature, and generating a fourth feature according to the third feature and a semantic segmentation result of the image to be processed includes performing feature fusion processing on the third feature and the semantic segmentation result of the image to be processed to obtain the fourth feature.

Optionally, in a possible implementation manner, the feature fusion process includes a summation process, a multiplication process, a cascade process, or a cascade process and a convolution process.

Optionally, in one possible implementation manner, before generating the fourth feature according to the third feature and the semantic segmentation result of the image to be processed, the method further includes processing the third feature to obtain a fifth feature, and generating the fourth feature according to the third feature and the semantic segmentation result of the image to be processed includes generating the fourth feature according to the fifth feature and the semantic segmentation result of the image to be processed.

That is, the image processing apparatus performs feature fusion based on the fifth feature obtained by further feature extraction and the semantic segmentation result of the image to be processed after performing feature extraction on the third feature. By further feature extraction processing of the third features obtained after feature fusion, features with finer granularity can be extracted on the basis of the third features, so that the accuracy of fourth features obtained by subsequent feature fusion is improved.

Optionally, in one possible implementation manner, the processing the image to be processed through the second network to obtain the second feature includes preprocessing the image to be processed to obtain a preprocessed feature, performing downsampling on the preprocessed feature to obtain a downsampled feature, processing the downsampled feature through the second network to obtain a sixth feature, and performing upsampling on the sixth feature to obtain the second feature of the image to be processed. Through the downsampling operation, the resolution of the preprocessing features is reduced, the calculated amount for extracting the semantic segmentation features can be reduced, and the computational power requirement on the image processing device is reduced.

Optionally, in one possible implementation, the method is used to implement at least one of image super-resolution reconstruction, image denoising, image defogging, image deblurring, image contrast enhancement, image demosaicing, image rain removal, image color enhancement, image brightness enhancement, image detail enhancement, and image dynamic range enhancement.

The second aspect of the application provides a model training method, which comprises the steps of obtaining a training sample pair, wherein the training sample pair comprises a first image and a second image, the quality of the first image is lower than that of the second image, the first image is processed through an image processing model to be trained to obtain a predicted image, the image processing model to be trained is used for obtaining an image to be processed, the first image is processed through a first network to obtain first characteristics, the first network is configured to at least extract characteristics for image enhancement, the first image is processed through a second network to obtain second characteristics, the second network is configured to at least extract semantic segmentation characteristics, a third characteristic is generated according to the first characteristics and the second characteristics, a semantic segmentation result of the first image is obtained, a fourth characteristic is generated according to the third characteristics and the semantic segmentation result of the first image, the fourth characteristic is reconstructed to obtain a predicted image, the second image in the training sample pair and the predicted image are obtained according to the first network, the second characteristics are obtained through the second network, the second network is configured to at least extract characteristics for image enhancement, the second characteristics are obtained, the second network is configured to at least extract semantic segmentation characteristics, the second characteristics are generated according to the first characteristics and the second characteristics, the semantic segmentation result of the first image is obtained according to the first characteristics, the semantic segmentation result of the first image is obtained, the fourth characteristics are generated according to the semantic segmentation result of the third characteristics of the first image, the fourth characteristics is obtained according to the semantic segmentation result of the first image, the fourth characteristics is obtained, the fourth characteristics are obtained, image is the image is obtained, and the first characteristics is at least the first characteristics is obtained at the image.

Optionally, in a possible implementation manner, the image processing model to be trained is further configured to process the third feature through a third network, so as to obtain a semantic segmentation prediction result of the first image.

Optionally, in one possible implementation manner, the image processing model to be trained is further used for obtaining a semantic segmentation real result of the first image, obtaining a second loss according to the semantic segmentation prediction result and the semantic segmentation real result, wherein the second loss is used for describing the difference between the semantic segmentation prediction result and the semantic segmentation real result, and updating model parameters of the image processing model to be trained at least according to the first loss and the second loss until model training conditions are met, so as to obtain the image processing model.

Optionally, in one possible implementation manner, the image processing model to be trained is further used for performing feature fusion processing on the first feature and the second feature to obtain the third feature, and generating a fourth feature according to the third feature and the semantic segmentation result of the first image includes performing feature fusion processing on the third feature and the semantic segmentation result of the first image to obtain the fourth feature.

Optionally, in a possible implementation manner, the feature fusion process comprises a summation process, a cascade process, or a cascade process and a convolution process, and the feature fusion process comprises a summation process, a cascade process, or a cascade process and a convolution process.

Optionally, in a possible implementation manner, the image processing model to be trained is further used for processing the third feature to obtain a fifth feature, and generating a fourth feature according to the third feature and the semantic segmentation result of the first image.

Optionally, in a possible implementation manner, the image processing model to be trained is further used for preprocessing the first image to obtain a preprocessed feature, downsampling the preprocessed feature to obtain a downsampled feature, processing the downsampled feature through the second network to obtain a sixth feature, and upsampling the sixth feature to obtain a second feature of the first image.

Optionally, in one possible implementation, the image processing model is configured to implement at least one of image super-resolution reconstruction, image denoising, image defogging, image deblurring, image contrast enhancement, image demosaicing, image rain removal, image color enhancement, image brightness enhancement, image detail enhancement, and image dynamic range enhancement.

The third aspect of the application provides an image processing device, which comprises an acquisition unit and a processing unit, wherein the acquisition unit is used for acquiring an image to be processed, the processing unit is used for processing the image to be processed through a first network to obtain a first feature, the first network is configured to at least extract a feature for image enhancement, the image to be processed is processed through a second network to obtain a second feature, the second network is configured to at least extract a semantic segmentation feature, a third feature is generated according to the first feature and the second feature, the acquisition unit is further used for acquiring a semantic segmentation result of the image to be processed, the processing unit is further used for generating a fourth feature according to the third feature and the semantic segmentation result of the image to be processed, and the fourth feature is subjected to image reconstruction to obtain a target image.

Optionally, in a possible implementation manner, the processing unit is further configured to process the third feature through a third network to obtain a semantic segmentation result of the image to be processed.

Optionally, in one possible implementation manner, the processing unit is further configured to perform feature fusion processing on the first feature and the second feature to obtain the third feature, and perform feature fusion processing on the third feature and a semantic segmentation result of the image to be processed to obtain the fourth feature.

Optionally, in one possible implementation, the feature fusion process includes at least one of a summation process, a multiplication process, a cascade process, and a cascade convolution process.

Optionally, in a possible implementation manner, the processing unit is further configured to process the third feature to obtain a fifth feature, and generate a fourth feature according to the fifth feature and a semantic segmentation result of the image to be processed.

Optionally, in a possible implementation manner, the processing unit is further configured to perform preprocessing on the image to be processed to obtain a preprocessed feature, perform downsampling on the preprocessed feature to obtain a downsampled feature, process the downsampled feature through the second network to obtain a sixth feature, and perform upsampling on the sixth feature to obtain a second feature of the image to be processed.

Optionally, in one possible implementation manner, the image processing device is used for at least one of image super-resolution reconstruction, image denoising, image defogging, image deblurring, image contrast enhancement, image demosaicing, image rain removal, image color enhancement, image brightness enhancement, image detail enhancement and image dynamic range enhancement.

The application provides a model training device which comprises an acquisition unit and a training unit, wherein the acquisition unit is used for acquiring a training sample pair, the training sample pair comprises a first image and a second image, the quality of the first image is lower than that of the second image, the training unit is used for processing the first image through an image processing model to be trained to obtain a predicted image, the image processing model to be trained is used for acquiring the image to be processed, the first image is processed through a first network to obtain first features, the first network is configured to at least extract features for image enhancement, the first image is processed through a second network to obtain second features, the second network is configured to at least extract semantic segmentation features, the third features are generated according to the first features and the second features, the semantic segmentation results of the first image are obtained, the fourth features are generated according to the semantic segmentation results of the third features and the first image, the fourth features are reconstructed to obtain predicted images, the first features are obtained according to the first network to the first image processing model processing the first images, the first network is configured to at least extract the features for image enhancement, the second network is configured to at least extract semantic segmentation features according to the first features and the second features, the semantic segmentation results are generated according to the first features, the semantic segmentation results of the first images are obtained according to the first images, the semantic segmentation results are generated, the fourth features are obtained according to the fourth features, the fourth features are obtained according to the semantic loss, the first image processing model is obtained according to the first model is obtained.

Optionally, in a possible implementation manner, the training unit is further configured to process the third feature through a third network to obtain a semantic segmentation prediction result of the first image.

Optionally, in one possible implementation manner, the training unit is further configured to obtain a real semantic segmentation result of the first image, obtain a second loss according to the predicted semantic segmentation result and the real semantic segmentation result, where the second loss is used to describe a difference between the predicted semantic segmentation result and the real semantic segmentation result, and update model parameters of the image processing model to be trained at least according to the first loss and the second loss until a model training condition is met, so as to obtain an image processing model.

Optionally, in one possible implementation manner, the training unit is further configured to perform feature fusion processing on the first feature and the second feature to obtain the third feature, and perform feature fusion processing on the third feature and a semantic segmentation result of the first image to obtain the fourth feature.

Optionally, in a possible implementation manner, the training unit is further configured to process the third feature to obtain a fifth feature, and generate a fourth feature according to the third feature and the semantic segmentation result of the first image.

Optionally, in a possible implementation manner, the training unit is further configured to perform preprocessing on the first image to obtain a preprocessed feature, perform downsampling on the preprocessed feature to obtain a downsampled feature, process the downsampled feature through the second network to obtain a sixth feature, and perform upsampling on the sixth feature to obtain a second feature of the first image.

A fifth aspect of the present application provides an image processing apparatus, which may comprise a processor coupled to a memory, the memory storing program instructions which, when executed by the processor, implement the method of the first aspect described above. For the steps in each possible implementation manner of the first aspect executed by the processor, reference may be specifically made to the first aspect, which is not described herein.

A sixth aspect of the application provides a model training apparatus, which may comprise a processor coupled to a memory, the memory storing program instructions which, when executed by the processor, implement the method of the second aspect. For the steps in each possible implementation manner of the second aspect executed by the processor, reference may be specifically made to the second aspect, which is not described herein.

A seventh aspect of the application provides a computer readable storage medium having a computer program stored therein, which when run on a computer causes the computer to perform the method of the first aspect described above.

An eighth aspect of the present application provides a computer readable storage medium having a computer program stored therein, which when run on a computer causes the computer to perform the method of the second aspect described above.

A ninth aspect of the application provides circuitry comprising processing circuitry configured to perform the method of the first aspect described above.

A tenth aspect of the application provides circuitry comprising processing circuitry configured to perform the method of the second aspect described above.

An eleventh aspect of the application provides a computer program which, when run on a computer, causes the computer to perform the method of the first aspect described above.

A twelfth aspect of the application provides a computer program which, when run on a computer, causes the computer to perform the method of the second aspect described above.

A thirteenth aspect of the present application provides a chip system comprising a processor for supporting a server or threshold value acquisition means to perform the functions involved in the above aspects, for example to transmit or process data and/or information involved in the above methods. In one possible design, the chip system further includes a memory for holding program instructions and data necessary for the server or the communication device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence main body framework according to an embodiment of the present application;

FIG. 2a is a schematic diagram of an image processing system according to an embodiment of the present application;

FIG. 2b is a schematic diagram of another image processing system according to an embodiment of the present application;

Fig. 2c is a schematic diagram of an apparatus related to image processing according to an embodiment of the present application;

FIG. 3a is a schematic diagram of a system 100 architecture according to an embodiment of the present application;

FIG. 3b is a schematic diagram of semantic segmentation of an image according to an embodiment of the present application;

fig. 4 is a schematic flow chart of an image processing method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a densely connected cavity convolution network according to an embodiment of the present disclosure;

FIG. 6a is a schematic diagram of an architecture for image processing according to an embodiment of the present application;

fig. 6b is a schematic diagram of a network structure for image processing according to an embodiment of the present application;

FIG. 7 is a comparison diagram of objective indicators according to an embodiment of the present application;

FIG. 8 is a diagram showing another objective index comparison provided in an embodiment of the present application;

FIG. 9 is a schematic diagram of image contrast provided in an embodiment of the present application;

FIG. 10 is a flow chart of a model training method according to an embodiment of the present application

Fig. 11 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

FIG. 12 is a schematic structural diagram of a model training device according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of an execution device according to an embodiment of the present application;

FIG. 14 is a schematic structural diagram of a training device according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of a chip according to an embodiment of the present application.

Detailed Description

Embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention. The terminology used in the description of the embodiments of the invention herein is for the purpose of describing particular embodiments of the invention only and is not intended to be limiting of the invention.

Embodiments of the present application are described below with reference to the accompanying drawings. As one of ordinary skill in the art can know, with the development of technology and the appearance of new scenes, the technical scheme provided by the embodiment of the application is also applicable to similar technical problems.

The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which embodiments of the application have been described in connection with the description of the objects having the same attributes. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, a schematic structural diagram of an artificial intelligence main body framework is shown in fig. 1, and the artificial intelligence main body framework is described below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where the "intelligent information chain" reflects a list of processes from the acquisition of data to the processing. For example, there may be general procedures of intelligent information awareness, intelligent information representation and formation, intelligent reasoning, intelligent decision making, intelligent execution and output. In this process, the data undergoes a "data-information-knowledge-wisdom" gel process. The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of personal intelligence, information (provisioning and processing technology implementation), to the industrial ecological process of the system.

(1) Infrastructure of

The infrastructure provides computing capability support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the base platform. The system is communicated with the outside through the sensor, the computing capacity is provided by an intelligent chip (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips), and the basic platform comprises a distributed computing framework, network and other relevant platform guarantees and supports, which can comprise cloud storage, computing, interconnection network and the like. For example, the sensor and external communication obtains data that is provided to a smart chip in a distributed computing system provided by the base platform for computation.

(2) Data

The data of the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data relate to graphics, images, voice and text, and also relate to the internet of things data of the traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

Wherein machine learning and deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Reasoning refers to the process of simulating human intelligent reasoning modes in a computer or an intelligent system, and carrying out machine thinking and problem solving by using formal information according to a reasoning control strategy, and typical functions are searching and matching.

Decision making refers to the process of making decisions after intelligent information is inferred, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capability

After the data has been processed, some general-purpose capabilities can be formed based on the result of the data processing, such as algorithms or a general-purpose system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.

(5) Intelligent product and industry application

The intelligent product and industry application refers to the product and application of the artificial intelligent system in various fields, which is the encapsulation of the whole artificial intelligent solution, and the intelligent information decision is produced to realize the floor application, and the application fields mainly comprise intelligent terminals, intelligent traffic, intelligent medical treatment, automatic driving, safe cities and the like.

Next, several application scenarios of the present application are described.

Fig. 2a is a schematic diagram of an image processing system according to an embodiment of the present application, where the image processing system includes a user device and a data processing device. The user equipment comprises intelligent terminals such as a mobile phone, a personal computer or an information processing center. The user device is the initiating end of image processing, and is used as the initiator of image enhancement request, and the user typically initiates the request through the user device.

The data processing device may be a device or a server having a data processing function, such as a cloud server, a web server, an application server, and a management server. The data processing equipment receives an image enhancement request from the intelligent terminal through the interactive interface, and then performs image processing in modes of machine learning, deep learning, searching, reasoning, decision and the like through a memory for storing data and a processor link for data processing. The memory in the data processing device may be a generic term comprising a database storing the history data locally, either on the data processing device or on another network server.

In the image processing system shown in fig. 2a, the user device may receive an instruction from a user, for example, the user device may acquire an image input/selected by the user, and then initiate a request to the data processing device, so that the data processing device performs an image enhancement processing application (for example, image super resolution reconstruction, image denoising, image defogging, image deblurring, and image contrast enhancement) on the image obtained by the user device, thereby obtaining a corresponding processing result for the image. For example, the user device may acquire an image input by the user, and then initiate an image denoising request to the data processing device, so that the data processing device performs image denoising on the image, thereby obtaining a denoised image.

In fig. 2a, a data processing apparatus may perform the image processing method of the embodiment of the present application.

Fig. 2b is a schematic diagram of another image processing system according to an embodiment of the present application, in fig. 2b, a user device directly serves as a data processing device, and the user device can directly obtain an input from a user and directly process the input by hardware of the user device, and a specific process is similar to that of fig. 2a, and reference is made to the above description and will not be repeated here.

In the image processing system shown in fig. 2b, the user device may receive an instruction from the user, for example, the user device may acquire an image selected by the user in the user device, and then perform an image processing application (such as image super-resolution reconstruction, image denoising, image defogging, image deblurring, and image contrast enhancement) on the image by the user device itself, so as to obtain a corresponding processing result for the image.

In fig. 2b, the user equipment itself may perform the image processing method according to the embodiment of the present application.

Fig. 2c is a schematic diagram of an apparatus related to image processing according to an embodiment of the present application.

The user device in fig. 2a and 2b may be the local device 301 or the local device 302 in fig. 2c, and the data processing device in fig. 2a may be the executing device 210 in fig. 2c, where the data storage system 250 may store data to be processed of the executing device 210, and the data storage system 250 may be integrated on the executing device 210, or may be disposed on a cloud or other network server.

The processors in fig. 2a and 2b may perform data training/machine learning/deep learning through a neural network model or other models (e.g., a model based on a support vector machine), and perform image processing application on the image using the model obtained by the data final training or learning, thereby obtaining corresponding processing results.

Fig. 3a is a schematic diagram of a system 100 architecture provided by an embodiment of the present application, in fig. 3a, an execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through a client device 140, where the input data may include tasks to be scheduled, callable resources, and other parameters in an embodiment of the present application.

In the preprocessing of the input data by the execution device 110, or in the process of performing a processing related to computation or the like (for example, performing a functional implementation of a neural network in the present application) by the computation module 111 of the execution device 110, the execution device 110 may call the data, the code or the like in the data storage system 150 for the corresponding processing, or may store the data, the instruction or the like obtained by the corresponding processing in the data storage system 150.

Finally, the I/O interface 112 returns the processing results to the client device 140 for presentation to the user.

It should be noted that the training device 120 may generate, based on different training data, a corresponding target model/rule for different targets or different tasks, where the corresponding target model/rule may be used to achieve the targets or complete the tasks, thereby providing the user with the desired result. Wherein the training data may be stored in database 130 and derived from training samples collected by data collection device 160.

In the case shown in fig. 3a, the user may manually give input data, which may be operated through an interface provided by the I/O interface 112. In another case, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data requiring the user's authorization, the user may set the corresponding permissions in the client device 140. The user may view the results output by the execution device 110 at the client device 140, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client device 140 may also be used as a data collection terminal to collect input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data as shown in the figure, and store the new sample data in the database 130. Of course, instead of being collected by the client device 140, the I/O interface 112 may directly store the input data input to the I/O interface 112 and the output result output from the I/O interface 112 as new sample data into the database 130.

It should be noted that fig. 3a is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawing is not limited in any way, for example, in fig. 3a, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may be disposed in the execution device 110. As shown in fig. 3a, the neural network may be trained in accordance with the training device 120.

The embodiment of the application also provides a chip, which comprises the NPU. The chip may be provided in an execution device 110 as shown in fig. 3a for performing the calculation of the calculation module 111. The chip may also be provided in the training device 120 as shown in fig. 3a for completing the training work of the training device 120 and outputting the target model/rule.

The neural network processor NPU, NPU is mounted as a coprocessor to a main central processing unit (central processing unit, CPU) (host CPU) which distributes tasks. The core part of the NPU is an operation circuit, and the controller controls the operation circuit to extract data in a memory (a weight memory or an input memory) and perform operation.

In some implementations, the arithmetic circuitry includes a plurality of processing units (PEs) internally. In some implementations, the operational circuit is a two-dimensional systolic array. The arithmetic circuitry may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the operational circuitry is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit takes the data corresponding to the matrix B from the weight memory and caches the data on each PE in the arithmetic circuit. The operation circuit takes the matrix A data and the matrix B from the input memory to perform matrix operation, and the obtained partial result or the final result of the matrix is stored in an accumulator (accumulator).

The vector calculation unit may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, etc. For example, the vector calculation unit may be used for network calculations of non-convolutional/non-FC layers in a neural network, such as pooling (pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector computation unit can store the vector of processed outputs to a unified buffer. For example, the vector calculation unit may apply a nonlinear function to an output of the arithmetic circuit, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit generates a normalized value, a combined value, or both. In some implementations, the vector of processed outputs can be used as an activation input to an arithmetic circuit, for example for use in subsequent layers in a neural network.

The unified memory is used for storing input data and output data.

The weight data is transferred to the input memory and/or the unified memory directly by the memory cell access controller (direct memory access controller, DMAC), the weight data in the external memory is stored in the weight memory, and the data in the unified memory is stored in the external memory.

And the bus interface unit (bus interface unit, BIU) is used for realizing interaction among the main CPU, the DMAC and the instruction fetch memory through a bus.

The instruction fetching memory (instruction fetch buffer) is connected with the controller and used for storing instructions used by the controller;

And the controller is used for calling the instruction which refers to the cache in the memory and controlling the working process of the operation accelerator.

Typically, the unified memory, input memory, weight memory, and finger memory are On-Chip (On-Chip) memories, and the external memory is a memory external to the NPU, which may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM), or other readable and writable memory.

Because the embodiments of the present application relate to a large number of applications of neural networks, for convenience of understanding, related terms and related concepts of the neural networks related to the embodiments of the present application will be described below.

(1) Neural network

The neural network may be composed of neural units, which may refer to an arithmetic unit having xs and intercept 1 as inputs, and the output of the arithmetic unit may be:

Wherein, the s=1, 2, &....n, n is a natural number greater than 1, ws is the weight of xs and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by joining together a number of the above-described single neural units, i.e., the output of one neural unit may be the input of another. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.

The operation of each layer in the neural network may be expressed mathematicallyTo describe, the operation of each layer in the physical layer neural network can be understood as the transformation of the input space into the output space (i.e., the row space to the column space of the matrix) is accomplished by five operations of 1, up/down dimension, 2, up/down, 3, rotation, 4, translation, 5, "bending". Wherein the operations of 1, 2 and 3 are as followsThe operation of 4 is completed by +b, and the operation of 5 is implemented by a (). The term "space" is used herein to describe two words because the object being classified is not a single thing, but rather a class of things, space referring to the collection of all individuals of such things. Where W is a weight vector, each value in the vector representing a weight value of a neuron in the layer neural network. The vector W determines the spatial transformation of the input space into the output space described above, i.e. the weights W of each layer control how the space is transformed. The purpose of training the neural network is to finally obtain a weight matrix (a weight matrix formed by a plurality of layers of vectors W) of all layers of the trained neural network. Thus, the training process of the neural network is essentially a way to learn and control the spatial transformation, and more specifically to learn the weight matrix.

Since it is desirable that the output of the neural network is as close as possible to the value actually desired, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the actually desired target value and then according to the difference between the two (of course, there is usually an initialization process before the first update, that is, the pre-configuration parameters of each layer in the neural network), for example, if the predicted value of the network is higher, the weight vector is adjusted to be predicted to be lower, and the adjustment is continued until the neural network can predict the actually desired target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and the training of the neural network becomes the process of reducing the loss as much as possible.

(2) Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the parameter in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial neural network model are updated by back propagation of the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal neural network model, such as a weight matrix.

(3) Image enhancement

Image enhancement refers to processing the brightness, color, contrast, saturation, dynamic range, etc. of an image to meet certain criteria. In short, by purposefully emphasizing the whole or partial characteristics of the image in the image processing process, the original unclear image is made clear or some interesting features are emphasized, the differences among different object features in the image are enlarged, and the uninteresting features are restrained, so that the effects of improving the image quality and enriching the image information are achieved, the image interpretation and recognition effects can be enhanced, and the needs of some special analysis are met. By way of example, image enhancement may include, but is not limited to, image super-resolution reconstruction, image denoising, image defogging, image deblurring, and image contrast enhancement.

(4) Image semantic segmentation

Image semantic segmentation refers to the subdivision of an image into different classes of pixels according to some rule (e.g., illumination, class). In short, the object of the semantic segmentation of an image is to label each pixel point in the image with a label, that is, label the object class to which each pixel in the image belongs, where the labels may include people, animals, automobiles, flowers, furniture, and the like. Referring to fig. 3b, fig. 3b is a schematic diagram of image semantic segmentation according to an embodiment of the present application. As shown in fig. 3b, the image can be divided into different subareas, such as subareas of buildings, sky, plants, etc., according to categories at the pixel level by image semantic segmentation.

The method provided by the application is described below from the training side of the neural network and the application side of the neural network.

The training method of the neural network provided by the embodiment of the application relates to image processing, and can be particularly applied to data processing methods such as data training, machine learning, deep learning and the like, intelligent information modeling, extraction, preprocessing, training and the like for symbolizing and formalizing training data (such as images in the application) are carried out, and finally a trained image processing model is obtained. It should be noted that, the training method and the image processing method for the image processing model provided by the embodiments of the present application are applications based on the same concept, and may be understood as two parts in a system or two phases of an overall process, such as a model training phase and a model application phase.

Referring to fig. 4, fig. 4 is a flow chart of an image processing method according to an embodiment of the application. As shown in fig. 4, an image processing method provided by an embodiment of the present application includes the following steps:

Step 401, acquiring an image to be processed.

In this embodiment, the image processing apparatus may acquire an image to be processed, and the image to be processed may be, for example, an image that needs to be subjected to image enhancement.

It can be appreciated that when the image processing device is deployed in the unmanned vehicle, the image processing device can acquire a street view acquired by the unmanned vehicle during driving through the camera. When the image processing device is deployed in the robot, the image processing device can acquire a live view map of the environment where the robot is located in real time. When the image processing device is deployed in a security device (for example, a monitoring camera), the image processing device can acquire a live-action image acquired by the monitoring camera in real time. When the image processing device is deployed on a handheld device such as a mobile phone or a tablet computer, the image processing device can acquire a picture shot by a user or a picture downloaded from a website, and the images can be used as images to be processed.

Step 402, processing the image to be processed through a first network to obtain a first feature, wherein the first network is configured to at least extract the feature for image enhancement.

In this embodiment, the first network may be a backbone network (backbone) associated with image enhancement, such as a convolutional neural network, configured to extract at least features for image enhancement, such as image low-level features (low level feature). By way of example, image low-level features may refer to some small detail information in an image, which may include, for example, high-frequency detail information such as edges (edges), corners (colors), colors, pixels (pixeles), gradients (gradients), and textures.

It will be appreciated that different networks may be employed for different image enhancement tasks to accommodate the requirements of the image enhancement task. For example, the first Network may employ a Residual Network (ResNet) when the image enhancement task is image super-resolution reconstruction, and the first Network may employ a Unet Network when the image enhancement task is image contrast enhancement.

Specifically, the image processing method provided in the present embodiment may be applied to different image enhancement tasks, for example, may include, but not limited to, image super-resolution reconstruction, image denoising, image defogging, image deblurring, image contrast enhancement, image demosaicing, image rain removal, image color enhancement, image brightness enhancement, image detail enhancement, and image dynamic range enhancement.

In a possible embodiment, the image to be processed may be further preprocessed by a preprocessing network, to obtain preprocessing features, before the image to be processed is processed by the first network. And then processing the obtained preprocessing features through the first network to obtain the first features. The preprocessing network may be, for example, a convolutional neural network. By preprocessing the image to be processed, irrelevant information in the image to be processed can be eliminated, useful real information is recovered, the detectability of related information is enhanced, data is simplified to the greatest extent, and therefore the reliability of feature extraction is improved.

It may be appreciated that the preprocessing network may be included in the first network, that is, the first network includes the preprocessing network, and the first feature may be obtained by processing the image to be processed through the first network.

Step 403, processing the image to be processed through a second network to obtain a second feature, wherein the second network is configured to extract at least semantic segmentation features.

In this embodiment, the second network may be a backbone network associated with image semantic segmentation, such as a convolutional neural network, configured to extract at least features for image semantic segmentation, such as image high-level features (HIGH LEVEL features). By way of example, image high-level features may refer to features that can reflect semantic information of an image based on image low-level features. In general, image high-level features can be used for recognition and detection of the shape of a target or object in an image, with richer semantic information.

In one possible embodiment, the second network may be, for example, a densely connected hole convolution network. The empty convolution network can increase receptive field, and the empty convolution network can acquire multi-scale information based on dense connection. By the combined action of the two, the image high-level characteristic information related to accurate semantic segmentation can be generated.

The hole convolution network actually introduces a dilation rate (also called a hole number) into a standard convolution network, and the parameter defines the distance between values when the convolution kernel processes data, so as to increase the receptive field. In general, a receptive field is used to represent the size of the extent of the perception of the original image by different neurons within the network, or the size of the area mapped on the original image by the pixels on the feature map (feature map) output by each layer of the convolutional network. By increasing the receptive field, pixels on the feature map can be made to respond to a sufficiently large area in the image to capture information about a large object, enabling accurate semantic information to be obtained.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a densely connected hole convolution network according to an embodiment of the present application. As shown in fig. 5, the densely connected hole convolution network includes multiple layers of networks, each of which includes a hole convolution network (dilated conv) and a linear activation function (leaky relu). For each layer of network in the densely connected hole convolution network, the output of the densely connected hole convolution network is used as the input of each layer of network behind the layer to realize the feature multiplexing. The characteristics of the low-level network are directly output to each subsequent high-level network for summarizing, so that the lost characteristics caused by transmission through the intermediate-level network are reduced, and the characteristics of the low-level network are better utilized.

It will be appreciated that in the embodiment of the present application, the hole convolutional network in which the second network is densely connected is described as an example, and in actual situations, the second network may be another neural network, which is not limited herein specifically.

In one possible embodiment, the processing the image to be processed through the second network to obtain the second feature may specifically include preprocessing the image to be processed by the image processing device to obtain a preprocessed feature, for example, preprocessing the image to be processed through the preprocessing network in the step 402, performing downsampling on the preprocessed feature to obtain a downsampled feature, processing the downsampled feature through the second network to obtain a sixth feature, and performing upsampling on the sixth feature to obtain the second feature of the image to be processed.

In this embodiment, the downsampled feature with reduced resolution may be output by downsampling the preprocessed feature, processing the downsampled feature through the second network to obtain a sixth feature, and then upsampling the sixth feature to generate a second feature with the same resolution as the preprocessed feature, i.e., to recover the resolution of the feature.

In practical applications, the multiple of downsampling may be determined according to the desired processing accuracy and the computational power of the target hardware platform, which is not specifically limited herein. In general, the larger the multiple of downsampling is, the lower the processing accuracy is, the smaller the calculation amount is, namely the calculation force requirement is lower, and the smaller the multiple of downsampling is, the higher the processing accuracy is, but the higher the calculation amount is, namely the calculation force requirement is higher. The up-sampling multiple needs to be consistent with the down-sampling multiple to ensure resolution of the recovered features. Among these, methods that may be used to perform upsampling include, but are not limited to, deconvolution, bilinear interpolation upsampling, and neighbor interpolation upsampling, and the upsampling method is not specifically limited herein.

Step 404, generating a third feature according to the first feature and the second feature.

In a possible embodiment, the image processing device may perform feature fusion processing on the first feature and the second feature to obtain the third feature. The feature fusion process may include at least one of a summation process, a multiplication process, a cascade process, and a cascade convolution process, where the cascade convolution process represents a cascade process as well as a convolution process. In actual situations, a corresponding feature fusion processing manner may be adopted according to actual needs, which is not specifically limited herein.

In this embodiment, by performing fusion processing on the first feature and the second feature, the image low-level feature related to image enhancement and the image high-level feature related to semantic segmentation may be effectively fused, so as to implement complementation of different levels of features, so as to improve robustness of the network.

And 405, acquiring a semantic segmentation result of the image to be processed.

In a possible embodiment, the image processing device may process the third feature to obtain a semantic segmentation result of the image to be processed. Since the third feature is a feature after fusing the image enhancement related image low-level feature and the semantic segmentation related image high-level feature. Therefore, by processing the third feature, the image low-level feature can be introduced on the basis of the image high-level feature related to semantic segmentation, namely, the semantic segmentation result of the image to be processed is obtained based on the features of different levels, and the accuracy of the obtained semantic segmentation result is improved. The image processing apparatus may perform a convolution operation on the third feature through a convolution network to obtain a semantic segmentation result of the image to be processed.

In another possible embodiment, the image processing device may process the second feature output by the second network to obtain a semantic segmentation result of the image to be processed, that is, obtain the semantic segmentation result of the image directly based on the feature related to the semantic segmentation. The image processing device may also perform convolution operation on the second feature through a convolution network to obtain a semantic segmentation result of the image to be processed.

And step 406, generating a fourth feature according to the third feature and the semantic segmentation result of the image to be processed.

In one possible embodiment, the image processing device may perform feature fusion processing on the third feature and the semantic segmentation result of the image to be processed, to obtain the fourth feature. The manner of the feature fusion process may be specifically described with reference to step 404, and will not be described herein.

In a possible embodiment, after obtaining the third feature, the image processing apparatus may process the third feature, for example, perform further feature extraction on the third feature, to obtain a fifth feature. And the image processing device generates a fourth feature according to the fifth feature and the semantic segmentation result of the image to be processed. That is, the image processing apparatus performs feature fusion based on the fifth feature obtained by further feature extraction and the semantic segmentation result of the image to be processed after performing feature extraction on the third feature. By further feature extraction processing of the third features obtained after feature fusion, features with finer granularity can be extracted on the basis of the third features, so that the accuracy of fourth features obtained by subsequent feature fusion is improved.

And step 407, performing image reconstruction on the fourth feature to obtain a target image.

In this embodiment, after obtaining the fourth feature after the twice feature fusion processing, the image processing apparatus may obtain the target image by performing image reconstruction on the fourth feature, for example, performing a convolution post-processing operation on the fourth feature, where the target image is an image obtained after performing image enhancement.

In the embodiment, in the image enhancement processing process, the semantic segmentation features and the semantic segmentation results are fused with the related image enhancement features through twice feature fusion processing, so that feature information complementation is realized, different image enhancement intensities can be adopted for different semantic areas, texture details can be accurately kept, and the authenticity of the texture details after image enhancement is improved.

It should be understood that the execution subject (i.e., the image processing apparatus) of steps 401 to 407 may be a terminal device or a cloud-side server, and steps 401 to 407 may also be obtained by performing data processing and interaction between the terminal device and the server.

For ease of understanding, how the image processing method provided by the present embodiment achieves image defogging will be described in detail below in connection with specific examples.

Referring to fig. 6a and fig. 6b, fig. 6a is a schematic diagram of an architecture for image processing according to an embodiment of the present application, and fig. 6b is a schematic diagram of a network structure for image processing according to an embodiment of the present application. As shown in fig. 6a and 6b, the architecture may include:

a preprocessing unit 100 for receiving a fogged low-contrast image and preprocessing the image to generate a preprocessing feature F. The preprocessing unit 100 may be, for example, a convolution network, and performs a convolution operation on a received image (for example, 12 megapixels, an image with a resolution of 3000×4000) to generate a preprocessing feature F, where the resolution of the preprocessing feature F is the same as that of the image, i.e., 3000×4000.

The first feature extraction unit 101 is configured to perform feature extraction on the preprocessed feature F, for example, perform image low-level feature extraction on the preprocessed feature F, to obtain a first feature F _L. The first feature extraction unit 101 may employ a backbone network associated with the defogging task, such as a multi-stage cascaded convolutional network + instance normalization (Instance Normalization, IN) network. The contrast normalization effect with high nonlinearity can be learned based on the IN network, so that the final prediction result cannot be influenced by the deviation of the image IN terms of brightness, color, style and the like, and the compatibility problem of the extracted image low-layer features and the subsequently extracted image high-layer features can be improved. The number of cascade layers N can be determined according to the desired processing precision and the computational effort of the target hardware platform, and in general, the larger the number of cascade layers N, the higher the precision of feature extraction and the larger the calculation amount.

The downsampling unit 200 is configured to perform a preprocessing operation on the preprocessed feature F, and perform downsampling processing to obtain a downsampled feature F _down with reduced resolution. Illustratively, the downsampling unit 200 may be, for example, a pooling layer network, by performing a k×k average pooling (average pooling) downsampling operation on the preprocessed feature F to obtain a downsampled feature F _down. The multiple k of the downsampling can be determined according to the expected processing precision and the calculation power of the target hardware platform, and in general, the smaller the cascade layer number N is, the higher the precision of feature extraction is, and the larger the calculation amount is. Illustratively, the value of k may be 4, i.e., the width and height of the feature are simultaneously downsampled by a factor of 4.

The second feature extraction unit 201 is configured to perform feature extraction processing on the downsampled feature F _down, for example, perform image high-level feature extraction on the downsampled feature F _down, to obtain a sixth feature F _down-seg. The second feature extraction unit 201 may be, for example, a densely connected hole convolution network.

The upsampling unit 202 is configured to upsample the second feature F _down-seg to obtain a second feature F _H-seg having the same resolution as the original input image. The upsampling unit 202 may use a sampling method such as deconvolution, bilinear interpolation upsampling, and neighbor interpolation upsampling, where the upsampling multiple is the same as the downsampling multiple.

The first feature fusion unit 102 is configured to perform feature fusion processing on the first feature FL and the second feature F _down-seg, for example, perform cascading operation on the first feature FL and the second feature F _down-seg, so as to obtain a fused third feature F _fusion1.

The third feature extraction unit 103 is configured to perform further feature extraction on the third feature F _fusion1, for example, perform convolution processing on the third feature F _fusion1 through a convolution network, so as to extract a fine-grained feature, that is, a fifth feature F _fine.

The semantic result prediction unit 203 is configured to perform semantic segmentation result prediction on the third feature F _fusion1, for example, perform post-processing operation on the third feature Ffusion1 through a convolution network, so as to obtain a semantic segmentation result corresponding to the input image.

The second feature fusion unit 104 is configured to perform feature fusion processing on the fifth feature F _fine and the semantic segmentation result corresponding to the input image, for example, perform cascade operation on the fifth feature F _fine and the semantic segmentation result corresponding to the input image, so as to obtain a fused fourth feature F _fusion2.

The image reconstruction unit 105 is configured to perform image reconstruction processing on the fused fourth feature F _fusion2, for example, perform post-processing operation on the fourth feature F _fusion2 through a convolution network, so as to obtain an defogged image.

Taking the method of the embodiment as an example for image defogging, the embodiment tests on an open source simulation data set to compare the implementation method with the existing defogging algorithm.

Referring to fig. 7, fig. 7 is a schematic diagram of comparing objective indexes according to an embodiment of the present application. As can be seen from fig. 7, compared with the existing various defogging algorithms, the image processing method provided by the embodiment of the application has higher peak signal-to-Noise Ratio (PSNR) and Structural similarity (Structural SIMILARITY, SSIM).

Where PSNR is an engineering term that represents the ratio of the maximum possible power of a signal to the destructive noise power affecting its accuracy of representation. PSNR is generally used as a measurement method of signal reconstruction quality in the field of image processing and the like, and is generally defined by a mean square error. In general, the higher the PSNR, the smaller the gap from the true value.

SSIM is an index that measures the similarity of two images, and evaluates the similarity of images based mainly on brightness (luminance), contrast (contrast), and structure (structure).

In addition, under the condition that noise with different levels is added on the open source simulation data set, compared with the existing defogging method, the method has higher PSNR and SSIM, namely the method has stronger robustness and stability. Specifically, referring to fig. 8, fig. 8 is another objective index comparison schematic diagram provided in the embodiment of the present application.

The embodiment tests on the true fog-containing data set, so that the method can obtain clearer and transparent results without artifact distortion. The existing defogging algorithm has the problems that the defogging level is insufficient, the contrast of a defogged picture is low, or the defogging is too low, and texture details are lost in some local areas. Specifically, referring to fig. 9, fig. 9 is a schematic diagram of image contrast provided in an embodiment of the present application. As can be seen from fig. 9, the method of the embodiment in the lower right corner well maintains details of green plants, floors and other areas on the basis of defogging, and the pictures are transparent, natural and have the best visual effect.

Referring to fig. 10, fig. 10 is a schematic flow chart of a model training method according to an embodiment of the present application. As shown in fig. 10, the model training method provided by the embodiment of the application includes the following steps:

step 1001, a training sample pair is obtained, wherein the training sample pair includes a first image and a second image, and the quality of the first image is lower than that of the second image.

In this embodiment, before the image training apparatus performs model training, a pair of training samples may be acquired. The first image and the second image are two images in the same scene, and the image quality of the first image is lower than that of the second image. Image quality refers to one or more of color, brightness, saturation, contrast, dynamic range, resolution, texture detail, sharpness, etc. For example, the first image is a fogged image, the second image is a non-fogged image, and the brightness, contrast, sharpness, and the like of the first image are lower than those of the second image.

Step 1002, processing the first image through an image processing model to be trained to obtain a predicted image, wherein the image processing model to be trained is used for obtaining an image to be processed, processing the first image through a first network to obtain a first feature, the first network is configured to at least extract features for image enhancement, processing the first image through a second network to obtain a second feature, the second network is configured to at least extract semantic segmentation features, generating a third feature according to the first feature and the second feature, obtaining a semantic segmentation result of the first image, generating a fourth feature according to the third feature and the semantic segmentation result of the first image, and performing image reconstruction on the fourth feature to obtain the predicted image.

Step 1003, obtaining a first loss according to the second image in the training sample pair and the predicted image, wherein the first loss is used for describing the difference between the second image and the predicted image.

In this embodiment, after obtaining the predicted image, the first loss corresponding to the predicted image and the second image may be obtained based on a preset loss function, so as to determine the difference between the second image and the predicted image.

In one possible implementation, the first loss corresponding to the second image and the predicted image may be obtained based on a reconstruction loss function (reconstruction loss) and a gradient loss function (gradient), so as to ensure that the enhanced image can meet objective index and subjective index requirements.

Illustratively, the reconstruction loss function may be to obtain the predicted image and the loss at the second image pixel level using the L1 paradigm. The reconstruction loss function may be as shown in equation 1:

Wherein L _rec represents the reconstruction loss, ||represents the L1 paradigm, GT represents the true value, i.e. the value of the pixels of the second image, output represents the value of the pixels of the predicted image, and P is the number of pixels. The L1 pattern is to perform a difference between the value of the pixel of the second image and the value of the pixel of the predicted image, and to sum the absolute values of the differences corresponding to the respective pixels.

The gradient loss function may be, for example, a loss representing an average gradient of the predicted image and the second image in the x/y direction. The gradient loss function may be as shown in equation 2:

L _grad = |grad (GT) -grad (output) |formula 2

Where L _grad denotes the gradient loss, ||denotes the L1 paradigm, GT denotes the true value, i.e. the value of the pixel of the second image, output denotes the value of the pixel of the predicted image, and grad () denotes the average gradient of the image in the x/y direction.

Based on the reconstructed loss function and the gradient loss function, a first loss corresponding to the second image and the predicted image can be obtained. Illustratively, the function of solving for the first loss may be as shown in equation 3:

l _total＝L_rec+α*L_grad equation 3

Wherein L _total represents the first loss, α is a superparameter for adjusting the weight of the gradient loss.

And step 1004, updating model parameters of the image processing model to be trained at least according to the first loss until model training conditions are met, so as to obtain the image processing model.

The image processing model obtained after training in step 1004 may refer to the description in the corresponding embodiment of fig. 4, which is not repeated here.

That is, in the model training process, the semantic segmentation prediction result of the first image may be subjected to constraint control by using the semantic segmentation loss function, so that the semantic segmentation result generated by the model may be more accurate.

For example, the semantic segmentation penalty function for obtaining the second penalty of the semantic segmentation prediction result and the semantic segmentation real result may be a cross entropy penalty function. The semantic segmentation loss function may be, for example, as shown in equation 4:

where L _seg is the second loss, p is the number of pixels of the image, S _i is used to represent the probability of the semantic segmentation prediction result for the semantic category z at pixel i position,For representing the probability of the semantic segmentation real result for the semantic class z at pixel i position, log () is used for representing the logarithm.

And obtaining a third loss based on the first loss and the second loss, and updating the model parameters of the image processing model to be trained according to the third loss until the model training conditions are met, so as to obtain the image processing model.

For example, the formula for solving the third loss may be as shown in formula 5:

L _total＝L_rec+α*L_seg+β*L_grad equation 5

Wherein L _total is used to represent the third loss, α is the first superparameter, β is the second superparameter, and α and β are used to adjust weights of the semantic segmentation loss and the gradient loss, respectively.

Referring to fig. 11, fig. 11 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application. As shown in fig. 11, the image processing apparatus provided by the embodiment of the application includes an obtaining unit 1101 and a processing unit 1102, where the obtaining unit 1101 is configured to obtain an image to be processed, the processing unit 1102 is configured to process the image to be processed through a first network to obtain a first feature, the first network is configured to at least extract a feature for enhancing an image, process the image to be processed through a second network to obtain a second feature, the second network is configured to at least extract a semantic segmentation feature, generate a third feature according to the first feature and the second feature, the obtaining unit 1101 is further configured to obtain a semantic segmentation result of the image to be processed, and the processing unit 1102 is further configured to generate a fourth feature according to the third feature and the semantic segmentation result of the image to be processed, and reconstruct the fourth feature to obtain a target image.

Optionally, in a possible implementation manner, the processing unit 1102 is further configured to process the third feature through a third network to obtain a semantic segmentation result of the image to be processed.

Optionally, in a possible implementation manner, the processing unit 1102 is further configured to perform feature fusion processing on the first feature and the second feature to obtain the third feature, and perform feature fusion processing on the third feature and a semantic segmentation result of the image to be processed to obtain the fourth feature.

Optionally, in a possible implementation manner, the processing unit 1102 is further configured to process the third feature to obtain a fifth feature, and generate a fourth feature according to the fifth feature and a semantic segmentation result of the image to be processed.

Optionally, in a possible implementation manner, the processing unit 1102 is further configured to perform preprocessing on the image to be processed to obtain a preprocessed feature, perform downsampling on the preprocessed feature to obtain a downsampled feature, process the downsampled feature through the second network to obtain a sixth feature, and perform upsampling on the sixth feature to obtain a second feature of the image to be processed.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a model training device according to an embodiment of the present application. As shown in fig. 12, the model training device provided by the embodiment of the application comprises an acquisition unit 1201 and a training unit 1202, wherein the acquisition unit 1201 is used for acquiring a training sample pair, the training sample pair comprises a first image and a second image, the quality of the first image is lower than that of the second image, the prediction unit is used for processing the first image through an image processing model to be trained to obtain a predicted image, the image processing model to be trained is used for acquiring an image to be processed, the first image is processed through a first network to obtain a first feature, the first network is configured to at least extract a feature for image enhancement, the first image is processed through a second network to obtain a second feature, the second network is configured to at least extract a semantically segmented feature, a third feature is generated according to the first feature and the second feature, a semantic segmented result of the first image is obtained, a fourth feature is generated according to the semantically segmented result of the third feature and the first image, the fourth feature is reconstructed, the first image is processed through the first network to obtain a first feature, the first feature is at least extracted to obtain a predicted image, the predicted image is used for obtaining a loss, the predicted image is processed according to the first image and the predicted image is subjected to the first image is subjected to the second image is subjected to the second image to the training model, the predicted loss is obtained until the first image is subjected to the predicted image loss and the first image is subjected to the image to the training loss has a loss.

Optionally, in a possible implementation manner, the training unit 1202 is further configured to process the third feature through a third network to obtain a semantic segmentation prediction result of the first image.

Optionally, in one possible implementation manner, the training unit 1202 is further configured to obtain a semantic segmentation real result of the first image, obtain a second loss according to the semantic segmentation prediction result and the semantic segmentation real result, where the second loss is used to describe a difference between the semantic segmentation prediction result and the semantic segmentation real result, and update model parameters of the image processing model to be trained at least according to the first loss and the second loss until a model training condition is met, so as to obtain an image processing model.

Optionally, in a possible implementation manner, the training unit 1202 is further configured to perform feature fusion processing on the first feature and the second feature to obtain the third feature, and perform feature fusion processing on the third feature and a semantic segmentation result of the first image to obtain the fourth feature.

Optionally, in a possible implementation manner, the training unit 1202 is further configured to process the third feature to obtain a fifth feature, and generate a fourth feature according to the third feature and the semantic segmentation result of the first image.

Optionally, in a possible implementation manner, the training unit 1202 is further configured to perform preprocessing on the first image to obtain a preprocessed feature, perform downsampling on the preprocessed feature to obtain a downsampled feature, process the downsampled feature through the second network to obtain a sixth feature, and perform upsampling on the sixth feature to obtain a second feature of the first image.

Referring to fig. 13, fig. 13 is a schematic structural diagram of an execution device provided in an embodiment of the present application, and the execution device 1300 may be embodied as a mobile phone, a tablet, a notebook computer, an intelligent wearable device, a server, etc., which is not limited herein. The execution device 1300 may be deployed with the data processing apparatus described in the corresponding embodiment of fig. 13, to implement the functions of data processing in the corresponding embodiment of fig. 13. Specifically, the execution device 1300 includes a receiver 1301, a transmitter 1302, a processor 1303, and a memory 1304 (where the number of processors 1303 in the execution device 1300 may be one or more, and one processor is illustrated in fig. 13 as an example), where the processor 1303 may include an application processor 13031 and a communication processor 13032. In some embodiments of the application, the receiver 1301, transmitter 1302, processor 1303, and memory 1304 may be connected by a bus or other means.

Memory 1304 may include read only memory and random access memory and provides instructions and data to processor 1303. A portion of the memory 1304 may also include non-volatile random access memory (non-volatile random access memory, NVRAM). The memory 1304 stores a processor and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operating instructions may include various operating instructions for performing various operations.

The processor 1303 controls operations of the execution device. In a specific application, the individual components of the execution device are coupled together by a bus system, which may include, in addition to a data bus, a power bus, a control bus, a status signal bus, etc. For clarity of illustration, however, the various buses are referred to in the figures as bus systems.

The method disclosed in the above embodiment of the present application may be applied to the processor 1303 or implemented by the processor 1303. The processor 1303 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the method described above may be performed by integrated logic circuitry in hardware or instructions in software in the processor 1303. The processor 1303 may be a general purpose processor, a Digital Signal Processor (DSP), a microprocessor, or a microcontroller, and may further include an Application SPECIFIC INTEGRATED Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The processor 1303 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1304, and the processor 1303 reads information in the memory 1304, and performs the steps of the method in combination with hardware.

The receiver 1301 may be used to receive input numeric or character information and to generate signal inputs related to performing relevant settings and function control of the device. The transmitter 1302 may be used to output digital or character information via the first interface, the transmitter 1302 may be further used to send instructions to the disk pack via the first interface to modify data in the disk pack, and the transmitter 1302 may further include a display device such as a display screen.

In an embodiment of the present application, in one case, the processor 1303 is configured to execute the image processing method executed by the execution device in the corresponding embodiment of fig. 4.

Referring to fig. 14, fig. 14 is a schematic structural diagram of a training apparatus according to an embodiment of the present application, specifically, training apparatus 1400 is implemented by one or more servers, and training apparatus 1400 may have a relatively large difference due to different configurations or performances, and may include one or more central processing units (central processing units, CPU) 1414 (e.g., one or more processors) and a memory 1432, and one or more storage mediums 1430 (e.g., one or more mass storage devices) storing application programs 1442 or data 1444. Wherein the memory 1432 and storage medium 1430 can be transitory or persistent storage. The program stored on the storage medium 1430 may include one or more modules (not shown) each of which may include a series of instruction operations for the training device. Still further, central processor 1414 may be configured to communicate with storage medium 1430 to execute a series of instruction operations in storage medium 1430 on training device 1400.

The training apparatus 1400 may also comprise one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input/output interfaces 1458, or one or more operating systems 1441, such as Windows Server ^TM,Mac OS X^TM,Unix^TM,Linux^TM,FreeBSD^TM, or the like.

Specifically, the training device may perform the steps in the corresponding embodiment of fig. 10.

Embodiments of the present application also provide a computer program product which, when run on a computer, causes the computer to perform the steps as performed by the aforementioned performing device or causes the computer to perform the steps as performed by the aforementioned training device.

The embodiment of the present application also provides a computer-readable storage medium having stored therein a program for performing signal processing, which when run on a computer, causes the computer to perform the steps performed by the aforementioned performing device or causes the computer to perform the steps performed by the aforementioned training device.

The execution device, the training device or the terminal device provided by the embodiment of the application can be a chip, wherein the chip comprises a processing unit and a communication unit, the processing unit can be a processor, and the communication unit can be an input/output interface, a pin or a circuit, for example. The processing unit may execute the computer-executable instructions stored in the storage unit to cause the chip in the execution device to perform the data processing method described in the above embodiment, or to cause the chip in the training device to perform the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, or the like, and the storage unit may also be a storage unit in the wireless access device side located outside the chip, such as a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a random access memory (random access memory, RAM), or the like.

Specifically, referring to fig. 15, fig. 15 is a schematic structural diagram of a chip provided in an embodiment of the present application, where the chip may be represented as a neural network processor NPU 1500, and the NPU 1500 is mounted as a coprocessor on a main CPU (Host CPU), and the Host CPU distributes tasks. The core part of the NPU is an operation circuit 1503, and the controller 1504 controls the operation circuit 1503 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 1503 includes a plurality of processing units (PEs) inside. In some implementations, the operation circuit 1503 is a two-dimensional systolic array. The operation circuit 1503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 1503 is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit takes the data corresponding to matrix B from the weight memory 1502 and buffers it on each PE in the arithmetic circuit. The arithmetic circuit takes matrix a data from the input memory 1501 and performs matrix operation with matrix B, and the obtained partial result or final result of the matrix is stored in an accumulator (accumulator) 1508.

Unified memory 1506 is used to store input data and output data. The weight data is carried directly to the weight memory 1502 through the memory cell access controller (Direct Memory Access Controller, DMAC) 1505. The input data is also carried into the unified memory 1506 through the DMAC.

BIU is Bus Interface Unit, bus interface unit 1510, for interaction of the AXI bus with the DMAC and instruction fetch memory (Instruction Fetch Buffer, IFB) 1509.

The bus interface unit 1510 (Bus Interface Unit, abbreviated as BIU) is configured to fetch the instruction from the external memory by the instruction fetch memory 1509, and further configured to fetch the raw data of the input matrix a or the weight matrix B from the external memory by the memory unit access controller 1505.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1506 or to transfer weight data to the weight memory 1502 or to transfer input data to the input memory 1501.

The vector calculation unit 1507 includes a plurality of operation processing units, and further processes such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like are performed on the output of the operation circuit 1503 if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization (batch normalization), pixel-level summation, up-sampling of a characteristic plane and the like.

In some implementations, the vector computation unit 1507 can store the vector of processed outputs to the unified memory 1506. For example, the vector calculation unit 1507 may apply a linear function, or a nonlinear function to the output of the operation circuit 1503, for example, to linearly interpolate the feature plane extracted by the convolution layer, and then, for example, to accumulate the vector of values to generate the activation value. In some implementations, the vector calculation unit 1507 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as an activation input to the arithmetic circuit 1503, for example for use in subsequent layers in a neural network.

An instruction fetch memory (instruction fetch buffer) 1509 connected to the controller 1504 for storing instructions used by the controller 1504;

The unified memory 1506, the input memory 1501, the weight memory 1502 and the finger memory 1509 are all On-Chip memories. The external memory is proprietary to the NPU hardware architecture.

The processor mentioned in any of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.

It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course by means of special purpose hardware including application specific integrated circuits, special purpose CPUs, special purpose memories, special purpose components, etc. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. But a software program implementation is a preferred embodiment for many more of the cases of the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., comprising several instructions for causing a computer device (which may be a personal computer, a training device, a network device, etc.) to perform the method according to the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device, a data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (Solid STATE DISK, SSD)), etc.

Claims

1. An image processing method, comprising:

Get the image to be processed;

Processing the image to be processed by a first network to obtain a first feature, wherein the first network is configured to extract at least a feature for image enhancement;

Processing the image to be processed by a second network to obtain a second feature, wherein the second network is configured to extract at least a semantic segmentation feature;

Performing feature fusion processing on the first feature and the second feature to obtain a third feature;

Obtaining a semantic segmentation result of the image to be processed;

generating a fourth feature according to the third feature and a semantic segmentation result of the image to be processed;

Perform image reconstruction on the fourth feature to obtain a target image.

2. The image processing method according to claim 1, wherein obtaining the semantic segmentation result of the image to be processed comprises:

The third feature is processed by a third network to obtain a semantic segmentation result of the image to be processed.

3. The image processing method according to claim 1 or 2, characterized in that:

Generating a fourth feature according to the third feature and a semantic segmentation result of the image to be processed includes:

Feature fusion processing is performed on the third feature and the semantic segmentation result of the image to be processed to obtain the fourth feature.

4. The image processing method according to claim 3 is characterized in that the feature fusion processing includes at least one of summation processing, multiplication processing, cascade processing and cascade convolution processing.

5. The image processing method according to any one of claims 1, 2, and 4, characterized in that before generating the fourth feature based on the third feature and the semantic segmentation result of the image to be processed, the method further comprises:

Processing the third feature to obtain a fifth feature;

Generating the fourth feature based on the third feature and the semantic segmentation result of the image to be processed includes: generating the fourth feature based on the fifth feature and the semantic segmentation result of the image to be processed.

6. The image processing method according to any one of claims 1, 2 and 4, wherein the step of processing the image to be processed by the second network to obtain the second feature comprises:

Preprocessing the image to be processed to obtain preprocessing features;

Performing downsampling processing on the preprocessed features to obtain downsampled features;

Processing the downsampled features through the second network to obtain a sixth feature;

Upsampling is performed on the sixth feature to obtain a second feature of the image to be processed.

7. The image processing method according to any one of claims 1, 2 and 4 is characterized in that the method is used to achieve at least one of the following image enhancement tasks: image super-resolution reconstruction, image denoising, image defogging, image deblurring, image contrast enhancement, image demosaicing, image deraining, image color enhancement, image brightness enhancement, image detail enhancement and image dynamic range enhancement.

8. A model training method, comprising:

Acquire a training sample pair, the training sample pair comprising a first image and a second image, wherein the quality of the first image is lower than that of the second image;

Processing the first image using a to-be-trained image processing model to obtain a predicted image, wherein the to-be-trained image processing model is used to obtain the to-be-processed image; processing the first image using a first network to obtain a first feature, wherein the first network is configured to extract at least a feature for image enhancement; processing the first image using a second network to obtain a second feature, wherein the second network is configured to extract at least a semantic segmentation feature; generating a third feature based on the first feature and the second feature; obtaining a semantic segmentation result of the first image; generating a fourth feature based on the third feature and the semantic segmentation result of the first image; and performing image reconstruction on the fourth feature to obtain a predicted image;

Obtaining a first loss based on the second image in the training sample pair and the predicted image, where the first loss is used to describe the difference between the second image and the predicted image;

The model parameters of the image processing model to be trained are updated at least according to the first loss until the model training conditions are met to obtain the image processing model.

9. The model training method according to claim 8 is characterized in that the image processing model to be trained is also used to process the third feature through a third network to obtain a semantic segmentation prediction result of the first image.

10. The model training method according to claim 9, wherein the image processing model to be trained is further used for:

Obtaining a true semantic segmentation result of the first image;

Obtaining a second loss according to the semantic segmentation prediction result and the true semantic segmentation result, where the second loss is used to describe the difference between the semantic segmentation prediction result and the true semantic segmentation result;

The model parameters of the image processing model to be trained are updated at least according to the first loss and the second loss until the model training conditions are met, thereby obtaining the image processing model.

11. The model training method according to any one of claims 8 to 10, wherein the image processing model to be trained is further used for:

performing feature fusion processing on the first feature and the second feature to obtain the third feature;

12. The model training method according to claim 11 is characterized in that the feature fusion processing includes at least one of summation processing, multiplication processing, cascade processing and cascade convolution processing.

13. The model training method according to any one of claims 8 to 10 is characterized in that the image processing model to be trained is also used to process the third feature to obtain a fifth feature; and generate a fourth feature based on the third feature and the semantic segmentation result of the first image.

14. The model training method according to any one of claims 8 to 10 is characterized in that the image processing model to be trained is also used to preprocess the first image to obtain preprocessing features; downsample the preprocessing features to obtain downsampled features; process the downsampled features through the second network to obtain a sixth feature; and upsample the sixth feature to obtain a second feature of the first image.

15. The model training method according to any one of claims 8 to 10 is characterized in that the image processing model is used to implement at least one of the following image enhancement tasks: image super-resolution reconstruction, image denoising, image defogging, image deblurring, image contrast enhancement, image demosaicing, image deraining, image color enhancement, image brightness enhancement, image detail enhancement and image dynamic range enhancement.

16. An image processing device, characterized in that the device comprises a memory and a processor; the memory stores code, and the processor is configured to execute the code, and when the code is executed, the image processing device performs the method according to any one of claims 1 to 15.

17. A computer storage medium, characterized in that the computer storage medium stores one or more instructions, which, when executed by one or more computers, enable the one or more computers to implement the method according to any one of claims 1 to 15.