CN112699855A

CN112699855A - Image scene recognition method and device based on artificial intelligence and electronic equipment

Info

Publication number: CN112699855A
Application number: CN202110306849.1A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-03-23
Filing date: 2021-03-23
Publication date: 2021-04-23
Anticipated expiration: 2041-03-23
Also published as: CN112699855B

Abstract

The application provides an image scene identification method and device based on artificial intelligence, electronic equipment and a computer readable storage medium; the method comprises the following steps: acquiring global features of an image; performing attention processing on the image to obtain at least one local area of a background in the image; acquiring local features of each local area, and performing fusion processing on at least one local feature and the global feature to obtain fusion features of a background in the image; and carrying out scene classification processing on the image based on the fusion characteristics to obtain a scene to which the image belongs. By the method and the device, the accuracy of image scene recognition can be improved.

Description

Image scene recognition method and device based on artificial intelligence and electronic equipment

Technical Field

The present disclosure relates to artificial intelligence technologies, and in particular, to an image scene recognition method and apparatus based on artificial intelligence, an electronic device, and a computer-readable storage medium.

Background

Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

The image recognition is mainly used for recognizing image scenes, for example, for videos, scenes in which scenarios occur in the videos need to be recognized, and tags of the videos are determined by understanding the scenes in which the scenarios occur in the videos, so that efficient video recommendation is performed.

Disclosure of Invention

The embodiment of the application provides an image scene recognition method and device based on artificial intelligence, an electronic device and a computer readable storage medium, and the accuracy of image scene recognition can be improved.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides an image scene recognition method based on artificial intelligence, which comprises the following steps:

acquiring global features of an image;

performing attention processing on the image to obtain at least one local area of a background in the image;

acquiring local features of each local area, and performing fusion processing on at least one local feature and the global feature to obtain fusion features of a background in the image;

and carrying out scene classification processing on the image based on the fusion characteristics to obtain a scene to which the image belongs.

The embodiment of the application provides an image scene recognition device based on artificial intelligence, includes:

the global module is used for acquiring global characteristics of the image;

the attention module is used for carrying out attention processing on the image to obtain at least one local area of a background in the image;

the fusion module is used for acquiring local features of each local area and performing fusion processing on at least one local feature and the global feature to obtain fusion features of a background in the image;

and the classification module is used for carrying out scene classification processing on the image based on the fusion characteristics to obtain the scene to which the image belongs.

In the foregoing solution, the global module is further configured to: extracting global convolution characteristics of the image; performing pooling processing on the global convolution characteristics to obtain global pooling characteristics of the image; and carrying out residual error processing of multiple levels on the global pooling characteristics, and carrying out pooling processing on a characteristic extraction result obtained by the residual error processing to obtain the global characteristics of the image.

In the foregoing solution, the global module is further configured to: performing feature extraction processing on the input of an nth residual error network in N cascaded residual error networks; transmitting the nth characteristic extraction result output by the nth residual error network to an (n + 1) th residual error network to continue the characteristic extraction processing; wherein N is an integer greater than or equal to 2, N is an integer with the value increasing from 1, and the value range of N is more than or equal to 1 and less than or equal to N-1; when the value of N is 1, the input of the nth residual network is the global pooling characteristic of the image, and when the value of N is more than or equal to 2 and less than or equal to N-1, the input of the nth residual network is the characteristic extraction result of the nth-1 residual network; and when the value of N is N-1, performing maximum pooling on the feature extraction result output by the (N + 1) th residual error network.

In the foregoing solution, the global module is further configured to: performing fusion processing on the output of the (n-1) th residual error network and the input of the (n-1) th residual error network to obtain a fusion processing result; and performing activation processing on the fusion processing result, and performing multi-size convolution processing on the activation processing result through the convolution layer of the nth residual error network.

In the foregoing aspect, the attention module is further configured to: extracting global convolution characteristics of a background in the image; performing pooling processing on the global convolution characteristics to obtain global pooling characteristics of the image; and carrying out residual error processing of multiple levels on the global pooling characteristics, and carrying out local region prediction processing on a characteristic extraction result obtained by the residual error processing to obtain at least one local region.

In the foregoing aspect, the attention module is further configured to: pooling the feature extraction result, and predicting attention intensity of the pooled result to obtain the attention intensity of each space coordinate in the pooled result; backtracking each spatial coordinate to obtain a candidate region corresponding to each spatial coordinate; and performing non-maximum suppression processing on the candidate regions based on the attention intensities of the candidate regions to obtain at least one local region.

In the foregoing aspect, the attention module is further configured to: when the number of candidate regions is greater than a region number threshold, performing the following: sequencing the attention intensities of the candidate regions, and determining the candidate region with the highest attention intensity as the local region according to a sequencing result; for each candidate region except the candidate region with the highest attention intensity in the ranking result, performing the following processing: and determining the intersection ratio between each candidate region and the candidate region with the highest attention intensity in the sorting result, and marking the candidate region with the intersection ratio larger than the intersection ratio threshold as a non-candidate region.

In the foregoing solution, the fusion module is further configured to: extracting the local convolution characteristics of each local area in the image; pooling the local convolution characteristics to obtain pooling characteristics of each local area in the image; and performing residual error processing on the pooled feature of each local area in multiple layers, and performing pooled processing on a feature extraction result obtained by the residual error processing to obtain the local feature of each local area.

In the foregoing solution, the fusion module is further configured to: performing end-to-end processing on at least one local feature and the global feature to obtain a fusion feature of a background in the image; the classification module is further configured to: performing probability mapping processing on the fusion features to obtain the joint probability of the image belonging to each candidate scene; and determining the candidate scene corresponding to the maximum joint probability as the scene to which the image belongs.

In the above scheme, the scene classification processing for the image is implemented by a scene recognition model, and the scene recognition model is obtained by performing auxiliary training through an image recognition model and an attention localization model; the device further comprises: a training module to: training the image recognition model individually based on image samples and an image classification loss function; performing fusion processing on the image classification loss function, the combined classification loss function and the positioning loss function to obtain an overall loss function; training the scene recognition model, the separately trained image recognition model and the attention localization model as a whole based on the image samples and the whole loss function; wherein the scene recognition model, the image recognition model, and the attention localization model share a feature extraction network.

In the foregoing solution, the training module is further configured to: executing the following processing in each iterative training process of the image recognition model: extracting global features of the image samples through the feature extraction network, and mapping the global features into predicted global probabilities belonging to pre-labeled categories through a global full-link layer of the image recognition model; and substituting the pre-marked category corresponding to the image sample and the prediction global probability into the image classification loss function to determine the parameters of the image identification model when the image classification loss function obtains the minimum value.

In the foregoing solution, the training module is further configured to: determining, by the scene recognition model, a predictive joint probability that the image sample belongs to a pre-labeled category; determining, by the image recognition model, a predicted global probability that the image sample belongs to the pre-labeled class; predicting a plurality of sample local regions of the image sample by the attention localization model to determine a predicted localization probability that image content in each of the sample local regions belongs to the pre-labeled class; and substituting the prediction joint probability, the prediction positioning probability, the prediction global probability and the pre-marking category into the overall loss function to determine parameters of the scene recognition model, the image recognition model and the attention positioning model when the overall loss function obtains a minimum value.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the image scene recognition method based on artificial intelligence provided by the embodiment of the application when the executable instructions stored in the memory are executed.

The embodiment of the application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the computer-readable storage medium, so as to implement the artificial intelligence-based image scene recognition method provided by the embodiment of the application.

The embodiment of the application has the following beneficial effects:

the attention processing is carried out on the image to obtain at least one local area of the background in the image and the local feature of each local area, which is equivalent to the fact that the obvious feature of the image background is mined through an attention mechanism, then the at least one local feature and the global feature are subjected to fusion processing, the global feature and the local feature of the image are fully utilized to carry out scene classification, and the scene recognition accuracy is effectively improved.

Drawings

FIG. 1 is a logic diagram of an image scene recognition method in the related art;

FIG. 2A is a schematic structural diagram of an artificial intelligence-based image scene recognition system provided by an embodiment of the present application;

fig. 2B is a schematic structural diagram of an image scene recognition system based on a blockchain network according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

FIG. 4A is a schematic flowchart of a training phase of an artificial intelligence-based image scene recognition method according to an embodiment of the present disclosure;

FIG. 4B is a flowchart illustrating an artificial intelligence based image scene recognition method according to an embodiment of the present disclosure;

FIG. 4C is a flowchart illustrating step 202 of an artificial intelligence based image scene recognition method according to an embodiment of the present application;

5A-5B are schematic diagrams of an architecture of an artificial intelligence based image scene recognition method provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a residual error network of an artificial intelligence-based image scene recognition method according to an embodiment of the present application;

fig. 7 is a schematic diagram of region backtracking of an artificial intelligence-based image scene recognition method provided in an embodiment of the present application;

fig. 8 is a process flow diagram of an artificial intelligence-based image scene recognition method provided in an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Image recognition: image recognition is a technology for performing specific level classification on an image, generally, regardless of a specific instance of an object, only the class of the object is considered for image recognition and the class to which the object belongs is given, for example, the image is classified into a person, a dog, a cat, a bird, and the like, and a model trained based on a large-scale general object recognition source data set ImageNet can recognize which one of 1000 classes a certain object belongs to.

2) Multi-label recognition task of images: whether an image corresponds to having multiple attribute tags is identified, for example, an image has multiple attribute tags, and the multi-tag identification task is used for judging which attribute tags the image has.

3) And (3) carrying out noise identification: training of the image recognition task is performed based on noise samples, including samples with incorrect class labels, samples with inaccurate class labels, e.g. the image does not correspond exactly to the class label, the concepts of the two class labels have a partial overlap, the image has the properties of the two class labels, but only one class label.

4) ImageNet: large generic objects identify the source data set.

5) ImageNet pre-training model: and training a deep learning network model based on the large-scale general object recognition source data set ImageNet, wherein the trained deep learning network model is an ImageNet pre-training model.

The primary task of video understanding in the related art is to identify scenes in which a scenario occurs in a video, and high-level semantic identification is required during scene identification, so the difficulty of scene identification is greater than that of general object identification, and since scene features are often in the background environment of image identification, an image identification task or an image identification pre-training model extracts features on specific objects or specific parts, which easily causes scene identification to over-fit the foreground of an image, that is, the scene identification model memorizes the foreground of an image (such as wearing of foreground characters) but does not memorize key objects in the background surrounding the foreground, and the key objects in the background have various distribution situations, such as centralized distribution of key objects in the background or discrete distribution of key objects in the background, such as a classroom room with self-learning tables and chairs, a library room with self-learning tables and chairs and a row of bookshelves, the background of the classroom study room is a desk and a chair which are distributed in a centralized mode, the background of the library study room is a bookshelf and a study desk and chair which are distributed discretely, and scene recognition is carried out in a characteristic learning mode based on a multi-scale local area in the related technology aiming at various distribution situations of key objects in the background, especially the situation that the key objects in the background are distributed at multiple positions.

Referring to fig. 1, fig. 1 is a logic schematic diagram of an image scene identification method in the related art, fig. 1 shows a local region (salient region) extraction process of a certain scale, where a local region is substantially a salient region of a background of an image, and any region of the image may not be used for scene identification, for an image X, a potential object density of each position in the scene is calculated according to a distribution of potential object frames, for the potential object density in the scene, an object density in a window region in the image is calculated by using a sliding window, a sliding window response is performed by combining the image X and the potential object density, so that a region with the highest potential object density is extracted as a local region, and then feature learning is performed based on a multi-scale local region, and scene identification is performed based on a feature learning result.

When scene recognition is carried out based on a multi-scale local region feature learning mode in the related art, the applicant finds that the following technical problems exist: 1. the scene recognition model in the related technology is a two-stage model, and a target detection positioning task needs to be completed in advance in a training process and a reasoning process, and then a scene recognition task is completed; 2. when a model corresponding to a target detection positioning task is trained in the related technology, all objects possibly appearing in a scene need to be marked, and time and labor are consumed; 3. in the related art, detection targets do not exist in all scenes, such as seasides, forests and the like, and the scenes have no common detection target.

In view of the foregoing technical problems, embodiments of the present application provide an image scene recognition method and apparatus based on artificial intelligence, an electronic device, and a computer-readable storage medium, which can extract local features from multiple distributed local regions to perform scene recognition by combining global features, thereby effectively improving scene recognition accuracy.

The image scene recognition method provided by the embodiment of the application can be implemented by various electronic devices, for example, can be implemented by a terminal or a server alone, or can be implemented by the terminal and the server in a cooperation manner.

An exemplary application of the electronic device implemented as a server in an image scene recognition system is described below, referring to fig. 2A, fig. 2A is a schematic structural diagram of an artificial intelligence-based image scene recognition system provided in an embodiment of the present application, a terminal 400 is connected to the server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two.

In some embodiments, the function of the artificial intelligence based image scene recognition system is implemented based on the server 200, during the process of using the terminal 400 by a user, the terminal 400 collects image samples and sends the image samples to the server 200, so that the server 200 performs training based on a plurality of loss functions on a scene recognition model, the trained scene recognition model is integrated in the server 200, in response to the terminal 400 receiving an image shot by the user, the terminal 400 sends the image to the server 200, and the server 200 determines a scene classification result of the image through the scene recognition model and sends the scene classification result to the terminal 400, so that the terminal 400 directly presents the scene classification result.

In some embodiments, when the image scene recognition system is applied to a video recommendation scene, the terminal 400 receives a video to be uploaded, the terminal 400 sends the video to the server 200, the server 200 determines a scene classification result of a video frame in the video through a scene recognition model to serve as a scene classification result of the video, and sends the scene classification result to the terminal 400, so that the terminal 400 directly presents the scene classification result of the corresponding video in a video recommendation home page, and the terminal uploading the video and the terminal presenting the scene classification result may be the same or different.

In some embodiments, when the image scene recognition system is applied to an image capturing scene, the terminal 400 receives an image captured by a user, the terminal 400 sends the captured image to the server 200, and the server 200 determines a scene classification result of the image through a scene recognition model and sends the scene classification result to the terminal 400, so that the terminal 400 directly presents the scene classification result and stores the captured image according to the corresponding scene classification result.

In other embodiments, when the image scene recognition method provided by the embodiment of the present application is implemented by a terminal alone, in the various application scenarios described above, the terminal may run a scene recognition model to determine a scene classification result of an image or a video, and directly present the scene classification result.

In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal 400 may be a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, a smart television, a smart car device, and the like, and the terminal 400 may be provided with a client, for example, but not limited to, a video client, a browser client, an information flow client, an image capturing client, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present invention.

In some embodiments, referring to fig. 2B, fig. 2B is a schematic structural diagram of an image scene recognition system based on a blockchain network provided in an embodiment of the present application, and an exemplary application of the blockchain network based on the embodiment of the present application is described below. Referring to fig. 2B, the blockchain network 600 (which exemplarily shows the node 610-1 and the node 610-2 included in the blockchain network 600), the server 200, and the terminal 400 are included, which are respectively described below.

The server 200 (mapped as node 610-2) and the terminal 400 (mapped as node 610-1) may each join the blockchain network 600 as a node therein, and the mapping of the terminal 400 as node 610-1 of the blockchain network 600 is exemplarily shown in fig. 2B, where each node (e.g., node 610-1, node 610-2) has a consensus function and an accounting (i.e., maintaining a state database, such as a key-value database) function.

The image of the terminal 400 and the scene classification result corresponding to the image are recorded in the state database of each node (e.g., the node 610-1), so that the terminal 400 can query the image recorded in the state database and the scene classification result corresponding to the image.

In some embodiments, in response to receiving the image, a plurality of servers 200 (each server is mapped to a node in the blockchain network) determine a scene classification result of the image, and when the number of nodes passing the consensus exceeds a node number threshold for a certain candidate scene classification result, the consensus is determined to pass, and the server 200 (mapped to the node 610-2) sends the candidate scene classification result passing the consensus to the terminal 400 (mapped to the node 610-1), presents the candidate scene classification result passing the consensus on a human-computer interaction interface of the terminal 400, and performs uplink storage on the image and the scene classification result corresponding to the image. Because the scene classification result is obtained after the consensus is carried out by the plurality of servers, the reliability of the scene classification result of the image can be effectively improved, and because of the characteristic of uneasy tampering of the block chain network, the image stored in the uplink and the corresponding scene classification result can not be tampered maliciously.

Next, a structure of an electronic device for implementing an artificial intelligence based image scene recognition method provided by an embodiment of the present application is explained, and as described above, the electronic device provided by an embodiment of the present application may be the server 200 or the terminal 400 in fig. 2A. Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device provided in the embodiment of the present application, and the electronic device is taken as a server 200 for example. The server 200 shown in fig. 3 includes: at least one processor 210, memory 250, at least one network interface 220. The various components in server 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 3.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210.

The memory 250 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 250 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 251 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks; a network communication module 252 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), among others.

In some embodiments, the artificial intelligence based image scene recognition device provided by the embodiments of the present application can be implemented in software, and fig. 3 shows an artificial intelligence based image scene recognition device 255 stored in a memory 250, which can be software in the form of programs and plug-ins, and the like, and includes the following software modules: a global module 2551, an attention module 2552, a fusion module 2553, a classification module 2554 and a training module 2555, which are logical and thus can be arbitrarily combined or further split according to the implemented functions, which will be explained below.

The artificial intelligence based image scene recognition method provided by the embodiment of the present application will be described in conjunction with an exemplary application and implementation of the server 200 provided by the embodiment of the present application.

Referring to fig. 5A, fig. 5A is a schematic diagram of an architecture of an artificial intelligence based image scene recognition method provided in an embodiment of the present application, where two convolutional neural networks in fig. 5A are identical network structures, the image a obtains global features of the image a after being input to the convolutional neural networks, the global features are mapped to probabilities (predicted global probabilities) that the image a belongs to each candidate category through a global full connection layer, global losses are determined based on the predicted global probabilities that the image a belongs to a pre-label category (the pre-label category is one of the candidate categories) and the pre-label category of the image a, the attention strength of each point in a feature matrix of the global features is determined through an attention network, and then the position of at least one local region is obtained based on the attention strength of each point, so as to obtain the content of each local region by combining the image a, and the content of each local region is input to the convolutional neural networks to obtain local characteristics corresponding to each local region And mapping each local feature to a probability (predictive localization probability) of belonging to each candidate category through a localization predictive full link layer, determining localization loss of each local region based on the probability (predictive localization probability) of belonging to a pre-label category of the content of each local region and the pre-label category of the image a, inputting the local feature and the global feature of each local region to a fusion network for fusion processing, further determining the probability (predictive joint probability) of belonging to each candidate category of the image a based on the fusion processing result, determining joint loss based on the probability (predictive joint probability) of belonging to the pre-label category of the image a and the pre-label category of the image a, and updating parameters in the framework in fig. 5A based on the aggregate result of the global loss, the joint loss and the loss of each local region.

Referring to fig. 4A, fig. 4A is a schematic flowchart of a training phase of the artificial intelligence based image scene recognition method according to the embodiment of the present application, which will be described with reference to steps 101 and 103 shown in fig. 4A.

In step 101, an image recognition model is trained solely based on image samples and an image classification loss function.

As an example, the image samples are derived from an open source sample set, such as ImageNet, and the image classification penalty function can be a cross entropy penalty function. The scene classification processing for the images is realized through a scene recognition model, and the scene recognition model is obtained through auxiliary training of the image recognition model and an attention positioning model.

In some embodiments, the training of the image recognition model based on the image samples and the image classification loss function in step 101 may be implemented by the following technical solutions: the following processing is executed in each iterative training process of the image recognition model: extracting global features of the image samples through a feature extraction network, and mapping the global features into predicted global probabilities belonging to pre-labeled categories through a global full-link layer of an image recognition model; and substituting the pre-marked category and the prediction global probability of the corresponding image sample into the image classification loss function to determine the parameters of the image identification model when the image classification loss function obtains the minimum value.

As an example, referring to fig. 5B, fig. 5B is a schematic structural diagram of an artificial intelligence based image scene recognition method provided in this embodiment of the present application, where an image recognition model includes a feature extraction network and a global fully-connected layer, a structure of the feature extraction network refers to structures related to pooling in table 1 and table 2, and a structure related to fully-connected in table 2 is referred to by the global fully-connected layer, a predicted global probability is output through the global fully-connected layer, a global loss _ cr is determined based on the predicted global probability and a pre-label category, where M is the number of pre-label categories in a training sample set, M is an integer greater than or equal to 2, and the pre-label category is a scene pre-label category of an image.

TABLE 1 convolution layer structure table in ResNet-101

TABLE 2 ResNet-101 pooling layer and Global fully-connected layer Structure Table

In step 102, the image classification loss function, the joint classification loss function, and the positioning loss function are fused to obtain an overall loss function.

In some embodiments, the fusion processing is obtained by performing weighted summation on the image classification loss function, the joint classification loss function, and the positioning loss function based on each loss function weight, and of course, the fusion processing may also be performed in combination with other operators, for example, the weighted summation result is used as a true number in a logarithm operator, where the base number in the logarithm operator may be a preset value; alternatively, the weighted sum result is taken as a power in an exponential operator, wherein the power in a logarithmic operator may be a preset value.

As an example, the weight of each loss function may be a preset value; or, the weights may be dynamically assigned and updated at different stages of the training process of the image recognition model according to the degree of emphasis on the loss of different classes; alternatively, the corresponding weights may be automatically assigned according to the degrees of emphasis on different losses in different application scenarios, for example, a data table of weights of different types of loss functions in different application scenarios may be preset, and the corresponding weights may be assigned by looking up the data table according to a specific application scenario of the image recognition model. Therefore, the method can meet the requirements of personalized application of different scenes, and improves the applicability of the scene recognition model.

In step 103, the scene recognition model, the individually trained image recognition model, and the attention localization model are trained as a whole based on the image samples and the overall loss function.

In some embodiments, in step 103, based on the image sample and the overall loss function, the scene recognition model, the separately trained image recognition model, and the attention localization model are trained as a whole, which may be implemented by the following technical solutions: determining a prediction joint probability that the image sample belongs to the pre-marked category through a scene recognition model; determining a predicted global probability that the image sample belongs to the pre-labeled category through an image recognition model; predicting a plurality of sample local areas of the image sample by an attention localization model to determine a prediction localization probability that image content in each sample local area belongs to a pre-labeled category; and substituting the prediction joint probability, the prediction positioning probability, the prediction global probability and the pre-marking category into the overall loss function to determine parameters of the scene recognition model, the image recognition model and the attention positioning model when the overall loss function obtains the minimum value.

As an example, referring to fig. 5B, the attention localization model includes an attention network, a structure in table 1 of a feature extraction network, and a localization prediction fully-connected layer, the structure of the attention network is referred to table 3, the structure of the localization prediction fully-connected layer is referred to table 4, K is the number of local regions, K is an integer greater than or equal to 1, M is the number of pre-label categories, and M is an integer greater than or equal to 2. Determining an image sample or a feature extraction result of an image through a structure shown in table 1 of a feature extraction network of an attention localization model, performing pooling processing on the feature extraction result through a down-sampling layer of the attention network of the attention localization model to obtain a pooling processing result (feature matrix), wherein the size of the feature matrix output by the down-sampling layer is bs x 128 x 19 x 31; predicting the attention intensity of each point in the feature matrix through an attention intensity prediction layer of an attention network of the attention localization model, outputting the predicted attention intensity of each position of the feature matrix by the attention intensity prediction layer, and forming a matrix with the size of bs × 6 × 9 × 15; through a region extraction layer of an attention network, carrying out hard non-maximum suppression processing on candidate regions backtraced by each point to obtain a candidate region corresponding to K points with the maximum attention intensity as a sample local region or a local region, firstly carrying out transformation processing on a matrix formed by the attention intensities to output an attention intensity list (bs x 810), backtracing coordinate positions corresponding to each point in the matrix formed by the attention intensities, and obtaining the previous K coordinate positions with the maximum attention intensity as output (bs x K); extracting local features of each sample local area through a feature extraction network of the attention localization model; the localization prediction full link layer of the attention localization model predicts the probability (predicted localization probability) that each sample local region or local region belongs to the pre-label category based on the local features, and determines the localization loss _ locate based on the predicted localization probability and the pre-label category.

TABLE 3 attention network

Table 4 positioning prediction full connectivity layer

In some embodiments, referring to fig. 5B, the scene recognition model includes an attention network, the above-mentioned feature extraction network, a fusion network, and a joint fully-connected layer, the fusion network may include a connection operation, the structure of the joint fully-connected layer is shown in table 5, K is the number of local regions, and M is the number of pre-labeled classes. The method comprises the steps of extracting global features of image samples or images through a feature extraction network of a scene recognition model, determining a plurality of sample local regions of the image samples or images through an attention network of the scene recognition model, determining local features of each sample local region through the feature extraction network of the scene recognition model, carrying out fusion processing on the global features and at least one local feature through a fusion network, for example, carrying out end-to-end processing on the plurality of features, determining a prediction joint probability of the image samples, namely the probability of the image samples belonging to a pre-labeled category, based on the fusion processing result by a joint full-connection layer, and determining joint loss _ all based on the prediction joint probability and the pre-labeled category.

TABLE 5 Joint full attachment layer

Referring to fig. 4B, fig. 4B is a schematic flowchart of an image scene recognition method based on artificial intelligence according to an embodiment of the present application, and will be described with reference to step 201 and step 203 shown in fig. 4B.

In step 201, global features of an image are acquired.

In some embodiments, the step 201 of obtaining the global features of the image may be implemented by the following technical solutions: extracting global convolution characteristics of the image; performing pooling processing on the global convolution characteristics to obtain global pooling characteristics of the image; and performing residual processing on the global pooling characteristics in multiple levels, and performing pooling processing on the characteristic extraction result obtained by the residual processing to obtain the global characteristics of the image.

As an example, the global features of the image are obtained through a feature extraction network, the feature extraction network comprises a convolution network, a pooling network and N (N is an integer greater than or equal to 2) cascaded residual error networks, and the global convolution features of the image are extracted through the convolution network; performing pooling treatment (maximum pooling treatment or average pooling treatment) on the global convolution characteristics through a pooling network to obtain global pooling characteristics of the image; and performing multi-level residual processing on the global pooling features through N cascaded residual networks, and performing pooling processing (maximum pooling processing or average pooling processing) on feature extraction results obtained by the residual processing to obtain the global features of the image.

In some embodiments, the performing residual processing on the global pooled features at multiple levels and performing pooled processing on feature extraction results obtained by the residual processing may be implemented by the following technical solutions: performing feature extraction processing on the input of an nth residual error network in the N cascaded residual error networks; transmitting the nth characteristic extraction result output by the nth residual error network to the (n + 1) th residual error network to continue the characteristic extraction processing; wherein N is an integer with the value increasing from 1, and the value range of N satisfies that N is more than or equal to 1 and less than or equal to N-1; when N is equal to or larger than 2 and equal to or smaller than N-1, the input of the nth residual error network is the feature extraction result of the nth-1 residual error network; and when the value of N is N-1, performing maximum pooling on the feature extraction result output by the (N + 1) th residual error network.

In some embodiments, the above processing the feature extraction of the input of the nth residual network by the nth residual network of the N cascaded residual networks may be implemented by the following technical solutions: performing fusion processing on the output of the (n-1) th residual error network and the input of the (n-1) th residual error network to obtain a fusion processing result; and performing activation processing on the fusion processing result, and performing multi-size convolution processing on the activation processing result through the convolution layer.

As an example, referring to fig. 6, fig. 6 is a schematic diagram of a residual network of the image scene recognition method based on artificial intelligence provided in the embodiment of the present application, where the residual network is composed of three convolutional layers, a fusion operator, and an activation function, and performs fusion processing on an output of an n-1 th residual network and an input of the n-1 th residual network, for example, performs addition processing through an addition operator to obtain a fusion processing result; the activation processing is performed on the fusion processing result, the activation processing is completed through a Relu activation function, the convolution layer is used for performing multi-size convolution processing on the activation processing result, for example, convolution processing of three layers is performed, the training becomes increasingly difficult along with the increase of the network depth, mainly because the phenomenon of gradient dispersion or gradient explosion is very easily caused by multi-layer back propagation of an error signal in the network training process based on random gradient reduction, the problem of training difficulty caused by the network depth is solved by the residual error network shown in FIG. 6, and the network performance (the accuracy and the precision of completing tasks) is high.

In step 202, attention processing is performed on the image to obtain at least one local area of the background in the image.

In some embodiments, referring to fig. 4C, fig. 4C is a flowchart illustrating the step 202 of the image scene recognition method based on artificial intelligence provided in the embodiment of the present application, and the step 202 performs attention processing on the image to obtain at least one local area of the background in the image, which may be implemented by the step 2021 and the step 2023.

In step 2021, global convolution features of the background in the image are extracted.

In step 2022, pooling is performed on the global convolution features to obtain global pooling features of the image.

In step 2023, residual processing of multiple levels is performed on the global pooled features, and local region prediction processing is performed on feature extraction results obtained by the residual processing, so as to obtain at least one local region.

As an example, a feature extraction result obtained by residual processing of an image is determined by a structure shown in table 1 of a feature extraction network of an attention localization model, a global convolution feature of a background in the image is extracted by a convolutional layer Conv1 shown in table 1 of the feature extraction network, the global convolution feature is pooled by a largest pooling layer in a convolutional layer Conv2_ x shown in table 1 of the feature extraction network to obtain a global pooled feature of the image, and the global pooled feature is subjected to residual processing of a plurality of levels by a plurality of residual modules shown in table 1 of the feature extraction network to obtain a feature extraction result.

In some embodiments, the local region prediction processing is performed on the feature extraction result obtained by the residual error processing to obtain at least one local region, and the method can be implemented by the following technical solutions: performing downsampling processing on the feature extraction result, and performing attention intensity prediction processing on the downsampling processing result to obtain the attention intensity of each space coordinate in the downsampling processing result; backtracking each space coordinate to obtain a candidate area corresponding to each space coordinate; and performing non-maximum suppression processing on the plurality of candidate regions based on the attention intensities of the plurality of candidate regions to obtain at least one local region.

As an example, the feature extraction result is down-sampled by a down-sampling layer of an attention network of the attention localization model to obtain a down-sampling processing result (feature matrix), the down-sampling processing is not limited to pooling, the attention intensity of each point in the feature matrix is predicted by an attention intensity prediction layer of the attention network of the attention localization model, and the hard non-maximum suppression processing is performed on a candidate area backtraced by each point by an area extraction layer of the attention network to obtain at least one local area.

In some embodiments, the performing, based on the attention intensities of the plurality of candidate regions, a non-local maximum suppression process on the plurality of candidate regions to obtain at least one local region may be implemented by the following technical solutions: when the number of candidate regions is greater than the region number threshold, performing the following: sequencing the attention intensities of the candidate regions, and determining the candidate region with the highest attention intensity as a local region according to a sequencing result; for each candidate region except the candidate region with the highest attention intensity in the ranking result, performing the following processing: and determining the intersection ratio between each candidate region and the candidate region with the highest attention intensity in the sequencing result, and marking the candidate region with the intersection ratio larger than the intersection ratio threshold as a non-candidate region.

As an example, a point in a matrix output by the attention intensity prediction layer may trace back to a region of an image sample through the process shown in fig. 7, see fig. 7, fig. 7 is a schematic diagram of region trace back of an artificial intelligence-based image scene recognition method provided in an embodiment of the present application, as can be seen from fig. 7, fig. 7 shows a general example of region trace back, the structure shown in fig. 7 is used to schematically illustrate region trace back, after an image passes through a plurality of convolutional layers (a first convolutional layer and a second convolutional layer), a plurality of pooling layers (a first pooling layer and a second pooling layer), and a fully-connected layer, each region in the image may be classified through a maximum likelihood function, that is, converted into an attention intensity (attention intensity matrix) of the corresponding region or a probability obtained in any other task, the trace back process is a completely opposite process, tracing back each point in the attention intensity matrix to the corresponding region in the image to obtain a plurality of candidate regions, after transformation such as reshaping (reshape) operation, the attention intensity 6 x 9 x 15 matrix output by the attention intensity prediction layer becomes the predicted attention intensity of 810 points in the matrix, the 810 points trace back to the coordinates (x 1, y1, x2, y 2) of the candidate regions of the image respectively, the 810 candidate regions are subjected to hard non-maximum suppression processing by a region extraction layer of the attention network to obtain K candidate regions as local regions, the hard non-maximum suppression processing is sorted from large to small according to the attention intensity of each candidate region, the candidate region with the largest attention intensity is reserved as the local region according to the sorting, other candidate regions with the largest attention intensity and the intersection ratio of the candidate regions larger than the intersection ratio threshold value are deleted, candidate regions with a cross-to-merge ratio greater than the cross-to-merge ratio threshold are marked as non-candidate regions, e.g., there are 4 candidate regions: (candidate region 1, 0.8), (candidate region 2, 0.9), (candidate region 3, 0.7), (candidate region 4, 0.5), the four candidate regions are ranked from large to small in intensity of attention, candidate region 2> candidate region 1> candidate region 3> candidate region 4; the candidate area 2 with the greatest attention intensity is reserved, then intersection ratios between the remaining three candidate areas and the candidate area 2 are calculated, if the intersection ratio is greater than the intersection ratio threshold, then the candidate area is deleted, assuming that the intersection ratio threshold is 0.5, the intersection ratio (candidate area 1, candidate area 2) =0.1, less than 0.5, the candidate area 1 is reserved, the intersection ratio (candidate area 3, candidate area 2) =0.7, greater than 0.5, the candidate area 3 is deleted, the intersection ratio (candidate area 4, candidate area 2) =0.2, less than 0.5, the candidate area 4 is reserved, then the process of sorting and intersection ratio calculation is repeated for the candidate area 1 and the candidate area 4 to obtain the next local area, and a part of the candidate areas is deleted (the candidate areas with the intersection ratio greater than the intersection ratio threshold are marked as non-candidate areas).

As an example, the number of local regions is controlled by a region number threshold or the number of local regions is controlled by the number of iterations, for example, when the region number threshold is zero, the above example is repeated until all candidate regions are marked as local regions or deleted, when the region number threshold is not zero, the above example is repeated until the remaining candidate regions are equal to the region number threshold, for example, when the specified local region number is K, the above example is repeated K times, and each time the candidate region with the highest current attention intensity is obtained as a local region.

In step 203, local features of each local region are obtained, and at least one local feature and the global feature are subjected to fusion processing to obtain a fusion feature of a background in the image.

In some embodiments, the step 203 of acquiring the local feature of each local region may be implemented by the following technical solutions: extracting local convolution characteristics of each local area in the image; pooling the local convolution characteristics to obtain pooling characteristics of each local area in the image; and performing residual error processing of multiple layers on the pooled feature of each local area, and performing maximum pooled processing on the feature extraction result obtained by residual error processing to obtain the local feature of each local area.

As an example, a local convolution feature of each local region of the image is extracted by a convolution network of the feature extraction network; performing pooling processing (maximum pooling processing or average pooling processing) on the local convolution characteristics through a pooling network to obtain pooling characteristics of each local area of the image; and performing multi-level residual processing on the pooled feature of each local area through N cascaded residual networks, and performing pooled processing (maximum pooled processing or average pooled processing) on the feature extraction result obtained by the residual processing to obtain the local feature of each local area of the image.

As an example, the input of the nth residual network is subjected to feature extraction processing through the nth residual network in the N cascaded residual networks; transmitting the nth characteristic extraction result output by the nth residual error network to the (n + 1) th residual error network to continue the characteristic extraction processing; wherein N is an integer greater than or equal to 2, N is an integer with the value increasing from 1, and the value range of N is more than or equal to 1 and less than or equal to N-1; when N is equal to or larger than 2 and equal to or smaller than N-1, the input of the nth residual error network is the characteristic extraction result of the nth-1 residual error network; and when the value of N is N-1, performing maximum pooling on the feature extraction result output by the (N + 1) th residual error network.

As an example, the output of the (n-1) th residual error network and the input of the (n-1) th residual error network are subjected to fusion processing to obtain a fusion processing result; and performing activation processing on the fusion processing result, and performing multi-size convolution processing on the activation processing result through the convolution layer.

In step 204, the image is subjected to scene classification processing based on the fusion features, so as to obtain a scene to which the image belongs.

In some embodiments, the fusion processing is performed on the at least one local feature and the global feature in step 203 to obtain a fusion feature of a background in the image, which may be implemented by the following technical solution: performing end-to-end processing on at least one local feature and global feature to obtain a fusion feature of a background in the image; in step 204, the images are subjected to scene classification processing based on the fusion features to obtain the scene to which the images belong, which can be implemented by the following technical scheme: performing probability mapping processing on the fusion characteristics to obtain joint probability of the image belonging to each candidate scene; and determining the candidate scene corresponding to the maximum joint probability as the scene to which the image belongs.

As an example, after extracting the global features of the image through the feature extraction network of the scene recognition model, determining the local features of each local region through the feature extraction network of the scene recognition model, performing fusion processing on the global features and at least one local feature through the fusion network, for example, performing end-to-end processing on a plurality of features, and determining the prediction joint probability of the image sample, that is, the probability that the image belongs to the pre-labeled category, based on the fusion processing result by the joint full-link layer of the scene recognition model.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

In some embodiments, the image scene identification method based on artificial intelligence provided by the embodiments of the present application is applied to a video recommendation application scene, a terminal uploads a video to be published to a server, the server performs scene classification processing on a key video frame of the video to obtain a scene classification result of the key video frame, the scene classification result of the key video frame is used as an attribute tag of the video to be published, and the video is published and recommended to a user whose portrait matches the attribute tag.

Referring to fig. 5A, fig. 5A is an architecture schematic diagram of an artificial intelligence based image scene recognition method provided in an embodiment of the present application, where an image a is input to a convolutional neural network to obtain a global feature of the image a, the global feature is mapped to a probability (predicted global probability) that the image a belongs to each candidate class through a global fully-connected layer, a global loss is determined based on the predicted global probability that the image a belongs to a pre-labeled class (the pre-labeled class is one of the candidate classes) and the pre-labeled class of the image a, attention processing is performed through an attention network to obtain an attention intensity of each position, that is, an attention intensity of each point in a feature matrix of the global feature, and then a position of a local region is obtained based on the attention intensity, that is, a position of at least one local region is obtained based on the attention intensity of each point, so as to obtain content of each local region in combination with the image, inputting the content of each local area into a convolutional neural network to obtain local features corresponding to each local area, further mapping each local feature to a probability (predictive localization probability) of belonging to each candidate class by locating the predictive full link layer, determining a localization loss of each local region based on the probability (predictive localization probability) of belonging to the pre-labeled class of the content of each local region and the pre-labeled class of the image A, inputting the local feature and the global feature of each local region to the fusion network for fusion processing, further, the probability that the image a belongs to each candidate class (predicted joint probability) is determined based on the fusion processing result, the joint loss is determined based on the probability that the image a belongs to the pre-label class (predicted joint probability) and the pre-label class of the image a, and the parameters in the framework in fig. 5A are updated based on the aggregate result of the global loss, the joint loss, and the loss of each local region.

Referring to fig. 5B, fig. 5B is a schematic diagram of an architecture of an artificial intelligence based image scene recognition method provided in the embodiment of the present application, and mainly includes a feature extraction network, an attention network, a fusion network, and a prediction network. Firstly, training a feature extraction network, wherein the structure of the feature extraction network refers to structures related to pooling in tables 1 and 2, when the feature extraction network is trained, the structures shown in tables 1 and 2 are used as a whole for training, and referring to fig. 6, fig. 6 is a schematic diagram of a residual error network of the image scene identification method based on artificial intelligence provided by the embodiment of the application, the structure of a residual error module in table 1 is shown in fig. 6, the input of the residual error module is 256-dimensional, and after convolution processing of convolution kernels with three different sizes, the convolution processing result and the input are added to be used as the input of the next residual error module, wherein relu represents an activation function and represents activation processing.

In some embodiments, assuming that the image recognition task is M-class image recognition, the structure in table 1 is initialized using parameters of ResNet101 pre-trained on the ImageNet dataset as initialization parameters, the newly added layer (e.g., global fully-connected layer FC _ cr) is initialized using gaussian distribution with variance of 0.01 and mean of 0, the convolution template parameter w and bias parameter b of the neural network model composed of tables 1 and 2 are solved by using a stochastic gradient descent method, all parameters of the neural network model composed of tables 1 and 2 are set to be in a state to be learned, in each iteration process, M image samples are extracted to participate in the processes of forward propagation and backward update, the predicted global probability of the extracted M image samples is calculated forward during the forward propagation, and then the global loss of each image sample is calculated based on the image classification loss function and is propagated backward to the neural network model, calculating gradient and updating parameters of a neural network model, substituting the predicted global probability of the original image belonging to the pre-labeled category and the pre-labeled category into an image classification loss function (such as a cross entropy loss function) in the calculation process of global loss to obtain global loss, transmitting the value of the global loss back to the neural network model, updating weight parameters by a random gradient descent method, thereby realizing one-time optimization of weight parameters, obtaining a trained neural network model after multiple iterations of the above processes, and performing subsequent whole framework learning based on the trained neural network model.

In some embodiments, table 3 is used as the attention network, the attention network includes a down-sampling layer (donw 1) and an attention intensity prediction layer (prompt 2), the input of the down-sampling layer is the output of table 1, the matrix size of the attention intensity prediction layer output is bs × 6 × 9 × 15, where bs represents the number of image samples in forward propagation, 6 represents the number of channels, the number of channels output by the down-sampling layer is 128, after passing through the attention intensity prediction layer, the number of channels is compressed to 6, 9 × 15 represents the spatial length width after convolution, and the value in 9 × 15 represents the attention intensity of the spatial coordinate where the point corresponding to the value is located.

In some embodiments, the point in the matrix output by the attention intensity prediction layer may trace back to a region of the image sample through the process shown in fig. 7, referring to fig. 7, after the image passes through the plurality of convolutional layers, the plurality of pooling layers and the full connection layer, each region in the image may be transformed into the attention intensity of the corresponding region (attention intensity matrix), the trace back process is the completely reverse process, each point in the attention intensity matrix is traced back to the corresponding region in the image to obtain a plurality of candidate regions, after the transformation (reshape) operation, the attention intensity 6 x 9 x 15 matrix output by the attention intensity prediction layer is transformed into the predicted attention intensity of 810 points in the matrix, the 810 points are traced back to the coordinates of the candidate regions of the image respectively (x 1, y1, x2, y 2), the 810 candidate regions are processed by the region extraction layer of the attention network through the hard non-maximum suppression process to obtain K candidate regions as local regions, the hard non-maximum suppression processing is performed by sorting each candidate region according to the attention intensity from large to small, reserving the candidate region with the largest attention intensity as a local region, and deleting other candidate regions whose intersection ratio with the candidate region with the largest attention intensity is greater than an intersection ratio threshold, namely, marking the candidate region whose intersection ratio is greater than the intersection ratio threshold as a non-candidate region, for example, there are 4 candidate regions: (candidate region 1, 0.8), (candidate region 2, 0.9), (candidate region 3, 0.7), (candidate region 4, 0.5), the four candidate regions are ranked from large to small in intensity of attention, candidate region 2> candidate region 1> candidate region 3> candidate region 4; the candidate area 2 with the greatest attention intensity is retained, then intersection ratios between the remaining three candidate areas and the candidate area 2 are calculated, and if the intersection ratio is greater than the intersection ratio threshold, the candidate area is deleted, assuming that the intersection ratio threshold is 0.5, the intersection ratio (candidate area 1, candidate area 2) =0.1, less than 0.5, the candidate area 1 is retained, the intersection ratio (candidate area 3, candidate area 2) =0.7, greater than 0.5, the candidate area 3 is deleted, the intersection ratio (candidate area 4, candidate area 2) =0.2, less than 0.5, the candidate area 4 is retained, then the processes of sorting and intersection ratio calculation are repeated for the candidate area 1 and the candidate area 4 to obtain the next local area, and a part of the candidate areas is deleted (the candidate areas having intersection ratios greater than the intersection ratio threshold are marked as non-candidate areas).

In some embodiments, the calculation process for the positioning loss is as follows: after the attention is extracted through an attention network (a local area is determined), the final M categories are learned through the attention location prediction of a location prediction full-link layer (Fc _ locate) shown in a table 4, in order to enable each attention output result to have the perception capability on the categories, after the sub-images of K corresponding local areas are input into a table 1, the sub-images are subjected to pooling processing through a pooling layer of a table 2, the pooling result is input into the location prediction full-link layer, the output result is that the prediction location probability of the sub-images of the K corresponding local areas, which belong to any one of the M candidate categories, is respectively predicted, the prediction location probability of the sub-images of the K corresponding local areas and the loss between the pre-marked categories are finally calculated through a location loss function, and the location loss of the images is obtained.

In some embodiments, the pooling result (global feature) corresponding to the complete image and the pooling result (local feature) corresponding to each local region are processed end to obtain a feature vector with the size of (1 + K) × 2048, then a joint full-connection layer (Fc _ all) is used to predict the probability that the image belongs to M candidate categories based on the feature vector with the size of (1 + K) × 2048, the input of the joint full-connection layer (table 5) is the feature vector with the size of (1 + K) × 2048, the output is 1 × N prediction joint probabilities, the prediction joint probabilities are based on the probability that all predicted image features of the image belong to a certain candidate category, and finally the candidate category with the highest prediction joint probability is used as the scene classification result.

In some embodiments, the global loss _ cr is calculated using formula (1), formula (1) being a cross-entropy loss function of the classification, see formula (1):

（1）；

where L is the global loss _ cr, the input is the image with the pre-marked category, y is the value corresponding to the pre-marked category of the image,

a predicted global probability of belonging to a pre-labeled class is predicted.

In some embodiments, the calculation process of the localization Loss _ locate is also performed by relying on formula (1), the sum of the localization losses of the sub-images of the K local regions is calculated by using formula (1) as the localization Loss of the whole image, since the prediction for the local regions is performed based on the result of the depth feature activation, and the activation region in the initial stage may be inaccurate, the localization Loss needs to be considered, the localization Loss is constrained for the purpose of localizing the image to the local regions with correlation, the input of the localization Loss is the output of the localization prediction fully-connected layer Fc _ locate, the label is the pre-labeled category of the image, the joint Loss _ all is calculated by using formula (1), the input of the joint Loss is the output of the joint fully-connected layer Fc _ all, the label is the pre-labeled category of the image, and the final overall Loss is a _ Loss _ cr + b _ Loss _ locate + c _ all, wherein a, b and c are weight parameters of the respective loss functions.

In some embodiments, a scene recognition model may be subjected to noise training, and the scene recognition model obtained through the noise training is loaded on a cloud server to provide a scene recognition service, referring to fig. 8, fig. 8 is a processing flow diagram of an image scene recognition method based on artificial intelligence provided in the embodiments of the present application, a terminal a receives an image input by a user and uploads the image to the server, the server performs scene classification on the image input by the user by using the scene recognition model provided in the embodiments of the present application, and outputs a scene classification result to a terminal B and/or the terminal a for corresponding display, the terminal B is a terminal different from the terminal a, for example, the terminal a uploads the image to the server for distribution, and the terminal B receives the image issued by the server and the classification result of the corresponding image.

The embodiment of the application provides an image scene recognition method based on artificial intelligence, which can perform end-to-end scene recognition based on the combination of local features and global features of an image, perform self-supervision attention feature (local feature) extraction in a high-dimensional image space, and perform scene recognition by combining the global features, and has the advantages that: 1) local features in the background are excavated by attention, so that secondary labeling and model investment caused by manual labeling or a staged target detection method are avoided; 2) complete key information of the scene is extracted for recognition in a mode of fusing local features and global features, so that the problem of inaccurate recognition based on the local features is avoided; 3) based on the input image and the corresponding pre-marked category, attention learning is carried out in a self-supervision mode so as to extract local features through attention processing, and joint identification is carried out by combining the local features and the global features so as to obtain a scene classification result, so that end-to-end scene identification is realized; and 4, local features are quickly acquired through attention processing by utilizing open-source labeling data (scene label data) and models (such as Resnet101 models) without labeling investment, and the data is expanded from the aspect of semantics to improve generalization capability.

The image scene recognition method based on the artificial intelligence can use different network structures and different pre-training model weights as basic models, an attention network can have different network parameters, other network layers can be added, and the like.

Continuing with the exemplary structure of the artificial intelligence based image scene recognition device 255 provided by the embodiments of the present application as software modules, in some embodiments, as shown in fig. 3, the software modules stored in the artificial intelligence based image scene recognition device 255 of the memory 250 may include: a global module 2551, configured to obtain global features of the image; an attention module 2552, configured to perform attention processing on the image to obtain at least one local area of a background in the image; a fusion module 2553, configured to obtain a local feature of each local region, and perform fusion processing on at least one local feature and the global feature to obtain a fusion feature of a background in the image; and the classification module 2554 is configured to perform scene classification processing on the image based on the fusion features to obtain a scene to which the image belongs.

In some embodiments, global module 2551 is further configured to: extracting global convolution characteristics of the image; performing pooling processing on the global convolution characteristics to obtain global pooling characteristics of the image; and performing residual processing on the global pooling characteristics in multiple levels, and performing pooling processing on the characteristic extraction result obtained by the residual processing to obtain the global characteristics of the image.

In some embodiments, global module 2551 is further configured to: performing feature extraction processing on the input of an nth residual error network in the N cascaded residual error networks; transmitting the nth characteristic extraction result output by the nth residual error network to the (n + 1) th residual error network to continue the characteristic extraction processing; wherein N is an integer greater than or equal to 2, N is an integer with the value increasing from 1, and the value range of N is more than or equal to 1 and less than or equal to N-1; when N is equal to or larger than 2 and equal to or smaller than N-1, the input of the nth residual error network is the feature extraction result of the nth-1 residual error network; and when the value of N is N-1, performing maximum pooling on the feature extraction result output by the (N + 1) th residual error network.

In some embodiments, global module 2551 is further configured to: performing fusion processing on the output of the (n-1) th residual error network and the input of the (n-1) th residual error network to obtain a fusion processing result; and activating the fusion processing result, and performing multi-size convolution processing on the activation processing result through the convolution layer of the nth residual error network.

In some embodiments, attention module 2552 is further configured to: extracting global convolution characteristics of a background in an image; performing pooling processing on the global convolution characteristics to obtain global pooling characteristics of the image; and carrying out residual error processing of multiple levels on the global pooling characteristics, and carrying out local region prediction processing on a characteristic extraction result obtained by residual error processing to obtain at least one local region.

In some embodiments, attention module 2552 is further configured to: performing downsampling processing on the feature extraction result, and performing attention intensity prediction processing on the downsampling processing result to obtain the attention intensity of each space coordinate in the downsampling processing result; backtracking each space coordinate to obtain a candidate area corresponding to each space coordinate; and performing non-maximum suppression processing on the plurality of candidate regions based on the attention intensities of the plurality of candidate regions to obtain at least one local region.

In some embodiments, attention module 2552 is further configured to: when the number of candidate regions is greater than the region number threshold, performing the following: sequencing the attention intensities of the candidate regions, and determining the candidate region with the highest attention intensity as a local region according to a sequencing result; for each candidate region except the candidate region with the highest attention intensity in the ranking result, performing the following processing: and determining the intersection ratio between each candidate region and the candidate region with the highest attention intensity in the sequencing result, and marking the candidate region with the intersection ratio larger than the intersection ratio threshold as a non-candidate region.

In some embodiments, a fusion module 2553 for: extracting local convolution characteristics of each local area in the image; pooling the local convolution characteristics to obtain pooling characteristics of each local area in the image; and performing residual error processing of multiple layers on the pooled feature of each local area, and performing pooled processing on the feature extraction result obtained by the residual error processing to obtain the local feature of each local area.

In some embodiments, a fusion module 2553 for: performing end-to-end processing on at least one local feature and global feature to obtain a fusion feature of a background in the image; a classification module 2554, further configured to: performing probability mapping processing on the fusion characteristics to obtain joint probability of the image belonging to each candidate scene; and determining the candidate scene corresponding to the maximum joint probability as the scene to which the image belongs.

In some embodiments, the scene classification processing for the image is implemented by a scene recognition model, and the scene recognition model is obtained by performing auxiliary training on the image recognition model and the attention localization model; the device still includes: a training module 2555 to: training an image recognition model based on the image samples and an image classification loss function; performing fusion processing on the image classification loss function, the combined classification loss function and the positioning loss function to obtain an overall loss function; training a scene recognition model, an image recognition model after independent training and an attention positioning model as a whole based on the image sample and the whole loss function; the scene recognition model, the image recognition model and the attention positioning model share a feature extraction network.

In some embodiments, training module 2555 is further configured to: the following processing is executed in each iterative training process of the image recognition model: extracting global features of the image samples through a feature extraction network, and mapping the global features into predicted global probabilities belonging to pre-labeled categories through a global full-link layer of an image recognition model; and substituting the pre-marked category and the prediction global probability of the corresponding image sample into the image classification loss function to determine the parameters of the image identification model when the image classification loss function obtains the minimum value.

In some embodiments, training module 2555 is further configured to: determining a prediction joint probability that the image sample belongs to the pre-marked category through a scene recognition model; determining a predicted global probability that the image sample belongs to the pre-labeled category through an image recognition model; predicting a plurality of sample local areas of the image sample by an attention localization model to determine a prediction localization probability that image content in each sample local area belongs to a pre-labeled category; and substituting the prediction joint probability, the prediction positioning probability, the prediction global probability and the pre-marking category into the overall loss function to determine parameters of the scene recognition model, the image recognition model and the attention positioning model when the overall loss function obtains the minimum value.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the artificial intelligence-based image scene recognition method according to the embodiment of the application.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which when executed by a processor, perform an artificial intelligence based image scene recognition method provided by embodiments of the present application, for example, the artificial intelligence based image scene recognition method shown in fig. 4A to 4C.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, taking an electronic device as a computer device, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or distributed across multiple sites and interconnected by a communication network.

In summary, according to the embodiment of the application, attention processing is performed on an image to obtain at least one local region of a background in the image and local features of each local region, so that salient features of the background of the image are mined through an attention mechanism, at least one local feature and global features are subjected to fusion processing to obtain fusion features of the background in the image, and therefore key information about a scene in the background is extracted through a fusion mode to classify the scene, the problem that the scene type cannot be accurately identified only through classification based on the local regions is solved, and scene identification accuracy is improved.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. An artificial intelligence based image scene recognition method, characterized in that the method comprises:

acquiring global features of an image;

2. The method of claim 1, wherein obtaining global features of the image comprises:

extracting global convolution characteristics of the image;

performing pooling processing on the global convolution characteristics to obtain global pooling characteristics of the image;

and carrying out residual error processing of multiple levels on the global pooling characteristics, and carrying out pooling processing on a characteristic extraction result obtained by the residual error processing to obtain the global characteristics of the image.

3. The method of claim 2,

the performing a plurality of levels of residual processing on the global pooled features and performing pooled processing on feature extraction results obtained by the residual processing includes:

performing feature extraction processing on the input of an nth residual error network in N cascaded residual error networks;

transmitting the nth characteristic extraction result output by the nth residual error network to an (n + 1) th residual error network to continue the characteristic extraction processing;

wherein N is an integer greater than or equal to 2, N is an integer with the value increasing from 1, and the value range of N is more than or equal to 1 and less than or equal to N-1; when the value of N is 1, the input of the nth residual network is the global pooling characteristic of the image, and when the value of N is more than or equal to 2 and less than or equal to N-1, the input of the nth residual network is the characteristic extraction result of the nth-1 residual network;

and when the value of N is N-1, performing maximum pooling on the feature extraction result output by the (N + 1) th residual error network.

4. The method of claim 3,

the feature extraction processing is performed on the input of the nth residual error network through the nth residual error network of the N cascaded residual error networks, and includes:

performing fusion processing on the output of the (n-1) th residual error network and the input of the (n-1) th residual error network to obtain a fusion processing result;

and performing activation processing on the fusion processing result, and performing multi-size convolution processing on the activation processing result through the convolution layer of the nth residual error network.

5. The method of claim 1,

the performing attention processing on the image to obtain at least one local area of a background in the image includes:

extracting global convolution characteristics of a background in the image;

and carrying out residual error processing of multiple levels on the global pooling characteristics, and carrying out local region prediction processing on a characteristic extraction result obtained by the residual error processing to obtain at least one local region.

6. The method according to claim 5, wherein said performing a local region prediction process on the feature extraction result obtained by the residual error process to obtain at least one local region comprises:

performing downsampling processing on the feature extraction result, and performing attention intensity prediction processing on the downsampling processing result to obtain the attention intensity of each space coordinate in the downsampling processing result;

backtracking each spatial coordinate to obtain a candidate region corresponding to each spatial coordinate;

and performing non-maximum suppression processing on the candidate regions based on the attention intensities of the candidate regions to obtain at least one local region.

7. The method of claim 6, wherein said performing non-maxima suppression processing on a plurality of said candidate regions based on their attentiveness intensities to obtain at least one said local region comprises:

when the number of candidate regions is greater than a region number threshold, performing the following: sequencing the attention intensities of the candidate regions, and determining the candidate region with the highest attention intensity as the local region according to a sequencing result;

the method further comprises the following steps:

for each candidate region except the candidate region with the highest attention intensity in the ranking result, performing the following processing: and determining the intersection ratio between each candidate region and the candidate region with the highest attention intensity in the sorting result, and marking the candidate region with the intersection ratio larger than the intersection ratio threshold as a non-candidate region.

8. The method of claim 1,

the acquiring local features of each local region includes:

extracting the local convolution characteristics of each local area in the image;

pooling the local convolution characteristics to obtain pooling characteristics of each local area in the image;

and performing residual error processing on the pooled feature of each local area in multiple layers, and performing maximum pooled processing on a feature extraction result obtained by the residual error processing to obtain the local feature of each local area.

9. The method of claim 1,

the performing fusion processing on at least one local feature and the global feature to obtain a fusion feature of a background in the image includes:

performing end-to-end processing on at least one local feature and the global feature to obtain a fusion feature of a background in the image;

the scene classification processing is performed on the image based on the fusion characteristics to obtain a scene to which the image belongs, and the method comprises the following steps:

performing probability mapping processing on the fusion features to obtain the joint probability of the image belonging to each candidate scene;

and determining the candidate scene corresponding to the maximum joint probability as the scene to which the image belongs.

10. The method according to claim 1, wherein the scene classification processing for the image is implemented by a scene recognition model, and the scene recognition model is obtained by auxiliary training of an image recognition model and an attention localization model;

before the acquiring global features of the image, the method further includes:

training the image recognition model individually based on image samples and an image classification loss function;

performing fusion processing on the image classification loss function, the combined classification loss function and the positioning loss function to obtain an overall loss function;

training the scene recognition model, the separately trained image recognition model and the attention localization model as a whole based on the image samples and the whole loss function;

wherein the scene recognition model, the image recognition model, and the attention localization model share a feature extraction network.

11. The method of claim 10,

the training the image recognition model separately based on the image samples and the image classification loss function includes:

executing the following processing in each iterative training process of the image recognition model:

extracting global features of the image samples through the feature extraction network, and mapping the global features into predicted global probabilities belonging to pre-labeled categories through a global full-link layer of the image recognition model;

and substituting the pre-marked category corresponding to the image sample and the prediction global probability into the image classification loss function to determine the parameters of the image identification model when the image classification loss function obtains the minimum value.

12. The method of claim 10,

the training the scene recognition model, the separately trained image recognition model, and the attention localization model as a whole based on the image samples and the overall loss function includes:

determining, by the scene recognition model, a predictive joint probability that the image sample belongs to a pre-labeled category;

determining, by the image recognition model, a predicted global probability that the image sample belongs to the pre-labeled class;

predicting at least one sample local area of the image sample by the attention localization model to determine a predicted localization probability that image content in each of the sample local areas belongs to the pre-labeled class;

and substituting the prediction joint probability, the prediction positioning probability, the prediction global probability and the pre-marking category into the overall loss function to determine parameters of the scene recognition model, the image recognition model and the attention positioning model when the overall loss function obtains a minimum value.

13. An artificial intelligence based image scene recognition apparatus, the apparatus comprising:

the global module is used for acquiring global characteristics of the image;

14. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the artificial intelligence based image scene recognition method of any one of claims 1 to 12 when executing executable instructions stored in the memory.

15. A computer-readable storage medium storing executable instructions for implementing the artificial intelligence based image scene recognition method of any one of claims 1 to 12 when executed by a processor.