WO2016112797A1

WO2016112797A1 - Method and device for determining image display information

Info

Publication number: WO2016112797A1
Application number: PCT/CN2016/070157
Authority: WO
Inventors: 石克阳; 曹阳
Original assignee: 阿里巴巴集团控股有限公司; 石克阳; 曹阳
Priority date: 2015-01-15
Filing date: 2016-01-05
Publication date: 2016-07-21
Also published as: CN105843816A

Abstract

A method and device for determining image display information, the method comprising: first acquiring, by a device, a plurality of training images having labelled display information (S1); then acquiring a corresponding image detection model by training convolutional neural networks based on the plurality of training images (S2); and when an image to be detected is obtained, determining, by the device, image display information in the image to be detected according to the image detection model (S3). The method realizes a display method by efficiently and accurately recognizing images, and in turn further improves the displayed images or the displaying manner of the images, thus improving information acquisition efficiency for a user and a screen resource utilization rate of a user terminal, and improving user experience.

Description

Method and device for determining picture display information

Technical field

The present application relates to the field of computers, and in particular, to a method and apparatus for determining picture display information.

Background technique

With the development of Internet technology, pictures are widely used in more and more web pages and applications because of their intuitive expression and rich content. For example, the online shopping platform provides various commodity information release mechanisms for various e-commerce providers, and merchants can upload photos of products with multiple angles and multiple backgrounds to attract users.

However, in practical applications, the poor image display method not only hinders the user from obtaining the required information, but also wastes the user's valuable bandwidth resources and reduces the user's screen utilization. Obviously, given the open nature of the Internet, such a situation will continue to exist; and, due to the explosive nature of Internet information, it is not feasible to attempt to manually review the display of these images.

Summary of the invention

It is an object of the present application to provide a method and apparatus for determining picture display information.

According to an aspect of the present application, a method for determining picture display information is provided, wherein the method includes:

Obtaining multiple training pictures with labeled display information;

Performing a corresponding picture detection model based on the plurality of training pictures via a convolutional neural network;

Determining picture display information of the picture to be detected according to the picture detection model.

According to another aspect of the present application, there is also provided an apparatus for determining picture display information, wherein the apparatus comprises:

a first device, configured to acquire a plurality of training pictures that have been marked with display information;

a second device, configured to perform a corresponding picture detection model by using a convolutional neural network based on the plurality of training pictures;

a third device, configured to determine, according to the picture detection model, a picture display letter of the picture to be detected interest.

Compared with the prior art, the present application models the display of different display modes of the picture, and determines the picture display information of the picture to be detected through the built model, thereby realizing efficient and accurate recognition of the picture display mode, thereby supporting The picture displayed or the display mode of the picture is further improved, thereby improving the efficiency of the user to obtain information, providing the utilization rate of the screen resources of the user terminal, and improving the user experience.

DRAWINGS

Other features, objects, and advantages of the present application will become more apparent from the detailed description of the accompanying drawings.

1 shows a schematic diagram of an apparatus for determining picture display information in accordance with an aspect of the present application;

2 is a schematic diagram showing a correspondence relationship between training pictures and display information acquired in an apparatus for determining picture display information according to an aspect of the present application;

3 shows a schematic diagram of a first device in an apparatus for determining picture display information in accordance with a preferred embodiment of the present application;

4 shows a flow chart executed by a second device in an apparatus for determining picture display information in accordance with a preferred embodiment of the present application;

FIG. 5 is a schematic diagram showing a third device in an apparatus for determining picture display information according to a preferred embodiment of the present application; FIG.

6 shows a flow chart of a method for determining picture display information according to another aspect of the present application;

Figure 7 shows a flow chart of step S1 in a method for determining picture display information in accordance with a preferred embodiment of the present application;

FIG. 8 shows a flow chart of step S3 in a method for determining picture display information in accordance with another preferred embodiment of the present application.

The same or similar reference numerals in the drawings denote the same or similar components.

detailed description

The present application is further described in detail below with reference to the accompanying drawings.

In a typical configuration of the present application, the terminal, the device of the service network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory. Memory is an example of a computer readable medium.

Computer readable media includes both permanent and non-persistent, removable and non-removable media. Information storage can be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, A magnetic tape cartridge, magnetic tape storage or other magnetic storage device or any other non-transportable medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media, such as modulated data signals and carrier waves.

1 shows an apparatus 1 for determining picture display information in accordance with an aspect of the present application. The device 1 includes a first device 11, a second device 12, and a third device 13. Specifically, the first device 11 is configured to acquire a plurality of training pictures that have been labeled with display information; and the second device 12 is configured to obtain a corresponding picture detection model by using a convolutional neural network based on the plurality of training pictures; The third device 13 is configured to determine picture display information of the to-be-detected picture according to the picture detection model.

Here, the device 1 may be implemented by a network host, a single network server, a plurality of network server sets, or a cloud composed of a plurality of servers. Here, the cloud is composed of a large number of host or network servers based on Cloud Computing, which is a kind of distributed computing, a super virtual computer composed of a group of loosely coupled computers. A person skilled in the art should understand that the foregoing network device is only an example, and other existing or future network devices may be applicable to the present application, and should also be included in the protection scope of the present application. It is hereby incorporated by reference. Here, the device 1 includes an electronic device capable of automatically performing numerical calculation and information processing according to an instruction set or stored in advance, the hardware of which includes but is not limited to a microprocessor, an application specific integrated circuit (ASIC), and a programmable Gate array (FPGA), digital processor (DSP), embedded devices, etc.

Specifically, the first device 11 constructs a size, format, and the like required by the second device 12 according to the size, format, and the like of the image detection model, and obtains the training image by using a predetermined communication method such as http, https, or by local reading. Corresponding display information. The training picture may be the stored source picture, or may be a picture obtained after the source picture is trimmed. The first device 11 uniformly acquires each training picture in accordance with the classification of the display information. Wherein, the display information includes any information that can describe the placement information of the item displayed by the training picture, and the details or overall effect of the item. For example, the display information describing the training picture of the clothing includes but is not limited to: the upper body display information, the model lower body display information, the model body display information, the upper body tile display information, the lower body tile display information, the whole body tile display information, the detail display Information, stacked display information, multi-picture display information, other display information, etc. Its correspondence with the training picture is exemplified in FIG. 2. For another example, the display information describing the training picture of the furniture includes, but is not limited to, front display information, three-dimensional display information, side display information, detail display information, other display information, and the like.

Here, the display information corresponding to each of the training pictures may be directly obtained from a database corresponding to each training picture. It may also be determined according to the obtained display manner of each training picture, wherein the display manner includes a plurality of display classification information, and the marked display information includes at least one of the plurality of display classification information.

For example, the display manner of the first device 11 for the lower body garment includes three types of display classification information, specifically: the model lower body display classification information, the lower body tile display classification information, and the lower body detail display classification information. The training picture acquired by the first device 11 includes: a model showing a picture of the pants, a model showing a picture of the skirt, and simultaneously obtaining the display manners of the two training pictures, the display information of the model lower body, corresponding to each The display information of the pictures are: the model shows the information under the body.

For another example, a picture in the training picture acquired by the first device 11 includes an image of the model showing the pants and an image of the pants details, according to the acquired corresponding display side. The first device 11 determines that the display information of the picture includes the model lower body display information and the lower body detail display classification information.

Preferably, the first device 11 obtains a plurality of training pictures by trimming the source picture. Specifically, the first device 11 includes: a first unit 111 and a first two unit 112 (shown in FIG. 3).

Specifically, the first unit 111 is configured to acquire a plurality of sample pictures that have been labeled with display information; and the first two units 112 are configured to perform pre-processing on each sample picture to obtain a corresponding training picture.

Here, the first unit 111 acquires a plurality of sample pictures remotely by means of an agreed communication method such as http, https, or the like by local reading or the like. Since the size, color, and the like of the acquired sample pictures are different, the first two units 112 preprocess each sample picture to obtain each training picture that meets the preset size and color requirements.

Here, the manner in which the first two units 112 preprocess each sample picture includes selecting, from the acquired sample pictures, a picture that meets a requirement of a preset size, a color, and the like as the training picture. Preferably, the processing manner comprises: normalizing each sample picture to obtain a corresponding training picture. Specifically, the manner of the normalization processing includes, but is not limited to, at least one of the following: 1) converting a sample picture into a three primary color representation. For example, if the acquired sample picture is in JPG format, the first two units 112 convert the sample picture into an RGB format. 2) Scaling the sample image so that one side is fixed length. For example, if the acquired size of each sample picture is different, the first two units 112 convert the sample picture into a short side size of a and a long side size of (a*a _i /b _i ). Where i is the sequence number of the sample picture, 0<i<n, n is the number of sample pictures acquired, a _i is the short side size of the ith sample picture, and b _i is the long side size of the ith sample picture . 3) Crop the sample image to make it square. For example, the first two units 112 respectively cut the two short sides of the acquired sample pictures by the width of (a _i -a)/2, and the two long sides respectively cut the width of (b _i -a)/2. Where i is the sequence number of each sample picture, 0<i<n, n is the number of sample pictures acquired, a _i is the short side size of the i-th sample picture, and b _i is the long side of the i-th sample picture Size, a is the length and width of the cropped sample image.

It should be noted that those skilled in the art should understand that the manner of the above normalization processing is merely an example. In fact, the first two units 112 can also convert the sample picture into three primary colors first. Indicates that the sample image is then scaled so that one side is fixed length, and/or cropped.

The number of the sample pictures may be the same as the number of training pictures, or may be less than the number of training pictures.

Preferably, the first two units 112 intercept a plurality of corresponding training pictures from each sample picture processed by the normalization by using a moving window. For example, the number of sample pictures acquired by the first unit 111 is n, and the first two units 112 now normalize the acquired sample pictures according to any one or more of the foregoing manners. Next, each of the cropped sample pictures of size a*a is carpet-moved by a moving window of a'*a', wherein the step of moving is t. Thus, the number of training pictures that are truncated for each sample picture is 1+(a-a')/t, and the number of training pictures obtained by the second unit is n*(1+(a-a' ) / t). More preferably, the first two units 112 intercept a plurality of corresponding training pictures from each sample picture subjected to the normalization process by using a moving window, so that the obtained training pictures retain the lower half of the original sample picture. For example, the first two units 112 intercept a plurality of training pictures that retain the lower half of the original sample picture from the sample picture by keeping the moving window aligned with the bottom of the sample picture during the movement.

More preferably, the first two units 112 can also obtain more training pictures by performing rotation processing on the intercepted training pictures, such as mirror flip, plane rotation, and the like. For example, after the first two units 112 perform normalization processing according to any one or more of the above manners, and even intercept a plurality of corresponding training pictures, the obtained training pictures are rotated, so that more The picture is trained and delivered to the second device 12.

It should be noted that the items displayed in the training pictures acquired by the first device 11 should belong to the same type of items. For example, all the acquired training pictures are displayed as clothing items; or, the obtained training pictures are all digital items and the like.

The second device 12 trains the convolutional neural network based on the plurality of training pictures to obtain a corresponding picture detection model.

Specifically, the second device 12 performs convolutional neural network training on each training picture acquired by the first device 11, and obtains a feature vector (ie, a neuron) corresponding to each display information, and then according to each display information. The obtained feature vectors are classified and processed to obtain a picture detection model.

For example, the second device 12 performs convolutional neural network training on each training picture, and correspondingly displays the obtained feature vector with the display information of the training picture to be associated, and when all the training pictures are completed by the convolutional neural network training, corresponding Each fixed feature vector of the same display information is subjected to normalized classification processing in all dimensions, and finally the feature vector of each dimension after classification is corresponding to a picture detection model of display information.

Here, the convoluted neural network includes three layers of convolution layers and two layers of fully connected layers. Specifically, the second device 12 iterates the results obtained by each layer of the convolution layer in a gradient descent manner. Then, the two connected layers are used to establish a connection relationship between the obtained feature vectors. Wherein, the convolutional neural network may also preferably set a dropout layer (shown in FIG. 4) in one of the all-connected layers to improve the efficiency of model convergence; here, the role of the Dropout layer is to Some of the parameters in its corresponding convolutional layer or fully connected layer are dormant, but their corresponding parameter values are retained but not updated until the next time they are not selected for hibernation. The convolutional neural network also includes a softmax (soft kernel function) layer; during the training phase, the training picture and the corresponding display information are used together, and the whole problem is trained through a multi-layer network, such as a dropout layer, a convolution layer, and a whole Connected layers, etc.; where the display information is played in the softmax layer of the last layer. Included in the Softmax layer is a nonlinear classifier that uses the feature vectors of the fully connected layer output and the corresponding tags for classifier training. The whole process of softmax can be divided into three steps. The first step is to find the maximum value of all the dimensions of the fixed feature vector X, which is denoted as Max_i. The second step uses the exponential function exp to convert each dimension in the vector to 0~. The number between 1 is the vector x[i]=exp(x[i]–Max_i) in the vector X. The third step sums all the values and then normalizes them, ie x[ i]=x[i]/sum(x[i]).

For example, the second device 12 inputs the picture itself as a feature into the convolutional neural network for training, and each of the obtained training pictures is directly converted into a feature matrix [W, H, C], where W is the The width dimension of the training picture, H is the height dimension of the training picture, and C is the display information such as the display classification information of the training picture. Then all the pictures are transferred into the model for training in K-segment. During the training, the random gradient descent method is used to iteratively learn the above convolutional neural network, where K is generally 32 or 64. Among them, each iteration will update the parameters of each layer in the network, such as the weight value of the nodes in the network layer and the paranoid value, until the values of these parameters converge to obtain the optimal solution. More preferably, the second device 12 can have three layers The result of the volume base layer processing is downsampled (as shown in the Maxpooling layer (maximum merge layer) in Figure 4). Then, the second device 12 establishes a connection relationship between all the feature vectors (ie, neurons) output through the downsampling using the fully connected layer, thereby implementing abstract expression.

Preferably, as shown in FIG. 4, the second device 12 is provided with a RELU layer and a normalized layer after each layer of the convolution layer. Among them, RELU (rectified linear unit, an activation function) layer utilizes the unsaturated nonlinear characteristics of each neuron in the neural network to improve the overall training efficiency of the model. The normalization layer performs normalization processing based on a local window of each pixel point, that is, a local normalization operation, which can enhance the overall generalization performance of the model.

Wherein the convolution layer comprises a Gaussian convolution layer, the Gaussian convolution layer is configured to perform a convolution operation on the output result of the previous layer and the plurality of Gaussian filter kernels, wherein the Gaussian filter kernel is based on the Multiple training pictures have been learned.

For example, the second device 12 performs a convolution operation on the output result of the previous layer with a plurality of preset Gaussian filter kernels using a Gaussian convolution layer. Among them, the parameters of the Gaussian kernel are learned. The size of the Gaussian kernel used by the second device 12 to set the three-layer Gaussian convolution layer is 5*5, and in each Gaussian convolution layer, the convolution kernel traverses all the pixels of the picture. Calculation. The second device 12 learns 64 Gaussian convolution kernels for the first layer convolutional layer, 32 Gaussian convolution kernels for the second layer convolutional layer, and 16 for the third layer convolutional layer. Gaussian convolutional layer.

It should be noted that those skilled in the art should understand that the number of Gaussian convolution kernels of the above-mentioned layers of convolution layers is only an example. In fact, the number of Gaussian convolution kernels of each layer convolution layer may be determined by actual needs.

After the picture detection model is established, the second device 12 provides the picture detection model to the third device 13. When the user uploads a to-be-detected picture, the third device 13 determines the picture display information of the picture to be detected according to the picture detection model.

Specifically, the third device 13 inputs the to-be-detected picture into the picture detection model, and obtains a probability vector of each display information corresponding to the picture to be detected, and takes the maximum value of the probability vector, or the probability value exceeds the preset. The display information corresponding to the threshold person is the picture display information of the picture to be detected.

Here, the picture display information may be only one display information, and may also include the plurality of At least one of the display classification information.

For example, the display information that can be detected by the image detection model includes: three types of display classification information, specifically: a front display type, a side display type, and a detail display type. When the values of the two probability vectors exceeding the preset threshold in the values of the respective probability vectors obtained by the third device 13 respectively correspond to the front display type and the detail display type, the third device 13 determines the to-be-detected picture. The picture display information includes: front display type and detail display type.

If the value of each probability vector obtained by the third device 13 is less than a preset threshold, it is determined that the picture to be detected of one pair is not compliant.

Preferably, the third device 13 includes: a third unit 131 and a third unit 132. (as shown in Figure 5)

The third unit 131 is configured to determine, according to the picture related information of the picture to be detected, the corresponding picture detection submodel from the picture detection model. The third two unit 132 is configured to determine, according to the picture detection submodel, picture display information of the to-be-detected picture, where the picture display information includes at least one of the plurality of display classification information.

Here, each of the picture detection sub-models corresponds to detecting a type of item picture. For example, the picture detection sub-model A corresponds to detecting a clothing type picture, and the picture detection sub-model B corresponds to detecting a digital product type picture.

The third unit 131 can also acquire the picture related information of the to-be-detected picture while acquiring the picture to be detected.

For example, the third unit 131 acquires a table containing information related to the picture to be detected and the picture through communication protocols such as http and https. The picture related information includes, but is not limited to: 1) display subject information of the to-be-detected picture. The display subject information is used to indicate an item name, a category, and the like displayed in the to-be-detected picture. For example, the display subject information includes: a garment, a top. 2) Display position information of the picture to be detected. The display position information is used to indicate a placement position of the item displayed in the picture to be detected, and the like. For example, the display position information includes: a main body view of the furniture, a left side view of the furniture, a right side view of the furniture, a partial view of the furniture, and the like. 3) Application-related information of the application to which the picture to be detected belongs. The application related information is used to indicate that the source information of the to-be-detected picture is uploaded. For example, the application related information includes: digital information provided by the application client, and a WEB page. The clothing class uploads information and so on.

As can be seen from the above, the third unit 131 can obtain the corresponding picture detection submodel according to the picture related information. When the third unit 131 cannot obtain the corresponding picture detection submodel according to the picture related information, it is determined that the acquired picture to be detected is not in compliance.

Next, the third two unit 132 determines the picture display information of the to-be-detected picture according to the picture detection sub-model.

It should be noted that those skilled in the art should understand that the third unit 132 determines the picture display information of the picture to be detected according to the picture detection sub-model and the foregoing third device 13 according to the picture detection model. The manner of determining the picture display information of the picture to be detected is the same or similar, and will not be described in detail herein.

FIG. 6 illustrates a method for determining picture display information in accordance with an aspect of the present application. Wherein, the method is mainly performed by a determining device. Wherein the method comprises steps S1, S2 and S3. Specifically, in step S1, the determining device acquires a plurality of training pictures that have been labeled with the display information; in step S2, the determining device performs corresponding picture detection by using the convolutional neural network training based on the plurality of training pictures. a model; in step S3, the determining device determines picture display information of the picture to be detected according to the picture detection model.

Here, the determining device may be implemented by a network host, a single network server, a plurality of network server sets, or a cloud composed of a plurality of servers. Here, the cloud is composed of a large number of host or network servers based on Cloud Computing, which is a kind of distributed computing, a super virtual computer composed of a group of loosely coupled computers. Those skilled in the art should understand that the foregoing network device is only an example, and other existing or future network devices may be applicable to the present application, and are also included in the protection scope of the present application, and are herein incorporated by reference. this. Here, the determining device includes an electronic device capable of automatically performing numerical calculation and information processing according to an instruction set or stored in advance, the hardware of which includes but is not limited to a microprocessor, an application specific integrated circuit (ASIC), and a programmable Gate array (FPGA), digital processor (DSP), embedded devices, etc.

Specifically, the determining device constructs a size, a format, and the like required by the second device according to the second device, and obtains the training picture and the corresponding image by using a predetermined communication method such as http, https, or the like, or by local reading. Display information. Wherein, the training picture may be The stored source picture may also be a picture obtained by trimming the source picture or the like. The determining device uniformly acquires each training picture according to the classification of the display information. Wherein, the display information includes any information that can describe the placement information of the item displayed by the training picture, and the details or overall effect of the item. For example, the display information describing the training picture of the clothing includes but is not limited to: the upper body display information, the model lower body display information, the model body display information, the upper body tile display information, the lower body tile display information, the whole body tile display information, the detail display Information, stacked display information, multi-picture display information, other display information, etc. Its correspondence with the training picture is exemplified in FIG. 2. For another example, the display information describing the training picture of the furniture includes, but is not limited to, front display information, three-dimensional display information, side display information, detail display information, other display information, and the like.

For example, the display manner for the lower body garment preset in the determining device includes three display classification information, specifically: the model lower body display classification information, the lower body tile display classification information, and the lower body detail display classification information. The training picture acquired by the determining device includes: a model showing a picture of the pants, a model showing a picture of the skirt, and simultaneously obtaining the display manners of the two training pictures are the model display information of the lower body, corresponding to each picture. The display information is: the model shows information under the body.

For example, if a picture in the training picture acquired by the determining device includes both an image of the model showing the pants and an image of the pants details, the determining device determines the picture according to the acquired corresponding display manner. The display information includes the model's lower body display information and the lower body detail display classification information.

Preferably, the determining device obtains a plurality of training pictures by trimming the source picture. Specifically, the step S1 includes: step S11, step S12. As shown in Figure 7. In step S11, the determining device acquires a plurality of sample pictures that have been labeled with the display information; in step S12, the determining device performs pre-processing on each sample picture to obtain a corresponding training picture.

Here, the determining device remotely calls or passes through an agreed communication method such as http or https. Get multiple sample images by local reading, etc. Since the size, color, and the like of the acquired sample pictures are different, the determining device performs pre-processing on each sample picture to obtain each training picture that meets the preset size and color requirements.

Here, the manner in which the determining device performs preprocessing on each sample picture includes selecting, from the acquired sample pictures, a picture that meets a requirement of a preset size, color, and the like as the training picture. Preferably, the processing manner comprises: normalizing each sample picture to obtain a corresponding training picture. Specifically, the manner of the normalization processing includes, but is not limited to, at least one of the following: 1) converting a sample picture into a three primary color representation. For example, if the acquired sample picture is in JPG format, the determining device converts the sample picture into an RGB format. 2) Scaling the sample image so that one side is fixed length. For example, if the acquired size of each sample picture is different, the determining device converts the sample picture into a short side size of a and a long side size of (a*a _i /b _i ). Where i is the sequence number of the sample picture, 0<i<n, n is the number of sample pictures acquired, a _i is the short side size of the ith sample picture, and b _i is the long side size of the ith sample picture . 3) Crop the sample image to make it square. For example, the determining device crops the two short sides of the acquired sample pictures by the width of (a _i -a)/2, and the two long sides respectively cut the width of (b _i -a)/2. Where i is the sequence number of each sample picture, 0<i<n, n is the number of sample pictures acquired, a _i is the short side size of the i-th sample picture, and b _i is the long side of the i-th sample picture Size, a is the length and width of the cropped sample image.

It should be noted that those skilled in the art should understand that the manner of the above normalization processing is merely an example. In fact, the determining device may first convert the sample picture into a three primary color representation, and then scale the sample picture to make the side length, and/or crop.

Preferably, the determining device intercepts a plurality of corresponding training pictures from each sample picture processed by the normalization by using a moving window. For example, the number of sample pictures acquired by the determining device is n, and the determining device now normalizes the acquired sample pictures according to any one or more of the foregoing manners. Next, each of the cropped sample pictures of size a*a is carpet-moved by a moving window of a'*a', wherein the step of moving is t. Thus, the number of training pictures that are taken out of each sample picture is 1+(a-a')/t, and the determining device has a total of The number of pictures to be trained is n*(1+(a-a’)/t). More preferably, the determining device intercepts a plurality of corresponding training pictures from each sample picture subjected to the normalization process by using a moving window, so that the obtained training picture retains the lower half information of the original sample picture; For example, the determining device intercepts a plurality of training pictures that retain the lower half of the original sample picture from the sample picture by keeping the moving window aligned with the bottom of the sample picture during the movement.

More preferably, the determining device may further obtain more training pictures by performing rotation processing on the intercepted training pictures such as mirror flipping, plane rotation, and the like. For example, after the determining device performs normalization processing according to any one or more of the above manners, and even intercepts a plurality of corresponding training pictures, the obtained training pictures are rotated, so that more training pictures are obtained. And step S2 is performed.

It should be noted that the items displayed in each training picture acquired by the determining device should belong to the same type of items. For example, all the acquired training pictures are displayed as clothing items; or, the obtained training pictures are all digital items and the like.

In step S2, the determining device trains the convolutional neural network based on the plurality of training pictures to obtain a corresponding picture detection model.

Specifically, the determining device performs convolutional neural network training on each training picture acquired in the step S1 to obtain a feature vector (ie, a neuron) corresponding to each display information, and then obtains the obtained information according to each display information. Each feature vector is classified and processed to obtain a picture detection model.

For example, the determining device performs convolutional neural network training on each training picture, and correspondingly displays the obtained feature vector with the display information of the training picture to be associated with the training picture. When all the training pictures are completed by the convolutional neural network training, the same display is corresponding to the same display. The fixed feature vectors of the information are subjected to normalized classification processing in all dimensions, and finally the feature vectors of each dimension after classification are corresponding to a picture detection model of display information.

Here, the convoluted neural network includes three layers of convolution layers and two layers of fully connected layers. Specifically, the determining device iterates the result obtained by each layer of the convolution layer in a gradient descent manner. Then, the two connected layers are used to establish a connection relationship between the obtained feature vectors. Wherein, the convolutional neural network may also preferably set a dropout layer (shown in FIG. 4) in one of the all-connected layers to improve the efficiency of model convergence; here, the role of the Dropout layer is to Part of the parameters in the corresponding convolutional layer or fully connected layer Hibernate, but its corresponding parameter value will be retained but not updated until the next time it is not selected for hibernation. The convolutional neural network also includes a softmax (soft kernel function) layer; during the training phase, the training picture and the corresponding display information are used together, and the whole problem is trained through a multi-layer network, such as a dropout layer, a convolution layer, and a whole Connected layers, etc.; where the display information is played in the softmax layer of the last layer. Included in the Softmax layer is a nonlinear classifier that uses the feature vectors of the fully connected layer output and the corresponding tags for classifier training. The whole process of softmax can be divided into three steps. The first step is to find the maximum value of all the dimensions of the fixed feature vector X, which is denoted as Max_i. The second step uses the exponential function exp to convert each dimension in the vector to 0~. The number between 1 is the vector x[i]=exp(x[i]–Max_i) in the vector X. The third step sums all the values and then normalizes them, ie x[ i]=x[i]/sum(x[i]).

For example, the determining device inputs the picture itself as a feature into the convolutional neural network for training, and each obtained training picture is directly converted into a feature matrix [W, H, C], where W is the training picture. The width dimension, H is the height dimension of the training picture, and C is the display information such as the display classification information of the training picture. Then all the pictures are transferred into the model for training in K-segment. During the training, the random gradient descent method is used to iteratively learn the above convolutional neural network, where K is generally 32 or 64. Among them, each iteration will update the parameters of each layer in the network, such as the weight value of the nodes in the network layer and the paranoid value, until the values of these parameters converge to obtain the optimal solution. More preferably, the determining device may downsample the result of the three-layer volume base layer processing (as shown by the Maxpooling layer (maximum merge layer) in FIG. 4). Then, the determining device establishes a connection relationship between all the feature vectors (ie, neurons) output through the downsampling using the fully connected layer, thereby implementing abstraction expression.

Preferably, as shown in FIG. 4, the determining device sets a RELU (reduced linear unit, an activation function) layer and a normalization layer after each layer of the convolution layer. Among them, the RELU layer utilizes the unsaturated nonlinear characteristics of each neuron in the neural network to improve the overall training efficiency of the model. The normalization layer performs normalization processing based on a local window of each pixel point, that is, a local normalization operation, which can enhance the overall generalization performance of the model.

For example, the determining device utilizes a Gaussian convolutional layer to perform a convolution operation on the output result of the previous layer with a plurality of preset Gaussian filter kernels. Among them, the parameters of the Gaussian kernel are learned. The size of the Gaussian kernel used by the determining device to set the three-layer Gaussian convolution layer is 5*5, and in each Gaussian convolution layer, the convolution kernel is traversed for all pixel points of the picture. Wherein, the determining device learns 64 Gaussian convolution kernels for the first layer convolutional layer, 32 Gaussian convolution kernels for the second layer convolutional layer, and 16 Gaussian for the third layer convolutional layer Convolution layer.

After the picture detection model is established, the determining device saves the picture detection model. When the user uploads a to-be-detected picture, the determining device performs step S3, that is, determines picture display information of the to-be-detected picture according to the picture detection model.

Specifically, the determining device inputs the to-be-detected picture into the picture detection model, and obtains a probability vector of each display information corresponding to the to-be-detected picture, where the value of the probability vector is the largest, or the probability value exceeds a preset threshold. The corresponding display information is the picture display information of the picture to be detected.

Here, the picture display information may be only one display information, and may further include at least one of the plurality of display classification information.

For example, the display information that can be detected by the image detection model includes: three types of display classification information, specifically: a front display type, a side display type, and a detail display type. Determining, by the determining device, the picture display information of the to-be-detected picture, when the values of the two probability vectors exceeding the preset threshold in the values of the probability vectors obtained by the determining device respectively correspond to the front display type and the detail display type Includes: front display type and detail display type.

If the value of each probability vector obtained by the determining device is less than a preset threshold, it is determined that the image to be detected of one of the ones is not in compliance.

Preferably, the step S3 comprises: steps S31, S32. As shown in Figure 4.

In step S31, the determining device determines, according to the picture related information of the picture to be detected, the corresponding picture detection submodel from the picture detection model. In step S32, the said The determining device determines the picture display information of the to-be-detected picture according to the picture detection sub-model, wherein the picture display information includes at least one of the plurality of display classification information.

The determining device can acquire the picture related information of the to-be-detected picture while acquiring the picture to be detected.

For example, the determining device acquires a table including information related to the detected picture and the picture through a communication protocol such as http, https, or the like. The picture related information includes, but is not limited to: 1) display subject information of the to-be-detected picture. The display subject information is used to indicate an item name, a category, and the like displayed in the to-be-detected picture. For example, the display subject information includes: a garment, a top. 2) Display position information of the picture to be detected. The display position information is used to indicate a placement position of the item displayed in the picture to be detected, and the like. For example, the display position information includes: a main body view of the furniture, a left side view of the furniture, a right side view of the furniture, a partial view of the furniture, and the like. 3) Application-related information of the application to which the picture to be detected belongs. The application related information is used to indicate that the source information of the to-be-detected picture is uploaded. For example, the application related information includes: digital information provided by the application client, clothing category upload information in the WEB page, and the like.

As can be seen from the above, the determining device can obtain the corresponding picture detection sub-model according to the picture related information. When the determining device cannot obtain the corresponding picture detection submodel according to the picture related information, it is determined that the acquired picture to be detected is not in compliance.

Next, the determining device determines the picture display information of the to-be-detected picture according to the picture detection sub-model.

It should be noted that, in the step S32, the manner in which the picture display information of the to-be-detected picture is determined according to the picture detection sub-model in step S32 is determined according to the picture detection model according to the foregoing step S3. The manner in which the picture display information of the detected picture is described is the same or similar, and will not be described in detail herein.

In summary, the method and device for determining picture display information of the present application are modeled by displaying different display modes of similar items, and the model to be detected is determined by the built model. The picture display information of the film realizes efficient and accurate recognition of the display mode of the picture, thereby supporting further improvement of the displayed picture or the display mode of the picture, thereby improving the efficiency of the user to obtain information, providing the utilization rate of the screen of the user terminal and improving the use of the user. In addition, the present application normalizes the acquired sample images, which is beneficial to the unified processing of the training images during modeling, and achieves the use of fewer sample images to obtain enough training images, and improve construction. Modular efficiency; also, the use of three-layer convolutional layer and two-layer fully connected layer for neural network training can effectively improve the accuracy of the picture detection model, so that the accuracy of the recognition when identifying and identifying the picture to be detected More than 90%. Therefore, the present application effectively overcomes various shortcomings in the prior art and has high industrial utilization value.

It is obvious to those skilled in the art that the present application is not limited to the details of the above-described exemplary embodiments, and the present invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the present application. Therefore, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the invention is defined by the appended claims instead All changes in the meaning and scope of equivalent elements are included in this application. Any reference signs in the claims should not be construed as limiting the claim. In addition, it is to be understood that the word "comprising" does not exclude other elements or steps. A plurality of units or devices recited in the device claims may also be implemented by a unit or device by software or hardware. The first, second, etc. words are used to denote names and do not denote any particular order.

Claims

A method for determining picture display information, wherein the method comprises:

Obtaining multiple training pictures with labeled display information;

Performing a corresponding picture detection model based on the plurality of training pictures via a convolutional neural network;

Determining picture display information of the picture to be detected according to the picture detection model.
The method of claim 1, wherein the obtaining the plurality of training pictures that have been labeled with the display information comprises:

Obtaining multiple sample images of the labeled display information;

Each sample picture is preprocessed to obtain a corresponding training picture.
The method of claim 2, wherein the pre-processing each sample picture to obtain a corresponding training picture comprises:

Each sample picture is normalized to obtain a corresponding training picture.
The method of claim 3, wherein the pre-processing each sample picture to obtain a corresponding training picture further comprises:

A plurality of corresponding training pictures are intercepted from each sample picture processed by the normalization using a moving window.
The method of claim 3 or 4, wherein the normalization process comprises at least one of the following:

Convert the sample image to a three primary color representation;

Scale the sample image to a fixed length on one side;

Crop the sample image to make it square.
The method according to any one of claims 1 to 5, wherein the convolutional neural network comprises three layers of convolutional layers and two layers of fully connected layers.
The method of claim 6 wherein said convolutional layer comprises a Gaussian convolutional layer for convoluting the output of the previous layer with a plurality of Gaussian filter kernels, wherein The Gaussian filter kernel is obtained based on the learning of the plurality of training pictures.
The method according to any one of claims 1 to 7, wherein the obtaining a plurality of training pictures with the annotated display information comprises:

Obtaining a plurality of training pictures that have been labeled with the display information, wherein the display manner of the training picture includes a plurality of display classification information, and the marked display information includes at least one of the plurality of display classification information;

The determining the image display information of the to-be-detected image according to the image detection model includes:

Determining, according to the picture detection model, picture display information of a picture to be detected, wherein the picture display information includes at least one of the plurality of display classification information.
The method of claim 8, wherein the picture detection model comprises a plurality of picture detection sub-models;

The determining the image display information of the to-be-detected image according to the image detection model includes:

Determining, according to the picture related information of the picture to be detected, the corresponding picture detection submodel from the picture detection model;

Determining, according to the picture detection submodel, picture display information of the to-be-detected picture, wherein the picture display information includes at least one of the plurality of display classification information.
The method of claim 9, wherein the picture related information comprises at least one of the following:

Displaying body information of the to-be-detected picture;

Display position information of the picture to be detected;

The application related information of the application to which the picture to be detected belongs.
A device for determining picture display information, wherein the device comprises:

a first device, configured to acquire a plurality of training pictures that have been marked with display information;

a second device, configured to perform a corresponding picture detection model by using a convolutional neural network based on the plurality of training pictures;

And a third device, configured to determine, according to the picture detection model, picture display information of the picture to be detected.
The apparatus of claim 11 wherein said first device comprises:

a first unit for acquiring a plurality of sample pictures of the displayed display information;

a first two unit for preprocessing each sample picture to obtain a corresponding training picture sheet.
The apparatus of claim 12 wherein said first two units are for:

Each sample picture is normalized to obtain a corresponding training picture.
The apparatus of claim 13 wherein said first two units are further for:

A plurality of corresponding training pictures are intercepted from each sample picture processed by the normalization using a moving window.
The apparatus according to claim 13 or 14, wherein said normalization processing comprises at least one of the following:

Convert the sample image to a three primary color representation;

Scale the sample image to a fixed length on one side;

Crop the sample image to make it square.
Apparatus according to any one of claims 11 to 15, wherein the convolutional neural network comprises three layers of convolutional layers and two layers of fully connected layers.
The apparatus according to claim 16, wherein said convolution layer comprises a Gaussian convolution layer for convoluting an output result of a previous layer with a plurality of Gaussian filter kernels, wherein The Gaussian filter kernel is obtained based on the learning of the plurality of training pictures.
Apparatus according to any one of claims 11 to 17, wherein said first means is for:

Obtaining a plurality of training pictures that have been labeled with the display information, wherein the display manner of the training picture includes a plurality of display classification information, and the marked display information includes at least one of the plurality of display classification information;

Wherein the third device is used to:

Determining, according to the picture detection model, picture display information of a picture to be detected, wherein the picture display information includes at least one of the plurality of display classification information.
The apparatus of claim 18, wherein the picture detection model comprises a plurality of picture detection sub-models;

Wherein the third device comprises:

a third unit, configured to determine, according to the picture related information of the picture to be detected, the corresponding picture detection submodel from the picture detection model;

And a third unit, configured to determine, according to the picture detection sub-model, picture display information of the to-be-detected picture, where the picture display information includes at least one of the plurality of display classification information.
The device according to claim 19, wherein the picture related information comprises at least one of the following:

Displaying body information of the to-be-detected picture;

Display position information of the picture to be detected;

The application related information of the application to which the picture to be detected belongs.