CN113989850A

CN113989850A - Humanoid detection method in video conference scene based on deep learning

Info

Publication number: CN113989850A
Application number: CN202111315469.0A
Authority: CN
Inventors: 丁帆; 任永忠; 梅宇青; 王沛; 曾德军; 陶宇
Original assignee: Shenzhen Innotrik Technology Co ltd
Current assignee: Shenzhen Innotrik Technology Co ltd
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-01-28
Anticipated expiration: 2041-11-08
Also published as: CN113989850B

Abstract

The invention discloses a video conference scene human shape detection method based on deep learning, which is used for extracting panoramic images in a video conference scene and detecting the positions of various personnel in the images, and is beneficial to realizing local picture focusing, assisting voice enhancement and the like; the method comprises the steps of collecting an original panoramic image from a conference scene by using a camera and correcting the original panoramic image; splicing and mapping the corrected rectangular panoramic images into square images with equal length and width, and performing normalization preprocessing and data enhancement; constructing a deep learning model based on a residual error network-feature pyramid network; establishing a bounding box regression network, and simultaneously calculating the positions of peripheral frames, confidence coefficients and center weights of the human body target; and training the model by using the adaptive focus loss, and inputting a conference scene image for training. The method guides the model to adapt to the human body target in the special scene through boundary regression, center weighting and self-adaptive focus loss, thereby improving the detection accuracy and recall rate in the scene of intensive personnel meetings and having good application prospect.

Description

Video conference scene human shape detection method based on deep learning

Technical Field

The invention relates to the field of machine vision, in particular to a video conference scene human shape detection method based on deep learning.

Background

Human shape detection in video conferencing is commonly used to achieve speaker focusing, assisting speech directional enhancement, and other functions. However, in a complex meeting room scene, due to the influence of factors such as a large number of participants, dense seats, unbalanced light, random movement of people and the like, the performance of the conventional target detection algorithm in the application scene is greatly reduced. Meanwhile, the camera used in the conference room is usually an ultra-wide-angle or panoramic camera, the picture is wide, the occupation ratio of a single human body target in the picture is small, and the available features are limited. In addition, the people in the meeting room often sit in a sitting posture, and the body of the people is partially shielded by the meeting facilities such as tables, chairs and computers, so that the available information is further deficient. Therefore, the human shape detection task in the video conference scene becomes a common problem.

Traditional human shape detection algorithms often adopt manual design operators to extract some features for analysis. The Viola Jones detector uses a sliding window to see all possible positions and scales in the image, checking if an object is present in the window. The method combines three important technologies of 'integral image', 'feature selection' and 'detection cascade', and greatly improves the detection speed. Histogram of Oriented Gradient (HOG) feature descriptors have also been used to solve the pedestrian detection problem, which can be used to balance feature invariance (including translation, scale, illumination, etc.) and nonlinearity (distinguishing between different object classes). The DPM algorithm is improved and extended on the basis of the HOG algorithm, consists of a main filter and a plurality of auxiliary filters, and improves the detection precision through hard negative mining, frame regression and context starting technologies. As an optimal traditional detection algorithm, the DPM method is high in operation speed and can adapt to object deformation, but cannot adapt to large-amplitude rotation, so that the stability is poor.

The deep learning algorithm which is rapidly developed in recent years is widely applied to the detection field. The target detection based on the deep learning method overcomes the defect that the traditional algorithm depends on the characteristics of manual design. Target detection currently has two stages, namely a single stage and a two-stage, wherein the two-stage means that a detection algorithm needs to be completed in two steps, firstly a candidate region needs to be obtained, and then classification is carried out, such as R-CNN series; in contrast, the method is a single-stage detection method, and candidate areas do not need to be searched separately, and the SSD, the YOLO series and the like are typical. For the two modes, the two-stage method based on the candidate area is superior in detection accuracy and positioning accuracy, and the single-stage algorithm speed based on end-to-end is superior. However, these algorithms usually solve general multi-class target detection, and can achieve better effects only under the conditions of rich object features, large targets, sparse distribution and consistent illumination. Due to the complexity of reality, a large deviation is likely to exist between an actual meeting room scene and a general training set, and although some defects can be made up by self-establishing a data set, the general deep learning detection model is always insufficient in the aspects of intensive detection, small target detection, the capture of a shielded human body and the like.

Therefore, how to solve the problems of dense human body targets, uneven illumination, wide picture width targets, small irregular object shielding and the like in a conference room is a key for improving the human shape detection effect in a video conference scene, and has important research significance for improving the human shape detection effect, improving the video conference call quality and meeting experience.

Disclosure of Invention

The invention aims to solve the problems of dense human body targets, uneven illumination, wide picture, small targets, irregular object shielding and the like in a human shape detection task of a video conference scene, thereby improving the performance of a detection algorithm, improving the detection accuracy and recall rate, and improving the indexes of the detection IOU precision and the like. According to the video conference scene human shape detection model based on deep learning, the bounding box regression network is used, and the positions of the peripheral frames, the confidence degrees and the central weighting of the human body targets are calculated at the same time, so that the densely arranged human body targets are better used, and the missing detection and the repeated detection are avoided; meanwhile, a model is trained by introducing adaptive focus loss to solve the problem of sample imbalance, the problem of whether positive and negative samples and difficult and easy samples are balanced is mainly included, and finally, the human shape detection scene is better adapted, so that the detection performance is greatly improved, and the method is ingenious and novel and has a good application prospect.

In order to achieve the purpose, the invention adopts the technical scheme that:

the video conference scene human shape detection method based on deep learning comprises the following steps:

step (a), using a camera to collect an original panoramic image from a conference scene and rectifying the original panoramic image (please briefly describe the specific method or process of collection and rectification);

splicing and mapping the corrected rectangular panoramic images into square images with equal length and width, performing normalization pretreatment and data enhancement operation, and then performing normalization treatment, wherein the normalization pretreatment and the data enhancement operation specifically comprise random overturning, region cutting and region covering and reorganizing;

step (C), constructing a deep learning model based on a residual error network-feature pyramid network as a baseline model, inputting the image processed in the step (B), and outputting rectangular position frames of all human bodies in the image;

step (D), introducing a bounding box regression network on the basis of the baseline model in the step (C), and calculating the positions of the peripheral frames, the confidence degrees and the central weighting of the human body target;

and (E) introducing a self-adaptive focus loss training model, inputting the conference scene image obtained in the step (B) according to the trained human shape detection model, and detecting the position of the corresponding conference participant.

the method comprises the following steps of (A) acquiring a rectangular panoramic image from a conference scene by using a camera, and correcting the rectangular panoramic image,

step (A1), placing a camera device with a 180-degree fisheye lens at the center of a conference room, and shooting a conference room panoramic image with distortion;

and (A2) correcting and analyzing the collected panoramic image of the conference room by using an OpenCV checkerboard calibration method, shooting a checkerboard picture with a fixed size by using a fisheye camera, carrying out binarization operation on the image, traversing the outline of each square and acquiring the angular points of all the small squares by carrying out corrosion and expansion operation on the image, calibrating lens parameters according to the distortion conditions of the angular points, acquiring camera lens parameters, inputting the image to be corrected, and carrying out coordinate transformation through the coordinate corresponding relation before and after the lens distortion to obtain a normal undistorted rectangular panoramic picture.

Step (B), the corrected rectangular panoramic image is spliced and mapped into a square image with equal length and width, normalization pretreatment and data enhancement operation are carried out, then normalization treatment is carried out, wherein the normalization pretreatment and the data enhancement operation specifically comprise random overturning, region cutting and region covering recombination,

step (B1), taking a rectangular panoramic image which is output by the camera in the step (A) after correction and has the side length of 3000 multiplied by 1000 and contains 360-degree annular scene information in a conference room, longitudinally cutting two original rectangular images with the side length of 2000 multiplied by 1000, and splicing the two original rectangular images into a square image with the side length of 2000 multiplied by 2000 from top to bottom for adapting to the input shape proportion of the deep learning detector;

step (B2), the spliced square images and the original rectangular images are mapped one by one, the upper half parts of the square images are directly mapped to the x e [0, 2000 ] position of the original rectangular image, and the lower half parts of the square images are images with the side length of 2000 multiplied by 1000 formed by splicing the x e [0, 500 ] of the original image and the x e [1500, 3000) in multiple sections, and are used for avoiding the panoramic image from being split;

step (B3), after the corrected image is mapped to the original image position, performing non-maximum value suppression for avoiding repeated detection of the image spliced in step (B1);

step (B4), performing data enhancement on the spliced square image, randomly turning the spliced square image up, down, left and right on the basis of the original image, then randomly cutting partial image areas containing human body targets, and smearing or covering the image areas not containing the human body targets in a mosaic mode;

and (B5) performing normalization processing on the square image subjected to data enhancement, changing each pixel value into a decimal of a (0, 1) interval, and compressing the input image to a size with the side length of 512 × 512 to serve as the input image of the model.

Step (C), constructing a deep learning model based on a residual error network-feature pyramid network as a baseline model, inputting the image processed in the step (B), and outputting rectangular position frames of all human bodies in the image, comprising the following steps,

step (C1), inputting the image processed in the step (B), then constructing a baseline model, and sequentially connecting a residual convolution network and a characteristic pyramid network;

step (C2), using a residual convolution network for learning the original image space semantic features as a backbone network, adopting a feature pyramid network to realize multi-scale feature fusion of the image, and modeling the features from different scales;

and step (C3), taking the full-connection layer as a detection head through a shallow convolutional network, obtaining the position of the target human shape, obtaining the anchor frame of the human shape in the adaptation data set by using a k-means clustering algorithm, and then outputting rectangular position frames of all human bodies in the image.

Step (D), introducing a bounding box regression network on the basis of the baseline model in the step (C) to calculate the peripheral frame position, the confidence coefficient and the center weight of the human body target, comprising the following steps,

step (D1), introducing a boundary box regression network, inputting a feature map output by a first layer of feature pyramid network and a first path of multilayer convolution, regressing the boundary of a target candidate frame area by changing the number of channels input by an upper layer, and outputting a first path of tensor with the convolution shape of H multiplied by W multiplied by 5, wherein H and W are numerical values of length and width output by the upper layer, and 5 is a channel number and is used as the distance (l, t, r, b) and the confidence coefficient between the upper, lower, left and right boundaries of a target human body and the central point of the current detection area;

step D2, the feature map is corresponding to the original map position one by one, the reduction ratio of the original map and the feature map is set as D, and the coordinate of the original map center point corresponding to the coordinate (x, y) in the feature map is set as D

The distance between the upper, lower, left and right boundaries of the object and the central point of the current detection area is (l, t, r, b) through regression network regression of the boundary frame of the human body actually existing in the current area, and the four values correspond to coordinate points of four corners of the peripheral frame of the human body target;

wherein (x)₁,y₁)(x₂,y₂) Coordinates of a left upper corner point and a right lower corner point of a peripheral frame of the human body target are respectively shown, d is the reduction ratio of the original image and the feature image, and H and W are the length and the width of the network input feature image;

step (D3), inputting the same input as the step (D1), inputting a feature map output by a second layer of feature pyramid network, regressing the boundary of a target candidate frame area by changing the number of channels input by an upper layer through a second-path multilayer convolution, then outputting a tensor with the shape of H multiplied by W multiplied by 1 of the second-path multilayer convolution, representing the distance coefficient between the center point of the current area and the center point of a real frame as a center weight, wherein the tensor represents the numerical value omega, and is used for ensuring that each detection area only detects the real human body closest to the detection area as far as possible

Step (E), introducing a self-adaptive focus loss training model, inputting the conference scene image obtained in the step (B) according to the trained human shape detection model, and detecting the position of the corresponding conference participant, comprising the following steps,

step (E1), adding a loss function training model, adding two super parameters, namely weight coefficients alpha and gamma, on the basis of the cross entropy of the loss function, and then introducing adaptive focus loss to carry out binary classification on the loss function for judging whether an object in the current detection area exists or not;

step (E2), inputting the conference scene image obtained in the step (B), taking the central weight value omega output in the step (D3) as a parameter of a loss function to participate in calculation in the loss function, wherein the longer the distance between the central weight value omega and the center of the detection area and the center of the real human body is, the smaller the loss function is; when no human body exists in the detection area, omega is 0;

the loss function is formulated as follows:

wherein, the weight α is used to balance the imbalance of the positive and negative samples, the weight γ is used to distinguish the difficult and easy samples, p is the predicted value of the confidence, ω is the central weight value outputted in step (D2), and σ is any extremely small number greater than 0, which is used to prevent the operation of dividing 0;

in step (E3), the value of γ is 0 in the initial state, and the adjustment factor is increased as γ increases, that is, the loss due to the simple samples is gradually suppressed, and the loss due to the simple samples is greatly reduced as the value of γ increases.

The beneficial effects of the invention are: firstly, an anchor frame clustering in an original detection algorithm based on deep learning is abandoned, a regression mode is used, a distance operator is used for representing the position of a human body, and on the basis of regression distance, confidence coefficient and center weighting parameters are simultaneously calculated through regression; secondly, a self-adaptive focus loss is designed to replace the cross entropy loss function selection training which is commonly used in the prior art, so that the model convergence speed is increased, and the detection precision under a complex scene is improved; therefore, the method can improve the robustness and performance of human body detection, is ingenious and novel, and has good application prospect.

Drawings

FIG. 1 is a flow chart of a method for detecting human forms in a video conference scene based on deep learning according to the present invention;

FIG. 2 is a schematic diagram of clipping and splicing rectangular images output by the panoramic camera in the invention;

FIG. 3 is a block diagram of the overall structure of the proposed model;

fig. 4 is a structure diagram of a network of detection heads designed by the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings.

As shown in fig. 1 and fig. 3, the method for detecting human shape of video conference scene based on deep learning of the present invention includes the following steps, step (a), using a camera to obtain a rectangular panoramic image from the conference scene, and rectifying it, including the following steps,

Step B, as shown in FIG. 2, the rectified rectangular panoramic image is spliced and mapped into a square image with equal length and width, and is subjected to normalization preprocessing and data enhancement operation, and then normalization processing is performed, wherein the normalization preprocessing and the data enhancement operation specifically comprise random overturning, region clipping and region covering recombination,

step (B1), taking a rectangular panoramic image which is output by the image acquisition equipment in the step (A) after being corrected and has the side length of 3000 multiplied by 1000 and contains 360-degree annular scene information in a conference room, longitudinally cutting two original rectangular images with the side length of 2000 multiplied by 1000, and splicing the two original rectangular images into a square image with the side length of 2000 multiplied by 2000 from top to bottom for adapting to the input shape proportion of the deep learning detector;

Step (D), introducing a bounding box regression network on the basis of the baseline model in the step (C), and calculating the peripheral frame position, the confidence coefficient and the center weight of the human body target, as shown in figure 4, comprising the following steps,

and (D3) inputting the same input as the step (D1), inputting a feature map output by a second layer of feature pyramid network, regressing the boundary of the target candidate frame area by changing the number of channels input by the upper layer through the second-path multilayer convolution, and then outputting a tensor with the shape of H multiplied by W multiplied by 1, representing the distance coefficient between the center point of the current area and the center point of the real frame as a center weight, wherein the tensor represents the numerical value omega, and is used for ensuring that each detection area only detects the real human body closest to the detection area as far as possible.

Step (E), introducing a self-adaptive focus loss training model, inputting the conference scene image obtained in the step (B) according to the trained human shape detection model, and detecting the position of the corresponding conference participant, wherein the method comprises the following steps;

the loss function is formulated as follows:

In order to fully compare the performance of the video conference scene human shape detection method based on deep learning, the experiment is arranged on a self-established conference room scene human shape detection data set and comprises 9183 pictures, and each conference room picture comprises about 9 persons on average; in the experiment, the average precision mean (AP) of a test set of a model on the data set is used as an evaluation index of the model performance, the IOU threshold is the average precision mean (mAP @0.5) under the condition of 0.5, and the mAP @ 75 is the same under the condition of 0.75 threshold bit; after the IOU threshold is determined, whether the intersection ratio of the prediction frame and the real frame of each category exceeds the threshold can be judged according to the threshold, so that the accuracy and the recall ratio under different confidence degrees (conf) are calculated, and then the results under the thresholds (IOU is 50: 05:0.95, namely AP values are obtained every 0.05 step length within the range of 0.5 to 0.95 of the IOU) are averaged, so that the technical index mAP adopted by the experimental step of the invention is obtained. The experimental result shows that the base line model mAP provided by the invention is 48.7; after a boundary regression network, a central weight network and an adaptive focus loss are introduced, the accuracy is improved to 73.6.

In summary, the method for detecting the human shape of the video conference scene based on the deep learning, provided by the invention, firstly discards the anchor frame clustering in the original detection algorithm based on the deep learning, uses a regression mode, uses a distance operator to represent the position of a human body, and simultaneously regresses to calculate the confidence coefficient and the center weighting parameter on the basis of regressing the distance; secondly, a self-adaptive focus loss is designed to replace the cross entropy loss function selection training which is commonly used in the prior art, so that the model convergence speed is increased, and the detection precision under a complex scene is improved; therefore, the method can improve the robustness and performance of human body detection, is ingenious and novel, and has good application prospect.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. a video conference scene humanoid detection method based on deep learning, is characterized in that, comprises the steps:

Step (A), use a camera to obtain a rectangular panoramic image from the meeting scene, and correct it;

Step (B), splicing and mapping the corrected rectangular panoramic image to a square image with equal length and width, and performing normalization preprocessing and data enhancement operations, followed by normalization processing, wherein normalization preprocessing and data enhancement operations are performed. , specifically random flipping, region cropping, and region masking reorganization;

Step (C), build a deep learning model based on residual network-feature pyramid network as a baseline model, input the image processed in step (B), and output the rectangular position frame of all human bodies in the image;

In step (D), a bounding box regression network is introduced on the basis of the baseline model in step (C), and the position of the border around the human target and the confidence and center weighting are calculated;

In step (E), an adaptive focus loss training model is introduced, and according to the trained humanoid detection model, the conference scene image obtained in step (B) is input to detect the position of the corresponding participant.

2. a kind of video conference scene humanoid detection method based on deep learning according to claim 1, is characterized in that, step (A), use camera to obtain rectangular panorama image from conference scene, and it is corrected, including following step,

Step (A1), place a camera device with a 180-degree fisheye lens in the center of the conference room, and shoot a panoramic image of the conference room with distortion;

In step (A2), use the OpenCV checkerboard calibration method to correct and analyze the collected panoramic image of the conference room, use a fisheye camera to take a fixed-size checkerboard picture, and perform a binarization operation on the image. Expansion operation, traverse the outline of each square and obtain the corner points of all small squares, calibrate the lens parameters according to the distortion of these corner points, obtain the camera lens parameters, input the image to be corrected, and pass the coordinates before and after lens distortion. Coordinate transformation is performed on the corresponding relationship to obtain a normal non-distorted rectangular panoramic photo.

3. a kind of video conference scene humanoid detection method based on deep learning according to claim 1, is characterized in that, step (B), the rectangular panorama image after rectification is spliced and mapped to the square image with equal length, and returns as normal. Normalization preprocessing and data enhancement, followed by normalization processing, where normalization preprocessing and data enhancement operations, specifically random flipping, region cropping, and region masking reorganization, include the following steps,

Step (B1), taking the rectangular panoramic image with side length of 3000×1000 and containing the information of the 360° annular scene in the conference room and output by the camera in step (A) after correction, and vertically cropping two original rectangles with side length of 2000×1000 image, and then stitch it up and down into a square image with a side length of 2000×2000, which is used to adapt to the input shape ratio of the deep learning detector;

Step (B2): Map the position of the spliced square image and the original rectangular image one by one, the upper half of the square image is directly mapped to the x∈[0, 2000) position of the original rectangular image, and the lower half is determined by the x of the original image. ∈[0, 500), x∈[1500, 3000) is an image with a side length of 2000×1000 formed by splicing multiple segments, which is used to prevent the panoramic image from being split;

In step (B3), after the corrected image is mapped to the position of the original image, non-maximum value suppression is performed to avoid repeated detection of the image after splicing in step (B1);

Step (B4), performing data enhancement on the spliced square image, randomly flipping up, down, left and right on the basis of the original image, and then randomly cropping part of the image area containing the human target, and smearing or smearing the image area that does not contain the human target. Mosaic cover up;

Step (B5), normalize the square image after data enhancement, change each pixel value into a decimal in the (0, 1) interval, and then compress the input image to a size with a side length of 512×512, as an input image for the model.

4. a kind of video conference scene humanoid detection method based on deep learning according to claim 1, is characterized in that, step (C), builds the deep learning model based on residual network-feature pyramid network as baseline model, input step The image processed in (B), the rectangular position frame of all the human bodies in the output image, including the following steps,

Step (C1), input the image processed in step (B), then build a baseline model, and connect the residual convolution network and the feature pyramid network in turn;

In step (C2), the residual convolutional network used to learn the spatial semantic features of the original image is used as the backbone network, and the feature pyramid network is used to realize the multi-scale feature fusion of the image, and the features are modeled from different scales;

Step (C3), use the fully connected layer as the detection head through the shallow convolutional network to obtain the position of the target humanoid, use the k-means clustering algorithm to obtain the anchor frame of the humanoid in the data set, and then output the rectangles of all humanoids in the image Location border.

5. a kind of video conference scene humanoid detection method based on deep learning according to claim 3, is characterized in that, step (D), on the basis of baseline model in step (C), introduces bounding box regression network, calculates The frame position, confidence and center weighting around the human target, including the following steps,

Step (D1), introduce the bounding box regression network, input the feature map output by the first layer feature pyramid network and the first multi-layer convolution by changing the number of channels input to the upper layer, return the boundary of the target candidate box area, and output the first The shape of the road convolution is a tensor of H×W×5, where H and W are the length and width values output by the previous layer, and 5 is the channel number, which is the distance between the upper, lower, left, and right boundaries of the target body and the center point of the current detection area. (l, t, r, b) and confidence;

Step (D2): Corresponding the position of the feature map and the original image one by one, setting the reduction ratio of the original image and the feature map to d, and the coordinates of the center point of the original image corresponding to the coordinates (x, y) in the feature map are

For the real human body in the current area, the distance between the upper and lower left and right boundaries of the object and the center point of the current detection area is (l, t, r, b) through the bounding box regression network. The coordinate points of the corners correspond;

Among them, (x ₁ , y ₁ ) (x ₂ , y ₂ ) are the coordinates of the upper left corner and the lower right corner of the frame around the human target, respectively, d is the reduction ratio of the original image and the feature map, and H and W are the input of the network The length and width of the feature map;

Step (D3), with the same input as step (D1), input the feature map output by the second layer feature pyramid network, the second multi-layer convolution returns the boundary of the target candidate frame area by changing the number of channels input to the upper layer, Then output a tensor with the shape of H×W×1 for the second multi-layer convolution, and represents the distance coefficient between the center point of the current area and the center point of the real frame, as the center weight, which represents the value ω, which is used to ensure that each The detection area only detects the real human body closest to itself as much as possible.

6. a kind of video conference scene humanoid detection method based on deep learning according to claim 4, is characterized in that, step (E), introduces adaptive focus loss training model, according to the humanoid detection model input step (B after training) ) in the conference scene image obtained in ) to detect the position of the corresponding participant, including the following steps:

Step (E1), add a loss function to train the model, and add two hyperparameters based on the cross-entropy of the loss function, that is, the weight coefficients α, γ, and then introduce an adaptive focus loss to perform binary classification of the loss function. Determine whether the object in the current detection area exists;

Step (E2), input the conference scene image obtained in step (B), and use the center weight value ω output in step (D3) as the parameter of the loss function to participate in the calculation of the loss function, and the distance from the center of the detection area and the center of the real human body is closer. far, the smaller the loss function; when there is no human body in the detection area, ω=0;

The loss function formula is as follows:

Among them, the weight α is used to balance the imbalance of positive and negative samples, the weight γ is used to distinguish between difficult and easy samples, p is the predicted value of confidence, ω is the central weight value output by step (D2), and σ is any value greater than 0. Very decimal, used to prevent division by 0;

In step (E3), the γ value is 0 in the initial state. When γ increases, the adjustment factor also increases, that is, the loss generated by the simple sample is gradually suppressed, and as the γ value increases, the loss generated by the simple sample is greatly reduced.