Video conference scene human shape detection method based on deep learning
Technical Field
The invention relates to the field of machine vision, in particular to a video conference scene human shape detection method based on deep learning.
Background
Human shape detection in video conferencing is commonly used to achieve speaker focusing, assisting speech directional enhancement, and other functions. However, in a complex meeting room scene, due to the influence of factors such as a large number of participants, dense seats, unbalanced light, random movement of people and the like, the performance of the conventional target detection algorithm in the application scene is greatly reduced. Meanwhile, the camera used in the conference room is usually an ultra-wide-angle or panoramic camera, the picture is wide, the occupation ratio of a single human body target in the picture is small, and the available features are limited. In addition, the people in the meeting room often sit in a sitting posture, and the body of the people is partially shielded by the meeting facilities such as tables, chairs and computers, so that the available information is further deficient. Therefore, the human shape detection task in the video conference scene becomes a common problem.
Traditional human shape detection algorithms often adopt manual design operators to extract some features for analysis. The Viola Jones detector uses a sliding window to see all possible positions and scales in the image, checking if an object is present in the window. The method combines three important technologies of 'integral image', 'feature selection' and 'detection cascade', and greatly improves the detection speed. Histogram of Oriented Gradient (HOG) feature descriptors have also been used to solve the pedestrian detection problem, which can be used to balance feature invariance (including translation, scale, illumination, etc.) and nonlinearity (distinguishing between different object classes). The DPM algorithm is improved and extended on the basis of the HOG algorithm, consists of a main filter and a plurality of auxiliary filters, and improves the detection precision through hard negative mining, frame regression and context starting technologies. As an optimal traditional detection algorithm, the DPM method is high in operation speed and can adapt to object deformation, but cannot adapt to large-amplitude rotation, so that the stability is poor.
The deep learning algorithm which is rapidly developed in recent years is widely applied to the detection field. The target detection based on the deep learning method overcomes the defect that the traditional algorithm depends on the characteristics of manual design. Target detection currently has two stages, namely a single stage and a two-stage, wherein the two-stage means that a detection algorithm needs to be completed in two steps, firstly a candidate region needs to be obtained, and then classification is carried out, such as R-CNN series; in contrast, the method is a single-stage detection method, and candidate areas do not need to be searched separately, and the SSD, the YOLO series and the like are typical. For the two modes, the two-stage method based on the candidate area is superior in detection accuracy and positioning accuracy, and the single-stage algorithm speed based on end-to-end is superior. However, these algorithms usually solve general multi-class target detection, and can achieve better effects only under the conditions of rich object features, large targets, sparse distribution and consistent illumination. Due to the complexity of reality, a large deviation is likely to exist between an actual meeting room scene and a general training set, and although some defects can be made up by self-establishing a data set, the general deep learning detection model is always insufficient in the aspects of intensive detection, small target detection, the capture of a shielded human body and the like.
Therefore, how to solve the problems of dense human body targets, uneven illumination, wide picture width targets, small irregular object shielding and the like in a conference room is a key for improving the human shape detection effect in a video conference scene, and has important research significance for improving the human shape detection effect, improving the video conference call quality and meeting experience.
Disclosure of Invention
The invention aims to solve the problems of dense human body targets, uneven illumination, wide picture, small targets, irregular object shielding and the like in a human shape detection task of a video conference scene, thereby improving the performance of a detection algorithm, improving the detection accuracy and recall rate, and improving the indexes of the detection IOU precision and the like. According to the video conference scene human shape detection model based on deep learning, the bounding box regression network is used, and the positions of the peripheral frames, the confidence degrees and the central weighting of the human body targets are calculated at the same time, so that the densely arranged human body targets are better used, and the missing detection and the repeated detection are avoided; meanwhile, a model is trained by introducing adaptive focus loss to solve the problem of sample imbalance, the problem of whether positive and negative samples and difficult and easy samples are balanced is mainly included, and finally, the human shape detection scene is better adapted, so that the detection performance is greatly improved, and the method is ingenious and novel and has a good application prospect.
In order to achieve the purpose, the invention adopts the technical scheme that:
the video conference scene human shape detection method based on deep learning comprises the following steps:
step (a), using a camera to collect an original panoramic image from a conference scene and rectifying the original panoramic image (please briefly describe the specific method or process of collection and rectification);
splicing and mapping the corrected rectangular panoramic images into square images with equal length and width, performing normalization pretreatment and data enhancement operation, and then performing normalization treatment, wherein the normalization pretreatment and the data enhancement operation specifically comprise random overturning, region cutting and region covering and reorganizing;
step (C), constructing a deep learning model based on a residual error network-feature pyramid network as a baseline model, inputting the image processed in the step (B), and outputting rectangular position frames of all human bodies in the image;
step (D), introducing a bounding box regression network on the basis of the baseline model in the step (C), and calculating the positions of the peripheral frames, the confidence degrees and the central weighting of the human body target;
and (E) introducing a self-adaptive focus loss training model, inputting the conference scene image obtained in the step (B) according to the trained human shape detection model, and detecting the position of the corresponding conference participant.
The video conference scene human shape detection method based on deep learning comprises the following steps:
the method comprises the following steps of (A) acquiring a rectangular panoramic image from a conference scene by using a camera, and correcting the rectangular panoramic image,
step (A1), placing a camera device with a 180-degree fisheye lens at the center of a conference room, and shooting a conference room panoramic image with distortion;
and (A2) correcting and analyzing the collected panoramic image of the conference room by using an OpenCV checkerboard calibration method, shooting a checkerboard picture with a fixed size by using a fisheye camera, carrying out binarization operation on the image, traversing the outline of each square and acquiring the angular points of all the small squares by carrying out corrosion and expansion operation on the image, calibrating lens parameters according to the distortion conditions of the angular points, acquiring camera lens parameters, inputting the image to be corrected, and carrying out coordinate transformation through the coordinate corresponding relation before and after the lens distortion to obtain a normal undistorted rectangular panoramic picture.
Step (B), the corrected rectangular panoramic image is spliced and mapped into a square image with equal length and width, normalization pretreatment and data enhancement operation are carried out, then normalization treatment is carried out, wherein the normalization pretreatment and the data enhancement operation specifically comprise random overturning, region cutting and region covering recombination,
step (B1), taking a rectangular panoramic image which is output by the camera in the step (A) after correction and has the side length of 3000 multiplied by 1000 and contains 360-degree annular scene information in a conference room, longitudinally cutting two original rectangular images with the side length of 2000 multiplied by 1000, and splicing the two original rectangular images into a square image with the side length of 2000 multiplied by 2000 from top to bottom for adapting to the input shape proportion of the deep learning detector;
step (B2), the spliced square images and the original rectangular images are mapped one by one, the upper half parts of the square images are directly mapped to the x e [0, 2000 ] position of the original rectangular image, and the lower half parts of the square images are images with the side length of 2000 multiplied by 1000 formed by splicing the x e [0, 500 ] of the original image and the x e [1500, 3000) in multiple sections, and are used for avoiding the panoramic image from being split;
step (B3), after the corrected image is mapped to the original image position, performing non-maximum value suppression for avoiding repeated detection of the image spliced in step (B1);
step (B4), performing data enhancement on the spliced square image, randomly turning the spliced square image up, down, left and right on the basis of the original image, then randomly cutting partial image areas containing human body targets, and smearing or covering the image areas not containing the human body targets in a mosaic mode;
and (B5) performing normalization processing on the square image subjected to data enhancement, changing each pixel value into a decimal of a (0, 1) interval, and compressing the input image to a size with the side length of 512 × 512 to serve as the input image of the model.
Step (C), constructing a deep learning model based on a residual error network-feature pyramid network as a baseline model, inputting the image processed in the step (B), and outputting rectangular position frames of all human bodies in the image, comprising the following steps,
step (C1), inputting the image processed in the step (B), then constructing a baseline model, and sequentially connecting a residual convolution network and a characteristic pyramid network;
step (C2), using a residual convolution network for learning the original image space semantic features as a backbone network, adopting a feature pyramid network to realize multi-scale feature fusion of the image, and modeling the features from different scales;
and step (C3), taking the full-connection layer as a detection head through a shallow convolutional network, obtaining the position of the target human shape, obtaining the anchor frame of the human shape in the adaptation data set by using a k-means clustering algorithm, and then outputting rectangular position frames of all human bodies in the image.
Step (D), introducing a bounding box regression network on the basis of the baseline model in the step (C) to calculate the peripheral frame position, the confidence coefficient and the center weight of the human body target, comprising the following steps,
step (D1), introducing a boundary box regression network, inputting a feature map output by a first layer of feature pyramid network and a first path of multilayer convolution, regressing the boundary of a target candidate frame area by changing the number of channels input by an upper layer, and outputting a first path of tensor with the convolution shape of H multiplied by W multiplied by 5, wherein H and W are numerical values of length and width output by the upper layer, and 5 is a channel number and is used as the distance (l, t, r, b) and the confidence coefficient between the upper, lower, left and right boundaries of a target human body and the central point of the current detection area;
step D2, the feature map is corresponding to the original map position one by one, the reduction ratio of the original map and the feature map is set as D, and the coordinate of the original map center point corresponding to the coordinate (x, y) in the feature map is set as D
The distance between the upper, lower, left and right boundaries of the object and the central point of the current detection area is (l, t, r, b) through regression network regression of the boundary frame of the human body actually existing in the current area, and the four values correspond to coordinate points of four corners of the peripheral frame of the human body target;
wherein (x)1,y1)(x2,y2) Coordinates of a left upper corner point and a right lower corner point of a peripheral frame of the human body target are respectively shown, d is the reduction ratio of the original image and the feature image, and H and W are the length and the width of the network input feature image;
step (D3), inputting the same input as the step (D1), inputting a feature map output by a second layer of feature pyramid network, regressing the boundary of a target candidate frame area by changing the number of channels input by an upper layer through a second-path multilayer convolution, then outputting a tensor with the shape of H multiplied by W multiplied by 1 of the second-path multilayer convolution, representing the distance coefficient between the center point of the current area and the center point of a real frame as a center weight, wherein the tensor represents the numerical value omega, and is used for ensuring that each detection area only detects the real human body closest to the detection area as far as possible
Step (E), introducing a self-adaptive focus loss training model, inputting the conference scene image obtained in the step (B) according to the trained human shape detection model, and detecting the position of the corresponding conference participant, comprising the following steps,
step (E1), adding a loss function training model, adding two super parameters, namely weight coefficients alpha and gamma, on the basis of the cross entropy of the loss function, and then introducing adaptive focus loss to carry out binary classification on the loss function for judging whether an object in the current detection area exists or not;
step (E2), inputting the conference scene image obtained in the step (B), taking the central weight value omega output in the step (D3) as a parameter of a loss function to participate in calculation in the loss function, wherein the longer the distance between the central weight value omega and the center of the detection area and the center of the real human body is, the smaller the loss function is; when no human body exists in the detection area, omega is 0;
the loss function is formulated as follows:
wherein, the weight α is used to balance the imbalance of the positive and negative samples, the weight γ is used to distinguish the difficult and easy samples, p is the predicted value of the confidence, ω is the central weight value outputted in step (D2), and σ is any extremely small number greater than 0, which is used to prevent the operation of dividing 0;
in step (E3), the value of γ is 0 in the initial state, and the adjustment factor is increased as γ increases, that is, the loss due to the simple samples is gradually suppressed, and the loss due to the simple samples is greatly reduced as the value of γ increases.
The beneficial effects of the invention are: firstly, an anchor frame clustering in an original detection algorithm based on deep learning is abandoned, a regression mode is used, a distance operator is used for representing the position of a human body, and on the basis of regression distance, confidence coefficient and center weighting parameters are simultaneously calculated through regression; secondly, a self-adaptive focus loss is designed to replace the cross entropy loss function selection training which is commonly used in the prior art, so that the model convergence speed is increased, and the detection precision under a complex scene is improved; therefore, the method can improve the robustness and performance of human body detection, is ingenious and novel, and has good application prospect.
Drawings
FIG. 1 is a flow chart of a method for detecting human forms in a video conference scene based on deep learning according to the present invention;
FIG. 2 is a schematic diagram of clipping and splicing rectangular images output by the panoramic camera in the invention;
FIG. 3 is a block diagram of the overall structure of the proposed model;
fig. 4 is a structure diagram of a network of detection heads designed by the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings.
As shown in fig. 1 and fig. 3, the method for detecting human shape of video conference scene based on deep learning of the present invention includes the following steps, step (a), using a camera to obtain a rectangular panoramic image from the conference scene, and rectifying it, including the following steps,
step (A1), placing a camera device with a 180-degree fisheye lens at the center of a conference room, and shooting a conference room panoramic image with distortion;
and (A2) correcting and analyzing the collected panoramic image of the conference room by using an OpenCV checkerboard calibration method, shooting a checkerboard picture with a fixed size by using a fisheye camera, carrying out binarization operation on the image, traversing the outline of each square and acquiring the angular points of all the small squares by carrying out corrosion and expansion operation on the image, calibrating lens parameters according to the distortion conditions of the angular points, acquiring camera lens parameters, inputting the image to be corrected, and carrying out coordinate transformation through the coordinate corresponding relation before and after the lens distortion to obtain a normal undistorted rectangular panoramic picture.
Step B, as shown in FIG. 2, the rectified rectangular panoramic image is spliced and mapped into a square image with equal length and width, and is subjected to normalization preprocessing and data enhancement operation, and then normalization processing is performed, wherein the normalization preprocessing and the data enhancement operation specifically comprise random overturning, region clipping and region covering recombination,
step (B1), taking a rectangular panoramic image which is output by the image acquisition equipment in the step (A) after being corrected and has the side length of 3000 multiplied by 1000 and contains 360-degree annular scene information in a conference room, longitudinally cutting two original rectangular images with the side length of 2000 multiplied by 1000, and splicing the two original rectangular images into a square image with the side length of 2000 multiplied by 2000 from top to bottom for adapting to the input shape proportion of the deep learning detector;
step (B2), the spliced square images and the original rectangular images are mapped one by one, the upper half parts of the square images are directly mapped to the x e [0, 2000 ] position of the original rectangular image, and the lower half parts of the square images are images with the side length of 2000 multiplied by 1000 formed by splicing the x e [0, 500 ] of the original image and the x e [1500, 3000) in multiple sections, and are used for avoiding the panoramic image from being split;
step (B3), after the corrected image is mapped to the original image position, performing non-maximum value suppression for avoiding repeated detection of the image spliced in step (B1);
step (B4), performing data enhancement on the spliced square image, randomly turning the spliced square image up, down, left and right on the basis of the original image, then randomly cutting partial image areas containing human body targets, and smearing or covering the image areas not containing the human body targets in a mosaic mode;
and (B5) performing normalization processing on the square image subjected to data enhancement, changing each pixel value into a decimal of a (0, 1) interval, and compressing the input image to a size with the side length of 512 × 512 to serve as the input image of the model.
Step (C), constructing a deep learning model based on a residual error network-feature pyramid network as a baseline model, inputting the image processed in the step (B), and outputting rectangular position frames of all human bodies in the image, comprising the following steps,
step (C1), inputting the image processed in the step (B), then constructing a baseline model, and sequentially connecting a residual convolution network and a characteristic pyramid network;
step (C2), using a residual convolution network for learning the original image space semantic features as a backbone network, adopting a feature pyramid network to realize multi-scale feature fusion of the image, and modeling the features from different scales;
and step (C3), taking the full-connection layer as a detection head through a shallow convolutional network, obtaining the position of the target human shape, obtaining the anchor frame of the human shape in the adaptation data set by using a k-means clustering algorithm, and then outputting rectangular position frames of all human bodies in the image.
Step (D), introducing a bounding box regression network on the basis of the baseline model in the step (C), and calculating the peripheral frame position, the confidence coefficient and the center weight of the human body target, as shown in figure 4, comprising the following steps,
step (D1), introducing a boundary box regression network, inputting a feature map output by a first layer of feature pyramid network and a first path of multilayer convolution, regressing the boundary of a target candidate frame area by changing the number of channels input by an upper layer, and outputting a first path of tensor with the convolution shape of H multiplied by W multiplied by 5, wherein H and W are numerical values of length and width output by the upper layer, and 5 is a channel number and is used as the distance (l, t, r, b) and the confidence coefficient between the upper, lower, left and right boundaries of a target human body and the central point of the current detection area;
step D2, the feature map is corresponding to the original map position one by one, the reduction ratio of the original map and the feature map is set as D, and the coordinate of the original map center point corresponding to the coordinate (x, y) in the feature map is set as D
The distance between the upper, lower, left and right boundaries of the object and the central point of the current detection area is (l, t, r, b) through regression network regression of the boundary frame of the human body actually existing in the current area, and the four values correspond to coordinate points of four corners of the peripheral frame of the human body target;
wherein (x)1,y1)(x2,y2) Coordinates of a left upper corner point and a right lower corner point of a peripheral frame of the human body target are respectively shown, d is the reduction ratio of the original image and the feature image, and H and W are the length and the width of the network input feature image;
and (D3) inputting the same input as the step (D1), inputting a feature map output by a second layer of feature pyramid network, regressing the boundary of the target candidate frame area by changing the number of channels input by the upper layer through the second-path multilayer convolution, and then outputting a tensor with the shape of H multiplied by W multiplied by 1, representing the distance coefficient between the center point of the current area and the center point of the real frame as a center weight, wherein the tensor represents the numerical value omega, and is used for ensuring that each detection area only detects the real human body closest to the detection area as far as possible.
Step (E), introducing a self-adaptive focus loss training model, inputting the conference scene image obtained in the step (B) according to the trained human shape detection model, and detecting the position of the corresponding conference participant, wherein the method comprises the following steps;
step (E1), adding a loss function training model, adding two super parameters, namely weight coefficients alpha and gamma, on the basis of the cross entropy of the loss function, and then introducing adaptive focus loss to carry out binary classification on the loss function for judging whether an object in the current detection area exists or not;
step (E2), inputting the conference scene image obtained in the step (B), taking the central weight value omega output in the step (D3) as a parameter of a loss function to participate in calculation in the loss function, wherein the longer the distance between the central weight value omega and the center of the detection area and the center of the real human body is, the smaller the loss function is; when no human body exists in the detection area, omega is 0;
the loss function is formulated as follows:
wherein, the weight α is used to balance the imbalance of the positive and negative samples, the weight γ is used to distinguish the difficult and easy samples, p is the predicted value of the confidence, ω is the central weight value outputted in step (D2), and σ is any extremely small number greater than 0, which is used to prevent the operation of dividing 0;
in step (E3), the value of γ is 0 in the initial state, and the adjustment factor is increased as γ increases, that is, the loss due to the simple samples is gradually suppressed, and the loss due to the simple samples is greatly reduced as the value of γ increases.
In order to fully compare the performance of the video conference scene human shape detection method based on deep learning, the experiment is arranged on a self-established conference room scene human shape detection data set and comprises 9183 pictures, and each conference room picture comprises about 9 persons on average; in the experiment, the average precision mean (AP) of a test set of a model on the data set is used as an evaluation index of the model performance, the IOU threshold is the average precision mean (mAP @0.5) under the condition of 0.5, and the mAP @ 75 is the same under the condition of 0.75 threshold bit; after the IOU threshold is determined, whether the intersection ratio of the prediction frame and the real frame of each category exceeds the threshold can be judged according to the threshold, so that the accuracy and the recall ratio under different confidence degrees (conf) are calculated, and then the results under the thresholds (IOU is 50: 05:0.95, namely AP values are obtained every 0.05 step length within the range of 0.5 to 0.95 of the IOU) are averaged, so that the technical index mAP adopted by the experimental step of the invention is obtained. The experimental result shows that the base line model mAP provided by the invention is 48.7; after a boundary regression network, a central weight network and an adaptive focus loss are introduced, the accuracy is improved to 73.6.
In summary, the method for detecting the human shape of the video conference scene based on the deep learning, provided by the invention, firstly discards the anchor frame clustering in the original detection algorithm based on the deep learning, uses a regression mode, uses a distance operator to represent the position of a human body, and simultaneously regresses to calculate the confidence coefficient and the center weighting parameter on the basis of regressing the distance; secondly, a self-adaptive focus loss is designed to replace the cross entropy loss function selection training which is commonly used in the prior art, so that the model convergence speed is increased, and the detection precision under a complex scene is improved; therefore, the method can improve the robustness and performance of human body detection, is ingenious and novel, and has good application prospect.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.