+

CN113989850A - Humanoid detection method in video conference scene based on deep learning - Google Patents

Humanoid detection method in video conference scene based on deep learning Download PDF

Info

Publication number
CN113989850A
CN113989850A CN202111315469.0A CN202111315469A CN113989850A CN 113989850 A CN113989850 A CN 113989850A CN 202111315469 A CN202111315469 A CN 202111315469A CN 113989850 A CN113989850 A CN 113989850A
Authority
CN
China
Prior art keywords
image
network
input
deep learning
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111315469.0A
Other languages
Chinese (zh)
Other versions
CN113989850B (en
Inventor
丁帆
任永忠
梅宇青
王沛
曾德军
陶宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Innotrik Technology Co ltd
Original Assignee
Shenzhen Innotrik Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Innotrik Technology Co ltd filed Critical Shenzhen Innotrik Technology Co ltd
Priority to CN202111315469.0A priority Critical patent/CN113989850B/en
Publication of CN113989850A publication Critical patent/CN113989850A/en
Application granted granted Critical
Publication of CN113989850B publication Critical patent/CN113989850B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a video conference scene human shape detection method based on deep learning, which is used for extracting panoramic images in a video conference scene and detecting the positions of various personnel in the images, and is beneficial to realizing local picture focusing, assisting voice enhancement and the like; the method comprises the steps of collecting an original panoramic image from a conference scene by using a camera and correcting the original panoramic image; splicing and mapping the corrected rectangular panoramic images into square images with equal length and width, and performing normalization preprocessing and data enhancement; constructing a deep learning model based on a residual error network-feature pyramid network; establishing a bounding box regression network, and simultaneously calculating the positions of peripheral frames, confidence coefficients and center weights of the human body target; and training the model by using the adaptive focus loss, and inputting a conference scene image for training. The method guides the model to adapt to the human body target in the special scene through boundary regression, center weighting and self-adaptive focus loss, thereby improving the detection accuracy and recall rate in the scene of intensive personnel meetings and having good application prospect.

Description

Video conference scene human shape detection method based on deep learning
Technical Field
The invention relates to the field of machine vision, in particular to a video conference scene human shape detection method based on deep learning.
Background
Human shape detection in video conferencing is commonly used to achieve speaker focusing, assisting speech directional enhancement, and other functions. However, in a complex meeting room scene, due to the influence of factors such as a large number of participants, dense seats, unbalanced light, random movement of people and the like, the performance of the conventional target detection algorithm in the application scene is greatly reduced. Meanwhile, the camera used in the conference room is usually an ultra-wide-angle or panoramic camera, the picture is wide, the occupation ratio of a single human body target in the picture is small, and the available features are limited. In addition, the people in the meeting room often sit in a sitting posture, and the body of the people is partially shielded by the meeting facilities such as tables, chairs and computers, so that the available information is further deficient. Therefore, the human shape detection task in the video conference scene becomes a common problem.
Traditional human shape detection algorithms often adopt manual design operators to extract some features for analysis. The Viola Jones detector uses a sliding window to see all possible positions and scales in the image, checking if an object is present in the window. The method combines three important technologies of 'integral image', 'feature selection' and 'detection cascade', and greatly improves the detection speed. Histogram of Oriented Gradient (HOG) feature descriptors have also been used to solve the pedestrian detection problem, which can be used to balance feature invariance (including translation, scale, illumination, etc.) and nonlinearity (distinguishing between different object classes). The DPM algorithm is improved and extended on the basis of the HOG algorithm, consists of a main filter and a plurality of auxiliary filters, and improves the detection precision through hard negative mining, frame regression and context starting technologies. As an optimal traditional detection algorithm, the DPM method is high in operation speed and can adapt to object deformation, but cannot adapt to large-amplitude rotation, so that the stability is poor.
The deep learning algorithm which is rapidly developed in recent years is widely applied to the detection field. The target detection based on the deep learning method overcomes the defect that the traditional algorithm depends on the characteristics of manual design. Target detection currently has two stages, namely a single stage and a two-stage, wherein the two-stage means that a detection algorithm needs to be completed in two steps, firstly a candidate region needs to be obtained, and then classification is carried out, such as R-CNN series; in contrast, the method is a single-stage detection method, and candidate areas do not need to be searched separately, and the SSD, the YOLO series and the like are typical. For the two modes, the two-stage method based on the candidate area is superior in detection accuracy and positioning accuracy, and the single-stage algorithm speed based on end-to-end is superior. However, these algorithms usually solve general multi-class target detection, and can achieve better effects only under the conditions of rich object features, large targets, sparse distribution and consistent illumination. Due to the complexity of reality, a large deviation is likely to exist between an actual meeting room scene and a general training set, and although some defects can be made up by self-establishing a data set, the general deep learning detection model is always insufficient in the aspects of intensive detection, small target detection, the capture of a shielded human body and the like.
Therefore, how to solve the problems of dense human body targets, uneven illumination, wide picture width targets, small irregular object shielding and the like in a conference room is a key for improving the human shape detection effect in a video conference scene, and has important research significance for improving the human shape detection effect, improving the video conference call quality and meeting experience.
Disclosure of Invention
The invention aims to solve the problems of dense human body targets, uneven illumination, wide picture, small targets, irregular object shielding and the like in a human shape detection task of a video conference scene, thereby improving the performance of a detection algorithm, improving the detection accuracy and recall rate, and improving the indexes of the detection IOU precision and the like. According to the video conference scene human shape detection model based on deep learning, the bounding box regression network is used, and the positions of the peripheral frames, the confidence degrees and the central weighting of the human body targets are calculated at the same time, so that the densely arranged human body targets are better used, and the missing detection and the repeated detection are avoided; meanwhile, a model is trained by introducing adaptive focus loss to solve the problem of sample imbalance, the problem of whether positive and negative samples and difficult and easy samples are balanced is mainly included, and finally, the human shape detection scene is better adapted, so that the detection performance is greatly improved, and the method is ingenious and novel and has a good application prospect.
In order to achieve the purpose, the invention adopts the technical scheme that:
the video conference scene human shape detection method based on deep learning comprises the following steps:
step (a), using a camera to collect an original panoramic image from a conference scene and rectifying the original panoramic image (please briefly describe the specific method or process of collection and rectification);
splicing and mapping the corrected rectangular panoramic images into square images with equal length and width, performing normalization pretreatment and data enhancement operation, and then performing normalization treatment, wherein the normalization pretreatment and the data enhancement operation specifically comprise random overturning, region cutting and region covering and reorganizing;
step (C), constructing a deep learning model based on a residual error network-feature pyramid network as a baseline model, inputting the image processed in the step (B), and outputting rectangular position frames of all human bodies in the image;
step (D), introducing a bounding box regression network on the basis of the baseline model in the step (C), and calculating the positions of the peripheral frames, the confidence degrees and the central weighting of the human body target;
and (E) introducing a self-adaptive focus loss training model, inputting the conference scene image obtained in the step (B) according to the trained human shape detection model, and detecting the position of the corresponding conference participant.
The video conference scene human shape detection method based on deep learning comprises the following steps:
the method comprises the following steps of (A) acquiring a rectangular panoramic image from a conference scene by using a camera, and correcting the rectangular panoramic image,
step (A1), placing a camera device with a 180-degree fisheye lens at the center of a conference room, and shooting a conference room panoramic image with distortion;
and (A2) correcting and analyzing the collected panoramic image of the conference room by using an OpenCV checkerboard calibration method, shooting a checkerboard picture with a fixed size by using a fisheye camera, carrying out binarization operation on the image, traversing the outline of each square and acquiring the angular points of all the small squares by carrying out corrosion and expansion operation on the image, calibrating lens parameters according to the distortion conditions of the angular points, acquiring camera lens parameters, inputting the image to be corrected, and carrying out coordinate transformation through the coordinate corresponding relation before and after the lens distortion to obtain a normal undistorted rectangular panoramic picture.
Step (B), the corrected rectangular panoramic image is spliced and mapped into a square image with equal length and width, normalization pretreatment and data enhancement operation are carried out, then normalization treatment is carried out, wherein the normalization pretreatment and the data enhancement operation specifically comprise random overturning, region cutting and region covering recombination,
step (B1), taking a rectangular panoramic image which is output by the camera in the step (A) after correction and has the side length of 3000 multiplied by 1000 and contains 360-degree annular scene information in a conference room, longitudinally cutting two original rectangular images with the side length of 2000 multiplied by 1000, and splicing the two original rectangular images into a square image with the side length of 2000 multiplied by 2000 from top to bottom for adapting to the input shape proportion of the deep learning detector;
step (B2), the spliced square images and the original rectangular images are mapped one by one, the upper half parts of the square images are directly mapped to the x e [0, 2000 ] position of the original rectangular image, and the lower half parts of the square images are images with the side length of 2000 multiplied by 1000 formed by splicing the x e [0, 500 ] of the original image and the x e [1500, 3000) in multiple sections, and are used for avoiding the panoramic image from being split;
step (B3), after the corrected image is mapped to the original image position, performing non-maximum value suppression for avoiding repeated detection of the image spliced in step (B1);
step (B4), performing data enhancement on the spliced square image, randomly turning the spliced square image up, down, left and right on the basis of the original image, then randomly cutting partial image areas containing human body targets, and smearing or covering the image areas not containing the human body targets in a mosaic mode;
and (B5) performing normalization processing on the square image subjected to data enhancement, changing each pixel value into a decimal of a (0, 1) interval, and compressing the input image to a size with the side length of 512 × 512 to serve as the input image of the model.
Step (C), constructing a deep learning model based on a residual error network-feature pyramid network as a baseline model, inputting the image processed in the step (B), and outputting rectangular position frames of all human bodies in the image, comprising the following steps,
step (C1), inputting the image processed in the step (B), then constructing a baseline model, and sequentially connecting a residual convolution network and a characteristic pyramid network;
step (C2), using a residual convolution network for learning the original image space semantic features as a backbone network, adopting a feature pyramid network to realize multi-scale feature fusion of the image, and modeling the features from different scales;
and step (C3), taking the full-connection layer as a detection head through a shallow convolutional network, obtaining the position of the target human shape, obtaining the anchor frame of the human shape in the adaptation data set by using a k-means clustering algorithm, and then outputting rectangular position frames of all human bodies in the image.
Step (D), introducing a bounding box regression network on the basis of the baseline model in the step (C) to calculate the peripheral frame position, the confidence coefficient and the center weight of the human body target, comprising the following steps,
step (D1), introducing a boundary box regression network, inputting a feature map output by a first layer of feature pyramid network and a first path of multilayer convolution, regressing the boundary of a target candidate frame area by changing the number of channels input by an upper layer, and outputting a first path of tensor with the convolution shape of H multiplied by W multiplied by 5, wherein H and W are numerical values of length and width output by the upper layer, and 5 is a channel number and is used as the distance (l, t, r, b) and the confidence coefficient between the upper, lower, left and right boundaries of a target human body and the central point of the current detection area;
step D2, the feature map is corresponding to the original map position one by one, the reduction ratio of the original map and the feature map is set as D, and the coordinate of the original map center point corresponding to the coordinate (x, y) in the feature map is set as D
Figure BDA0003343471250000051
The distance between the upper, lower, left and right boundaries of the object and the central point of the current detection area is (l, t, r, b) through regression network regression of the boundary frame of the human body actually existing in the current area, and the four values correspond to coordinate points of four corners of the peripheral frame of the human body target;
Figure BDA0003343471250000061
Figure BDA0003343471250000062
Figure BDA0003343471250000063
Figure BDA0003343471250000064
wherein (x)1,y1)(x2,y2) Coordinates of a left upper corner point and a right lower corner point of a peripheral frame of the human body target are respectively shown, d is the reduction ratio of the original image and the feature image, and H and W are the length and the width of the network input feature image;
step (D3), inputting the same input as the step (D1), inputting a feature map output by a second layer of feature pyramid network, regressing the boundary of a target candidate frame area by changing the number of channels input by an upper layer through a second-path multilayer convolution, then outputting a tensor with the shape of H multiplied by W multiplied by 1 of the second-path multilayer convolution, representing the distance coefficient between the center point of the current area and the center point of a real frame as a center weight, wherein the tensor represents the numerical value omega, and is used for ensuring that each detection area only detects the real human body closest to the detection area as far as possible
Step (E), introducing a self-adaptive focus loss training model, inputting the conference scene image obtained in the step (B) according to the trained human shape detection model, and detecting the position of the corresponding conference participant, comprising the following steps,
step (E1), adding a loss function training model, adding two super parameters, namely weight coefficients alpha and gamma, on the basis of the cross entropy of the loss function, and then introducing adaptive focus loss to carry out binary classification on the loss function for judging whether an object in the current detection area exists or not;
step (E2), inputting the conference scene image obtained in the step (B), taking the central weight value omega output in the step (D3) as a parameter of a loss function to participate in calculation in the loss function, wherein the longer the distance between the central weight value omega and the center of the detection area and the center of the real human body is, the smaller the loss function is; when no human body exists in the detection area, omega is 0;
the loss function is formulated as follows:
Figure BDA0003343471250000071
wherein, the weight α is used to balance the imbalance of the positive and negative samples, the weight γ is used to distinguish the difficult and easy samples, p is the predicted value of the confidence, ω is the central weight value outputted in step (D2), and σ is any extremely small number greater than 0, which is used to prevent the operation of dividing 0;
in step (E3), the value of γ is 0 in the initial state, and the adjustment factor is increased as γ increases, that is, the loss due to the simple samples is gradually suppressed, and the loss due to the simple samples is greatly reduced as the value of γ increases.
The beneficial effects of the invention are: firstly, an anchor frame clustering in an original detection algorithm based on deep learning is abandoned, a regression mode is used, a distance operator is used for representing the position of a human body, and on the basis of regression distance, confidence coefficient and center weighting parameters are simultaneously calculated through regression; secondly, a self-adaptive focus loss is designed to replace the cross entropy loss function selection training which is commonly used in the prior art, so that the model convergence speed is increased, and the detection precision under a complex scene is improved; therefore, the method can improve the robustness and performance of human body detection, is ingenious and novel, and has good application prospect.
Drawings
FIG. 1 is a flow chart of a method for detecting human forms in a video conference scene based on deep learning according to the present invention;
FIG. 2 is a schematic diagram of clipping and splicing rectangular images output by the panoramic camera in the invention;
FIG. 3 is a block diagram of the overall structure of the proposed model;
fig. 4 is a structure diagram of a network of detection heads designed by the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings.
As shown in fig. 1 and fig. 3, the method for detecting human shape of video conference scene based on deep learning of the present invention includes the following steps, step (a), using a camera to obtain a rectangular panoramic image from the conference scene, and rectifying it, including the following steps,
step (A1), placing a camera device with a 180-degree fisheye lens at the center of a conference room, and shooting a conference room panoramic image with distortion;
and (A2) correcting and analyzing the collected panoramic image of the conference room by using an OpenCV checkerboard calibration method, shooting a checkerboard picture with a fixed size by using a fisheye camera, carrying out binarization operation on the image, traversing the outline of each square and acquiring the angular points of all the small squares by carrying out corrosion and expansion operation on the image, calibrating lens parameters according to the distortion conditions of the angular points, acquiring camera lens parameters, inputting the image to be corrected, and carrying out coordinate transformation through the coordinate corresponding relation before and after the lens distortion to obtain a normal undistorted rectangular panoramic picture.
Step B, as shown in FIG. 2, the rectified rectangular panoramic image is spliced and mapped into a square image with equal length and width, and is subjected to normalization preprocessing and data enhancement operation, and then normalization processing is performed, wherein the normalization preprocessing and the data enhancement operation specifically comprise random overturning, region clipping and region covering recombination,
step (B1), taking a rectangular panoramic image which is output by the image acquisition equipment in the step (A) after being corrected and has the side length of 3000 multiplied by 1000 and contains 360-degree annular scene information in a conference room, longitudinally cutting two original rectangular images with the side length of 2000 multiplied by 1000, and splicing the two original rectangular images into a square image with the side length of 2000 multiplied by 2000 from top to bottom for adapting to the input shape proportion of the deep learning detector;
step (B2), the spliced square images and the original rectangular images are mapped one by one, the upper half parts of the square images are directly mapped to the x e [0, 2000 ] position of the original rectangular image, and the lower half parts of the square images are images with the side length of 2000 multiplied by 1000 formed by splicing the x e [0, 500 ] of the original image and the x e [1500, 3000) in multiple sections, and are used for avoiding the panoramic image from being split;
step (B3), after the corrected image is mapped to the original image position, performing non-maximum value suppression for avoiding repeated detection of the image spliced in step (B1);
step (B4), performing data enhancement on the spliced square image, randomly turning the spliced square image up, down, left and right on the basis of the original image, then randomly cutting partial image areas containing human body targets, and smearing or covering the image areas not containing the human body targets in a mosaic mode;
and (B5) performing normalization processing on the square image subjected to data enhancement, changing each pixel value into a decimal of a (0, 1) interval, and compressing the input image to a size with the side length of 512 × 512 to serve as the input image of the model.
Step (C), constructing a deep learning model based on a residual error network-feature pyramid network as a baseline model, inputting the image processed in the step (B), and outputting rectangular position frames of all human bodies in the image, comprising the following steps,
step (C1), inputting the image processed in the step (B), then constructing a baseline model, and sequentially connecting a residual convolution network and a characteristic pyramid network;
step (C2), using a residual convolution network for learning the original image space semantic features as a backbone network, adopting a feature pyramid network to realize multi-scale feature fusion of the image, and modeling the features from different scales;
and step (C3), taking the full-connection layer as a detection head through a shallow convolutional network, obtaining the position of the target human shape, obtaining the anchor frame of the human shape in the adaptation data set by using a k-means clustering algorithm, and then outputting rectangular position frames of all human bodies in the image.
Step (D), introducing a bounding box regression network on the basis of the baseline model in the step (C), and calculating the peripheral frame position, the confidence coefficient and the center weight of the human body target, as shown in figure 4, comprising the following steps,
step (D1), introducing a boundary box regression network, inputting a feature map output by a first layer of feature pyramid network and a first path of multilayer convolution, regressing the boundary of a target candidate frame area by changing the number of channels input by an upper layer, and outputting a first path of tensor with the convolution shape of H multiplied by W multiplied by 5, wherein H and W are numerical values of length and width output by the upper layer, and 5 is a channel number and is used as the distance (l, t, r, b) and the confidence coefficient between the upper, lower, left and right boundaries of a target human body and the central point of the current detection area;
step D2, the feature map is corresponding to the original map position one by one, the reduction ratio of the original map and the feature map is set as D, and the coordinate of the original map center point corresponding to the coordinate (x, y) in the feature map is set as D
Figure BDA0003343471250000101
The distance between the upper, lower, left and right boundaries of the object and the central point of the current detection area is (l, t, r, b) through regression network regression of the boundary frame of the human body actually existing in the current area, and the four values correspond to coordinate points of four corners of the peripheral frame of the human body target;
Figure BDA0003343471250000102
Figure BDA0003343471250000103
Figure BDA0003343471250000104
Figure BDA0003343471250000105
wherein (x)1,y1)(x2,y2) Coordinates of a left upper corner point and a right lower corner point of a peripheral frame of the human body target are respectively shown, d is the reduction ratio of the original image and the feature image, and H and W are the length and the width of the network input feature image;
and (D3) inputting the same input as the step (D1), inputting a feature map output by a second layer of feature pyramid network, regressing the boundary of the target candidate frame area by changing the number of channels input by the upper layer through the second-path multilayer convolution, and then outputting a tensor with the shape of H multiplied by W multiplied by 1, representing the distance coefficient between the center point of the current area and the center point of the real frame as a center weight, wherein the tensor represents the numerical value omega, and is used for ensuring that each detection area only detects the real human body closest to the detection area as far as possible.
Step (E), introducing a self-adaptive focus loss training model, inputting the conference scene image obtained in the step (B) according to the trained human shape detection model, and detecting the position of the corresponding conference participant, wherein the method comprises the following steps;
step (E1), adding a loss function training model, adding two super parameters, namely weight coefficients alpha and gamma, on the basis of the cross entropy of the loss function, and then introducing adaptive focus loss to carry out binary classification on the loss function for judging whether an object in the current detection area exists or not;
step (E2), inputting the conference scene image obtained in the step (B), taking the central weight value omega output in the step (D3) as a parameter of a loss function to participate in calculation in the loss function, wherein the longer the distance between the central weight value omega and the center of the detection area and the center of the real human body is, the smaller the loss function is; when no human body exists in the detection area, omega is 0;
the loss function is formulated as follows:
Figure BDA0003343471250000111
wherein, the weight α is used to balance the imbalance of the positive and negative samples, the weight γ is used to distinguish the difficult and easy samples, p is the predicted value of the confidence, ω is the central weight value outputted in step (D2), and σ is any extremely small number greater than 0, which is used to prevent the operation of dividing 0;
in step (E3), the value of γ is 0 in the initial state, and the adjustment factor is increased as γ increases, that is, the loss due to the simple samples is gradually suppressed, and the loss due to the simple samples is greatly reduced as the value of γ increases.
In order to fully compare the performance of the video conference scene human shape detection method based on deep learning, the experiment is arranged on a self-established conference room scene human shape detection data set and comprises 9183 pictures, and each conference room picture comprises about 9 persons on average; in the experiment, the average precision mean (AP) of a test set of a model on the data set is used as an evaluation index of the model performance, the IOU threshold is the average precision mean (mAP @0.5) under the condition of 0.5, and the mAP @ 75 is the same under the condition of 0.75 threshold bit; after the IOU threshold is determined, whether the intersection ratio of the prediction frame and the real frame of each category exceeds the threshold can be judged according to the threshold, so that the accuracy and the recall ratio under different confidence degrees (conf) are calculated, and then the results under the thresholds (IOU is 50: 05:0.95, namely AP values are obtained every 0.05 step length within the range of 0.5 to 0.95 of the IOU) are averaged, so that the technical index mAP adopted by the experimental step of the invention is obtained. The experimental result shows that the base line model mAP provided by the invention is 48.7; after a boundary regression network, a central weight network and an adaptive focus loss are introduced, the accuracy is improved to 73.6.
In summary, the method for detecting the human shape of the video conference scene based on the deep learning, provided by the invention, firstly discards the anchor frame clustering in the original detection algorithm based on the deep learning, uses a regression mode, uses a distance operator to represent the position of a human body, and simultaneously regresses to calculate the confidence coefficient and the center weighting parameter on the basis of regressing the distance; secondly, a self-adaptive focus loss is designed to replace the cross entropy loss function selection training which is commonly used in the prior art, so that the model convergence speed is increased, and the detection precision under a complex scene is improved; therefore, the method can improve the robustness and performance of human body detection, is ingenious and novel, and has good application prospect.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (6)

1.一种基于深度学习的视频会议场景人形检测方法,其特征在于,包括如下步骤:1. a video conference scene humanoid detection method based on deep learning, is characterized in that, comprises the steps: 步骤(A)、使用摄像机从会议场景中获取矩形全景图像,并对其进行矫正;Step (A), use a camera to obtain a rectangular panoramic image from the meeting scene, and correct it; 步骤(B)、将矫正后的矩形全景图像拼接映射成长宽相等的方形图像,并作归一化预处理和数据增强操作,随后进行归一化处理,其中归一化预处理和数据增强操作,具体为随机翻转、区域裁剪和区域掩盖重组;Step (B), splicing and mapping the corrected rectangular panoramic image to a square image with equal length and width, and performing normalization preprocessing and data enhancement operations, followed by normalization processing, wherein normalization preprocessing and data enhancement operations are performed. , specifically random flipping, region cropping, and region masking reorganization; 步骤(C)、构建基于残差网络-特征金字塔网络的深度学习模型作为基线模型,输入步骤(B)中处理完成的图像,输出图像中所有人体的矩形位置边框;Step (C), build a deep learning model based on residual network-feature pyramid network as a baseline model, input the image processed in step (B), and output the rectangular position frame of all human bodies in the image; 步骤(D)、在步骤(C)中基线模型的基础上引入边界框回归网络,计算出人体目标四周边框位置以及置信度和中心加权;In step (D), a bounding box regression network is introduced on the basis of the baseline model in step (C), and the position of the border around the human target and the confidence and center weighting are calculated; 步骤(E)、引入自适应焦点损失训练模型,根据训练后的人形检测模型输入步骤(B)中获取的会议场景图像,检测出对应与会者的位置。In step (E), an adaptive focus loss training model is introduced, and according to the trained humanoid detection model, the conference scene image obtained in step (B) is input to detect the position of the corresponding participant. 2.根据权利要求1所述的一种基于深度学习的视频会议场景人形检测方法,其特征在于,步骤(A)、使用摄像机从会议场景中获取矩形全景图像,并对其进行矫正,包括以下步骤,2. a kind of video conference scene humanoid detection method based on deep learning according to claim 1, is characterized in that, step (A), use camera to obtain rectangular panorama image from conference scene, and it is corrected, including following step, 步骤(A1)、将带有180度鱼眼镜头的摄像头设备置于会议室正中心,拍摄出带有畸变的会议室全景图像;Step (A1), place a camera device with a 180-degree fisheye lens in the center of the conference room, and shoot a panoramic image of the conference room with distortion; 步骤(A2)、利用OpenCV棋盘格标定法对采集到的会议室全景图像进行矫正分析,使用鱼眼摄像头拍摄固定大小的棋盘格图片,将图像做二值化操作后,经过对图像的腐蚀和膨胀操作,遍历每一个方格的轮廓并获取所有小方格的角点,根据这些角点的畸变情况,对镜头参数进行标定,获取相机镜头参数,输入待矫正图像,通过透镜畸变前后的坐标对应关系进行坐标变换,得到正常非畸变的矩形全景照片。In step (A2), use the OpenCV checkerboard calibration method to correct and analyze the collected panoramic image of the conference room, use a fisheye camera to take a fixed-size checkerboard picture, and perform a binarization operation on the image. Expansion operation, traverse the outline of each square and obtain the corner points of all small squares, calibrate the lens parameters according to the distortion of these corner points, obtain the camera lens parameters, input the image to be corrected, and pass the coordinates before and after lens distortion. Coordinate transformation is performed on the corresponding relationship to obtain a normal non-distorted rectangular panoramic photo. 3.根据权利要求1所述的一种基于深度学习的视频会议场景人形检测方法,其特征在于,步骤(B),将矫正后的矩形全景图像拼接映射成长宽相等的方形图像,并作归一化预处理和数据增强,随后进行归一化处理,其中归一化预处理和数据增强操作,具体为随机翻转、区域裁剪和区域掩盖重组,包括以下步骤,3. a kind of video conference scene humanoid detection method based on deep learning according to claim 1, is characterized in that, step (B), the rectangular panorama image after rectification is spliced and mapped to the square image with equal length, and returns as normal. Normalization preprocessing and data enhancement, followed by normalization processing, where normalization preprocessing and data enhancement operations, specifically random flipping, region cropping, and region masking reorganization, include the following steps, 步骤(B1)、取步骤(A)中摄像机矫正后输出的边长为3000×1000的包含会议室中360°环形场景信息的矩形全景图像,纵向裁剪边长为2000×1000的两个原始矩形图像,再将其上下拼接成边长为2000×2000的方形图像,用于适配深度学习检测器的输入形状比例;Step (B1), taking the rectangular panoramic image with side length of 3000×1000 and containing the information of the 360° annular scene in the conference room and output by the camera in step (A) after correction, and vertically cropping two original rectangles with side length of 2000×1000 image, and then stitch it up and down into a square image with a side length of 2000×2000, which is used to adapt to the input shape ratio of the deep learning detector; 步骤(B2)、将拼接后的方形图像与原矩形图像位置一一映射,方形图像的上半部分直接映射为原矩形图像的x∈[0,2000)位置,下半部分由原图的x∈[0,500),x∈[1500,3000)多段拼接而成的边长为2000×1000的图像,用于避免全景图像被割裂;Step (B2): Map the position of the spliced square image and the original rectangular image one by one, the upper half of the square image is directly mapped to the x∈[0, 2000) position of the original rectangular image, and the lower half is determined by the x of the original image. ∈[0, 500), x∈[1500, 3000) is an image with a side length of 2000×1000 formed by splicing multiple segments, which is used to prevent the panoramic image from being split; 步骤(B3)、矫正后的图像映射到原图位置之后,进行非极大值抑制,用于避免步骤(B1)拼接后图像的重复检测;In step (B3), after the corrected image is mapped to the position of the original image, non-maximum value suppression is performed to avoid repeated detection of the image after splicing in step (B1); 步骤(B4)、对拼接后的方形图像进行数据增强,在原图的基础上进行上下左右随机翻转,而后对包含人体目标的部分图像区域进行随机裁剪,对不包含人体目标的图像区域进行涂抹或马赛克方式掩盖;Step (B4), performing data enhancement on the spliced square image, randomly flipping up, down, left and right on the basis of the original image, and then randomly cropping part of the image area containing the human target, and smearing or smearing the image area that does not contain the human target. Mosaic cover up; 步骤(B5)、对进行数据增强后的方形图像做归一化处理,将每个像素值变为(0,1)区间的小数,再将输入图像压缩至边长为512×512的大小,作为模型的输入图像。Step (B5), normalize the square image after data enhancement, change each pixel value into a decimal in the (0, 1) interval, and then compress the input image to a size with a side length of 512×512, as an input image for the model. 4.根据权利要求1所述的一种基于深度学习的视频会议场景人形检测方法,其特征在于,步骤(C),构建基于残差网络-特征金字塔网络的深度学习模型作为基线模型,输入步骤(B)中处理完成的图像,输出图像中所有人体的矩形位置边框,包括以下步骤,4. a kind of video conference scene humanoid detection method based on deep learning according to claim 1, is characterized in that, step (C), builds the deep learning model based on residual network-feature pyramid network as baseline model, input step The image processed in (B), the rectangular position frame of all the human bodies in the output image, including the following steps, 步骤(C1)、输入步骤(B)中处理完成的图像,而后构建基线模型,依次连接残差卷积网络和特征金字塔网络;Step (C1), input the image processed in step (B), then build a baseline model, and connect the residual convolution network and the feature pyramid network in turn; 步骤(C2)、将用于学习原始图像空间语义特征的残差卷积网络作为主干网络,采用特征金字塔网络实现图像的多尺度特征融合,对特征从不同的尺度大小进行建模;In step (C2), the residual convolutional network used to learn the spatial semantic features of the original image is used as the backbone network, and the feature pyramid network is used to realize the multi-scale feature fusion of the image, and the features are modeled from different scales; 步骤(C3)、将全连接层通过浅层卷积网络作为检测头,获取目标人形的位置,使用k-means聚类算法获取适配数据集中人形的锚框,而后输出图像中所有人体的矩形位置边框。Step (C3), use the fully connected layer as the detection head through the shallow convolutional network to obtain the position of the target humanoid, use the k-means clustering algorithm to obtain the anchor frame of the humanoid in the data set, and then output the rectangles of all humanoids in the image Location border. 5.根据权利要求3所述的一种基于深度学习的视频会议场景人形检测方法,其特征在于,步骤(D),在步骤(C)中基线模型的基础上引入边界框回归网络,计算出人体目标四周边框位置、置信度和中心加权,包括以下步骤,5. a kind of video conference scene humanoid detection method based on deep learning according to claim 3, is characterized in that, step (D), on the basis of baseline model in step (C), introduces bounding box regression network, calculates The frame position, confidence and center weighting around the human target, including the following steps, 步骤(D1)、引入边界框回归网络,输入第一层特征金字塔网络输出的特征图与第一路多层卷积通过改变上层输入的通道数,回归出目标候选框区域的边界,输出第一路卷积形状为H×W×5的张量,其中H和W为上一层输出的长和宽的数值,5为通道数字,作为目标人体上下左右的边界与当前检测区域中心点的距离(l,t,r,b)和置信度;Step (D1), introduce the bounding box regression network, input the feature map output by the first layer feature pyramid network and the first multi-layer convolution by changing the number of channels input to the upper layer, return the boundary of the target candidate box area, and output the first The shape of the road convolution is a tensor of H×W×5, where H and W are the length and width values output by the previous layer, and 5 is the channel number, which is the distance between the upper, lower, left, and right boundaries of the target body and the center point of the current detection area. (l, t, r, b) and confidence; 步骤(D2)、将特征图与原图位置一一对应,设置原图与特征图的缩小比例为d,特征图中的坐标(x,y)所对应的原图中心点坐标为
Figure RE-FDA0003387281580000041
当前区域真实存在的人体,通过边界框回归网络回归出该物体上下左右边界与当前检测区域中心点的距离为(l,t,r,b),这四个值与人体目标四周边框的四个角的坐标点进行对应;
Step (D2): Corresponding the position of the feature map and the original image one by one, setting the reduction ratio of the original image and the feature map to d, and the coordinates of the center point of the original image corresponding to the coordinates (x, y) in the feature map are
Figure RE-FDA0003387281580000041
For the real human body in the current area, the distance between the upper and lower left and right boundaries of the object and the center point of the current detection area is (l, t, r, b) through the bounding box regression network. The coordinate points of the corners correspond;
Figure RE-FDA0003387281580000042
Figure RE-FDA0003387281580000042
Figure RE-FDA0003387281580000043
Figure RE-FDA0003387281580000043
Figure RE-FDA0003387281580000044
Figure RE-FDA0003387281580000044
Figure RE-FDA0003387281580000045
Figure RE-FDA0003387281580000045
其中,(x1,y1)(x2,y2)分别是人体目标四周边框左上角点和右下角点的坐标,d是原图与特征图的缩小比例,H和W为该网络输入特征图的长和宽;Among them, (x 1 , y 1 ) (x 2 , y 2 ) are the coordinates of the upper left corner and the lower right corner of the frame around the human target, respectively, d is the reduction ratio of the original image and the feature map, and H and W are the input of the network The length and width of the feature map; 步骤(D3)、与步骤(D1)进行相同输入,输入第二层特征金字塔网络输出的特征图,第二路多层卷积通过改变上层输入的通道数,回归出目标候选框区域的边界,而后输出第二路多层卷积形状为H×W×1的张量,并且代表了当前区域中心点与真实框中心点的距离系数,作为中心权重,其表示数值为ω,用于保证每个检测区域尽可能只检测与自己最近的真实人体。Step (D3), with the same input as step (D1), input the feature map output by the second layer feature pyramid network, the second multi-layer convolution returns the boundary of the target candidate frame area by changing the number of channels input to the upper layer, Then output a tensor with the shape of H×W×1 for the second multi-layer convolution, and represents the distance coefficient between the center point of the current area and the center point of the real frame, as the center weight, which represents the value ω, which is used to ensure that each The detection area only detects the real human body closest to itself as much as possible.
6.根据权利要求4所述的一种基于深度学习的视频会议场景人形检测方法,其特征在于,步骤(E),引入自适应焦点损失训练模型,根据训练后的人形检测模型输入步骤(B)中获取的会议场景图像,检测出对应与会者的位置,包括以下步骤,6. a kind of video conference scene humanoid detection method based on deep learning according to claim 4, is characterized in that, step (E), introduces adaptive focus loss training model, according to the humanoid detection model input step (B after training) ) in the conference scene image obtained in ) to detect the position of the corresponding participant, including the following steps: 步骤(E1)、添加损失函数训练模型,并在损失函数交叉熵的基础上增加两个超参数,即权重系数α,γ,而后引入自适应焦点损失来对损失函数进行二元分类,用于判断当前检测区域内的物体是否存在;Step (E1), add a loss function to train the model, and add two hyperparameters based on the cross-entropy of the loss function, that is, the weight coefficients α, γ, and then introduce an adaptive focus loss to perform binary classification of the loss function. Determine whether the object in the current detection area exists; 步骤(E2)、输入步骤(B)中获取的会议场景图像,将步骤(D3)中输出的中心权重数值ω作为损失函数的参数参与损失函数中计算,与检测区域中心与真实人体中心距离越远,则损失函数越小;当检测区域内没有人体存在时,ω=0;Step (E2), input the conference scene image obtained in step (B), and use the center weight value ω output in step (D3) as the parameter of the loss function to participate in the calculation of the loss function, and the distance from the center of the detection area and the center of the real human body is closer. far, the smaller the loss function; when there is no human body in the detection area, ω=0; 损失函数公式如下:The loss function formula is as follows:
Figure RE-FDA0003387281580000051
Figure RE-FDA0003387281580000051
其中,权重α用来平衡正负样本的不均衡,权重γ用来区分难易样本,p为置信度的预测值,ω是步骤(D2)输出的中心权重数值,σ是任意一个大于0的极小数,用于防止除0操作;Among them, the weight α is used to balance the imbalance of positive and negative samples, the weight γ is used to distinguish between difficult and easy samples, p is the predicted value of confidence, ω is the central weight value output by step (D2), and σ is any value greater than 0. Very decimal, used to prevent division by 0; 步骤(E3)、初始状态下γ值为0,当γ增加时,调整因子也在增加,即简单样本产生的loss逐渐被抑制,随着γ值增大简单样本产生的loss则大幅缩小。In step (E3), the γ value is 0 in the initial state. When γ increases, the adjustment factor also increases, that is, the loss generated by the simple sample is gradually suppressed, and as the γ value increases, the loss generated by the simple sample is greatly reduced.
CN202111315469.0A 2021-11-08 2021-11-08 Human figure detection method in video conference scene based on deep learning Active CN113989850B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111315469.0A CN113989850B (en) 2021-11-08 2021-11-08 Human figure detection method in video conference scene based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111315469.0A CN113989850B (en) 2021-11-08 2021-11-08 Human figure detection method in video conference scene based on deep learning

Publications (2)

Publication Number Publication Date
CN113989850A true CN113989850A (en) 2022-01-28
CN113989850B CN113989850B (en) 2025-04-25

Family

ID=79747204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111315469.0A Active CN113989850B (en) 2021-11-08 2021-11-08 Human figure detection method in video conference scene based on deep learning

Country Status (1)

Country Link
CN (1) CN113989850B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115082960A (en) * 2022-06-21 2022-09-20 京东方科技集团股份有限公司 Image processing method, computer device and readable storage medium
CN116511063A (en) * 2023-03-30 2023-08-01 合肥中科深谷科技发展有限公司 A Robot Vision Guided Tracking and Sorting System Based on Lightweight Object Detection Algorithm

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101776952A (en) * 2010-01-29 2010-07-14 联动天下科技(大连)有限公司 Novel interactive projection system
CN107169421A (en) * 2017-04-20 2017-09-15 华南理工大学 A kind of car steering scene objects detection method based on depth convolutional neural networks
CN111611998A (en) * 2020-05-21 2020-09-01 中山大学 An Adaptive Feature Block Extraction Method Based on Area, Width and Height of Candidate Regions
CN111898406A (en) * 2020-06-05 2020-11-06 东南大学 Face detection method based on focal loss and multi-task cascade
US11074711B1 (en) * 2018-06-15 2021-07-27 Bertec Corporation System for estimating a pose of one or more persons in a scene

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101776952A (en) * 2010-01-29 2010-07-14 联动天下科技(大连)有限公司 Novel interactive projection system
CN107169421A (en) * 2017-04-20 2017-09-15 华南理工大学 A kind of car steering scene objects detection method based on depth convolutional neural networks
US11074711B1 (en) * 2018-06-15 2021-07-27 Bertec Corporation System for estimating a pose of one or more persons in a scene
CN111611998A (en) * 2020-05-21 2020-09-01 中山大学 An Adaptive Feature Block Extraction Method Based on Area, Width and Height of Candidate Regions
CN111898406A (en) * 2020-06-05 2020-11-06 东南大学 Face detection method based on focal loss and multi-task cascade

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨其睿;: "油田安防领域基于改进的深度残差网络行人检测模型", 计算机测量与控制, no. 11, 25 November 2018 (2018-11-25) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115082960A (en) * 2022-06-21 2022-09-20 京东方科技集团股份有限公司 Image processing method, computer device and readable storage medium
CN115082960B (en) * 2022-06-21 2025-08-29 京东方科技集团股份有限公司 Image processing method, computer device and readable storage medium
CN116511063A (en) * 2023-03-30 2023-08-01 合肥中科深谷科技发展有限公司 A Robot Vision Guided Tracking and Sorting System Based on Lightweight Object Detection Algorithm

Also Published As

Publication number Publication date
CN113989850B (en) 2025-04-25

Similar Documents

Publication Publication Date Title
CN113065558A (en) Lightweight small target detection method combined with attention mechanism
CN110941594B (en) Splitting method and device of video file, electronic equipment and storage medium
CN112150493B (en) Semantic guidance-based screen area detection method in natural scene
CN109767422A (en) Pipe detection recognition methods, storage medium and robot based on deep learning
CN114140623B (en) Image feature point extraction method and system
CN110598610A (en) Target significance detection method based on neural selection attention
CN108960404B (en) Image-based crowd counting method and device
CN110766720A (en) Multi-camera vehicle tracking system based on deep learning
CN111783685B (en) An improved target detection algorithm based on a single-stage network model
CN113989850B (en) Human figure detection method in video conference scene based on deep learning
CN109635814B (en) Forest fire automatic detection method and device based on deep neural network
CN109829924B (en) Image quality evaluation method based on principal feature analysis
CN111882555B (en) Net clothing detection methods, devices, equipment and storage media based on deep learning
CN113627302A (en) Method and system for detecting compliance of ascending construction
CN110110131A (en) It is a kind of based on the aircraft cable support of deep learning and binocular stereo vision identification and parameter acquiring method
CN112036259A (en) Form correction and recognition method based on combination of image processing and deep learning
CN103955949A (en) Moving target detection method based on Mean-shift algorithm
CN104361357B (en) Photo album categorizing system and sorting technique based on image content analysis
CN114863302B (en) A small target detection method for UAV based on improved multi-head self-attention
Liu et al. Extended faster R-CNN for long distance human detection: Finding pedestrians in UAV images
CN112287802A (en) Face image detection method, system, storage medium and equipment
CN116342519A (en) Image processing method based on machine learning
CN109344758B (en) Face recognition method based on improved local binary pattern
CN109919832A (en) A traffic image stitching method for unmanned driving
CN110490170A (en) A kind of face candidate frame extracting method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 518108, Bao'an District, Shenzhen City, Guangdong Province, Tangtou Community, Shiyan Street, Baishi Road, Baishi Science and Technology Park, 2nd and 3rd floors of Building A

Patentee after: SHENZHEN INNOTRIK TECHNOLOGY Co.,Ltd.

Country or region after: China

Address before: 518000 Guangdong Province Shenzhen City Bao'an District Xin'an Street Lingzhiyuan Community 22 Area Qinchengda Garden Building 13 1302-1307;The business premises are located at the sixth floor, north side of Building E in the Gem Science and Technology Park, engaged in production and business activities.

Patentee before: SHENZHEN INNOTRIK TECHNOLOGY Co.,Ltd.

Country or region before: China

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载