Disclosure of Invention
The invention aims to solve the problems, and provides a multi-scale pedestrian detection method and device based on the combination of head and overall information, which can improve the detection precision of a detector on multi-scale pedestrian targets and blocked pedestrian targets in a complex crowd intensive scene and reduce the omission ratio of the detector.
In a first aspect, the present invention provides a multi-scale pedestrian detection method based on a combination of head and overall information, comprising:
Constructing a Faster R-CNN network model;
The backbone network of the Faster R-CNN network model is fused with an improved feature extraction network, and an image to be detected is input into the Faster R-CNN model of the fused and improved feature extraction network for feature extraction, so that an extracted feature map is obtained;
The sampling mode of the regional suggestion network is improved, a non-uniform difficult sample mining strategy based on shielding overlapping rate discrimination is constructed, the shielding overlapping rate of each sample in a sample set is calculated, a sample with higher shielding overlapping rate is given with higher weight, and the regional suggestion network is utilized to simultaneously generate a head candidate frame and an integral candidate frame set for all pedestrian instances in a scene;
constructing a pedestrian head detection branch module and a pedestrian overall detection branch module, and obtaining a preliminary target detection result through the pedestrian head detection branch module and the pedestrian overall detection branch module, wherein the preliminary target detection result comprises a pedestrian head detection frame and a pedestrian overall detection frame;
Post-processing the obtained pedestrian head detection frame and the pedestrian overall detection frame, and screening out redundant detection frames generated in the combined detection process to obtain a final pedestrian detection result;
In order to further inhibit false detection and missing detection conditions in the post-processing link, a joint loss function is constructed in the loss function part, and the joint loss function is used for punishing false detection conditions which are not inhibited by a correct non-maximum value and missing detection conditions which are inhibited by the wrong non-maximum value, so that a final detection result is more accurate.
As a preferred solution, the fusing the backbone network of the fast R-CNN network model with the improved feature extraction network, and inputting the image to be detected into the fast R-CNN model of the fused and improved feature extraction network for feature extraction, to obtain an extracted feature map, which includes:
Taking ResNet network as backbone network of the fast R-CNN network model;
Learning characteristic information of an image to be detected by using the backbone network;
Feature fusion is carried out on feature information acquired by the backbone network, a dense connection idea is combined, feature splicing strategies of a feature pyramid network FPN are improved, namely, features of all scales participate in calculation and output of the feature information, the advantages of high-level semantic information expressed by a large-scale feature map and detailed information such as bottom texture expressed by a small-scale feature map are absorbed, fusion of the network to the multi-scale features is enhanced, and predicted values corresponding to the scale features of each layer are obtained ~Specifically as shown in formula 1 and formula 2;
(1)
(2)
Wherein, Are all the weight parameters of the weight-based material,As a function of the feature map,~Extracting each layer of scale characteristics obtained by a network for the characteristics, and in the specific training process, parametersParticipate in the back propagation of the gradient, and learn to update to the most appropriate value through training of the model.
As a preferred solution, the method for improving the sampling manner of the area suggestion network, constructing a non-uniform difficult sample mining strategy based on the judgment of the shielding overlapping rate, calculating the shielding overlapping rate of each sample in the sample set, and giving a higher weight to the sample with a higher shielding overlapping rate, and generating a head candidate frame and an integral candidate frame set simultaneously for all pedestrian instances in the scene by using the area suggestion network, including:
the non-uniform difficult sample mining strategy based on shielding overlap rate discrimination is realized by introducing a judging threshold value The sample set is based on the average shielding overlapping rateDivided into difficult sets) And common collection) Two classes, the probability of each sample being extracted is defined asThe specific calculation method is shown in a formula 3;
(3)
Wherein, Represent the firstThe shading coefficient of each sample is used for reflecting the firstThe degree to which the individual samples are occluded;
For the first Occlusion coefficient of individual samplesFurther expressed as:
(4)
Wherein, Indicating the number of samples required for the test,Representing the total number of candidate samples,AndRespectively represent the first in the sample setDetermining a threshold value by using the occlusion overlapping rate of each sample and the average occlusion overlapping rate of the whole sample setDifferent values may be set depending on the overall occlusion degree of the dataset;
for sample set No Occlusion overlap ratio of individual samplesAnd average occlusion overlap rate for the entire sample setFurther expressed as:
(5)
(6)
As a preferred solution, the post-processing is performed on the obtained pedestrian head detection frame and the pedestrian overall detection frame, and redundant detection frames generated in the combined detection process are screened out, so as to obtain a final pedestrian detection result, which includes:
Respectively introducing penalty factors for decreasing confidence to a head detection frame and a pedestrian overall detection frame obtained by combining the detection frames, and gradually reducing the confidence score of the overlapped detection frames, so that competition of the overlapped frames is reduced while the overlapped frames are not excessively restrained, and the first step is that Score of individual pedestrian overall detection frameAnd (d)Score of individual head detection framesExpressed as:
(7)
(8)
(9)
(10)
Wherein, AndThe overall box and the head box with the highest scores in the previous iteration are respectively,AndNMS thresholds corresponding to the overall detection and the header detection respectively,A smaller fixed value for the initial setting;
Weighting and summing the confidence scores of the obtained head frame and the whole frame to obtain a joint confidence score The method is specifically expressed as follows:
(11)
Wherein, And representing the weight occupied by the head detection score, and if the head overlapping degree or the whole overlapping degree is larger than a preset threshold value, suppressing the detection frame with lower confidence.
As a preferred solution, to further suppress false detection and missing detection situations occurring in the post-processing link, a joint loss function is constructed in the loss function part, where the joint loss function is used to penalize false detection situations that are not suppressed by a correct non-maximum value and missing detection situations that are suppressed by a wrong non-maximum value, so that a final detection result is more accurate, and the method includes:
The joint loss function The concrete steps are as follows:
(12)
Wherein the coefficient is 、、AndAre the weights of the balance loss,And (3) withFor pulling the head frame and the whole frame of the false detection closer to be removed,And (3) withThe method comprises the steps of pushing away a missing head frame and an integral frame which are restrained by an error non-maximum value, so that the head frame and the integral frame correctly correspond to a target frame;
For the missing head box and the whole box to be suppressed by the error non-maximum value And (3) withThe loss function, further expressed as:
(13)
(14)
For head frames and whole frames for false detection to be drawn in for rejection And (3) withThe loss function, further expressed as:
(15)
(16)
Wherein, AndNMS threshold values corresponding to the head detection branch and the whole detection branch are respectively obtained, and after redundant frames generated by the head detection branch and the whole detection branch are removed, a final pedestrian detection result is obtained.
In a second aspect, the present invention provides a multi-scale pedestrian detection device based on a combination of head and overall information, comprising:
the building unit is used for building a Faster R-CNN network model;
The extraction unit is used for fusing the backbone network of the Faster R-CNN network model with the improved feature extraction network, inputting the image to be detected into the Faster R-CNN model fused with the improved feature extraction network for feature extraction, and obtaining an extracted feature map;
The generation unit is used for improving the sampling mode of the regional suggestion network, constructing a non-uniform difficult sample mining strategy based on shielding overlapping rate discrimination, calculating the shielding overlapping rate of each sample in the sample set, giving higher weight to the sample with higher shielding overlapping rate, and simultaneously generating a head candidate frame and an integral candidate frame set for all pedestrian examples in a scene by using the regional suggestion network;
The primary detection unit is used for constructing a pedestrian head detection branch module and a pedestrian overall detection branch module, and obtaining a primary target detection result through the pedestrian head detection branch module and the pedestrian overall detection branch module, wherein the primary target detection result comprises a pedestrian head detection frame and a pedestrian overall detection frame;
The post-processing unit is used for carrying out post-processing on the obtained pedestrian head detection frame and the pedestrian overall detection frame, and screening out redundant detection frames generated in the combined detection process to obtain a final pedestrian detection result;
The output unit is used for constructing a joint loss function in the loss function part for further suppressing false detection and missing detection conditions occurring in the post-processing link, and the joint loss function is used for punishing false detection conditions which are not suppressed by the correct non-maximum value and missing detection conditions which are suppressed by the incorrect non-maximum value, so that the final detection result is more accurate.
As a preferred embodiment, the extraction unit is specifically configured to:
Taking ResNet network as backbone network of the fast R-CNN network model;
Learning characteristic information of an image to be detected by using the backbone network;
Feature fusion is carried out on feature information acquired by the backbone network, a dense connection idea is combined, feature splicing strategies of a feature pyramid network FPN are improved, namely, features of all scales participate in calculation and output of the feature information, the advantages of high-level semantic information expressed by a large-scale feature map and detailed information such as bottom texture expressed by a small-scale feature map are absorbed, fusion of the network to the multi-scale features is enhanced, and predicted values corresponding to the scale features of each layer are obtained ~Specifically as shown in formula 1 and formula 2;
(1)
(2)
Wherein, Are all the weight parameters of the weight-based material,As a function of the feature map,~Extracting each layer of scale characteristics obtained by a network for the characteristics, and in the specific training process, parametersParticipate in the back propagation of the gradient, and learn to update to the most appropriate value through training of the model.
As a preferred solution, the generating unit is specifically configured to:
the non-uniform difficult sample mining strategy based on shielding overlap rate discrimination is realized by introducing a judging threshold value The sample set is based on the average shielding overlapping rateDivided into difficult sets) And common collection) Two classes, the probability of each sample being extracted is defined asThe specific calculation method is shown in a formula 3;
(3)
Wherein, Represent the firstThe shading coefficient of each sample is used for reflecting the firstThe degree to which the individual samples are occluded;
For the first Occlusion coefficient of individual samplesFurther expressed as:
(4)
Wherein, Indicating the number of samples required for the test,Representing the total number of candidate samples,AndRespectively represent the first in the sample setDetermining a threshold value by using the occlusion overlapping rate of each sample and the average occlusion overlapping rate of the whole sample setDifferent values may be set depending on the overall occlusion degree of the dataset;
for sample set No Occlusion overlap ratio of individual samplesAnd average occlusion overlap rate for the entire sample setFurther expressed as:
(5)
(6)
As a preferred solution, the post-processing unit is specifically configured to:
Respectively introducing penalty factors for decreasing confidence to a head detection frame and a pedestrian overall detection frame obtained by combining the detection frames, and gradually reducing the confidence score of the overlapped detection frames, so that competition of the overlapped frames is reduced while the overlapped frames are not excessively restrained, and the first step is that Score of individual pedestrian overall detection frameAnd (d)Score of individual head detection framesExpressed as:
(7)
(8)
(9)
(10)
Wherein, AndThe overall box and the head box with the highest scores in the previous iteration are respectively,AndNMS thresholds corresponding to the overall detection and the header detection respectively,A smaller fixed value for the initial setting;
Weighting and summing the confidence scores of the obtained head frame and the whole frame to obtain a joint confidence score The method is specifically expressed as follows:
(11)
Wherein, And representing the weight occupied by the head detection score, and if the head overlapping degree or the whole overlapping degree is larger than a preset threshold value, suppressing the detection frame with lower confidence.
As a preferred solution, the output unit is specifically configured to:
The joint loss function The concrete steps are as follows:
(12)
Wherein the coefficient is 、、AndAre the weights of the balance loss,And (3) withFor pulling the head frame and the whole frame of the false detection closer to be removed,And (3) withThe method comprises the steps of pushing away a missing head frame and an integral frame which are restrained by an error non-maximum value, so that the head frame and the integral frame correctly correspond to a target frame;
For the missing head box and the whole box to be suppressed by the error non-maximum value And (3) withThe loss function, further expressed as:
(13)
(14)
For head frames and whole frames for false detection to be drawn in for rejection And (3) withThe loss function, further expressed as:
(15)
(16)
Wherein, AndNMS threshold values corresponding to the head detection branch and the whole detection branch are respectively obtained, and after redundant frames generated by the head detection branch and the whole detection branch are removed, a final pedestrian detection result is obtained.
Compared with the prior art, the invention has the following beneficial effects:
The embodiment of the invention provides a multiscale pedestrian detection method and device based on combination of head and whole information, which comprises the steps of firstly intensively connecting different layers of characteristics, namely enabling all scale characteristics to participate in a strategy for calculating final output characteristics, enabling each layer of characteristics to cover semantic information and detail texture information so as to improve the sensitivity of a network to multiscale pedestrian targets, secondly, optimizing a sampling mode of an area suggestion network, calculating the shielding overlapping rate of each sample in a sample set, giving higher weight to samples with higher shielding overlapping rate, strengthening learning of serious shielding samples in difficult sample sets during training, and further improving the detection capability of a model to shielding pedestrian targets, and then constructing a combined detection framework of pedestrian heads and whole information, aiming at assisting pedestrian detection by using head detection, thereby reducing adverse effects on detection caused by shielding of pedestrian bodies. And the post-processing link and the loss function module are optimized, so that the interference of adjacent pedestrian targets on detection is weakened, the intelligence and the rationality of screening redundant frames are improved, and false detection frames generated by two detection branches are removed while the detection frames at the dense positions are not excessively restrained, so that the omission rate and the false detection rate of pedestrian detection are further reduced. Therefore, compared with the previous algorithm, the pedestrian detection algorithm provided by the invention has stronger detection capability on multi-scale pedestrian targets and blocked pedestrian targets in a complex crowd intensive scene, and can reduce the missed detection rate of the pedestrian targets.
Detailed Description
Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the following description, like modules are denoted by like reference numerals. In the case of the same reference numerals, their names and functions are also the same. Therefore, a detailed description thereof will not be repeated.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limiting the invention.
Referring to fig. 1, an embodiment of the present invention provides a multi-scale pedestrian detection method based on head and overall information association, including:
S101, constructing a Faster R-CNN network model;
S102, integrating the backbone network of the Faster R-CNN network model with an improved feature extraction network, and inputting an image to be detected into the Faster R-CNN model of the integrated and improved feature extraction network for feature extraction to obtain an extracted feature map;
s103, improving a sampling mode of a regional suggestion network, constructing a non-uniform difficult sample mining strategy based on shielding overlapping rate discrimination, calculating the shielding overlapping rate of each sample in a sample set, giving higher weight to samples with higher shielding overlapping rate, and simultaneously generating a head candidate frame and an integral candidate frame set for all pedestrian examples in a scene by using the regional suggestion network;
S104, constructing a pedestrian head detection branch module and a pedestrian overall detection branch module, and obtaining a preliminary target detection result through the pedestrian head detection branch module and the pedestrian overall detection branch module, wherein the preliminary target detection result comprises a pedestrian head detection frame and a pedestrian overall detection frame;
S105, performing post-processing on the obtained pedestrian head detection frame and the pedestrian overall detection frame, and screening out redundant detection frames generated in the combined detection process to obtain a final pedestrian detection result;
S106, for further suppressing false detection and missing detection conditions in the post-processing link, constructing a joint loss function in the loss function part, wherein the joint loss function is used for punishing false detection conditions which are not suppressed by the correct non-maximum value and missing detection conditions which are suppressed by the incorrect non-maximum value, so that a final detection result is more accurate.
In S102, the fusing the backbone network of the fast R-CNN network model with the improved feature extraction network, and inputting the image to be detected into the fast R-CNN model of the fused and improved feature extraction network for feature extraction, to obtain an extracted feature map, including:
Taking ResNet network as backbone network of the fast R-CNN network model;
Learning characteristic information of an image to be detected by using the backbone network;
Feature fusion is carried out on feature information acquired by the backbone network, a dense connection idea is combined, feature splicing strategies of a feature pyramid network FPN are improved, namely, features of all scales participate in calculation and output of the feature information, the advantages of high-level semantic information expressed by a large-scale feature map and detailed information such as bottom texture expressed by a small-scale feature map are absorbed, fusion of the network to the multi-scale features is enhanced, and predicted values corresponding to the scale features of each layer are obtained ~Specifically as shown in formula 1 and formula 2;
(1)
(2)
Wherein, Are all the weight parameters of the weight-based material,As a function of the feature map,~Extracting each layer of scale characteristics obtained by a network for the characteristics, and in the specific training process, parametersParticipate in the back propagation of the gradient, and learn to update to the most appropriate value through training of the model.
Further, in S103, the method for improving the sampling manner of the area suggestion network, constructing a non-uniform difficult sample mining strategy based on the determination of the occlusion overlapping rate, and simultaneously generating a head candidate frame and an overall candidate frame set for all pedestrian instances in the scene by using the area suggestion network by calculating the occlusion overlapping rate of each sample in the sample set and giving a higher weight to the sample with a higher occlusion overlapping rate, including:
the non-uniform difficult sample mining strategy based on shielding overlap rate discrimination is realized by introducing a judging threshold value The sample set is based on the average shielding overlapping rateDivided into difficult sets) And common collection) Two classes, the probability of each sample being extracted is defined asThe specific calculation method is shown in a formula 3;
(3)
Wherein, Represent the firstThe shading coefficient of each sample is used for reflecting the firstThe degree to which the individual samples are occluded;
For the first Occlusion coefficient of individual samplesFurther expressed as:
(4)
Wherein, Indicating the number of samples required for the test,Representing the total number of candidate samples,AndRespectively represent the first in the sample setDetermining a threshold value by using the occlusion overlapping rate of each sample and the average occlusion overlapping rate of the whole sample setDifferent values may be set depending on the overall occlusion degree of the dataset;
for sample set No Occlusion overlap ratio of individual samplesAnd average occlusion overlap rate for the entire sample setFurther expressed as:
(5)
(6)
Further, in S105, the post-processing is performed on the obtained pedestrian head detection frame and the pedestrian overall detection frame, and the redundant detection frames generated in the combined detection process are screened out, so as to obtain a final pedestrian detection result, which includes:
Respectively introducing penalty factors for decreasing confidence to a head detection frame and a pedestrian overall detection frame obtained by combining the detection frames, and gradually reducing the confidence score of the overlapped detection frames, so that competition of the overlapped frames is reduced while the overlapped frames are not excessively restrained, and the first step is that Score of individual pedestrian overall detection frameAnd (d)Score of individual head detection framesExpressed as:
(7)
(8)
(9)
(10)
Wherein, AndThe overall box and the head box with the highest scores in the previous iteration are respectively,AndNMS thresholds corresponding to the overall detection and the header detection respectively,A smaller fixed value for the initial setting;
Weighting and summing the confidence scores of the obtained head frame and the whole frame to obtain a joint confidence score The method is specifically expressed as follows:
(11)
Wherein, And representing the weight occupied by the head detection score, and if the head overlapping degree or the whole overlapping degree is larger than a preset threshold value, suppressing the detection frame with lower confidence.
Further, in S106, to further suppress the false detection and missing detection situations occurring in the post-processing link, a joint loss function is constructed in the loss function portion, where the joint loss function is used to penalize the false detection situations that are not suppressed by the correct non-maximum value and the missing detection situations that are suppressed by the incorrect non-maximum value, so that the final detection result is more accurate, and includes:
The joint loss function The concrete steps are as follows:
(12)
Wherein the coefficient is 、、AndAre the weights of the balance loss,And (3) withFor pulling the head frame and the whole frame of the false detection closer to be removed,And (3) withThe method comprises the steps of pushing away a missing head frame and an integral frame which are restrained by an error non-maximum value, so that the head frame and the integral frame correctly correspond to a target frame;
For the missing head box and the whole box to be suppressed by the error non-maximum value And (3) withThe loss function, further expressed as:
(13)
(14)
For head frames and whole frames for false detection to be drawn in for rejection And (3) withThe loss function, further expressed as:
(15)
(16)
Wherein, AndNMS threshold values corresponding to the head detection branch and the whole detection branch are respectively obtained, and after redundant frames generated by the head detection branch and the whole detection branch are removed, a final pedestrian detection result is obtained.
The embodiment of the invention provides a multiscale pedestrian detection method based on combination of head and whole information, which comprises the steps of firstly intensively connecting different layers of characteristics, namely enabling all scale characteristics to participate in a strategy of calculating final output characteristics, enabling each layer of characteristics to cover semantic information and detail texture information so as to improve the sensitivity of a network to multiscale pedestrian targets, secondly, optimizing a sampling mode of a regional suggestion network, calculating the shielding overlapping rate of each sample in a sample set, giving higher weight to samples with higher shielding overlapping rate, strengthening learning of serious samples which are blocked in difficult sample sets during training, further improving the detection capability of a model to the pedestrian targets, and then constructing a combined detection framework of pedestrian heads and whole information, aiming at assisting in pedestrian detection by head detection, and reducing adverse effects on detection caused by shielding of pedestrian bodies. And the post-processing link and the loss function module are optimized, so that the interference of adjacent pedestrian targets on detection is weakened, the intelligence and the rationality of screening redundant frames are improved, and false detection frames generated by two detection branches are removed while the detection frames at the dense positions are not excessively restrained, so that the omission rate and the false detection rate of pedestrian detection are further reduced. Therefore, compared with the previous algorithm, the pedestrian detection algorithm provided by the invention has stronger detection capability on multi-scale pedestrian targets and blocked pedestrian targets in a complex crowd intensive scene, and can reduce the missed detection rate of the pedestrian targets.
2, 3, 4 And 5, the invention aims at solving the problem that the detection precision of small-scale pedestrian targets and blocked pedestrian targets is reduced and the omission ratio is high in a complex and dense scene by the existing target detection algorithm, and in order to further facilitate the understanding of the scheme of the invention, the following description is given of a multi-scale pedestrian detection method based on the combination of head and whole information under one embodiment of the invention, and the whole method is shown in FIG. 2, wherein the specific steps include:
step 1, constructing a Faster R-CNN network model;
Step 2, fusing a backbone network of the Faster R-CNN network model with the improved feature extraction network, and inputting an image to be detected into the Faster R-CNN network model fused with the improved feature extraction network to perform feature extraction to obtain an extracted feature map;
Step 2.1, taking ResNet network as backbone network of the Faster R-CNN network model, wherein the structure diagram of ResNet network is shown in figure 3;
And 2.2, fusing the improved feature extraction network with a Faster R-CNN network model, wherein the overall structure of the improved feature extraction network is shown in figure 4. The specific implementation steps are as follows:
Step 2.2.1, learning characteristic information of an image to be detected by using a backbone network ResNet;
Step 2.2.2, feature fusion is carried out on Feature information acquired by a backbone network, namely, a Feature splicing strategy of a Feature pyramid network (Feature PyramidNetwork, FPN) is improved by combining a dense connection idea, namely, features of all scales participate in calculation and output of Feature information, so that the advantages of expressing high-level semantic information by a large-scale Feature map and detail information such as bottom texture by a small-scale Feature map are simultaneously absorbed, fusion of the network on multi-scale features is enhanced, and predicted values corresponding to the scale features of each layer are obtained ~Specifically, the method is shown in a formula (1) and a formula (2).
(1)
(2)
Wherein, Are all the weight parameters of the weight-based material,As a function of the feature map,~And extracting the scale characteristics of each layer acquired by the network for the characteristics. In the specific training process, parametersParticipate in the back propagation of the gradient, and learn to update to the most appropriate value through training of the model. The method enables the model to learn which connections are conducive to prediction by the detection module and which connections are invalid calculations, thereby improving the performance of the feature extraction network.
And 3, improving a sampling mode of the regional suggestion network (Region Proposal Network, RPN) to construct a non-uniform difficult sample mining strategy based on shielding overlap rate discrimination. The shielding overlapping rate of each sample in the sample set is calculated, and a higher weight is given to the sample with a higher shielding overlapping rate, so that the study of the seriously shielded sample in the difficult sample set during training is enhanced, and the detection capability of the model on the shielding pedestrian target is further improved. The region suggestion network is then utilized to simultaneously generate a set of head candidate boxes and overall candidate boxes for all pedestrian instances in the scene.
Step 3.1, for the non-uniform difficult sample mining strategy based on the shielding overlap rate discrimination constructed in the invention, a determination threshold value is introducedBased on average shielding overlapping rate of sample setDivided into difficult sets) And common collection) Two types. The probability that each sample is extracted is then defined asThe specific calculation method is shown in the formula (3).
(3)
Wherein, Represent the firstThe shading coefficient of each sample can reflect the firstThe degree to which individual samples are occluded.
Step 3.2 for the first step in step 3.1Occlusion coefficient of individual samplesFurther expressed as:
(4)
Wherein, Indicating the number of samples required for the test,Representing the total number of candidate samples.AndRespectively represent the first in the sample setOcclusion overlap ratio of individual samples and average occlusion overlap ratio of the entire sample set, thresholdDifferent values may be set depending on the overall occlusion degree of the dataset.
Step 3.3 for sample set in step 3.2Occlusion overlap ratio of individual samplesAnd average occlusion overlap rate for the entire sample setFurther expressed as:
(5)
(6)
and 4, constructing a pedestrian head detection branch and a pedestrian overall detection branch, and obtaining a preliminary target detection result through the pedestrian head detection branch module and the pedestrian overall detection branch module.
And 5, carrying out post-processing on the obtained pedestrian head detection frame and the pedestrian overall detection frame, and screening out redundant detection results generated in the combined detection process, wherein the overall flow pseudo code is shown in figure 5. The specific implementation steps are as follows:
And 5.1, respectively introducing penalty factors for decreasing confidence to the head detection frame and the whole pedestrian detection frame obtained by combining the detection frames, and gradually reducing the confidence score of the overlapped detection frames so as to reduce the competitiveness of the overlapped frames while excessively inhibiting the overlapped frames. First, the Score of individual pedestrian overall detection frameAnd (d)Score of individual head detection framesCan be expressed as:
(7)
(8)
(9)
(10)
Wherein, AndThe overall box and the head box with the highest scores in the previous iteration are respectively,AndNMS thresholds corresponding to the overall detection and the header detection respectively,A small fixed value for the initial setting.
Step 5.2, weighting and summing the confidence scores of the head frame and the whole frame obtained in the step 5.1 to obtain a joint confidence scoreSpecifically, the method can be expressed as:
(11)
Wherein, Representing the weight occupied by the head detection score, if the head overlapping degree or the whole overlapping degree is larger than a preset threshold value, the detection frame with lower confidence degree is restrained.
Step6, for false detection and missing detection in non-maximum value inhibition link, the invention constructs a combined loss function in the loss function partThe purpose is to punish false detection cases that are not suppressed by the correct non-maximum value and false omission cases that are suppressed by the incorrect non-maximum value.The concrete steps are as follows:
(12)
Wherein the coefficient is 、、AndAre weights for balancing losses.And (3) withFor "pulling" the false-detected head box and the whole box so as to eliminate them,And (3) withThe method is used for 'pushing away' the missed head frame and the whole frame which are restrained by the error non-maximum value, so that the missed head frame and the whole frame can correctly correspond to the target frame.
Step 6.1 "push away" for the miss header box and the global box to be suppressed by the erroneous non-maximaAnd (3) withThe loss function, may be further expressed as:
(13)
(14)
Step 6.2 "zoom in" on the head frame and the whole frame for false detection to facilitate its rejection And (3) withThe loss function, may be further expressed as:
(15)
(16)
Wherein, AndNMS thresholds corresponding to the header detection branch and the overall detection branch, respectively. And removing redundant frames generated by the head detection branch and the whole detection branch to obtain a final pedestrian detection result.
The technical scheme designed by the invention can be well applied to the fields of intelligent monitoring, auxiliary safe driving, large-scale place video monitoring, intelligent robots and the like.
The effectiveness of the invention can be demonstrated by the following experimental verification:
Experimental conditions and experimental contents
1. The image Net data set is used as a pre-training data set of the pedestrian detection model, then the pedestrian detection model is trained based on CrowdHuman, cityPersons data sets and TJU-DHD-pedestrian data sets respectively, and some enhancement strategies including random clipping, random horizontal overturn and random image brightness interference are used for the data sets in the training process. In the actual training process, a random gradient descent optimizer is adopted to train 35 pieces of epochs, the momentum factor momentum is set to 0.9, the learning rate adopts the strategy of norm up, the initial learning rate is set to a lower value, and the training rate is set to be in the inventionOn iteration 1 to iteration 800 of each epoch, the learning rate increases linearly toAnd then remain unchanged. By 24 th epoch, the learning rate was reduced to 10% of the original, and by 28 th epoch, the learning rate was changed to 1%.
2. The overall performance of the method involved in the present invention was evaluated using the average accuracy (Average Precision, AP) and the logarithmic average omission ratio (MR -2) as evaluation indices. The specific calculation method of the average precision and the logarithmic average omission ratio can be expressed as follows:
Wherein, The representation accuracy rate is the ratio of the number of samples with the predicted result being the pedestrian target to the pedestrian target actually marked in the image; Representing recall rate, wherein the recall rate is the proportion of the number of pedestrian targets which are correctly predicted to occupy all the prediction results in the pedestrian target detection process; the detection omission factor is indicated to be the detection omission factor, Representing the average number of false positives in each image. The higher the AP value and the lower the MR -2 value, the more excellent the detection performance of the algorithm is explained.
(II) results of experiments
As shown in fig. 6, 7 and 8, on CrowdHuman, cityPersons and TJU-DHD-pedestrian pedestrian detection datasets, the average detection accuracy value is higher and the logarithmic average omission factor value is lower than some of the more popular pedestrian detection algorithms and the baseline fast R-CNN algorithm before improvement. Therefore, the experimental result can fully show that the method has better performance than some popular pedestrian detection algorithms, and can better detect multi-scale pedestrian targets and pedestrian targets with serious shielding.
Compared with the prior art, the multi-scale pedestrian detection method provided by the invention has the following characteristics:
1. Compared with the existing YOLO series network model and R-CNN series network model, the method provided by the invention has the advantages that in the feature extraction link, through densely connecting different layers of features and enabling all scale features to participate in a strategy for calculating final output features, semantic information and detail texture information can be covered in each finally obtained layer of features, and further, the detection capability of the network on multi-scale pedestrian targets is improved. From the information circulation perspective analysis, the feature extraction network constructed by the invention can learn abundant and detailed multi-scale features, deep semantic information can be fused in shallow features, shallow detail texture information can be fused in deep features, and the sensitivity of the network to multi-scale pedestrian targets can be improved. When the multi-scale features are fused by adopting a dense connection strategy from the perspective of parameter optimization, the gradient can be more efficiently counter-propagated in an actual training stage, so that the network complexity is not excessively increased while the convergence speed of a network model is increased, and the method is more beneficial to obtaining a high-quality global optimal solution.
2. Compared with the existing R-CNN series network model, the invention optimizes the sampling mode of the regional suggestion network, calculates the shielding overlapping rate of each sample in the sample set, and gives higher weight to the sample with higher shielding overlapping rate, thereby strengthening the study of the serious sample shielded in the difficult sample set during training, and gradually improving the capability of detecting the pedestrian shielding target along with the continuous training, so as to show higher performance in complex crowd intensive scenes.
3. Compared with the existing YOLO series network model and the R-CNN series network model, the method provided by the invention designs the head and the whole double detection branches at the core part of the model to carry out joint detection, so that the head detection is fully utilized to assist the pedestrian detection, and the influence of the shielding of the pedestrian body on the performance of the detector is reduced.
4. Compared with the existing YOLO series network model and the R-CNN series network model, the method disclosed by the invention optimizes the post-processing link and the loss function module, weakens the interference caused by adjacent pedestrian targets on detection, improves the intelligence and rationality of screening redundant frames, and enables the densely-located detection frames not to be excessively restrained and simultaneously eliminates false detection frames generated by two detection branches, thereby further reducing the omission ratio and the false detection ratio of pedestrian detection.
Accordingly, as shown in fig. 9, in an embodiment of the present invention, there is provided a multi-scale pedestrian detection device based on a combination of head and overall information, including:
the building unit 901 is used for building a fast R-CNN network model;
The extracting unit 902 is configured to fuse the backbone network of the fast R-CNN network model with an improved feature extracting network, and input an image to be detected into the fast R-CNN model fused with the improved feature extracting network for feature extraction, so as to obtain an extracted feature map;
the generating unit 903 is configured to improve a sampling manner of the area suggestion network, construct a non-uniform difficult sample mining policy based on determination of an occlusion overlapping rate, and simultaneously generate a head candidate frame and an overall candidate frame set for all pedestrian instances in a scene by using the area suggestion network by calculating an occlusion overlapping rate of each sample in a sample set and giving a higher weight to a sample with a higher occlusion overlapping rate;
The primary detection unit 904 is configured to construct a pedestrian head detection branch module and a pedestrian overall detection branch module, and obtain a primary target detection result through the pedestrian head detection branch module and the pedestrian overall detection branch module, where the primary target detection result includes a pedestrian head detection frame and a pedestrian overall detection frame;
The post-processing unit 905 is configured to perform post-processing on the obtained pedestrian head detection frame and the obtained pedestrian overall detection frame, and screen out a redundant detection frame generated in the combined detection process, so as to obtain a final pedestrian detection result;
And an output unit 906, configured to construct a joint loss function in the loss function portion for further suppressing false detection and missing detection situations occurring in the post-processing link, where the joint loss function is configured to penalize false detection situations not suppressed by the correct non-maximum value and missing detection situations suppressed by the incorrect non-maximum value, so that a final detection result is more accurate.
As a preferred solution, the extracting unit 902 is specifically configured to:
Taking ResNet network as backbone network of the fast R-CNN network model;
Learning characteristic information of an image to be detected by using the backbone network;
Feature fusion is carried out on feature information acquired by the backbone network, a dense connection idea is combined, feature splicing strategies of a feature pyramid network FPN are improved, namely, features of all scales participate in calculation and output of the feature information, the advantages of high-level semantic information expressed by a large-scale feature map and detailed information such as bottom texture expressed by a small-scale feature map are absorbed, fusion of the network to the multi-scale features is enhanced, and predicted values corresponding to the scale features of each layer are obtained ~Specifically as shown in formula 1 and formula 2;
(1)
(2)
Wherein, Are all the weight parameters of the weight-based material,As a function of the feature map,~Extracting each layer of scale characteristics obtained by a network for the characteristics, and in the specific training process, parametersParticipate in the back propagation of the gradient, and learn to update to the most appropriate value through training of the model.
As a preferable solution, the generating unit 903 is specifically configured to:
the non-uniform difficult sample mining strategy based on shielding overlap rate discrimination is realized by introducing a judging threshold value The sample set is based on the average shielding overlapping rateDivided into difficult sets) And common collection) Two classes, the probability of each sample being extracted is defined asThe specific calculation method is shown in a formula 3;
(3)
Wherein, Represent the firstThe shading coefficient of each sample is used for reflecting the firstThe degree to which the individual samples are occluded;
For the first Occlusion coefficient of individual samplesFurther expressed as:
(4)
Wherein, Indicating the number of samples required for the test,Representing the total number of candidate samples,AndRespectively represent the first in the sample setDetermining a threshold value by using the occlusion overlapping rate of each sample and the average occlusion overlapping rate of the whole sample setDifferent values may be set depending on the overall occlusion degree of the dataset;
for sample set No Occlusion overlap ratio of individual samplesAnd average occlusion overlap rate for the entire sample setFurther expressed as:
(5)
(6)
As a preferred solution, the post-processing unit 905 is specifically configured to:
Respectively introducing penalty factors for decreasing confidence to a head detection frame and a pedestrian overall detection frame obtained by combining the detection frames, and gradually reducing the confidence score of the overlapped detection frames, so that competition of the overlapped frames is reduced while the overlapped frames are not excessively restrained, and the first step is that Score of individual pedestrian overall detection frameAnd (d)Score of individual head detection framesExpressed as:
(7)
(8)
(9)
(10)
Wherein, AndThe overall box and the head box with the highest scores in the previous iteration are respectively,AndNMS thresholds corresponding to the overall detection and the header detection respectively,A smaller fixed value for the initial setting;
Weighting and summing the confidence scores of the obtained head frame and the whole frame to obtain a joint confidence score The method is specifically expressed as follows:
(11)
Wherein, And representing the weight occupied by the head detection score, and if the head overlapping degree or the whole overlapping degree is larger than a preset threshold value, suppressing the detection frame with lower confidence.
As a preferred solution, the output unit 906 is specifically configured to:
The joint loss function The concrete steps are as follows:
(12)
Wherein the coefficient is 、、AndAre the weights of the balance loss,And (3) withFor pulling the head frame and the whole frame of the false detection closer to be removed,And (3) withThe method comprises the steps of pushing away a missing head frame and an integral frame which are restrained by an error non-maximum value, so that the head frame and the integral frame correctly correspond to a target frame;
For the missing head box and the whole box to be suppressed by the error non-maximum value And (3) withThe loss function, further expressed as:
(13)
(14)
For head frames and whole frames for false detection to be drawn in for rejection And (3) withThe loss function, further expressed as:
(15)
(16)
Wherein, AndNMS threshold values corresponding to the head detection branch and the whole detection branch are respectively obtained, and after redundant frames generated by the head detection branch and the whole detection branch are removed, a final pedestrian detection result is obtained.
The embodiment of the invention provides a multi-scale pedestrian detection device based on combination of head and whole information, which comprises the steps of firstly intensively connecting different layers of characteristics, namely enabling all scale characteristics to participate in a strategy for calculating final output characteristics, enabling each layer of characteristics to cover semantic information and detail texture information so as to improve the sensitivity of a network to multi-scale pedestrian targets, secondly, optimizing a sampling mode of a regional suggestion network, calculating the shielding overlapping rate of each sample in a sample set, giving higher weight to samples with higher shielding overlapping rate, strengthening learning of serious shielding samples in difficult sample sets during training, further improving the detection capability of a model to shielding pedestrian targets, and then constructing a combined detection framework of pedestrian heads and whole information, aiming at assisting pedestrian detection by head detection, and reducing adverse effects on detection caused by shielding of pedestrian bodies. And the post-processing link and the loss function module are optimized, so that the interference of adjacent pedestrian targets on detection is weakened, the intelligence and the rationality of screening redundant frames are improved, and false detection frames generated by two detection branches are removed while the detection frames at the dense positions are not excessively restrained, so that the omission rate and the false detection rate of pedestrian detection are further reduced. Therefore, compared with the previous algorithm, the pedestrian detection algorithm provided by the invention has stronger detection capability on multi-scale pedestrian targets and blocked pedestrian targets in a complex crowd intensive scene, and can reduce the missed detection rate of the pedestrian targets.
While embodiments of the present invention have been illustrated and described above, it will be appreciated that the above described embodiments are illustrative and should not be construed as limiting the invention. Variations, modifications, alternatives and variations of the above-described embodiments may be made by those of ordinary skill in the art within the scope of the present invention.
The above embodiments of the present invention do not limit the scope of the present invention. Any other corresponding changes and modifications made in accordance with the technical idea of the present invention shall be included in the scope of the claims of the present invention.