CN119904894A

CN119904894A - Multi-scale pedestrian detection method and device based on joint head and overall information

Info

Publication number: CN119904894A
Application number: CN202510409873.6A
Authority: CN
Inventors: 马晞茗; 李宁; 吴迪
Original assignee: Changchun Institute of Optics Fine Mechanics and Physics of CAS
Current assignee: Changchun Institute of Optics Fine Mechanics and Physics of CAS
Priority date: 2025-04-02
Filing date: 2025-04-02
Publication date: 2025-04-29
Anticipated expiration: 2045-04-02
Also published as: CN119904894B

Abstract

The present invention relates to the field of computer vision technology, and specifically provides a multi-scale pedestrian detection method and device based on the joint use of head and overall information. By densely connecting features at different levels, the sensitivity of the network to multi-scale pedestrian targets is improved; secondly, the sampling method of the region proposal network is optimized, and the adaptability of the model to occluded pedestrian targets is improved by calculating the occlusion overlap rate of each sample in the sample set; then, a joint detection framework for pedestrian heads and overall information is constructed to reduce the adverse effects of pedestrian body occlusion on detection. The post-processing link and loss function module are optimized to reduce the interference caused by adjacent pedestrian targets on detection, while improving the intelligence and rationality of screening out redundant frames, and further reducing the missed detection rate and false detection rate of pedestrian detection. The detection capability of multi-scale pedestrian targets and occluded pedestrian targets in complex crowd-dense scenes can be enhanced, and the pedestrian missed detection rate can be reduced.

Description

Multi-scale pedestrian detection method and device based on head and whole information combination

Technical Field

The invention relates to the technical field of computer vision, and particularly provides a multi-scale pedestrian detection method and device based on head and whole information combination.

Background

In the field of computer vision, pedestrian detection technology is one of the popular research directions in which it is widely used. The pedestrian detection task mainly refers to accurately detecting and positioning pedestrian target examples in image, video and video stream data through a computer vision correlation technology, and is essentially a classifying and regression process. In actual life, pedestrian detection plays an important role in the fields of automatic driving, intelligent monitoring, intelligent robots, man-machine interaction and the like, and has higher application value. In the automatic driving field and the auxiliary safe driving field, whether a pedestrian suddenly intrudes in front of the vehicle can be detected in real time by utilizing a pedestrian detection technology, so that adjustment is timely made according to actual conditions, and the safety of the pedestrian and the driving safety are ensured. In particular, in the crowded scene of the congested road section, the pedestrian detection technology has more obvious auxiliary effect on the automatic driving safety. In the field of intelligent monitoring, most public places today use cameras to monitor the whole scene and count crowd flow data in real time. Particularly, in epidemic situation, pedestrian traffic density can be strictly controlled in most large public places, pedestrian traffic in places can be accurately counted in real time by utilizing pedestrian detection technology, and corresponding adjustment measures can be further facilitated for managers by analyzing and predicting the data. In the field of intelligent robots, sensors such as cameras and the like are used for transmitting corresponding environmental scene signals to the intelligent robots, and a pedestrian detection algorithm is used as an important thinking sensing network in the brains of the intelligent robots, so that the intelligent robots can be helped to quickly and accurately sense pedestrian targets and make corresponding decisions in time to adjust. In the man-machine interaction field, the intelligent meal delivery vehicle, the intelligent express delivery vehicle and other devices in campuses and restaurant are integrated with various functions such as pedestrian detection algorithm, and the aim of better serving people in daily life by utilizing an artificial intelligence technology is achieved through interaction with pedestrians.

The pedestrian detection algorithm based on the deep learning can be divided into a single-stage pedestrian detection algorithm and a two-stage pedestrian detection algorithm according to detection ideas. The single-stage pedestrian detection algorithm is mainly represented by a YOLO series network model, and the two-stage pedestrian detection algorithm is represented by an R-CNN series network model. However, both algorithms lack research and design on specific detection scenes, and have weaker detection capability on small-scale pedestrian targets at far-view angles and pedestrian targets with serious shielding overlap in dense scenes, which mainly shows lower detection precision and higher omission ratio. In addition, both types of algorithms ignore an important problem that although the pedestrian body part is easy to be blocked and overlapped in a complex dense scene, the head area is always blocked slightly or even not blocked at all, and even if the pedestrian body part is severely blocked, the head area still can provide important characteristic information, which is very important for detecting pedestrian targets in the dense scene. However, the overall dimensions of the head area are small, which can easily cause ambiguity in the hand, elbow, and surrounding small objects, thus causing false detection. Therefore, the pedestrian detection is difficult to realize accurately only depending on the whole detection and the head detection, the head detection and the whole detection are required to be effectively combined, and the richer and finer multi-scale characteristic information is required to be obtained in the characteristic extraction link, so that the advantages of the head detection and the whole detection are better exerted, and the accuracy of the pedestrian detection is improved. Therefore, the design of the multi-scale pedestrian detection algorithm based on the combination of the head and the whole information is very necessary, and has important significance for improving the detection accuracy of the shielding pedestrian target and the multi-scale pedestrian target in the dense scene.

In view of the above requirements, there are many related solutions at home and abroad.

The Chinese patent publication number is CN111767882A, the publication date is 2020, 10 months and 13 days, and the patent name is the patent application of the invention of a multi-mode pedestrian detection method based on an improved YOLO model, and the pedestrian detection effect is improved by fusing CBAM attention mechanism and optimizing a loss function. However, there are some problems such as (1) insufficient sensitivity to multi-scale pedestrian targets in the feature extraction link, and especially weak feature information learning ability for small-scale pedestrians at far-view angles. (2) There is still room for improvement in terms of missed detection of pedestrian targets that overlap severely. The Chinese patent publication number is CN113989939A, the publication date is 2022, 1 and 28 days, and the patent name is the patent application of 'a small target pedestrian detection system based on an improved YOLO algorithm', but the phenomenon of dense crowding of pedestrians is unavoidable in an actual detection scene, so that the detection capability of the method for shielding the pedestrian target is weaker, and certain limitation exists in practical application. The Chinese patent publication number is CN114882527A, the publication date is 2022, 8 and 9, the patent name is the patent application of the pedestrian detection method and system based on dynamic grouping convolution, the pedestrian detection is realized mainly by using grouping convolution, but consideration of specific scenes of pedestrian detection is lacking, and the execution efficiency in complex crowd-intensive scenes has certain limitation.

For the existing pedestrian detection technology, more schemes rely on classical target detection algorithms to detect pedestrian targets in a scene, and the detection thought can be divided into a single-stage detection algorithm represented by a YOLO series network model and a two-stage detection algorithm represented by a fast R-CNN network model. However, most of the classical target detection algorithms lack specific consideration for detection scenes, especially in complex crowd-intensive scenes, the robustness of these mainstream detection algorithms is affected, mainly including the following two points:

(1) The variable pedestrian target dimensions result in the overall performance of the detector being affected. Because the current pedestrian detection data set is mostly obtained by shooting based on a camera and performing calibration processing, the camera has a rule of 'near big and far small' when shooting, the overall dimension of a pedestrian target at a near visual angle is larger, and the overall dimension of the pedestrian target at a far visual angle is smaller. Because the resolution ratio of the small-scale pedestrian targets is relatively low, the problems of limited learned characteristic information or weak characteristic expression capability easily occur in the process of characteristic extraction by the algorithm, and the problem that the pedestrian targets with different scales are difficult to have high sensitivity is difficult to occur, so that the phenomenon of missed detection or false detection is easily caused.

(2) The high occlusion of pedestrian targets results in an impact on the overall performance of the detector. In the pedestrian detection task under a complex crowd intensive scene, a pedestrian target is often subjected to a certain shielding phenomenon. As can be seen from analysis of images in a pedestrian detection dataset, the pedestrian occlusion problem mainly includes two cases, i.e., intra-class occlusion and inter-class occlusion. The intra-class shielding refers to mutual shielding among pedestrian targets. The inter-class shielding means that pedestrians are interfered by background information, and the background information mainly comprises buildings, trees, vehicles, articles carried by the pedestrians, articles carried by other pedestrians nearby, and the like. The proportion of the visible area part of the whole body of the pedestrian is reduced due to the intra-class shielding and inter-class shielding, the characteristic extraction link of the algorithm is difficult, the information required by the reasoning of the algorithm detection module is reduced, and the accuracy of the pedestrian target positioning is influenced, so that the comprehensive performance of the pedestrian detection algorithm is influenced.

Disclosure of Invention

The invention aims to solve the problems, and provides a multi-scale pedestrian detection method and device based on the combination of head and overall information, which can improve the detection precision of a detector on multi-scale pedestrian targets and blocked pedestrian targets in a complex crowd intensive scene and reduce the omission ratio of the detector.

In a first aspect, the present invention provides a multi-scale pedestrian detection method based on a combination of head and overall information, comprising:

Constructing a Faster R-CNN network model;

The backbone network of the Faster R-CNN network model is fused with an improved feature extraction network, and an image to be detected is input into the Faster R-CNN model of the fused and improved feature extraction network for feature extraction, so that an extracted feature map is obtained;

The sampling mode of the regional suggestion network is improved, a non-uniform difficult sample mining strategy based on shielding overlapping rate discrimination is constructed, the shielding overlapping rate of each sample in a sample set is calculated, a sample with higher shielding overlapping rate is given with higher weight, and the regional suggestion network is utilized to simultaneously generate a head candidate frame and an integral candidate frame set for all pedestrian instances in a scene;

constructing a pedestrian head detection branch module and a pedestrian overall detection branch module, and obtaining a preliminary target detection result through the pedestrian head detection branch module and the pedestrian overall detection branch module, wherein the preliminary target detection result comprises a pedestrian head detection frame and a pedestrian overall detection frame;

Post-processing the obtained pedestrian head detection frame and the pedestrian overall detection frame, and screening out redundant detection frames generated in the combined detection process to obtain a final pedestrian detection result;

In order to further inhibit false detection and missing detection conditions in the post-processing link, a joint loss function is constructed in the loss function part, and the joint loss function is used for punishing false detection conditions which are not inhibited by a correct non-maximum value and missing detection conditions which are inhibited by the wrong non-maximum value, so that a final detection result is more accurate.

As a preferred solution, the fusing the backbone network of the fast R-CNN network model with the improved feature extraction network, and inputting the image to be detected into the fast R-CNN model of the fused and improved feature extraction network for feature extraction, to obtain an extracted feature map, which includes:

Taking ResNet network as backbone network of the fast R-CNN network model;

Learning characteristic information of an image to be detected by using the backbone network;

Feature fusion is carried out on feature information acquired by the backbone network, a dense connection idea is combined, feature splicing strategies of a feature pyramid network FPN are improved, namely, features of all scales participate in calculation and output of the feature information, the advantages of high-level semantic information expressed by a large-scale feature map and detailed information such as bottom texture expressed by a small-scale feature map are absorbed, fusion of the network to the multi-scale features is enhanced, and predicted values corresponding to the scale features of each layer are obtained ~Specifically as shown in formula 1 and formula 2;

(1)

(2)

Wherein, Are all the weight parameters of the weight-based material,As a function of the feature map,~Extracting each layer of scale characteristics obtained by a network for the characteristics, and in the specific training process, parametersParticipate in the back propagation of the gradient, and learn to update to the most appropriate value through training of the model.

As a preferred solution, the method for improving the sampling manner of the area suggestion network, constructing a non-uniform difficult sample mining strategy based on the judgment of the shielding overlapping rate, calculating the shielding overlapping rate of each sample in the sample set, and giving a higher weight to the sample with a higher shielding overlapping rate, and generating a head candidate frame and an integral candidate frame set simultaneously for all pedestrian instances in the scene by using the area suggestion network, including:

the non-uniform difficult sample mining strategy based on shielding overlap rate discrimination is realized by introducing a judging threshold value The sample set is based on the average shielding overlapping rateDivided into difficult sets) And common collection) Two classes, the probability of each sample being extracted is defined asThe specific calculation method is shown in a formula 3;

(3)

Wherein, Represent the firstThe shading coefficient of each sample is used for reflecting the firstThe degree to which the individual samples are occluded;

For the first Occlusion coefficient of individual samplesFurther expressed as:

(4)

Wherein, Indicating the number of samples required for the test,Representing the total number of candidate samples,AndRespectively represent the first in the sample setDetermining a threshold value by using the occlusion overlapping rate of each sample and the average occlusion overlapping rate of the whole sample setDifferent values may be set depending on the overall occlusion degree of the dataset;

for sample set No Occlusion overlap ratio of individual samplesAnd average occlusion overlap rate for the entire sample setFurther expressed as:

(5)

(6)

As a preferred solution, the post-processing is performed on the obtained pedestrian head detection frame and the pedestrian overall detection frame, and redundant detection frames generated in the combined detection process are screened out, so as to obtain a final pedestrian detection result, which includes:

Respectively introducing penalty factors for decreasing confidence to a head detection frame and a pedestrian overall detection frame obtained by combining the detection frames, and gradually reducing the confidence score of the overlapped detection frames, so that competition of the overlapped frames is reduced while the overlapped frames are not excessively restrained, and the first step is that Score of individual pedestrian overall detection frameAnd (d)Score of individual head detection framesExpressed as:

(7)

(8)

(9)

(10)

Wherein, AndThe overall box and the head box with the highest scores in the previous iteration are respectively,AndNMS thresholds corresponding to the overall detection and the header detection respectively,A smaller fixed value for the initial setting;

Weighting and summing the confidence scores of the obtained head frame and the whole frame to obtain a joint confidence score The method is specifically expressed as follows:

(11)

Wherein, And representing the weight occupied by the head detection score, and if the head overlapping degree or the whole overlapping degree is larger than a preset threshold value, suppressing the detection frame with lower confidence.

As a preferred solution, to further suppress false detection and missing detection situations occurring in the post-processing link, a joint loss function is constructed in the loss function part, where the joint loss function is used to penalize false detection situations that are not suppressed by a correct non-maximum value and missing detection situations that are suppressed by a wrong non-maximum value, so that a final detection result is more accurate, and the method includes:

The joint loss function The concrete steps are as follows:

(12)

Wherein the coefficient is 、、AndAre the weights of the balance loss,And (3) withFor pulling the head frame and the whole frame of the false detection closer to be removed,And (3) withThe method comprises the steps of pushing away a missing head frame and an integral frame which are restrained by an error non-maximum value, so that the head frame and the integral frame correctly correspond to a target frame;

For the missing head box and the whole box to be suppressed by the error non-maximum value And (3) withThe loss function, further expressed as:

(13)

(14)

For head frames and whole frames for false detection to be drawn in for rejection And (3) withThe loss function, further expressed as:

(15)

(16)

Wherein, AndNMS threshold values corresponding to the head detection branch and the whole detection branch are respectively obtained, and after redundant frames generated by the head detection branch and the whole detection branch are removed, a final pedestrian detection result is obtained.

In a second aspect, the present invention provides a multi-scale pedestrian detection device based on a combination of head and overall information, comprising:

the building unit is used for building a Faster R-CNN network model;

The extraction unit is used for fusing the backbone network of the Faster R-CNN network model with the improved feature extraction network, inputting the image to be detected into the Faster R-CNN model fused with the improved feature extraction network for feature extraction, and obtaining an extracted feature map;

The generation unit is used for improving the sampling mode of the regional suggestion network, constructing a non-uniform difficult sample mining strategy based on shielding overlapping rate discrimination, calculating the shielding overlapping rate of each sample in the sample set, giving higher weight to the sample with higher shielding overlapping rate, and simultaneously generating a head candidate frame and an integral candidate frame set for all pedestrian examples in a scene by using the regional suggestion network;

The primary detection unit is used for constructing a pedestrian head detection branch module and a pedestrian overall detection branch module, and obtaining a primary target detection result through the pedestrian head detection branch module and the pedestrian overall detection branch module, wherein the primary target detection result comprises a pedestrian head detection frame and a pedestrian overall detection frame;

The post-processing unit is used for carrying out post-processing on the obtained pedestrian head detection frame and the pedestrian overall detection frame, and screening out redundant detection frames generated in the combined detection process to obtain a final pedestrian detection result;

The output unit is used for constructing a joint loss function in the loss function part for further suppressing false detection and missing detection conditions occurring in the post-processing link, and the joint loss function is used for punishing false detection conditions which are not suppressed by the correct non-maximum value and missing detection conditions which are suppressed by the incorrect non-maximum value, so that the final detection result is more accurate.

As a preferred embodiment, the extraction unit is specifically configured to:

Taking ResNet network as backbone network of the fast R-CNN network model;

(1)

(2)

As a preferred solution, the generating unit is specifically configured to:

(3)

For the first Occlusion coefficient of individual samplesFurther expressed as:

(4)

(5)

(6)

As a preferred solution, the post-processing unit is specifically configured to:

(7)

(8)

(9)

(10)

(11)

As a preferred solution, the output unit is specifically configured to:

The joint loss function The concrete steps are as follows:

(12)

(13)

(14)

(15)

(16)

Compared with the prior art, the invention has the following beneficial effects:

The embodiment of the invention provides a multiscale pedestrian detection method and device based on combination of head and whole information, which comprises the steps of firstly intensively connecting different layers of characteristics, namely enabling all scale characteristics to participate in a strategy for calculating final output characteristics, enabling each layer of characteristics to cover semantic information and detail texture information so as to improve the sensitivity of a network to multiscale pedestrian targets, secondly, optimizing a sampling mode of an area suggestion network, calculating the shielding overlapping rate of each sample in a sample set, giving higher weight to samples with higher shielding overlapping rate, strengthening learning of serious shielding samples in difficult sample sets during training, and further improving the detection capability of a model to shielding pedestrian targets, and then constructing a combined detection framework of pedestrian heads and whole information, aiming at assisting pedestrian detection by using head detection, thereby reducing adverse effects on detection caused by shielding of pedestrian bodies. And the post-processing link and the loss function module are optimized, so that the interference of adjacent pedestrian targets on detection is weakened, the intelligence and the rationality of screening redundant frames are improved, and false detection frames generated by two detection branches are removed while the detection frames at the dense positions are not excessively restrained, so that the omission rate and the false detection rate of pedestrian detection are further reduced. Therefore, compared with the previous algorithm, the pedestrian detection algorithm provided by the invention has stronger detection capability on multi-scale pedestrian targets and blocked pedestrian targets in a complex crowd intensive scene, and can reduce the missed detection rate of the pedestrian targets.

Drawings

Fig. 1 is a flow chart of a multi-scale pedestrian detection method based on head and overall information association according to an embodiment of the present invention.

Fig. 2 is an overall logic schematic diagram of a multi-scale pedestrian detection method based on head and overall information association according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of ResNet network in a multi-scale pedestrian detection method based on head and overall information association according to an embodiment of the present invention.

Fig. 4 is a network structure diagram of feature extraction in a multi-scale pedestrian detection method based on head and overall information association according to an embodiment of the present invention.

Fig. 5 is a pseudo code flow chart of a joint post-processing algorithm in a multi-scale pedestrian detection method based on joint of head and overall information, according to an embodiment of the invention.

Fig. 6 is a graph comparing the overall performance of a multi-scale pedestrian detection method based on the combination of head and overall information with some mainstream detection algorithms on CrowdHuman datasets according to an embodiment of the present invention.

Fig. 7 is a graph comparing the overall performance of a multi-scale pedestrian detection method based on the combination of head and overall information with some mainstream detection algorithms on CityPersons datasets according to an embodiment of the present invention.

Fig. 8a is a comparison result of the performance of the multi-scale pedestrian detection method based on the combination of the head and the overall information and some main stream detection algorithms in TJU-Ped-campus subsets according to an embodiment of the present invention.

Fig. 8b is a comparison result of the performance of the multi-scale pedestrian detection method based on the combination of the head and the overall information and some main stream detection algorithms in TJU-Ped-traffic subsets according to the embodiment of the present invention.

Fig. 9 is a block diagram of a multi-scale pedestrian detection device based on the combination of head and overall information according to an embodiment of the present invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the following description, like modules are denoted by like reference numerals. In the case of the same reference numerals, their names and functions are also the same. Therefore, a detailed description thereof will not be repeated.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limiting the invention.

Referring to fig. 1, an embodiment of the present invention provides a multi-scale pedestrian detection method based on head and overall information association, including:

S101, constructing a Faster R-CNN network model;

S102, integrating the backbone network of the Faster R-CNN network model with an improved feature extraction network, and inputting an image to be detected into the Faster R-CNN model of the integrated and improved feature extraction network for feature extraction to obtain an extracted feature map;

s103, improving a sampling mode of a regional suggestion network, constructing a non-uniform difficult sample mining strategy based on shielding overlapping rate discrimination, calculating the shielding overlapping rate of each sample in a sample set, giving higher weight to samples with higher shielding overlapping rate, and simultaneously generating a head candidate frame and an integral candidate frame set for all pedestrian examples in a scene by using the regional suggestion network;

S104, constructing a pedestrian head detection branch module and a pedestrian overall detection branch module, and obtaining a preliminary target detection result through the pedestrian head detection branch module and the pedestrian overall detection branch module, wherein the preliminary target detection result comprises a pedestrian head detection frame and a pedestrian overall detection frame;

S105, performing post-processing on the obtained pedestrian head detection frame and the pedestrian overall detection frame, and screening out redundant detection frames generated in the combined detection process to obtain a final pedestrian detection result;

S106, for further suppressing false detection and missing detection conditions in the post-processing link, constructing a joint loss function in the loss function part, wherein the joint loss function is used for punishing false detection conditions which are not suppressed by the correct non-maximum value and missing detection conditions which are suppressed by the incorrect non-maximum value, so that a final detection result is more accurate.

In S102, the fusing the backbone network of the fast R-CNN network model with the improved feature extraction network, and inputting the image to be detected into the fast R-CNN model of the fused and improved feature extraction network for feature extraction, to obtain an extracted feature map, including:

Taking ResNet network as backbone network of the fast R-CNN network model;

(1)

(2)

Further, in S103, the method for improving the sampling manner of the area suggestion network, constructing a non-uniform difficult sample mining strategy based on the determination of the occlusion overlapping rate, and simultaneously generating a head candidate frame and an overall candidate frame set for all pedestrian instances in the scene by using the area suggestion network by calculating the occlusion overlapping rate of each sample in the sample set and giving a higher weight to the sample with a higher occlusion overlapping rate, including:

(3)

For the first Occlusion coefficient of individual samplesFurther expressed as:

(4)

(5)

(6)

Further, in S105, the post-processing is performed on the obtained pedestrian head detection frame and the pedestrian overall detection frame, and the redundant detection frames generated in the combined detection process are screened out, so as to obtain a final pedestrian detection result, which includes:

(7)

(8)

(9)

(10)

(11)

Further, in S106, to further suppress the false detection and missing detection situations occurring in the post-processing link, a joint loss function is constructed in the loss function portion, where the joint loss function is used to penalize the false detection situations that are not suppressed by the correct non-maximum value and the missing detection situations that are suppressed by the incorrect non-maximum value, so that the final detection result is more accurate, and includes:

The joint loss function The concrete steps are as follows:

(12)

(13)

(14)

(15)

(16)

The embodiment of the invention provides a multiscale pedestrian detection method based on combination of head and whole information, which comprises the steps of firstly intensively connecting different layers of characteristics, namely enabling all scale characteristics to participate in a strategy of calculating final output characteristics, enabling each layer of characteristics to cover semantic information and detail texture information so as to improve the sensitivity of a network to multiscale pedestrian targets, secondly, optimizing a sampling mode of a regional suggestion network, calculating the shielding overlapping rate of each sample in a sample set, giving higher weight to samples with higher shielding overlapping rate, strengthening learning of serious samples which are blocked in difficult sample sets during training, further improving the detection capability of a model to the pedestrian targets, and then constructing a combined detection framework of pedestrian heads and whole information, aiming at assisting in pedestrian detection by head detection, and reducing adverse effects on detection caused by shielding of pedestrian bodies. And the post-processing link and the loss function module are optimized, so that the interference of adjacent pedestrian targets on detection is weakened, the intelligence and the rationality of screening redundant frames are improved, and false detection frames generated by two detection branches are removed while the detection frames at the dense positions are not excessively restrained, so that the omission rate and the false detection rate of pedestrian detection are further reduced. Therefore, compared with the previous algorithm, the pedestrian detection algorithm provided by the invention has stronger detection capability on multi-scale pedestrian targets and blocked pedestrian targets in a complex crowd intensive scene, and can reduce the missed detection rate of the pedestrian targets.

2, 3, 4 And 5, the invention aims at solving the problem that the detection precision of small-scale pedestrian targets and blocked pedestrian targets is reduced and the omission ratio is high in a complex and dense scene by the existing target detection algorithm, and in order to further facilitate the understanding of the scheme of the invention, the following description is given of a multi-scale pedestrian detection method based on the combination of head and whole information under one embodiment of the invention, and the whole method is shown in FIG. 2, wherein the specific steps include:

step 1, constructing a Faster R-CNN network model;

Step 2, fusing a backbone network of the Faster R-CNN network model with the improved feature extraction network, and inputting an image to be detected into the Faster R-CNN network model fused with the improved feature extraction network to perform feature extraction to obtain an extracted feature map;

Step 2.1, taking ResNet network as backbone network of the Faster R-CNN network model, wherein the structure diagram of ResNet network is shown in figure 3;

And 2.2, fusing the improved feature extraction network with a Faster R-CNN network model, wherein the overall structure of the improved feature extraction network is shown in figure 4. The specific implementation steps are as follows:

Step 2.2.1, learning characteristic information of an image to be detected by using a backbone network ResNet;

Step 2.2.2, feature fusion is carried out on Feature information acquired by a backbone network, namely, a Feature splicing strategy of a Feature pyramid network (Feature PyramidNetwork, FPN) is improved by combining a dense connection idea, namely, features of all scales participate in calculation and output of Feature information, so that the advantages of expressing high-level semantic information by a large-scale Feature map and detail information such as bottom texture by a small-scale Feature map are simultaneously absorbed, fusion of the network on multi-scale features is enhanced, and predicted values corresponding to the scale features of each layer are obtained ~Specifically, the method is shown in a formula (1) and a formula (2).

(1)

(2)

Wherein, Are all the weight parameters of the weight-based material,As a function of the feature map,~And extracting the scale characteristics of each layer acquired by the network for the characteristics. In the specific training process, parametersParticipate in the back propagation of the gradient, and learn to update to the most appropriate value through training of the model. The method enables the model to learn which connections are conducive to prediction by the detection module and which connections are invalid calculations, thereby improving the performance of the feature extraction network.

And 3, improving a sampling mode of the regional suggestion network (Region Proposal Network, RPN) to construct a non-uniform difficult sample mining strategy based on shielding overlap rate discrimination. The shielding overlapping rate of each sample in the sample set is calculated, and a higher weight is given to the sample with a higher shielding overlapping rate, so that the study of the seriously shielded sample in the difficult sample set during training is enhanced, and the detection capability of the model on the shielding pedestrian target is further improved. The region suggestion network is then utilized to simultaneously generate a set of head candidate boxes and overall candidate boxes for all pedestrian instances in the scene.

Step 3.1, for the non-uniform difficult sample mining strategy based on the shielding overlap rate discrimination constructed in the invention, a determination threshold value is introducedBased on average shielding overlapping rate of sample setDivided into difficult sets) And common collection) Two types. The probability that each sample is extracted is then defined asThe specific calculation method is shown in the formula (3).

(3)

Wherein, Represent the firstThe shading coefficient of each sample can reflect the firstThe degree to which individual samples are occluded.

Step 3.2 for the first step in step 3.1Occlusion coefficient of individual samplesFurther expressed as:

(4)

Wherein, Indicating the number of samples required for the test,Representing the total number of candidate samples.AndRespectively represent the first in the sample setOcclusion overlap ratio of individual samples and average occlusion overlap ratio of the entire sample set, thresholdDifferent values may be set depending on the overall occlusion degree of the dataset.

Step 3.3 for sample set in step 3.2Occlusion overlap ratio of individual samplesAnd average occlusion overlap rate for the entire sample setFurther expressed as:

(5)

(6)

and 4, constructing a pedestrian head detection branch and a pedestrian overall detection branch, and obtaining a preliminary target detection result through the pedestrian head detection branch module and the pedestrian overall detection branch module.

And 5, carrying out post-processing on the obtained pedestrian head detection frame and the pedestrian overall detection frame, and screening out redundant detection results generated in the combined detection process, wherein the overall flow pseudo code is shown in figure 5. The specific implementation steps are as follows:

And 5.1, respectively introducing penalty factors for decreasing confidence to the head detection frame and the whole pedestrian detection frame obtained by combining the detection frames, and gradually reducing the confidence score of the overlapped detection frames so as to reduce the competitiveness of the overlapped frames while excessively inhibiting the overlapped frames. First, the Score of individual pedestrian overall detection frameAnd (d)Score of individual head detection framesCan be expressed as:

(7)

(8)

(9)

(10)

Wherein, AndThe overall box and the head box with the highest scores in the previous iteration are respectively,AndNMS thresholds corresponding to the overall detection and the header detection respectively,A small fixed value for the initial setting.

Step 5.2, weighting and summing the confidence scores of the head frame and the whole frame obtained in the step 5.1 to obtain a joint confidence scoreSpecifically, the method can be expressed as:

(11)

Wherein, Representing the weight occupied by the head detection score, if the head overlapping degree or the whole overlapping degree is larger than a preset threshold value, the detection frame with lower confidence degree is restrained.

Step6, for false detection and missing detection in non-maximum value inhibition link, the invention constructs a combined loss function in the loss function partThe purpose is to punish false detection cases that are not suppressed by the correct non-maximum value and false omission cases that are suppressed by the incorrect non-maximum value.The concrete steps are as follows:

(12)

Wherein the coefficient is 、、AndAre weights for balancing losses.And (3) withFor "pulling" the false-detected head box and the whole box so as to eliminate them,And (3) withThe method is used for 'pushing away' the missed head frame and the whole frame which are restrained by the error non-maximum value, so that the missed head frame and the whole frame can correctly correspond to the target frame.

Step 6.1 "push away" for the miss header box and the global box to be suppressed by the erroneous non-maximaAnd (3) withThe loss function, may be further expressed as:

(13)

(14)

Step 6.2 "zoom in" on the head frame and the whole frame for false detection to facilitate its rejection And (3) withThe loss function, may be further expressed as:

(15)

(16)

Wherein, AndNMS thresholds corresponding to the header detection branch and the overall detection branch, respectively. And removing redundant frames generated by the head detection branch and the whole detection branch to obtain a final pedestrian detection result.

The technical scheme designed by the invention can be well applied to the fields of intelligent monitoring, auxiliary safe driving, large-scale place video monitoring, intelligent robots and the like.

The effectiveness of the invention can be demonstrated by the following experimental verification:

Experimental conditions and experimental contents

1. The image Net data set is used as a pre-training data set of the pedestrian detection model, then the pedestrian detection model is trained based on CrowdHuman, cityPersons data sets and TJU-DHD-pedestrian data sets respectively, and some enhancement strategies including random clipping, random horizontal overturn and random image brightness interference are used for the data sets in the training process. In the actual training process, a random gradient descent optimizer is adopted to train 35 pieces of epochs, the momentum factor momentum is set to 0.9, the learning rate adopts the strategy of norm up, the initial learning rate is set to a lower value, and the training rate is set to be in the inventionOn iteration 1 to iteration 800 of each epoch, the learning rate increases linearly toAnd then remain unchanged. By 24 th epoch, the learning rate was reduced to 10% of the original, and by 28 th epoch, the learning rate was changed to 1%.

2. The overall performance of the method involved in the present invention was evaluated using the average accuracy (Average Precision, AP) and the logarithmic average omission ratio (MR ^-2) as evaluation indices. The specific calculation method of the average precision and the logarithmic average omission ratio can be expressed as follows:

Wherein, The representation accuracy rate is the ratio of the number of samples with the predicted result being the pedestrian target to the pedestrian target actually marked in the image; Representing recall rate, wherein the recall rate is the proportion of the number of pedestrian targets which are correctly predicted to occupy all the prediction results in the pedestrian target detection process; the detection omission factor is indicated to be the detection omission factor, Representing the average number of false positives in each image. The higher the AP value and the lower the MR ^-2 value, the more excellent the detection performance of the algorithm is explained.

(II) results of experiments

As shown in fig. 6, 7 and 8, on CrowdHuman, cityPersons and TJU-DHD-pedestrian pedestrian detection datasets, the average detection accuracy value is higher and the logarithmic average omission factor value is lower than some of the more popular pedestrian detection algorithms and the baseline fast R-CNN algorithm before improvement. Therefore, the experimental result can fully show that the method has better performance than some popular pedestrian detection algorithms, and can better detect multi-scale pedestrian targets and pedestrian targets with serious shielding.

Compared with the prior art, the multi-scale pedestrian detection method provided by the invention has the following characteristics:

1. Compared with the existing YOLO series network model and R-CNN series network model, the method provided by the invention has the advantages that in the feature extraction link, through densely connecting different layers of features and enabling all scale features to participate in a strategy for calculating final output features, semantic information and detail texture information can be covered in each finally obtained layer of features, and further, the detection capability of the network on multi-scale pedestrian targets is improved. From the information circulation perspective analysis, the feature extraction network constructed by the invention can learn abundant and detailed multi-scale features, deep semantic information can be fused in shallow features, shallow detail texture information can be fused in deep features, and the sensitivity of the network to multi-scale pedestrian targets can be improved. When the multi-scale features are fused by adopting a dense connection strategy from the perspective of parameter optimization, the gradient can be more efficiently counter-propagated in an actual training stage, so that the network complexity is not excessively increased while the convergence speed of a network model is increased, and the method is more beneficial to obtaining a high-quality global optimal solution.

2. Compared with the existing R-CNN series network model, the invention optimizes the sampling mode of the regional suggestion network, calculates the shielding overlapping rate of each sample in the sample set, and gives higher weight to the sample with higher shielding overlapping rate, thereby strengthening the study of the serious sample shielded in the difficult sample set during training, and gradually improving the capability of detecting the pedestrian shielding target along with the continuous training, so as to show higher performance in complex crowd intensive scenes.

3. Compared with the existing YOLO series network model and the R-CNN series network model, the method provided by the invention designs the head and the whole double detection branches at the core part of the model to carry out joint detection, so that the head detection is fully utilized to assist the pedestrian detection, and the influence of the shielding of the pedestrian body on the performance of the detector is reduced.

4. Compared with the existing YOLO series network model and the R-CNN series network model, the method disclosed by the invention optimizes the post-processing link and the loss function module, weakens the interference caused by adjacent pedestrian targets on detection, improves the intelligence and rationality of screening redundant frames, and enables the densely-located detection frames not to be excessively restrained and simultaneously eliminates false detection frames generated by two detection branches, thereby further reducing the omission ratio and the false detection ratio of pedestrian detection.

Accordingly, as shown in fig. 9, in an embodiment of the present invention, there is provided a multi-scale pedestrian detection device based on a combination of head and overall information, including:

the building unit 901 is used for building a fast R-CNN network model;

The extracting unit 902 is configured to fuse the backbone network of the fast R-CNN network model with an improved feature extracting network, and input an image to be detected into the fast R-CNN model fused with the improved feature extracting network for feature extraction, so as to obtain an extracted feature map;

the generating unit 903 is configured to improve a sampling manner of the area suggestion network, construct a non-uniform difficult sample mining policy based on determination of an occlusion overlapping rate, and simultaneously generate a head candidate frame and an overall candidate frame set for all pedestrian instances in a scene by using the area suggestion network by calculating an occlusion overlapping rate of each sample in a sample set and giving a higher weight to a sample with a higher occlusion overlapping rate;

The primary detection unit 904 is configured to construct a pedestrian head detection branch module and a pedestrian overall detection branch module, and obtain a primary target detection result through the pedestrian head detection branch module and the pedestrian overall detection branch module, where the primary target detection result includes a pedestrian head detection frame and a pedestrian overall detection frame;

The post-processing unit 905 is configured to perform post-processing on the obtained pedestrian head detection frame and the obtained pedestrian overall detection frame, and screen out a redundant detection frame generated in the combined detection process, so as to obtain a final pedestrian detection result;

And an output unit 906, configured to construct a joint loss function in the loss function portion for further suppressing false detection and missing detection situations occurring in the post-processing link, where the joint loss function is configured to penalize false detection situations not suppressed by the correct non-maximum value and missing detection situations suppressed by the incorrect non-maximum value, so that a final detection result is more accurate.

As a preferred solution, the extracting unit 902 is specifically configured to:

Taking ResNet network as backbone network of the fast R-CNN network model;

(1)

(2)

As a preferable solution, the generating unit 903 is specifically configured to:

(3)

For the first Occlusion coefficient of individual samplesFurther expressed as:

(4)

(5)

(6)

As a preferred solution, the post-processing unit 905 is specifically configured to:

(7)

(8)

(9)

(10)

(11)

As a preferred solution, the output unit 906 is specifically configured to:

The joint loss function The concrete steps are as follows:

(12)

(13)

(14)

(15)

(16)

The embodiment of the invention provides a multi-scale pedestrian detection device based on combination of head and whole information, which comprises the steps of firstly intensively connecting different layers of characteristics, namely enabling all scale characteristics to participate in a strategy for calculating final output characteristics, enabling each layer of characteristics to cover semantic information and detail texture information so as to improve the sensitivity of a network to multi-scale pedestrian targets, secondly, optimizing a sampling mode of a regional suggestion network, calculating the shielding overlapping rate of each sample in a sample set, giving higher weight to samples with higher shielding overlapping rate, strengthening learning of serious shielding samples in difficult sample sets during training, further improving the detection capability of a model to shielding pedestrian targets, and then constructing a combined detection framework of pedestrian heads and whole information, aiming at assisting pedestrian detection by head detection, and reducing adverse effects on detection caused by shielding of pedestrian bodies. And the post-processing link and the loss function module are optimized, so that the interference of adjacent pedestrian targets on detection is weakened, the intelligence and the rationality of screening redundant frames are improved, and false detection frames generated by two detection branches are removed while the detection frames at the dense positions are not excessively restrained, so that the omission rate and the false detection rate of pedestrian detection are further reduced. Therefore, compared with the previous algorithm, the pedestrian detection algorithm provided by the invention has stronger detection capability on multi-scale pedestrian targets and blocked pedestrian targets in a complex crowd intensive scene, and can reduce the missed detection rate of the pedestrian targets.

While embodiments of the present invention have been illustrated and described above, it will be appreciated that the above described embodiments are illustrative and should not be construed as limiting the invention. Variations, modifications, alternatives and variations of the above-described embodiments may be made by those of ordinary skill in the art within the scope of the present invention.

The above embodiments of the present invention do not limit the scope of the present invention. Any other corresponding changes and modifications made in accordance with the technical idea of the present invention shall be included in the scope of the claims of the present invention.

Claims

1. A multi-scale pedestrian detection method based on the combination of head and overall information, characterized by comprising:

Build the Faster R-CNN network model;

The backbone network of the Faster R-CNN network model is integrated with the improved feature extraction network, and the image to be detected is input into the Faster R-CNN model integrated with the improved feature extraction network to extract features, thereby obtaining an extracted feature map;

The sampling method of the region proposal network is improved, and a non-uniform difficult sample mining strategy based on occlusion overlap rate discrimination is constructed. By calculating the occlusion overlap rate of each sample in the sample set and assigning a higher weight to samples with a higher occlusion overlap rate, the region proposal network is used to simultaneously generate head candidate boxes and overall candidate box sets for all pedestrian instances in the scene.

Constructing a pedestrian head detection branch module and a pedestrian overall detection branch module, and obtaining a preliminary target detection result through the pedestrian head detection branch module and the pedestrian overall detection branch module, wherein the preliminary target detection result includes a pedestrian head detection frame and a pedestrian overall detection frame;

Post-processing the obtained pedestrian head detection frame and the pedestrian overall detection frame, and filtering out redundant detection frames generated in the joint detection process to obtain a final pedestrian detection result;

In order to further suppress false detection and missed detection in the post-processing stage, a joint loss function is constructed in the loss function part. The joint loss function is used to punish false detections that are not suppressed by correct non-maxima and missed detections that are suppressed by incorrect non-maxima, so that the final detection result is more accurate.

2. The multi-scale pedestrian detection method based on the combination of head and overall information according to claim 1 is characterized in that the backbone network of the Faster R-CNN network model is fused with the improved feature extraction network, and the image to be detected is input into the Faster R-CNN model fused with the improved feature extraction network for feature extraction to obtain an extracted feature map, including:

Use the ResNet50 network as the backbone network of the Faster R-CNN network model;

Using the backbone network to learn feature information of the image to be detected;

The feature information obtained by the backbone network is fused, and the feature splicing strategy of the feature pyramid network FPN is improved by combining the idea of dense connection, that is, all scale features are involved in the calculation of output feature information, absorbing the advantages of large-scale feature maps expressing high-level semantic information and small-scale feature maps expressing underlying texture and other detail information, enhancing the network's fusion of multi-scale features, and then the prediction value corresponding to each layer of scale features is ~ The details are shown in formula 1 and formula 2;

(1)

(2)

in, are weight parameters, is the feature mapping function, ~ The scale features of each layer obtained by the feature extraction network. In the specific training process, the parameters Participate in the back propagation of gradients and learn to update to the most appropriate value through model training.

3. The multi-scale pedestrian detection method based on the joint head and overall information as claimed in claim 2 is characterized in that the sampling method of the region proposal network is improved, and a non-uniform difficult sample mining strategy based on occlusion overlap rate discrimination is constructed, by calculating the occlusion overlap rate of each sample in the sample set, and giving a higher weight to the sample with a higher occlusion overlap rate, and using the region proposal network to simultaneously generate a head candidate box and an overall candidate box set for all pedestrian instances in the scene, including:

The non-uniform difficult sample mining strategy based on occlusion overlap ratio judgment introduces a judgment threshold , the sample set is divided according to the average occlusion overlap rate Divided into difficult sets ( ) and the ordinary set ( ) two categories, the probability of each sample being drawn is defined as , the specific calculation method is shown in formula 3;

(3)

in, Indicates The occlusion coefficient of the sample is used to reflect the The degree to which each sample is occluded;

For The occlusion coefficient of samples , further expressed as:

(4)

in, Indicates the number of samples required for testing. represents the total number of candidate samples, and Respectively represent the first The occlusion overlap rate of samples and the average occlusion overlap rate of the entire sample set are used to determine the threshold Different values can be set according to the overall occlusion level of the data set;

For the sample set The occlusion overlap ratio of samples and the average occlusion overlap ratio of the entire sample set , further expressed as:

(5)

(6).

4. The multi-scale pedestrian detection method based on joint head and overall information according to claim 3, characterized in that the obtained pedestrian head detection frame and the pedestrian overall detection frame are post-processed, and redundant detection frames generated in the joint detection process are filtered out to obtain the final pedestrian detection result, including:

For the head detection frame and pedestrian overall detection frame obtained by the joint detection framework, a penalty factor of decreasing confidence is introduced respectively. By gradually reducing the confidence score of the overlapping detection frame, the competition between the overlapping frames is reduced without excessively suppressing them. The score of the overall pedestrian detection box and The score of head detection boxes It is expressed as:

(7)

(8)

(9)

(10)

in, and are the overall frame and head frame with the highest scores in the previous iteration, and They are the NMS thresholds corresponding to overall detection and head detection, A small fixed value that is initially set;

The confidence scores of the head frame and the overall frame are weighted summed to obtain the joint confidence score. , specifically expressed as:

(11)

in, Indicates the weight of the head detection score. If the head overlap or the overall overlap is greater than the preset threshold, the detection box with lower confidence will be suppressed.

5. The multi-scale pedestrian detection method based on the joint head and overall information according to claim 4 is characterized in that, in order to further suppress the false detection and missed detection in the post-processing link, a joint loss function is constructed in the loss function part, and the joint loss function is used to punish the false detection that is not suppressed by the correct non-maximum value and the missed detection that is suppressed by the wrong non-maximum value, so that the final detection result is more accurate, including:

The joint loss function Specifically expressed as:

(12)

Among them, the coefficient , , and are the weights of the balanced loss, and It is used to bring the falsely detected head frame and the overall frame closer for elimination. and It is used to push away the missed head frame and the overall frame that are suppressed by the erroneous non-maximum value, so that the head frame and the overall frame correctly correspond to the target frame;

For the missed head frame and the overall frame that are used to suppress the erroneous non-maximum value, and The loss function is further expressed as:

(13)

(14)

For the purpose of bringing the falsely detected head frame and the overall frame closer together for elimination and The loss function is further expressed as:

(15)

(16)

in, and are the NMS thresholds corresponding to the head detection branch and the overall detection branch respectively. After removing the redundant frames generated by the head detection branch and the overall detection branch, the final pedestrian detection result is obtained.

6. A multi-scale pedestrian detection device based on the combination of head and overall information, characterized by comprising:

Construction unit, used to build Faster R-CNN network model;

An extraction unit is used to fuse the backbone network of the Faster R-CNN network model with the improved feature extraction network, and input the image to be detected into the Faster R-CNN model fused with the improved feature extraction network to perform feature extraction, so as to obtain an extracted feature map;

A generation unit is used to improve the sampling method of the region proposal network, construct a non-uniform difficult sample mining strategy based on occlusion overlap rate discrimination, calculate the occlusion overlap rate of each sample in the sample set, and give a higher weight to the sample with a higher occlusion overlap rate, and use the region proposal network to simultaneously generate a head candidate box and a whole candidate box set for all pedestrian instances in the scene;

A preliminary detection unit, used to construct a pedestrian head detection branch module and a pedestrian overall detection branch module, and obtain preliminary target detection results through the pedestrian head detection branch module and the pedestrian overall detection branch module, wherein the preliminary target detection results include a pedestrian head detection frame and a pedestrian overall detection frame;

A post-processing unit, used to post-process the obtained pedestrian head detection frame and the pedestrian whole detection frame, and filter out redundant detection frames generated in the joint detection process to obtain a final pedestrian detection result;

The output unit is used to further suppress false detection and missed detection in the post-processing link. A joint loss function is constructed in the loss function part. The joint loss function is used to punish false detections that are not suppressed by correct non-maxima and missed detections that are suppressed by incorrect non-maxima, so that the final detection result is more accurate.

7. The multi-scale pedestrian detection device based on the combination of head and overall information according to claim 6, characterized in that the extraction unit is specifically used for:

(1)

(2)

8. The multi-scale pedestrian detection device according to claim 7, wherein the generating unit is specifically used for:

(3)

For The occlusion coefficient of samples , further expressed as:

(4)

(5)

(6).

9. The multi-scale pedestrian detection device based on the combination of head and overall information according to claim 8, characterized in that the post-processing unit is specifically used for:

(7)

(8)

(9)

(10)

(11)

10. The multi-scale pedestrian detection device based on the combination of head and overall information according to claim 9, characterized in that the output unit is specifically used for:

The joint loss function Specifically expressed as:

(12)

(13)

(14)

(15)

(16)