CN108875519B

CN108875519B - Object detection method, device and system and storage medium

Info

Publication number: CN108875519B
Application number: CN201711372437.8A
Authority: CN
Inventors: 梁喆
Original assignee: Beijing Kuangshi Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd
Priority date: 2017-12-19
Filing date: 2017-12-19
Publication date: 2023-05-26
Anticipated expiration: 2037-12-19
Also published as: CN108875519A

Abstract

The embodiment of the invention provides an object detection method, device and system and a storage medium. The method comprises the following steps: acquiring a video; and selecting a corresponding object detection model from at least two object detection models according to the object detection result of at least one previous video frame, and performing object detection on the current video frame by using the selected object detection model. According to the object detection method, device and system and the storage medium, the current most suitable object detection model can be continuously optimized and selected according to the previous object detection result, and the detection accuracy and the hardware power consumption are well balanced.

Description

Object detection method, device and system and storage medium

Technical Field

The present invention relates to the field of image processing, and more particularly, to an object detection method, apparatus, and system, and a storage medium.

Background

The device for detecting the object by adopting the neural network deep learning algorithm, such as a face snapshot camera, needs to run a neural network model with extremely large calculation amount to realize high-precision object detection, which can cause a great increase in power consumption of the camera. If a neural network model with smaller calculation amount is adopted, although the power consumption can be reduced, the detection accuracy can be correspondingly reduced. In short, there is a contradiction between the detection accuracy and the hardware power consumption, and a scheme for achieving better balance between the detection accuracy and the hardware power consumption needs to be found.

Disclosure of Invention

The present invention has been made in view of the above-described problems. The invention provides an object detection method, an object detection device, an object detection system and a storage medium.

According to an aspect of the present invention, there is provided an object detection method. The method comprises the following steps: acquiring a video; and selecting a corresponding object detection model from at least two object detection models according to the object detection result of at least one previous video frame, and performing object detection on the current video frame by using the selected object detection model.

Illustratively, the object detection result of at least one previous video frame indicates that the higher the detection difficulty level, the higher the budget detection accuracy of the corresponding object detection model.

Illustratively, at least two object detection models have different network structures and/or different computational accuracies from each other.

Illustratively, selecting a corresponding object detection model from the at least two object detection models based on the object detection result of the at least one previous video frame comprises: determining an object condition from object detection results of at least one previous video frame, wherein the object condition includes one or more of a number of objects, a size of objects, and a density of objects; and selecting a corresponding object detection model according to the object condition.

Illustratively, the at least two object detection models respectively correspond to at least two preset conditions, the at least two preset conditions respectively correspond to at least two detection difficulty levels, and selecting the corresponding object detection model according to the object condition includes: and judging whether the object condition accords with one of at least two preset conditions, and if the object condition accords with a specific preset condition in the at least two preset conditions, selecting an object detection model corresponding to the specific preset condition, wherein the detection difficulty level indicated by the object detection result of at least one previous video frame is the detection difficulty level corresponding to the specific preset condition.

Illustratively, the number of at least one previous video frame is a plurality, and determining the object condition from the object detection result of the at least one previous video frame includes one or more of: calculating a median or average of the number of objects of at least one previous video frame as the number of objects in the object condition; calculating a median or average value of object sizes of the same object in at least one previous video frame as an object size corresponding to the same object in the object condition; an average distance of objects in at least one previous video frame is calculated as an object concentration in the object condition.

Illustratively, calculating the average distance of objects in at least one previous video frame includes: for each two objects in at least one previous video frame, calculating a distance between the two objects in each previous video frame containing the two objects in the at least one previous video frame; for each two objects in at least one previous video frame, calculating a median or average value of distances between the two objects in all previous video frames containing the two objects in the at least one previous video frame to obtain object distance values related to the two objects; the median or average of all the object distance values obtained is calculated to obtain an average distance.

Illustratively, the at least two object detection models include a first object detection model and a second object detection model, the first object detection model has fewer network layers than the second object detection model and/or the first object detection model has lower calculation accuracy than the second object detection model, the initial object detection model is the first object detection model, and selecting the corresponding object detection model according to the object condition includes: if the number of objects is greater than the preset number in the object condition, or the object sizes of the objects with the first number are smaller than the preset size, or the object densities are within the preset density range, selecting the second object detection model, otherwise selecting the first object detection model.

Illustratively, the step of selecting a corresponding object detection model from at least two object detection models according to an object detection result of at least one previous video frame, and performing object detection on the current video frame using the selected object detection model is performed in a case where the current video frame is a non-first video frame in the video; the object detection method further comprises the following steps: and in the case that the current video frame is the first video frame in the video, performing object detection on the current video frame by using an initial object detection model in at least two object detection models.

According to another aspect of the present invention, there is provided an object detection apparatus including: the video acquisition module is used for acquiring videos; and the object detection module is used for selecting a corresponding object detection model from at least two object detection models according to the object detection result of at least one previous video frame, and carrying out object detection on the current video frame by utilizing the selected object detection model.

Illustratively, the object detection module includes: an object condition determining sub-module for determining an object condition according to an object detection result of at least one previous video frame in the case that the current video frame is a non-first video frame in the video, wherein the object condition includes one or more of the number of objects, the size of objects, and the density of objects; and a model selection sub-module for selecting a corresponding object detection model according to the object condition.

Illustratively, the at least two object detection models correspond to at least two preset conditions, respectively, the at least two preset conditions correspond to at least two detection difficulty levels, respectively, and the model selection submodule includes: the judging unit is used for judging whether the object condition accords with one of at least two preset conditions, and if the object condition accords with a specific preset condition in the at least two preset conditions, selecting an object detection model corresponding to the specific preset condition, wherein the detection difficulty level indicated by the object detection result of at least one previous video frame is the detection difficulty level corresponding to the specific preset condition.

Illustratively, the number of at least one previous video frame is a plurality, and the object condition determination submodule includes one or more of: a first calculation unit for calculating a median or average value of the number of objects of at least one previous video frame as the number of objects in the object condition; a second calculation unit for calculating a median or average value of object sizes of the same object in at least one previous video frame as an object size corresponding to the same object in the object condition; a third calculation unit for calculating an average distance of objects in at least one previous video frame as an object concentration in the object condition.

Illustratively, the third computing unit includes: a first calculation subunit for calculating, for each two objects in at least one previous video frame, a distance between the two objects in each previous video frame containing the two objects in the at least one previous video frame; a second calculation subunit for calculating, for each two objects in at least one previous video frame, a median or average value of distances between the two objects in all previous video frames containing the two objects in the at least one previous video frame, to obtain object distance values related to the two objects; and a third calculation subunit, configured to calculate a median or average value of all the obtained object distance values, so as to obtain an average distance.

Illustratively, the at least two object detection models include a first object detection model and a second object detection model, the first object detection model having fewer network layers than the second object detection model and/or the first object detection model having lower computational accuracy than the second object detection model, the initial object detection model being the first object detection model, wherein the model selection submodule includes: and the selection unit is used for selecting the second object detection model if the number of objects is larger than the preset number, or the object sizes of the objects with the first number are smaller than the preset size, or the object density is in the preset density range, or else selecting the first object detection model.

Illustratively, the object detection module includes: the first detection sub-module is used for selecting a corresponding object detection model from at least two object detection models according to an object detection result of at least one previous video frame under the condition that the current video frame is a non-first video frame in the video, and performing object detection on the current video frame by utilizing the selected object detection model; and the second detection sub-module is used for carrying out object detection on the current video frame by utilizing an initial object detection model in at least two object detection models under the condition that the current video frame is the first video frame in the video.

According to another aspect of the present invention there is provided an object detection system comprising a processor and a memory, wherein the memory has stored therein computer program instructions which, when executed by the processor, are adapted to carry out the steps of: acquiring a video; and selecting a corresponding object detection model from at least two object detection models according to the object detection result of at least one previous video frame, and performing object detection on the current video frame by using the selected object detection model.

Illustratively, the object detection system includes a camera including an image sensor for capturing video.

Illustratively, the step of selecting a corresponding object detection model from at least two object detection models based on object detection results of at least one previous video frame, as used by the processor when executed, comprises: determining an object condition from object detection results of at least one previous video frame, wherein the object condition includes one or more of a number of objects, a size of objects, and a density of objects; and selecting a corresponding object detection model according to the object condition.

Illustratively, at least two object detection models respectively correspond to at least two preset conditions, at least two preset conditions respectively correspond to at least two detection difficulty levels, and the step of selecting the corresponding object detection model according to the object condition, which is executed by the processor, includes: and judging whether the object condition accords with one of at least two preset conditions, and if the object condition accords with a specific preset condition in the at least two preset conditions, selecting an object detection model corresponding to the specific preset condition, wherein the detection difficulty level indicated by the object detection result of at least one previous video frame is the detection difficulty level corresponding to the specific preset condition.

Illustratively, the number of at least one previous video frame is a plurality, and the step of determining the object condition from the object detection result of the at least one previous video frame, which is performed by the processor when executed, comprises one or more of: calculating a median or average of the number of objects of at least one previous video frame as the number of objects in the object condition; calculating a median or average value of object sizes of the same object in at least one previous video frame as an object size corresponding to the same object in the object condition; an average distance of objects in at least one previous video frame is calculated as an object concentration in the object condition.

Illustratively, the step of calculating an average distance of objects in at least one previous video frame, as performed by the computer program instructions when executed by the processor, comprises: for each two objects in at least one previous video frame, calculating a distance between the two objects in each previous video frame containing the two objects in the at least one previous video frame; for each two objects in at least one previous video frame, calculating a median or average value of distances between the two objects in all previous video frames containing the two objects in the at least one previous video frame to obtain object distance values related to the two objects; the median or average of all the object distance values obtained is calculated to obtain an average distance.

Illustratively, the at least two object detection models comprise a first object detection model and a second object detection model, the first object detection model having fewer network layers than the second object detection model and/or the first object detection model having lower computational accuracy than the second object detection model, the initial object detection model being the first object detection model, wherein the computer program instructions, when executed by the processor, are operable to perform the steps of selecting a corresponding object detection model based on object conditions comprising: if the number of objects is greater than the preset number in the object condition, or the object sizes of the objects with the first number are smaller than the preset size, or the object densities are within the preset density range, selecting the second object detection model, otherwise selecting the first object detection model.

Illustratively, the computer program instructions, when executed by the processor, are configured to perform the steps of selecting a corresponding object detection model from at least two object detection models based on an object detection result of at least one previous video frame, and performing object detection on a current video frame using the selected object detection model, if the current video frame is a non-first video frame of the video; the computer program instructions, when executed by the processor, are further operable to perform the steps of: and in the case that the current video frame is the first video frame in the video, performing object detection on the current video frame by using an initial object detection model in at least two object detection models.

According to another aspect of the present invention, there is provided a storage medium having stored thereon program instructions, which when executed, are adapted to carry out the steps of: acquiring a video; and selecting a corresponding object detection model from at least two object detection models according to the object detection result of at least one previous video frame, and performing object detection on the current video frame by using the selected object detection model.

Illustratively, the program instructions are operable, at run-time, to perform the step of selecting a corresponding object detection model from at least two object detection models based on object detection results of at least one previous video frame, comprising: determining an object condition from object detection results of at least one previous video frame, wherein the object condition includes one or more of a number of objects, a size of objects, and a density of objects; and selecting a corresponding object detection model according to the object condition.

Illustratively, at least two object detection models respectively correspond to at least two preset conditions, at least two preset conditions respectively correspond to at least two detection difficulty levels, and the step of selecting the corresponding object detection model according to the object condition, which is used for executing the program instructions in the running process, includes: and judging whether the object condition accords with one of at least two preset conditions, and if the object condition accords with a specific preset condition in the at least two preset conditions, selecting an object detection model corresponding to the specific preset condition, wherein the detection difficulty level indicated by the object detection result of at least one previous video frame is the detection difficulty level corresponding to the specific preset condition.

Illustratively, the number of at least one previous video frame is a plurality, and the step of determining the object condition from the object detection result of the at least one previous video frame, which the program instructions are used at runtime, comprises one or more of: calculating a median or average of the number of objects of at least one previous video frame as the number of objects in the object condition; calculating a median or average value of object sizes of the same object in at least one previous video frame as an object size corresponding to the same object in the object condition; an average distance of objects in at least one previous video frame is calculated as an object concentration in the object condition.

Illustratively, the step of calculating the average distance of objects in at least one previous video frame, which the program instructions are used to perform at run-time, comprises: for each two objects in at least one previous video frame, calculating a distance between the two objects in each previous video frame containing the two objects in the at least one previous video frame; for each two objects in at least one previous video frame, calculating a median or average value of distances between the two objects in all previous video frames containing the two objects in the at least one previous video frame to obtain object distance values related to the two objects; the median or average of all the object distance values obtained is calculated to obtain an average distance.

Illustratively, the at least two object detection models include a first object detection model and a second object detection model, the first object detection model having fewer network layers than the second object detection model and/or the first object detection model having lower computational accuracy than the second object detection model, the initial object detection model being the first object detection model, wherein the program instructions are operable in execution to select the corresponding object detection model based on the object condition, the step of selecting the corresponding object detection model comprising: if the number of objects is greater than the preset number in the object condition, or the object sizes of the objects with the first number are smaller than the preset size, or the object densities are within the preset density range, selecting the second object detection model, otherwise selecting the first object detection model.

Illustratively, the program instructions are configured to select, at run-time, a corresponding object detection model from at least two object detection models based on an object detection result of at least one previous video frame, and perform the step of object detecting the current video frame using the selected object detection model if the current video frame is a non-first video frame of the videos; the computer program instructions, when executed by the processor, are further operable to perform the steps of: and in the case that the current video frame is the first video frame in the video, performing object detection on the current video frame by using an initial object detection model in at least two object detection models.

According to the object detection method, device and system and the storage medium, the current most suitable object detection model can be continuously optimized and selected according to the previous object detection result, and the detection accuracy and the hardware power consumption are well balanced.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent from the following more particular description of embodiments of the present invention, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, and not constitute a limitation to the invention. In the drawings, like reference numerals generally refer to like parts or steps.

FIG. 1 shows a schematic block diagram of an example electronic device for implementing object detection methods and apparatus in accordance with embodiments of the invention;

FIG. 2 shows a schematic flow chart of an object detection method according to one embodiment of the invention;

FIG. 3 shows a schematic block diagram of an object detection apparatus according to one embodiment of the invention; and

fig. 4 shows a schematic block diagram of an object detection system according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments according to the present invention will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present invention and not all embodiments of the present invention, and it should be understood that the present invention is not limited by the example embodiments described herein. Based on the embodiments of the invention described in the present application, all other embodiments that a person skilled in the art would have without inventive effort shall fall within the scope of the invention.

In order to solve the contradiction between the detection precision and the hardware power consumption, the embodiment of the invention provides an object detection method, an object detection device, an object detection system and a storage medium. The specific implementation is as follows: multiple object detection models are prepared in advance, and an object detection device (such as a camera) optimally selects one of the most suitable object detection models according to the previous detection results to detect the current video frame, so that the detection accuracy and the hardware power consumption are balanced better. The object detection method according to the embodiment of the invention can be applied to any field in which detection of certain specific objects is required, such as the field of security monitoring (pedestrian monitoring, vehicle monitoring and the like), the field of internet finance and the field of banking.

First, an example electronic device 100 for implementing the object detection method and apparatus according to an embodiment of the present invention is described with reference to fig. 1.

As shown in fig. 1, electronic device 100 includes one or more processors 102, one or more storage devices 104, an input device 106, an output device 108, and an image capture device 110, which are interconnected by a bus system 112 and/or other forms of connection mechanisms (not shown). It should be noted that the components and structures of the electronic device 100 shown in fig. 1 are exemplary only and not limiting, as the electronic device may have other components and structures as desired.

The processor 102 may be implemented in at least one hardware form of a Digital Signal Processing (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), the processor 102 may be one or a combination of integrated Circuits (CPU), an image processor (GPU), an Application Specific Integrated Circuit (ASIC), or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device 100 to perform desired functions.

The storage 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 102 to implement client functions and/or other desired functions in embodiments of the present invention as described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, mouse, microphone, touch screen, and the like.

The output device 108 may output various information (e.g., images and/or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like. The output device 108 may also be a network communication interface.

The image acquisition device 110 may acquire images (including video frames) and store the acquired images in the storage device 104 for use by other components. The image acquisition device 110 may be an image sensor in a camera. It should be understood that the image capturing apparatus 110 is merely an example, and the electronic device 100 may not include the image capturing apparatus 110. In this case, the image to be processed may be acquired by other devices having image acquisition capability, and the acquired image may be transmitted to the electronic apparatus 100.

By way of example, example electronic devices for implementing the object detection methods and apparatus according to embodiments of the invention may be implemented on devices such as personal computers or remote servers.

Next, an object detection method according to an embodiment of the present invention will be described with reference to fig. 2. Fig. 2 shows a schematic flow chart of an object detection method 200 according to an embodiment of the invention. As shown in fig. 2, the object detection method 200 includes the following steps.

In step S210, a video is acquired.

The video may include a number of video frames. The video may be an original video acquired by an image acquisition device (e.g., an image sensor in a camera) or may be a video obtained after preprocessing (such as digitizing, normalizing, smoothing, etc.) the original video.

In one example, after the complete video is acquired, the following step S220 may be performed, i.e., object detection is performed for the current video frame. In another example, steps S210 and S220 may be performed synchronously, i.e., the video stream is acquired in real time and the object detection is performed on the current video frame in real time.

In step S220, a corresponding object detection model is selected from the at least two object detection models according to the object detection result of the at least one previous video frame, and the object detection is performed on the current video frame using the selected object detection model.

The object described herein may be any object including, but not limited to: a person or a part of a human body (such as a human face), an animal, a vehicle, a building, etc.

At least two object detection models may be prepared in an object detection device (e.g., a camera). In one example, the at least two object detection models are all neural networks. These object detection models have different network structures and/or different computational accuracies from each other. The network structure, that is, the network complexity, may be different, for example, the characteristics of the network layer number, the channel number, the connection mode, etc. in different neural networks may be different. The different calculation accuracy means that different bit number operation modes are adopted in the neural network, for example, the internal convolution of the neural network can select 32 bit floating point operation, or 16 bit floating point operation, 16 bit fixed point operation, 8 bit fixed point operation, or even 4/2/1 bit fixed point operation.

It will be appreciated that different object detection models have different network structures and/or different calculation accuracies, meaning that different object detection models have different amounts of calculation and different detection accuracies (or accuracy of detection results), and correspondingly have different power consumption. In general, when the calculation accuracy is the same, the network structure of the object detection model becomes more complex, the calculation amount becomes larger, and the detection accuracy becomes higher, and the power consumption becomes correspondingly larger. Under the condition of the same network structure, the higher the calculation precision of the object detection model is, the larger the calculation amount is, the higher the detection precision is, and the larger the power consumption is.

In addition to convolutional neural networks, the at least two object detection models may also include other types of detection models, such as Support Vector Machines (SVMs), and the like. All of the at least two object detection models may be the same type of detection model or may be two or more different types of detection models. For example, at least two object detection models may include both a neural network and a support vector machine. Regardless of the model implementation with which the object detection model is implemented, the at least two object detection models that are prepared preferably have different accuracy and power consumption from each other.

It should be noted that the power consumption of the object detection model may be related to many factors, including, but not limited to, the network structure and the computational accuracy described above.

It should be noted that the budget detection precision of the at least two object detection models may be a detection precision predicted based on theory, or may be a detection precision calculated when the at least two object detection models are tested by using the same sample image. For example, if three kinds of neural networks are prepared in total, the first neural network has a total of 4 layers of network structures (each layer of network structure includes layers of convolution, pooling, and the like, and the following are the same) and the multiply-add operation at the bottom of the network adopts the calculation accuracy of 2b×2b, the second neural network has a total of 5 layers of network structures and the multiply-add operation at the bottom of the network adopts the calculation accuracy of 2b×2b, the third neural network has a total of 6 layers of network structures and the multiply-add operation at the bottom of the network adopts the calculation accuracy of 4bit×4bit, the budget detection accuracy can be defined as ordered from low to high: a first neural network, a second neural network, and a third neural network.

It will be appreciated that the amount of computation, the accuracy of detection and the power consumption are substantially in a positive correlation, and if a relatively high accuracy of detection is desired, the amount of computation required tends to be large, with correspondingly large power consumption. Therefore, whether the current detection difficulty (represented by the detection difficulty level herein) is high can be judged according to the object detection result of the previous video frame, if the detection difficulty is not high, the object detection model with low detection precision and low power consumption can be selected for subsequent detection, and if the detection difficulty is relatively high, the object detection model with high detection precision and high power consumption can be selected for detection.

Illustratively, step S220 may be performed in the case where the current video frame is a non-first video frame in the video; the object detection method 200 may further include: and in the case that the current video frame is the first video frame in the video, performing object detection on the current video frame by using an initial object detection model in at least two object detection models.

Illustratively, the initial object detection model is the lowest budget detection precision model of the at least two object detection models. That is, the first video frame may initially be detected using an object detection model that has low network complexity and/or computational accuracy (and which also has low budget detection accuracy and power consumption). Then, each of the following video frames may optimally select an object detection model most suitable for the current situation according to the previous object detection result (not limited to the previous frame, and the object detection result of one or more previous frames may be integrated), for example, how many faces are in the picture, the sizes of the faces, whether the face distribution is dense, and the like. By the method, a better balance between the detection effect and the power consumption of the equipment can be achieved.

According to the object detection method provided by the embodiment of the invention, various object detection model algorithms which are different in adaptation scene, detection effect and power consumption can be operated, and the current most suitable object detection model can be continuously optimized and selected according to the previous object detection result so as to achieve the better balance between detection precision and hardware power consumption.

The object detection method according to the embodiment of the present invention may be implemented in an apparatus, device or system having a memory and a processor, for example.

The object detection method according to the embodiment of the invention can be deployed at an image acquisition end, for example, in the security application field, can be deployed at the image acquisition end of an access control system; in the field of financial applications, it may be deployed at personal terminals, such as smart phones, tablets, personal computers, etc.

Alternatively, the object detection method according to the embodiment of the present invention may be distributed at the server side and the personal terminal. For example, in the security application field, a video may be collected at an image collection end, where the image collection end transmits the collected video to a server end (or cloud end), and the server end (or cloud end) performs object detection.

According to an embodiment of the present invention, selecting a corresponding object detection model from at least two object detection models according to an object detection result of at least one previous video frame may include: determining an object condition from object detection results of at least one previous video frame, wherein the object condition includes one or more of a number of objects, a size of objects, and a density of objects; and selecting a corresponding object detection model according to the object condition.

The factors included in the condition of the object are merely examples and are not limiting of the invention, which may include other factors related to the difficulty of detection of the object. Illustratively, the detection difficulty level may be partitioned according to object conditions, such as one or more of object number, object size, and object concentration.

It will be appreciated that the greater the number of objects, the higher the level of difficulty in detection, as are other factors in the condition of the objects. The smaller the object size, the higher the detection difficulty level, with the same other factors in the object condition. The greater the object concentration, the higher the detection difficulty level, with the same other factors in the object condition.

If multiple factors in the object condition need to be comprehensively considered, the detection difficulty level of the condition (namely, preset condition) of which the multiple factors in the object condition meet can be preset. For example, if the number of objects is smaller than 3, the object sizes of all the objects are larger than a preset size, and the distance between any two objects which are not present in all the objects is smaller than the preset distance, the situation belongs to the lowest detection difficulty level, and the object detection model with the lowest budget detection precision can be selected. For other detection difficulty levels, proper preset conditions can be defined, and in actual detection, the condition of the object can be judged to conform to which preset condition, so as to determine which detection difficulty level belongs to.

According to an embodiment of the present invention, at least two object detection models respectively correspond to at least two preset conditions, the at least two preset conditions respectively correspond to at least two detection difficulty levels, and selecting the corresponding object detection model according to the object condition includes: and judging whether the object condition accords with one of at least two preset conditions, and if the object condition accords with a specific preset condition in the at least two preset conditions, selecting an object detection model corresponding to the specific preset condition, wherein the detection difficulty level indicated by the object detection result of at least one previous video frame is the detection difficulty level corresponding to the specific preset condition.

As described above, the preset condition may be used as a judging condition, and which preset condition is met by the object condition is judged, so that it can be determined what level of detection difficulty the current detection difficulty reaches, and then the corresponding object detection model is selected.

The following description will be made taking an example in which the object status includes only the number of objects. Assume that the detection difficulty level is divided into three levels, corresponding to three different object detection models, respectively. For example, if the number of objects detected according to at least one previous video frame is 1-10, the detection difficulty level is determined to be the first level, at this time, a first object detection model may be selected, which has the minimum number of layers of the network, the minimum calculation amount, the minimum budget detection precision and the minimum power consumption. If the number of objects detected according to at least one previous video frame is 10-50, determining that the detection difficulty level is the second level, wherein a second object detection model can be selected, the number of layers of the network is medium, the calculated amount is medium, the budget detection precision is medium, and the power consumption is medium. If the number of objects detected according to at least one previous video frame is more than 50, determining the detection difficulty level as a third level, selecting a third object detection model, wherein the third object detection model has the highest number of layers of a network, the highest calculated amount, the highest budget detection precision and the highest power consumption.

According to an embodiment of the present invention, at least two object detection models include a first object detection model and a second object detection model, the first object detection model has fewer network layers than the second object detection model and/or the first object detection model has lower calculation accuracy than the second object detection model, and the initial object detection model is the first object detection model, wherein selecting the corresponding object detection model according to the object condition includes: if the number of objects is greater than the preset number in the object condition, or the object sizes of the objects with the first number are smaller than the preset size, or the object densities are within the preset density range, selecting the second object detection model, otherwise selecting the first object detection model.

Two types of neural networks are prepared in the device and described below as an example. The first is a full resolution full depth neural network model, whose detection resolution for the input image is 1:1, and the detection accuracy is also high. However, the neural network model requires a large amount of calculation, and the consumed power is also large because the power is proportional to the amount of calculation. The second neural network model can downsample the input image to 1/4 of the original area and the number of layers (depth) of the network is also small. The neural network model can obtain detection results with lower resolution (1/4 of original image), and lower detection accuracy (more false detection). However, since the calculated amount of the neural network model is about 1/8 of that of the first neural network model, the power consumption of the second neural network model is reduced to about 1/8 as compared with the first neural network model. In addition, the neural network model can be further distinguished in terms of computational accuracy. For example, the high-precision network uses a 4bit by 4bit calculation precision in the underlying multiply add operation, while the low-precision network uses a 2bit by 2bit calculation precision. In the case of other networks of the same structure, the power consumption of the low-precision network may be changed to 1/4 of the original one (of course, there is some loss in accuracy of the final detection result) compared with the high-precision network.

When the two neural network models are adopted to detect the human faces, a first video frame can be detected by a second neural network model with low detection precision and low power consumption, and if the number of the detected human faces is more than 3 or the size of any one of the detected human face frames is smaller than a preset threshold value, the next video frame can be detected by using the first neural network model with high detection precision and high power consumption instead. And repeating the similar process for the following video frames until the number of faces detected by adopting the first neural network model is smaller than 3 and the sizes of all detected face frames are larger than a preset threshold value, and switching to adopting the second neural network model with low precision and low power consumption for face detection again.

In addition, whether to switch the high detection precision algorithm can be determined according to the density of the faces in the image, for example, if the distance between two or three faces in the faces detected in the prior art is relatively close, the density of the faces in the current situation can be considered to be relatively high, and a neural network model with high detection precision and high power consumption can be selected to detect the faces in the current video frame.

In one example, the number of at least one previous video frame is one, that is, the remaining video frames except the first video frame all determine the current object condition according to the object detection result of the previous video frame, determine the current detection difficulty level, and select an object detection model corresponding to the current detection difficulty level.

In another example, the number of at least one previous video frame is a plurality. In this case, determining the object condition from the object detection result of the at least one previous video frame includes one or more of: calculating a median or average of the number of objects of at least one previous video frame as the number of objects in the object condition; calculating a median or average value of object sizes of the same object in at least one previous video frame as an object size corresponding to the same object in the object condition; an average distance of objects in at least one previous video frame is calculated as an object concentration in the object condition.

The following description will be given by taking face detection as an example.

In the case where the object condition includes the number of objects, a median or average value of the number of objects detected in the previous several video frames (e.g., 5 frames to 10 frames) may be taken as one of criteria for model switching. Assuming that the number of at least one previous video frame is 10, some faces are detected in each video frame and have a corresponding number of faces. The 10 faces may be median or averaged to obtain a result representing the number of faces in the face condition (which may be understood as the average number of faces for 10 previous video frames).

In the case where the object condition includes the object size, the median or average of the sizes of the objects detected in the previous several video frames (e.g., 5 frames to 10 frames) may be taken as one of the criteria for model switching. Assuming that the number of at least one previous video frame is 10, some faces are detected in each video frame and have a corresponding face size. It will be appreciated that different faces may appear in different video frames, for example face a appears in frames 1 to 5 and face B appears in frames 4 to 9. In this way, the face size of each face in each previous video frame containing the face can be calculated for that face. For face a, 5 face size values may be obtained, and for face B, 6 face size values may be obtained. The median or average value of the values of the 5 face sizes of the face a can be calculated, and the obtained result represents the face size corresponding to the face a in the face condition. The operation of the face B is similar and will not be described in detail. It can thus be appreciated that in the last ascertained object situation, each object that appears in at least one previous video frame has a corresponding object size. Of course, alternatively, the operation of median or average may be continued for all the object sizes of the objects appearing in at least one previous video frame, with the resulting one final value being the object size in the object condition.

In the case where the object condition includes the object concentration, the average distance of the objects detected in the previous several video frames (for example, 5 frames to 10 frames) may be taken as one of the criteria of model switching.

Illustratively, calculating the average distance of the objects in the at least one previous video frame may include: for each two objects in at least one previous video frame, calculating a distance between the two objects in each previous video frame containing the two objects in the at least one previous video frame; for each two objects in at least one previous video frame, calculating a median or average value of distances between the two objects in all previous video frames containing the two objects in the at least one previous video frame to obtain object distance values related to the two objects; the median or average of all the object distance values obtained is calculated to obtain an average distance.

Assuming that the number of at least one previous video frame is 10, some faces are detected in each video frame. For example, face C and face D both appear in frames 2-7 and face D and face E both appear in frames 4-6. For the face C and the face D, distances of the face C and the face D in each video frame of the 2 nd to 7 th frames may be calculated, and 6 distance values may be obtained. Then, median or average value is calculated for the 6 distance values, and the obtained result is used as the face distance value of the face A and the face B. The operations on the face D and the face E are similar (only the distance needs to be calculated for the 4 th to 6 th frames), and the face distance values of the face D and the face E can be obtained. A similar operation is performed between every two faces in 10 video frames, and several face distance values can be obtained. Finally, a median or average value may be obtained for all face distance values obtained, and a value may be obtained as the average distance.

The above-described manner of calculating the object density in the object situation is merely an example and not a limitation of the present invention, and the object density may have other expressions (not necessarily average distance) and calculation manners.

The object conditions of a plurality of video frames are integrated, and whether to switch the object detection model is determined based on the integrated result, so that the frequent switching of an algorithm caused by abrupt change (for example, abrupt low head of a person) or accidental omission of detection of the object can be prevented.

According to another aspect of the present invention, there is provided an object detection apparatus. Fig. 3 shows a schematic block diagram of an object detection apparatus 300 according to an embodiment of the invention.

As shown in fig. 3, the object detection apparatus 300 according to an embodiment of the present invention includes a video acquisition module 310 and an object detection module 320. The various modules may perform the various steps/functions of the object detection method described above in connection with fig. 2, respectively. Only the main functions of the respective components of the object detection apparatus 300 will be described below, and the details already described above will be omitted.

The video acquisition module 310 is configured to acquire video. The video acquisition module 310 may be implemented by the processor 102 in the electronic device shown in fig. 1 running program instructions stored in the storage 104.

The object detection module 320 is configured to select a corresponding object detection model from at least two object detection models according to an object detection result of at least one previous video frame, and perform object detection on a current video frame using the selected object detection model. The object detection module 320 may be implemented by the processor 102 in the electronic device shown in fig. 1 running program instructions stored in the storage 104.

Illustratively, the object detection module 320 includes: an object condition determining sub-module for determining an object condition according to an object detection result of at least one previous video frame in the case that the current video frame is a non-first video frame in the video, wherein the object condition includes one or more of the number of objects, the size of objects, and the density of objects; and a model selection sub-module for selecting a corresponding object detection model according to the object condition.

Illustratively, the object detection module 320 includes: the first detection sub-module is used for selecting a corresponding object detection model from at least two object detection models according to an object detection result of at least one previous video frame under the condition that the current video frame is a non-first video frame in the video, and performing object detection on the current video frame by utilizing the selected object detection model; and the second detection sub-module is used for carrying out object detection on the current video frame by utilizing an initial object detection model in at least two object detection models under the condition that the current video frame is the first video frame in the video.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Fig. 4 shows a schematic block diagram of an object detection system 400 according to an embodiment of the invention. The object detection system 400 includes an image acquisition device 410, a storage device 420, and a processor 430.

The image acquisition device 410 is used for acquiring video. The image acquisition device 410 is optional and the object detection system 400 may not include the image acquisition device 410. In this case, video may be acquired using other image acquisition apparatuses and the acquired video may be transmitted to the object detection system 400.

The storage means 420 stores computer program instructions for implementing the respective steps in the object detection method according to an embodiment of the invention.

The processor 430 is configured to execute computer program instructions stored in the storage 420 to perform the respective steps of the object detection method according to an embodiment of the present invention, and to implement the video acquisition module 310 and the object detection module 320 in the object detection device 300 according to an embodiment of the present invention.

In one embodiment, the computer program instructions, when executed by the processor 430, are configured to perform the steps of: acquiring a video; and selecting a corresponding object detection model from at least two object detection models according to the object detection result of at least one previous video frame, and performing object detection on the current video frame by using the selected object detection model.

Illustratively, the object detection system 400 includes a camera that includes an image sensor for capturing video. In this embodiment, the image processing device 410 is an image sensor.

Illustratively, the step of selecting a corresponding object detection model from at least two object detection models based on object detection results of at least one previous video frame, as executed by the processor 430, comprises: determining an object condition from object detection results of at least one previous video frame, wherein the object condition includes one or more of a number of objects, a size of objects, and a density of objects; and selecting a corresponding object detection model according to the object condition.

Illustratively, at least two object detection models respectively correspond to at least two preset conditions, at least two preset conditions respectively correspond to at least two detection difficulty levels, and the step of selecting a corresponding object detection model according to an object condition, which is executed by the processor 430, includes: and judging whether the object condition accords with one of at least two preset conditions, and if the object condition accords with a specific preset condition in the at least two preset conditions, selecting an object detection model corresponding to the specific preset condition, wherein the detection difficulty level indicated by the object detection result of at least one previous video frame is the detection difficulty level corresponding to the specific preset condition.

Illustratively, the number of at least one previous video frame is a plurality, and the step of determining an object condition from the object detection result of the at least one previous video frame, as executed by the processor 430, comprises one or more of: calculating a median or average of the number of objects of at least one previous video frame as the number of objects in the object condition; calculating a median or average value of object sizes of the same object in at least one previous video frame as an object size corresponding to the same object in the object condition; an average distance of objects in at least one previous video frame is calculated as an object concentration in the object condition.

Illustratively, the step of calculating an average distance of objects in at least one previous video frame, as performed by the processor 430 when executed, includes: for each two objects in at least one previous video frame, calculating a distance between the two objects in each previous video frame containing the two objects in the at least one previous video frame; for each two objects in at least one previous video frame, calculating a median or average value of distances between the two objects in all previous video frames containing the two objects in the at least one previous video frame to obtain object distance values related to the two objects; the median or average of all the object distance values obtained is calculated to obtain an average distance.

Illustratively, the at least two object detection models include a first object detection model and a second object detection model, the first object detection model having fewer network layers than the second object detection model and/or the first object detection model having lower computational accuracy than the second object detection model, the initial object detection model being the first object detection model, wherein the computer program instructions, when executed by the processor 430, are configured to perform the steps of selecting a corresponding object detection model based on object conditions comprising: if the number of objects is greater than the preset number in the object condition, or the object sizes of the objects with the first number are smaller than the preset size, or the object densities are within the preset density range, selecting the second object detection model, otherwise selecting the first object detection model.

Illustratively, the computer program instructions, when executed by the processor 430, are configured to select a corresponding object detection model from at least two object detection models based on the object detection result of at least one previous video frame, and perform the step of object detecting the current video frame using the selected object detection model if the current video frame is a non-first video frame of the videos; the computer program instructions, when executed by the processor, are further operable to perform the steps of: and in the case that the current video frame is the first video frame in the video, performing object detection on the current video frame by using an initial object detection model in at least two object detection models.

Furthermore, according to an embodiment of the present invention, there is also provided a storage medium on which program instructions are stored, which program instructions, when being executed by a computer or a processor, are for performing the respective steps of the object detection method of the embodiment of the present invention, and for realizing the respective modules in the object detection device according to the embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a memory component of a tablet computer, a hard disk of a personal computer, read-only memory (ROM), erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, or any combination of the foregoing storage media.

In one embodiment, the program instructions may cause a computer or processor to implement the respective functional modules of the object detection apparatus according to the embodiments of the present invention and/or may perform the object detection method according to the embodiments of the present invention when executed by the computer or processor.

In one embodiment, the program instructions, when executed, are configured to perform the steps of: acquiring a video; and selecting a corresponding object detection model from at least two object detection models according to the object detection result of at least one previous video frame, and performing object detection on the current video frame by using the selected object detection model.

The modules in the object detection system according to the embodiment of the present invention may be implemented by a processor of an electronic device implementing object detection according to the embodiment of the present invention running computer program instructions stored in a memory, or may be implemented when computer instructions stored in a computer readable storage medium of a computer program product according to the embodiment of the present invention are run by a computer.

Although the illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the above illustrative embodiments are merely illustrative and are not intended to limit the scope of the present invention thereto. Various changes and modifications may be made therein by one of ordinary skill in the art without departing from the scope and spirit of the invention. All such changes and modifications are intended to be included within the scope of the present invention as set forth in the appended claims.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another device, or some features may be omitted or not performed.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in order to streamline the invention and aid in understanding one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof in the description of exemplary embodiments of the invention. However, the method of the present invention should not be construed as reflecting the following intent: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be combined in any combination, except combinations where the features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some of the modules in an object detection device according to embodiments of the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present invention can also be implemented as an apparatus program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

The foregoing description is merely illustrative of specific embodiments of the present invention and the scope of the present invention is not limited thereto, and any person skilled in the art can easily think about variations or substitutions within the scope of the present invention. The protection scope of the invention is subject to the protection scope of the claims.

Claims

1. An object detection method, comprising:

acquiring a video;

selecting a corresponding object detection model from at least two object detection models according to an object detection result of at least one previous video frame, and performing object detection on the current video frame by using the selected object detection model;

the object detection result of the at least one previous video frame indicates that the higher the detection difficulty level is, the higher the budget detection precision of the corresponding object detection model is;

wherein selecting a corresponding object detection model from at least two object detection models according to the object detection result of at least one previous video frame comprises:

determining an object condition according to an object detection result of the at least one previous video frame, wherein the object condition comprises one or more of the number of objects, the size of the objects and the density of the objects, and the detection difficulty level is divided according to the object condition; and

and selecting the corresponding object detection model according to the object condition.

2. The method of claim 1, wherein the at least two object detection models have different network structures and/or different computational accuracies from each other.

3. The method of claim 1, wherein the at least two object detection models correspond to at least two preset conditions, respectively, the at least two preset conditions correspond to at least two detection difficulty levels, respectively,

the selecting the corresponding object detection model according to the object condition includes:

judging whether the object condition accords with one of the at least two preset conditions, and if the object condition accords with a specific preset condition in the at least two preset conditions, selecting an object detection model corresponding to the specific preset condition, wherein the detection difficulty level indicated by the object detection result of the at least one previous video frame is the detection difficulty level corresponding to the specific preset condition.

4. The method of claim 1, wherein the number of the at least one previous video frame is a plurality, the determining an object condition from the object detection result of the at least one previous video frame comprising one or more of:

calculating a median or average of the number of objects of the at least one previous video frame as the number of objects in the object condition;

calculating a median or average value of object sizes of the same object in the at least one previous video frame as an object size corresponding to the same object in the object condition;

An average distance of objects in the at least one previous video frame is calculated as an object concentration in the object condition.

5. The method of claim 4, wherein the calculating the average distance of objects in the at least one previous video frame comprises:

for each two objects in the at least one previous video frame,

calculating a distance between the two objects in each previous video frame containing the two objects in the at least one previous video frame;

calculating a median or average value of distances between the two objects in all previous video frames containing the two objects in the at least one previous video frame to obtain object distance values related to the two objects;

the median or average of all object distance values obtained is calculated to obtain the average distance.

6. The method of claim 1, wherein the at least two object detection models comprise a first object detection model and a second object detection model, the first object detection model having a smaller number of network layers than the second object detection model and/or the first object detection model having a lower computational accuracy than the second object detection model, wherein,

and if the number of the objects is larger than the preset number, or the object sizes of the objects with the first number are smaller than the preset size, or the object density is in the preset density range, selecting the second object detection model, otherwise, selecting the first object detection model.

7. The method of claim 1, wherein the selecting a corresponding object detection model from at least two object detection models according to an object detection result of at least one previous video frame, and the object detecting a current video frame using the selected object detection model is performed in a case where the current video frame is a non-first video frame of the videos;

the object detection method further includes: and under the condition that the current video frame is the first video frame in the video, performing object detection on the current video frame by utilizing an initial object detection model in the at least two object detection models.

8. An object detection apparatus comprising:

the video acquisition module is used for acquiring videos;

The object detection module is used for selecting a corresponding object detection model from at least two object detection models according to an object detection result of at least one previous video frame, and carrying out object detection on the current video frame by utilizing the selected object detection model;

wherein the object detection module comprises:

an object condition determining sub-module for determining an object condition according to an object detection result of the at least one previous video frame, wherein the object condition includes one or more of the number of objects, the size of objects, and the density of objects, and the detection difficulty level is divided according to the object condition; and

and the model selection sub-module is used for selecting the corresponding object detection model according to the object condition.

9. An object detection system comprising a processor and a memory, wherein the memory has stored therein computer program instructions which, when executed by the processor, are adapted to carry out the object detection method of any of claims 1-7.

10. The system of claim 9, wherein the object detection system comprises a camera comprising an image sensor for capturing the video.

11. A storage medium having stored thereon program instructions for performing the object detection method according to any of claims 1-7 when run.