CN120495616A

CN120495616A - Intelligent UAV real-time spatial perception and target precision detection system

Info

Publication number: CN120495616A
Application number: CN202510396604.0A
Authority: CN
Inventors: 赵海山; 何强俊
Original assignee: Individual
Current assignee: Individual
Priority date: 2025-04-01
Filing date: 2025-04-01
Publication date: 2025-08-15

Abstract

According to the intelligent unmanned aerial vehicle real-time space sensing and target fine detection system, a space sensing module of the unmanned aerial vehicle is perfected through an integrated three-dimensional visual RGB space extraction method, a real-time target fine positioning and target fine recognition capability perfecting content extraction module is added to the unmanned aerial vehicle through an optimized target detection method, navigation and path planning basis is provided for the unmanned aerial vehicle through fusion of the two methods, and an intelligent unmanned aerial vehicle system comprising software and hardware is built. The unmanned aerial vehicle and JETSON TX are used as carrying and developing platforms, the combination of visual real-time space perception modeling and target detection methods of the unmanned aerial vehicle in a specific scene is realized, the unmanned aerial vehicle is used for unmanned aerial vehicles with flight speed limited by the scene, the space perception efficiency is high, the target detection speed is high, the accuracy is high, and obstacle avoidance and path planning are successfully completed under the condition that the environment is unknown and the unmanned aerial vehicle is required to perform obstacle avoidance flight and exploration.

Description

Intelligent unmanned aerial vehicle real-time space sensing and target fine detection system

Technical Field

The application relates to an unmanned aerial vehicle space perception target detection system, in particular to an intelligent unmanned aerial vehicle real-time space perception and target precision detection system, and belongs to the technical field of unmanned aerial vehicle target detection.

Background

Along with the continuous promotion of unmanned aerial vehicle using value, the performance also continuously promotes, and unmanned aerial vehicle not only carries on simple camera and is used for recording the video, still begins to carry special equipment such as depth camera, laser radar and accomplishes specific task. The application direction of the unmanned aerial vehicle at present comprises vegetation protection, street view shooting, electric power inspection, post-disaster rescue and the like, but is basically a basic application. Obstacle avoidance and path planning have been problems in unmanned aerial vehicle applications, which relate to the spatial perception of the unmanned aerial vehicle to the surrounding environment and content extraction based on target recognition, and at present, unmanned aerial vehicles are required to solve the problems of obstacle avoidance and path planning, mostly under the condition that the environment is known, to manually plan and control unmanned aerial vehicle flight. Then the successful completion of obstacle avoidance and path planning is still under exploration under the condition that the environment is unknown and unmanned aerial vehicle is required to perform edging flight and exploration, and the occurrence of the content extraction method based on convolutional neural network target detection provides a certain opportunity for solving the problem. The real-time three-dimensional map is provided for the unmanned aerial vehicle, the unmanned aerial vehicle can be helped to accurately position, the target in the scene can be identified by the content extraction method for target detection, and the position of the target can be given, so that the unmanned aerial vehicle and the target can be exactly provided with meaningful reference data for obstacle avoidance and path planning of the unmanned aerial vehicle by combining the unmanned aerial vehicle and the unmanned aerial vehicle, and the unmanned aerial vehicle can fly in a dry or semi-dry mode. More importantly, if the two methods achieve real-time processing, an unmanned aerial vehicle-mounted intelligent processing system can be established by using a space scene sensing method of an embedded microcomputer and a content extraction method based on target detection. In addition, the realization of the unmanned aerial vehicle-mounted real-time space sensing and target recognition-based content extraction method is not limited to solving the obstacle avoidance and path planning problems, and the collected data can be efficiently utilized to expand to other application directions, which are deep excavation of unmanned aerial vehicle application, so that the unmanned aerial vehicle-mounted real-time space sensing and target recognition-based content extraction method has a large application value.

In the aspect of space scene perception, with the improvement of hardware performance of a visual camera, a laser radar camera and the like, the scene perception can not only obtain a reliable data source, but also realize information complementation by cooperative work of multiple devices. The laser radar and the real-time space perception modeling have the characteristics, and the independent use has certain limitation, and the fusion can complement the advantages and disadvantages. For example, vision can work relatively stably in a dynamic environment with rich textures, and can provide very accurate point cloud matching for a laser radar, and relatively accurate direction and distance information provided by the laser radar can also assist in correcting point cloud images. In an environment with darker light or obviously lacking texture, the advantages of the laser radar can be utilized to assist real-time space perception modeling to record a scene by means of a small amount of information. In addition, the laser radar system and the real-time space perception modeling system cannot be limited in structure by only using one solution, and basically all the solutions can be configured with auxiliary positioning tools such as an inertial element, a satellite positioning system, an indoor base station positioning system and the like to form a complementary situation, which is the research trend in recent years, namely, the radar system and other sensors are subjected to data fusion and work cooperatively. Compared with the prior loose coupling fusion method based on Kalman filtering, the current trend is tight coupling fusion based on nonlinear global optimization. For example, real-time space perception modeling and IMU (inertial navigation system) fusion can realize real-time mutual calibration, so that a vision module can keep certain positioning precision when accelerating and decelerating suddenly or rotating suddenly, tracking loss is prevented, and positioning and map construction errors are reduced to a greater extent.

In the aspect of content scene perception, namely target detection, the current trend is to consider accuracy and speed, the key of the problem is to start with target detection based on candidate frames, specifically, the method can realize as much shared calculation amount as possible among different ROIs, remove redundant calculation, and efficiently utilize the characteristics obtained by CNN, so that the speed of the whole detection is improved. Meanwhile, the target detection of the candidate frame still has a certain virtual view and the condition of omission, and the two problems are key problems to be solved in specific application of the target detection.

The problems to be solved by the unmanned aerial vehicle space perception target detection in the prior art and the key technical difficulties of the application include:

(1) Obstacle avoidance and path planning are problems in current unmanned aerial vehicle applications, which relate to the spatial perception of unmanned aerial vehicles to the surrounding environment and content extraction based on target identification, and the current unmanned aerial vehicle aims to solve the problems of obstacle avoidance and path planning mostly under the condition that the environment is known, so that the unmanned aerial vehicle is manually planned and controlled to fly. The appearance of a content extraction method based on convolutional neural network target detection provides a certain opportunity for solving the problem under the condition that the environment is unknown and unmanned aerial vehicle is required to perform edging flight and exploration, obstacle avoidance and path planning are successfully completed, but the method is not mature enough, a real-time three-dimensional map cannot be provided, unmanned aerial vehicle cannot be accurately positioned, the content extraction of the target detection can not completely identify the target in the scene and give the position of the target, and the combination of the target and the target can provide some meaningful reference data for obstacle avoidance and path planning of the unmanned aerial vehicle, but the combination process has a plurality of problems, the technology is not mature enough, and the unmanned aerial vehicle can not realize the dry pre-flight or the semi-dry pre-flight of the unmanned aerial vehicle. Moreover, the two methods cannot achieve real-time processing, and an unmanned aerial vehicle-mounted intelligent processing system cannot be established by using a space scene sensing method of embedding a microcomputer and a content extraction method based on target detection. The problem of obstacle avoidance and path planning cannot be effectively solved, and the application of the unmanned aerial vehicle is restricted.

(2) In the prior art, a space perception module of an unmanned aerial vehicle is perfected by an integrated three-dimensional visual RGB space extraction method, a target detection method is lacked to add a real-time target fine positioning and target fine recognition capability perfecting content extraction module to the unmanned aerial vehicle, navigation and path planning basis cannot be provided for the unmanned aerial vehicle by combining the two methods, and an intelligent unmanned aerial vehicle system including software and hardware is not established, so that the unmanned aerial vehicle cannot finish real-time space perception and target fine detection. The prior art lacks a fast deep convolutional neural network feature extractor, which cannot learn and extract features of all interested targets to be used for specific target detection and identification, lacks a method for realizing fast composition of specific scenes by using single-purpose ORB-vision real-time space perception modeling, lacks an integrated target detection module to mark the interested targets in a vision real-time space perception modeling diagram and give specific position information, and causes poor real-time space perception capability, low target detection accuracy, poor obstacle avoidance capability and poor application safety.

(3) In the prior art, a sample library and an interesting target feature library of indoor interesting targets are lacking, real-time detection and matching of targets cannot be realized by utilizing an end-to-end based fast neural network, real-time composition of the detected targets is lacking by utilizing a three-dimensional visual RGB space extraction system, the identified interesting targets cannot be marked in a simulation map, specific position information cannot be given, a visual real-time space perception modeling method cannot be lacked, a three-dimensional point cloud map cannot be provided for the unmanned aerial vehicle, space positioning of the unmanned aerial vehicle cannot be given, data support cannot be provided for unmanned aerial vehicle navigation, extraction of target detection contents based on deep learning cannot be provided for the unmanned aerial vehicle, spatial positions of various targets cannot be provided for the unmanned aerial vehicle, avoidance and path planning cannot be performed for the unmanned aerial vehicle by utilizing a spatial relationship, a real-time perception and target detection system of the unmanned aerial vehicle is lacked, an intelligent system suitable for real-time perception and target detection of the unmanned aerial vehicle, which comprises software and hardware is lacked, the problem of obstacle avoidance and path planning of the unmanned aerial vehicle cannot be well solved, and unmanned aerial vehicle cannot be realized.

Disclosure of Invention

The method comprises the steps of constructing a rapid deep convolutional neural network feature extractor, learning and extracting features of all interested targets, using the features as specific target detection and identification, utilizing single-purpose ORB-vision real-time space perception modeling to realize rapid composition of specific scenes, marking the interested targets in the vision real-time space perception modeling graph by an integrated target detection module, giving specific position information, ensuring a certain accuracy, applying a machine vision target detection method to scene perception based on content, fusing a scene perception method based on space, establishing an intelligent unmanned aerial vehicle system, taking an unmanned aerial vehicle and JETSON TX as carrying and developing platforms, and combining the vision real-time space perception modeling of the unmanned aerial vehicle under the specific scenes and the target detection method.

In order to achieve the technical effects, the technical scheme adopted by the application is as follows:

The intelligent unmanned aerial vehicle real-time space perception and target fine detection system is characterized in that a space perception module of the unmanned aerial vehicle is perfected through an integrated three-dimensional visual RGB space extraction method, a target detection method is optimized to add real-time target fine positioning and target fine recognition capability to the unmanned aerial vehicle to perfect a content extraction module, navigation and path planning basis are provided for the unmanned aerial vehicle through fusion of the two methods, and an intelligent unmanned aerial vehicle system comprising software and hardware is built;

The application establishes a sample library and an interesting target feature library of indoor interesting targets, realizes real-time detection and matching of targets by using an end-to-end based rapid neural network, performs real-time composition by using a three-dimensional visual RGB space extraction system while detecting the targets, marks the identified interesting targets in a simulation graph and gives specific position information, and the core method comprises the following steps:

(1) The method is based on a visual real-time space perception modeling method, which comprises the steps of carrying a depth camera on an unmanned aerial vehicle, providing an RGB acquisition chart and a depth acquisition chart for real-time space perception modeling, adopting real-time space perception modeling, namely three-dimensional visual RGB space extraction to finish the perception of a space scene, finally providing a three-dimensional point cloud map for the unmanned aerial vehicle, providing the space positioning of the unmanned aerial vehicle, and providing data support for unmanned aerial vehicle navigation;

(2) The method comprises the steps of extracting target detection content based on deep learning, optimizing real-time processing capacity of a target detection network based on the flight speed of an unmanned aerial vehicle, converting a target detection frame target into a three-dimensional space by combining a conversion relation obtained by scene perception, providing the spatial positions of various targets for the unmanned aerial vehicle, and utilizing the spatial relation unmanned aerial vehicle to perform obstacle avoidance and path planning;

(3) And establishing a real-time sensing and target detection system of the unmanned aerial vehicle, namely based on optimization of real-time scene sensing and real-time target detection, embedding the realization of the two modules into a microcomputer, and establishing an intelligent system comprising software and hardware and suitable for the real-time sensing and target detection of the unmanned aerial vehicle.

Preferably, the target real-time fine detection network data format comprises a training data sample and a target real-time fine detection network data format, wherein the training data sample comprises a large acquisition graph of a plurality of objects, for each object in the acquisition graph, a training label not only comprises a class of the object, but also comprises coordinates of each corner point of a boundary box, the number of the objects among different training acquisition graphs is different, the problem that a loss function is difficult to define is solved by introducing a fixed three-dimensional label format to select label formats with different lengths and dimensions, and the definition format can input the acquisition graph with any size comprising any plurality of objects;

The acquisition map is segmented with regular grids, the grid size is slightly smaller than the minimum object desired to be detected, each grid has two pieces of key information, including the class of the object and the coordinates of the corner points comprising the grid of the object, in addition, in the case of no object in the grid, a special custom class, namely a 'dontcare' class, is used to uniformly maintain a fixed size on the data representation, and an object coverage value represented by 0 or 1 is also set to represent whether the grid has an object, and for the case that many objects are in the same grid, the object occupying the most pixels in the grid is selected, and in the case that there is an overlap of objects, the object of the bounding box with the minimum Y value is used.

Preferably, the real-time accurate target detection network training is divided into three steps:

the first step, a data layer acquires a training acquisition graph and a label, and a conversion layer carries out online data enhancement;

The second step, the full convolution network performs feature extraction and prediction on the object class and the boundary frame of each grid;

Predicting the object category and the target boundary box of each grid respectively, and then simultaneously calculating errors of two prediction tasks by using a loss function;

the prediction process comprises two points, namely, generating a final frame set by using a clustering function in the verification process, and measuring the performance of a model on a verification data set by using a simplified mAP (maximum likelihood) calculation value;

The network receives input collection graphs with different sizes, effectively applies CNN in a sliding window mode with step length, outputs a multi-dimensional array, is overlapped on the collection graphs, and uses GoogLeNet for deleting a final pooling layer to enable the CNN to be used in a sliding window with the maximum step length of 555 multiplied by 555 pixels and 16 pixels;

A final optimized loss function is generated using a linear combination of two independent loss functions, the loss functions comprising in the training data samples the sum of squares of the differences between the true and predicted object coverage of all meshes, the average absolute difference loss of the true and predicted corner points of the bounding box of the object covered at each mesh.

Preferably, the flow of real-time spatial perception modeling comprises:

Reading sensor acquisition graph data, namely reading and preprocessing acquisition graph information of an unmanned aerial vehicle camera in real-time space perception modeling, wherein the data of a depth camera comprises an RGB acquisition graph and a depth graph corresponding to the RGB acquisition graph;

Modeling a visual odometer, namely calculating the attitude change of a camera and a local map by estimating the rotation and translation relation between every two adjacent acquisition graphs by the visual odometer, wherein the key of the step is feature point extraction and acquisition graph matching;

the back end adopts a nonlinear global optimization algorithm to optimize the position and the gesture of a camera from the front end and the detection result of the current and the future from the other thread, and corrects a global unified track diagram and a point cloud diagram;

Judging whether the scene passed by the sensor or the unmanned aerial vehicle carrying the sensor is over or not, and if the scene passes by a certain place, providing information for the rear end to correct the position and the gesture again;

and fourthly, constructing a cruise map of the unmanned aerial vehicle which meets the task requirements according to the estimated camera track.

Preferably, the real-time spatial perception modeling framework:

1) Visual odometer modeling

The front end is in charge of receiving a video stream of a camera, namely an acquisition graph sequence, estimating the motion of the camera between adjacent frames by a feature matching method, and preliminarily obtaining mileage information with certain error accumulation, wherein the visual odometer modeling comprises four parts:

firstly, collecting a frame, wherein the carried information comprises the pose of an unmanned aerial vehicle camera, an RGB (red green blue) collection chart and a depth chart when the frame collection chart is shot;

secondly, a camera model is corresponding to a camera in actual shooting and only comprises internal parameters;

Thirdly, the local map comprises key frames and landmark information points, wherein the key frames and the landmark information points conforming to the matching rule are added into the map, the map is only the local map but not the global map, and only the landmark information points near the current position are included, and the more distant landmark information points are deleted;

fourthly, the landmark information points are map points with known information in the map, wherein the known information included in the landmark information points is feature description corresponding to the landmark information points, and the obtaining mode is to apply a feature matching algorithm to extract the landmark information points in batches;

2) Backend global optimization

The method comprises the steps of analyzing and processing noise problems on data in a global process, wherein the noise problems comprise a linear global optimization algorithm and a nonlinear global optimization algorithm, the linear global optimization assumes that each frame acquisition chart in the shooting process has a linear relation, a Kalman filtering algorithm is used for carrying out state estimation, if the linear relation exists between a previous frame and a next frame, the state estimation is completed through extended Kalman filtering, the difference between an observed value and an algorithm estimated value is calculated, namely, the error value of a pixel coordinate and the pixel coordinate of a corresponding 3D point projected to a two-dimensional plane through a camera position resource is calculated, the error of the linear global optimization default camera position resource and a space point has a causal relation, the camera position and the pose are firstly calculated, and then the position of the space point is further calculated according to the camera position;

3) So the detection of the arrival and departure

The key point is that a word bag model is established, the word bag model abstracts the features into words, the detection process is to match the words appearing in the two images to judge whether the two images describe the same scene, the features are classified into words, a dictionary comprising all possible word sets needs to be trained, massive data are needed to be established for training the dictionary, the dictionary is established as a clustering process, 1 hundred million features are assumed to be extracted from all the images, the K-means clustering method is used for gathering the features into hundred thousand words, a tree with K branches and depth d is constructed for the dictionary in the training process, coarse classification is provided for the upper node of the tree, fine classification is provided for the lower node of the tree, the tree extends to leaf nodes, the time complexity is reduced to logarithmic level by using the tree, and the feature matching speed is accelerated;

4) Patterning of

The two-dimensional plane points are converted into a three-dimensional space by using the data collected by the camera after optimization and correction of the camera gesture, so that three-dimensional space point cloud information is formed, besides a point cloud map, the optimization process of the camera gesture is shown in a g2o tool to form a gesture map, and the map can be defined and described according to specific situations.

Preferably, the three-dimensional visual RGB space extraction method comprises the following steps:

The sensor adopts a monocular camera for acquiring depth information, and the data source comprises an RGB acquisition chart and a depth chart.

The method comprises the steps of obtaining a color acquisition chart and a depth acquisition chart by using a depth camera, transferring 2D plane data to a 3D three-dimensional space by using a geometric model, wherein the coordinate system center points from pixel coordinates to acquisition chart coordinates are different, only have offset relation, the coordinate axes from the acquisition chart coordinates to a camera coordinate system are parallel, only have scaling relation, and the conversion relation from the pixel coordinates to the camera coordinate system is expressed as follows:

Where u, v are the offset between the origins of the coordinate system and the center of the acquisition plane, d _x,d_y is the scaling of the pixel coordinates and the actual imaging plane, d _x＝z_c/f_x,d_y＝z_c/f_y,f_x,f_y is the focal length of the camera on the x, y axes, and the form of writing into a matrix is:

the camera motion camera coordinate system and the world coordinate system are not parallel, have rotation and translation relations, and when the subsequent visual odometer calculates, the relation between the front frame and the rear frame is the same as above, and the matrix relation is given as follows:

And converting the points of the two-dimensional plane into a three-dimensional space, finally obtaining a series of point cloud data, and endowing RGB color attributes to obtain a pair of color three-dimensional maps preliminarily.

Preferably, three-dimensional visual RGB space extraction is implemented:

(1) Front-end visual odometer

The initialization is to start searching key frames by taking a first frame acquisition diagram as a reference, the matching between every two acquisition diagrams adopts an ORB algorithm to extract key points, then BRIEF descriptors are calculated for each key point, and finally quick matching is carried out by adopting a quick approximate nearest neighbor algorithm, wherein the ORB angular point extraction algorithm adds scale and rotation description on a FAST angular point extraction algorithm, adds feature information, has richer feature description and high matching precision, is more accurate and reliable in composition, adopts binary descriptors for BRIEF descriptors, and uses random point selection comparison;

After matching is finished, 2D points are projected to a 3D space according to the depth acquisition graph, 2D coordinates and corresponding 3D coordinates of a series of points are obtained, the position information of a camera is estimated by solving a PnP problem, the actual calculation result is a rotation and translation matrix between the front frame acquisition graph and the rear frame acquisition graph, all data are matched in sequence in pairs, and the pose of the camera is calculated, so that a complete visual odometer is finally obtained;

(2) Backend nonlinear global optimization

The three-dimensional visual RGB space extraction expresses the calculated gesture of the visual odometer through the gesture graph, the three-dimensional visual RGB space extraction comprises nodes and edges, the nodes represent the gestures of each camera, the edges represent the transformation among the gestures of the cameras, the gesture graph not only intuitively describes the visual odometer, but also is convenient for understanding the change of the gestures of the cameras, the nonlinear global optimization expression is graph optimization, the same scene cannot appear in a plurality of positions, the gesture graph has sparsity characteristics, and the gesture graph is solved by adopting a sparse BA algorithm to correct the gestures of the cameras.

The unmanned aerial vehicle hardware system comprises an airborne computer, an airborne module assembly, a camera and a cloud platform, an M100 four-rotor unmanned aerial vehicle, wherein the M100 four-rotor unmanned aerial vehicle provides a flight platform to realize a basic flight function, the camera and the cloud platform are acquisition graph acquisition components in the airborne hardware system, the airborne module assembly is a hardware assembly part positioned between the camera cloud platform and the M100 four-rotor unmanned aerial vehicle, 1) video data of the acquisition camera are sent to the airborne computer, 2) the airborne computer realizes control of the cloud platform through the airborne module assembly, 3) the airborne computer can realize flight control of the M100 four-rotor unmanned aerial vehicle through the airborne module assembly, 4) video acquisition graph data of the camera can be input to a graph transmission system of the M100 four-rotor unmanned aerial vehicle through the airborne module assembly, 5. Voltage conversion, the airborne module assembly converts 24V voltage acquired from a battery of the M100 unmanned aerial vehicle into 12V to supply power to the cloud platform and the airborne computer;

(1) Airborne computer

The onboard computer adopts NVIDIA JETSON TX RTS-ASG003 microcomputer, and the total weight is 170g;

(2) Airborne module assembly

The machine-mounted module assembly belongs to an intermediate execution processing unit in the whole machine-mounted hardware system, and comprises 1) video data output by a high-definition camera are divided into two paths through an HDMI distributor, one path of the video data is input into a video collector and is output to a machine-mounted computer by the video collector, the other path of the video data is output to a wireless image transmission system of an M100 unmanned aerial vehicle through an N1 encoder, 2) a USB-to-UART and PWM module is used for controlling the flight control of the M100 unmanned aerial vehicle and the control of the cloud deck by the machine-mounted computer, 3) a vision sensor realizes the autonomous obstacle avoidance of the M100 unmanned aerial vehicle, 4) an RC receiver receives a control signal of a ground remote controller to control the action of the cloud deck, 5) the wireless data transmission module provides a low-bandwidth data link for the machine-mounted hardware system and the ground system, and 6) supplies power to all components in the machine-mounted computer, the cloud deck and the machine-mounted module assembly;

The video acquisition part in the airborne module assembly consists of an HDMI distributor, a video acquisition device and an N1 encoder, wherein the HDMI distributor divides a video stream from a high-definition camera into two paths, and the two paths of video streams respectively enter the video acquisition device and the N1 encoder through HDMI interfaces;

The wireless data transmission module and the wireless data transmission module at the ground end realize a low-bandwidth wireless data transmission data link, and the sky end and the ground end provide a data path for bidirectional data transmission; the method comprises the steps of establishing a relatively independent vision sensing system by using a Guidance, providing five groups of vision ultrasonic combined sensors, monitoring environmental information in multiple directions in real time, sensing obstacles, matching with an unmanned aerial vehicle flight controller, enabling the aircraft to timely avoid possible collision in high-speed flight, receiving control signals of a ground remote controller by using an RC wireless receiver R7008SB, internally processing the control signals and outputting PWM waveforms to control the movement state of a tripod head, setting 16 receiving channels by using a receiver, converting a USB into a UART and PWM module, realizing two parts of functions, completing conversion from the USB to the URAT (TTL level), connecting a UART interface of the module with a UART interface of the M100 unmanned aerial vehicle, realizing the flight control of the M100 unmanned aerial vehicle by using the UART interface of the module, enabling the airborne computer to output PWM control signals through the module, connecting the PWM control signals output by the module with the heading and the pitching control signals of the tripod head, and realizing the movement state control of the tripod head by using the airborne computer;

(3) Camera and cradle head

The method comprises the steps that a GoPro Hero4 high-definition camera is adopted to collect video data, a MiNi3DPro cradle head is adopted to carry GoProHero high-definition camera to control a camera visual angle, the cradle head is a triaxial cradle head to realize motion control in three directions of pitching, rolling and heading, two control modes of the cradle head are set, one control mode is that an onboard computer outputs PWM waveforms through a USB-UART and PWM module to realize control of the cradle head, and the other control mode is that an onboard cradle head receiver RS7008SB is controlled by a ground remote controller to output PWM waveforms to control the cradle head;

Setting control signals of a cradle head navigation axis and a pitching axis, wherein the control signals of the heading axis and the pitching axis are PWM waveforms with the period of 50HZ, and realizing position control of the heading axis and the pitching axis by adjusting the duty ratio of the control signals, wherein the duty ratio is 5.1% and corresponds to the minimum position, 7.6% and corresponds to the balance position, and 10.1% and corresponds to the maximum position;

setting a mode control signal of a cradle head to realize the control of a locking mode, a heading and a pitching following mode, wherein the three modes of the heading following mode are realized, when a signal input to a mode control lead is a PWM signal with a period of 50HZ and a duty ratio of between 5 and 6 percent, the cradle head enters the locking mode, the heading, the pitching and the rolling are locked at the moment, the heading and the pitching are controlled by a remote controller or an airborne computer, when the signal input to the mode control lead is a PWM signal with a period of 50HZ and a duty ratio of between 6 and 9 percent, the cradle head enters the heading and the pitching following mode, the rolling is locked at the moment, the heading smoothly rotates along with the direction of a nose, the pitching smoothly rotates along with the elevation angle of an airplane, and when the signal input to the mode control lead is a PWM signal with a period of 50HZ and a duty ratio of between 9 and 100 percent, the MiNi3DPro enters the heading following mode, and the heading, the pitching and the rolling are locked at the moment, the heading and the pitching smoothly rotate along with the direction of the nose and the pitching are controlled by the remote controller or the airborne computer.

The unmanned aerial vehicle hardware system is integrated by setting each unit module of an unmanned aerial vehicle airborne module assembly, designing a cloud deck power supply system, enabling 25V voltage output by an unmanned aerial vehicle battery to be output to 19V, 12V and 5V voltages respectively after passing through three DCDC voltage conversion modules of the airborne module assembly, enabling the 19V voltage to supply power to an airborne computer, enabling the 12V voltage to supply power to the cloud deck, enabling the 5V voltage to supply power to an RC receiver and an HDMI distributor, enabling an unmanned aerial vehicle platform internal power supply system to supply power to Guidence visual sensors, an N1 encoder, an airborne unmanned aerial vehicle line sensor and wireless data transmission respectively, enabling an HDMI video collector and the airborne wireless data transmission to take power through USB interfaces of the airborne computer, and enabling the camera to be powered by a battery carried by the camera.

Preferably, the integration of the unmanned aerial vehicle software system is completed by adopting an ROS system platform, and communication between modules is carried out by using a message mechanism of ROS, and meanwhile, the coupling degree of the two modules is loose;

1) Target detection flow

The target detection system adopts Jetson-INFERENCE system and comprises classification, detection and segmentation, wherein the detection module comprises acquisition chart detection, video detection and camera real-time detection, and the video stream is finally decomposed into acquisition chart frames, and the essence of the detection is the acquisition chart detection;

The target detection system converts the video stream into an acquisition image frame, then detects the acquisition image by using a trained network model, finally obtains the category of the target and the pixel coordinates of the target frame, and the output of the target detection system is transmitted to the visual real-time space perception modeling system for positioning and navigation;

2) Three-dimensional visual RGB space extraction process

The three-dimensional visual RGB space extraction system is used for receiving an RGB acquisition image and a depth image, firstly matching the RGB acquisition image to obtain a key frame, then matching the depth acquisition image to construct a point cloud image, then carrying out nonlinear global optimization and local forward and backward detection on the corrected point cloud image, and finally receiving a target detection result to finish target positioning of a three-dimensional space;

3) Unmanned aerial vehicle airborne processing method integration

The onboard processing process of the software system is that after TX2 receives collected image data from a camera, a target detection module detects collected image content to obtain a position coordinate of a target, and then the position coordinate is transmitted to a visual real-time space perception modeling system in real time, and at the moment, the visual real-time space perception modeling system reconstructs a target detection result corresponding to a key frame in a three-dimensional space according to the key frame composition to realize space content extraction.

Compared with the prior art, the application has the innovation points and advantages that:

(1) According to the application, the space perception module of the unmanned aerial vehicle is perfected through integrating the three-dimensional visual RGB space extraction method, the real-time target fine positioning and target fine recognition capability perfecting content extraction module is added to the unmanned aerial vehicle through optimizing the target detection method, and finally navigation and path planning basis is provided for the unmanned aerial vehicle through fusing the two methods, and an intelligent unmanned aerial vehicle system comprising software and hardware is established. The content extraction method based on the convolutional neural network target detection is adopted to provide a real-time three-dimensional map for the unmanned aerial vehicle so as to help the unmanned aerial vehicle to accurately position, the content extraction method based on the target detection can identify the target in a scene and give the position of the target, and the two methods are combined to just provide meaningful reference data for obstacle avoidance and path planning of the unmanned aerial vehicle, so that the unmanned aerial vehicle can fly in a dry or semi-dry mode. A space scene sensing method for embedding a microcomputer into an unmanned aerial vehicle and a content extraction method based on target detection are used for establishing an unmanned aerial vehicle-mounted intelligent processing system, and the two methods achieve real-time processing. The realization of the unmanned aerial vehicle-mounted real-time space sensing and target recognition-based content extraction method is not limited to solving the problems of obstacle avoidance and path planning, the collected data can be efficiently utilized, the obstacle avoidance and path planning can be successfully completed under the condition that the environment is unknown and the unmanned aerial vehicle is required to perform the edging flight and exploration, the unmanned aerial vehicle-mounted real-time space sensing and target recognition-based content extraction method is expanded to other application directions, the deep excavation and expansion of unmanned aerial vehicle application are realized, and the unmanned aerial vehicle-mounted real-time space sensing and target recognition-based content extraction method has a large application value.

(2) The method comprises the steps of constructing a rapid deep convolutional neural network feature extractor, learning and extracting features of all interested targets, using the features as specific target detection and identification, utilizing single-purpose ORB-vision real-time space perception modeling to realize rapid composition of specific scenes, marking the interested targets in the vision real-time space perception modeling graph by an integrated target detection module, giving specific position information, ensuring a certain accuracy, applying a machine vision target detection method to scene perception based on content, fusing a scene perception method based on space, establishing an intelligent unmanned aerial vehicle system, taking an unmanned aerial vehicle and JETSON TX as carrying and developing platforms, and combining the vision real-time space perception modeling of the unmanned aerial vehicle under the specific scenes and the target detection method.

(3) The method comprises the steps of 1) carrying a depth camera on an unmanned aerial vehicle, providing an RGB acquisition chart and a depth acquisition chart for real-time space perception modeling, finally providing a three-dimensional point cloud map for the unmanned aerial vehicle, providing space positioning of the unmanned aerial vehicle, providing data support for unmanned aerial vehicle navigation, 2) extracting target detection content based on deep learning, optimizing real-time processing capacity based on the speed of unmanned aerial vehicle flight, combining a conversion relation obtained by scene perception, converting a target detection frame into a three-dimensional space, providing spatial positions of various targets for the unmanned aerial vehicle, and making obstacle avoidance and path planning by using a space relation unmanned aerial vehicle, 3) establishing a real-time perception and target detection system of the unmanned aerial vehicle, optimizing real-time perception scenes and real-time target detection based on the real-time perception, providing data support for unmanned aerial vehicle navigation, 2) realizing high accuracy of the unmanned aerial vehicle navigation, optimizing real-time processing capacity based on the target detection content, combining the conversion relation obtained by scene perception, and realizing the real-time perception of the unmanned aerial vehicle, and the unmanned aerial vehicle has high accuracy, and the unmanned aerial vehicle navigation system.

Drawings

Fig. 1 is a general block diagram of the on-board hardware of the unmanned aerial vehicle hardware system.

Fig. 2 is a system block diagram of an unmanned aerial vehicle hardware on-board module assembly.

FIG. 3 is a schematic diagram illustrating the connection between the USB to UART and PWM module output pins and the cradle head control signal line.

Fig. 4 is a schematic diagram of control signals for setting the pan/tilt axis and the tilt axis.

Fig. 5 is a schematic diagram of a mode control signal for setting a pan/tilt head.

Fig. 6 is a schematic diagram of each unit module of the unmanned aerial vehicle on-board module assembly.

Fig. 7 is a schematic diagram of the overall framework of the unmanned aerial vehicle on-board processing method.

Fig. 8 is a schematic diagram of an outdoor application case of the unmanned aerial vehicle of the present application.

Fig. 9 is a schematic diagram of an indoor application case of the unmanned aerial vehicle of the present application.

Detailed Description

The technical scheme of the intelligent unmanned aerial vehicle real-time space sensing and target fine detection system provided by the application is further described below with reference to the accompanying drawings, so that the application can be better understood and implemented by those skilled in the art.

With the great improvement of the performance of software and hardware, the machine vision is improved in accuracy and real-time, the content-based scene perception method and the space-based scene perception method are increased in recent years, the target detection method has real-time performance while ensuring a certain accuracy, the machine vision target detection method is applied to the content-based scene perception, the space-based scene perception method is fused, an intelligent unmanned aerial vehicle system is built, unmanned aerial vehicles and JETSON TX are used as carrying and developing platforms, the combination of the visual real-time space perception modeling and the target detection method of the unmanned aerial vehicle under a specific scene is realized, and the unmanned aerial vehicle with the flight speed limited by the scene is used;

(1) The scene perception method based on the content is characterized in that a scene perception task based on the content is completed by adopting a target detection method based on deep learning, and the unmanned aerial vehicle is different from common target detection by taking the unmanned aerial vehicle as a carrying platform, and firstly, the unmanned aerial vehicle has a certain flight speed, and the real-time requirement is provided for the target detection. Secondly, the unmanned aerial vehicle is far from near to near in the process of flying and passing through the object, and meanwhile, the shooting view angle can also have the conditions of front view, strabismus, overlook and the like due to the relative positions of the unmanned aerial vehicle and the object, so that the target detection method is required to have the characteristics of rotation and unchanged scale. And optimizing and constructing a neural network to finish the task of extracting real-time content of the unmanned aerial vehicle, and accurately identifying the target in the specific scene through training under the condition of sufficient data quantity.

(2) The space-based scene perception method comprises the steps of completing matching and realizing rapid composition by adopting a fast ORB corner detection method, and integrating a visual real-time space perception modeling method into the real-time space perception of a specific scene of the unmanned aerial vehicle.

(3) The intelligent unmanned aerial vehicle system is formed by embedding a space perception module based on a visual real-time space perception modeling method and a target detection content extraction method based on a convolutional neural network into a microcomputer and combining the space perception module and the target detection content extraction method with the unmanned aerial vehicle system. Through experiments, the problem of carrying two kinds of perception modules in the actual flight process is solved.

The method comprises the steps of improving a space perception module of an unmanned aerial vehicle through an integrated three-dimensional visual RGB space extraction method, improving a target detection method, adding a real-time target fine positioning and target fine recognition capability improvement content extraction module to the unmanned aerial vehicle, finally providing navigation and path planning basis for the unmanned aerial vehicle through fusion of the two methods, and establishing an intelligent unmanned aerial vehicle system comprising software and hardware;

(2) The method comprises the steps of extracting target detection content based on deep learning, optimizing real-time processing capacity in terms of target detection by a network based on the speed of unmanned aerial vehicle flight, converting a target detection frame target into a three-dimensional space by combining a conversion relation obtained by scene perception, providing the spatial positions of various targets for the unmanned aerial vehicle, and utilizing the spatial relation unmanned aerial vehicle to perform obstacle avoidance and path planning;

1. Content extraction method for real-time accurate target detection

Target real-time accurate detection network data format

The training data sample comprises a large collection chart of a plurality of objects, for each object in the collection chart, the training label not only comprises the class of the object, but also comprises the coordinates of each corner point of the boundary box, the number of the objects is different among different training collection charts, the problem that the definition of a loss function is difficult due to the selection of label formats with different lengths and dimensions is solved by introducing a fixed three-dimensional label format, and the definition format can input the collection chart with any size comprising any plurality of objects.

(II) real-time accurate detection network framework for targets

The real-time accurate target detection network training is divided into three steps:

2. Visual real-time space perception modeling method

First, real-time space perception modeling framework

The real-time space perception modeling process comprises the following steps:

1. Visual odometer modeling

And thirdly, adding the key frames and the landmark information points which meet the matching rule into the map, wherein the map is only the local map but not the global map, and only comprises the landmark information points near the current position, and deleting the more distant landmark information points.

And fourthly, the landmark information points are map points with known information in the map, wherein the known information included in the landmark information points is feature description corresponding to the landmark information points, and the obtaining mode is to extract the landmark information points in batches by using a feature matching algorithm.

2. Backend global optimization

The global process analyzes the data to process noise problems, including linear global optimization and nonlinear global optimization algorithm;

The linear global optimization assumes that each frame acquisition graph in the shooting process has a linear relation, a Kalman filtering algorithm is used for carrying out state estimation, if the state estimation is carried out by adopting an extended Kalman filtering algorithm if only the previous frame and the next frame have the linear relation, the state estimation is completed by adopting the extended Kalman filtering, the difference between an observed value and an algorithm estimated value, namely the error value of pixel coordinates and the pixel coordinates of the corresponding 3D point projected to a two-dimensional plane through a camera position resource, the error generation of the linear global optimization default camera position resource and a space point has a causal relation, the camera position and the attitude are firstly solved, then the position of the space point is further solved according to the camera position resource, and the nonlinear global optimization directly puts all data into the same model for optimization solution to desalt the front-back relation between the data.

3. So the detection of the arrival and departure

The method is characterized in that a word bag model is established, the word bag model abstracts features into words, the detection process is to match words appearing in two images to judge whether the two images describe the same scene or not, the features are classified into words, a dictionary comprising all possible word sets needs to be trained, massive data are needed to be established for training the dictionary, the dictionary is established as a clustering process, 1 hundred million features are assumed to be extracted from all the images, the K-means clustering method is used for gathering the features into hundred thousand words, a tree with K branches and depth of d is established for the dictionary in the training process, coarse classification is provided for the upper node of the tree, fine classification is provided for the lower node of the tree, the tree extends to leaf nodes, the time complexity is reduced to logarithmic level by utilizing the tree, and the feature matching speed is accelerated.

4. Patterning of

(II) three-dimensional visual RGB space extraction method

1. Three-dimensional visual RGB space extraction implementation

(1) Front-end visual odometer

After matching is completed, 2D points are projected to a 3D space according to the depth acquisition graph, 2D coordinates and corresponding 3D coordinates of a series of points are obtained, the position information of a camera is estimated by solving a PnP problem, the actual calculation result is a rotation and translation matrix between the front frame acquisition graph and the rear frame acquisition graph, all data are matched in sequence in pairs, and the pose of the camera is calculated, so that a complete visual odometer is finally obtained.

(2) Backend nonlinear global optimization

In the front-end visual odometer establishing process, only two adjacent frames of acquisition images are continuously matched and the corresponding camera gestures are solved, so that the situation that errors are accumulated cannot be avoided, and the camera gestures need to be corrected.

(3) So the detection of the arrival and departure

Although the back-end global optimization is performed to correct the camera pose to a certain extent, there is a problem that after a period of time, when the unmanned aerial vehicle returns to the original point or the same place, the system can distinguish whether the unmanned aerial vehicle is the original point or has come and come, this is the problem to be solved by the local and the current detection, if the same scene is identified by collecting the image matching, the back-end can obtain another optimization information to adjust the track and the map to the local and the current detection result, the global optimization is completed, whether the same place is detected, the similarity of the current frame and all the previous frames needs to be compared, but the longer the time is, the larger the data volume is, so that the real-time performance is greatly reduced, in order to achieve a better effect, the three-dimensional visual RGB space extraction adopts a close-range loop and random loop mode to replace the traversing mode, the close-range loop is to match the current frame with the previous n frames, n is selected by self according to the situation, the random loop is to match the current frame with the previous n frames, n is selected by self according to the situation, and the local and comes after the camera is detected by the random loop.

3. Unmanned airport scene perception and target precision detection system

Unmanned aerial vehicle hardware system

The overall block diagram of the airborne hardware is shown in figure 1 and comprises an airborne computer, an airborne module assembly, a camera and a cloud platform, an M100 four-rotor unmanned aerial vehicle, wherein the M100 four-rotor unmanned aerial vehicle provides a flight platform to achieve basic flight functions, the camera and the cloud platform are acquisition graph acquisition components in an airborne hardware system, the airborne module assembly is a hardware assembly part positioned between the camera cloud platform and the M100 four-rotor unmanned aerial vehicle and the airborne computer, 1) video data of the acquisition camera are sent to the airborne computer, 2) the airborne computer achieves control of the cloud platform through the airborne module assembly, 3) the airborne computer can achieve flight control of the M100 four-rotor unmanned aerial vehicle through the airborne module assembly, 4) video acquisition graph data of the camera can be input into a graph transmission system of the M100 four-rotor unmanned aerial vehicle through the airborne module assembly, and 5. Voltage conversion is achieved through the airborne module assembly, and 24V voltage acquired from a battery of the M100 unmanned aerial vehicle is converted into 12V to supply power for the cloud platform and the airborne computer.

1. Unmanned aerial vehicle hardware system composition

(1) Airborne computer

The onboard computer adopts NVIDIA JETSON TX RTS-ASG003 microcomputer, has total weight of 170g, and has the size of bank card, and is very light.

(2) Airborne module assembly

The on-board module assembly belongs to an intermediate execution processing unit in the whole on-board hardware system, and a system block diagram of the on-board module assembly is shown in fig. 2. The method comprises the following steps of 1) dividing video data output by a high-definition camera into two paths through an HDMI distributor, wherein one path of video data enters a video collector and is output to an onboard computer through the video collector, the other path of video data is output to a wireless image transmission system of an M100 unmanned aerial vehicle through an N1 encoder, 2) a USB-UART-and PWM (universal serial bus-universal asynchronous receiver/transmitter) module is used for controlling flight control of the M100 and control of a cradle head by the onboard computer, 3) a vision sensor is used for realizing autonomous obstacle avoidance of the M100 unmanned aerial vehicle, 4) an RC (remote control) receiver receives control signals of a ground remote controller to control actions of the cradle head, 5) a wireless data transmission module is used for providing a low-bandwidth data link for an onboard hardware system and a ground system, and 6) power is supplied to all components in the onboard computer, the cradle head and an onboard module assembly.

The video acquisition part in the airborne module assembly consists of an HDMI distributor, a video acquisition device and an N1 encoder, wherein the HDMI distributor divides a video stream from a high-definition camera into two paths, and the two paths of video streams respectively enter the video acquisition device and the N1 encoder through HDMI interfaces.

The system comprises an airborne module assembly, a wireless data transmission module, a wireless receiver R7008SB, a receiver, a USB-to-UART (universal asynchronous receiver/transmitter) and a PWM module, wherein the wireless data transmission module of the wireless data transmission module and the wireless data transmission module of the ground end realize a low-bandwidth wireless data transmission data link, the sky end and the ground end provide data paths for bidirectional data transmission, the guildance establishes a relatively independent vision sensing system, the system is provided with five groups of vision ultrasonic combined sensors, environmental information in multiple directions is monitored in real time and obstacles are perceived, the five groups of vision ultrasonic combined sensors are matched with an unmanned aerial vehicle flight controller, the aircraft can avoid possible collision in high-speed flight in time, the RC wireless receiver R7008SB receives control signals of the ground remote controller, PWM waveforms are output after internal processing to control the motion state of a cradle head, the receiver is provided with 16 receiving channels, the USB-to-UART and the PWM module realizes two-to-UART (TTL level) conversion, a UART interface of the module is connected with a UART interface of the M100 unmanned aerial vehicle, the unmanned aerial vehicle flight control is realized, the UART-to the flight control signal can be independently flown by the UART computer, and the control signal can be connected with the cradle head motion state of the cradle head through the module. FIG. 3 illustrates the connection of the USB to UART and PWM module output pins to the cradle head control signal line.

(3) Camera and cradle head

The method is characterized in that a GoPro Hero4 high-definition camera is adopted to collect video data, a MiNi3DPro cradle head is adopted to carry GoProHero high-definition camera to control a camera visual angle, the cradle head is a triaxial cradle head to realize motion control in three directions of pitching, rolling and heading, two control modes of the cradle head are set, one control mode is that an onboard computer outputs PWM waveforms through a USB-UART and PWM module to realize control of the cradle head, and the other control mode is that a ground remote controller controls an onboard cradle head receiver RS7008SB to output PWM waveforms to control the cradle head.

Fig. 4 sets control signals of a pan-tilt navigation axis and a tilt-tilt axis, wherein the control signals of the heading axis and the tilt-tilt axis are PWM waveforms with a period of 50HZ, and position control of the heading axis and the tilt-tilt axis is achieved by adjusting a duty ratio of the control signals, wherein the duty ratio is 5.1% corresponding to a minimum position, 7.6% corresponding to a balance position, and 10.1% corresponding to a maximum position.

The method comprises the steps of setting a mode control signal of a cradle head to realize control of a locking mode, a heading and a pitching following mode and the heading following mode, enabling the cradle head to enter the locking mode when a signal input to a mode control lead is a PWM signal with a period of 50HZ and a duty ratio of between 5% and 6%, locking the heading, pitching and rolling, controlling the heading and pitching through a remote controller or an onboard computer, enabling the cradle head to enter the heading and pitching following mode when the signal input to the mode control lead is the PWM signal with the period of 50HZ and the duty ratio of between 6% and 9%, locking the rolling, enabling the heading to smoothly rotate along with the direction of a machine head, enabling the pitching to rotate along with the elevation of the machine head, and enabling the MiNi3DPro cradle head to enter the heading following mode when the signal input to the mode control lead is the PWM signal with the period of 50HZ and the duty ratio of between 9% and 100%, enabling the heading, pitching and rolling to smoothly rotate along with the direction of the machine head, and controlling the pitching through the remote controller or the onboard computer.

(4) Unmanned aerial vehicle interface

The battery power output interface of the unmanned aerial vehicle is input to the power input interface of the airborne module assembly, so that power supply of airborne equipment is realized. The output and output of the vision obstacle avoidance system are connected with a CAN-Bus of the unmanned aerial vehicle, the vision obstacle avoidance system is matched with a flight controller of the unmanned aerial vehicle to realize autonomous obstacle avoidance, and a power supply and a video interface of the N1 encoder are connected with a special interface on the unmanned aerial vehicle.

2. Unmanned aerial vehicle hardware system integration

The unmanned aerial vehicle on-board module assembly is arranged in the figure 6, a cradle head power supply system is designed, 25V voltage output by an unmanned aerial vehicle battery is output to 19V, 12V and 5V voltage respectively after passing through three DCDC voltage conversion modules of the on-board module assembly, wherein 19V voltage supplies power to an on-board computer, 12V voltage supplies power to the cradle head, 5V voltage supplies power to an RC receiver and an HDMI distributor, an unmanned aerial vehicle platform internal power supply system supplies power to Guidence visual sensors, an N1 encoder, an on-board wireless line transmission and a wireless data transmission respectively, an HDMI video collector and an on-board wireless data transmission take power through a USB interface of the on-board computer, and the camera is powered by a battery carried by the camera.

(II) unmanned aerial vehicle software System design

The integration of the unmanned aerial vehicle software system is completed by adopting an ROS system platform, communication between modules is performed by using a message mechanism of ROS, and meanwhile, the coupling degree of the two modules is very loose, so that the improvement effect can be continued in the later period.

1. Target detection flow

The target detection system adopts Jetson-INFERENCE system and comprises classification, detection and segmentation, wherein the detection module comprises acquisition chart detection, video detection and camera real-time detection, and the video stream is finally decomposed into acquisition chart frames, and the essence of the detection is the acquisition chart detection.

The target detection system converts the video stream into an acquisition image frame, then detects the acquisition image by using a trained network model, and finally obtains the category of the target and the pixel coordinates of the target frame. The output of the target detection system is transmitted to a visual real-time space perception modeling system for positioning and navigation.

2. Three-dimensional visual RGB space extraction process

The three-dimensional visual RGB space extraction system is used for receiving the RGB acquisition image and the depth image, firstly matching the RGB acquisition image to obtain a key frame, then matching the depth acquisition image to construct a point cloud image, then carrying out nonlinear global optimization and the forward and backward detection of the correction point cloud image, and finally receiving the target detection result to finish the target positioning of the three-dimensional space.

3. Unmanned aerial vehicle airborne processing method integration

The overall frame of the unmanned aerial vehicle on-board processing method is shown in fig. 7. The process of the software system airborne processing is that TX2 receives the collected image data from a camera, then the collected image content is detected by a target detection module to obtain the position coordinates of a target, and then the position coordinates are transmitted to a visual real-time space perception modeling system in real time, at the moment, the visual real-time space perception modeling system is patterning according to key frames, and experiments show that as the video stream is decomposed into the collected image frames, the continuous collected image comprises the key frames, and the key frames necessarily also comprise the target detection results, the target detection results corresponding to the key frames are directly reconstructed in a three-dimensional space to realize space content extraction.

4. Unmanned aerial vehicle outdoor application case

The experimental place is selected from a playground with more pedestrians, and the speed and accuracy of target detection on a microcomputer and the adaptability of target detection in a flight state are fully checked.

As shown in fig. 8, from the experimental result, the unmanned aerial vehicle carries the microcomputer and detects the effect very well to intensive crowd with DETECTNET, basically no omission is examined, and the effect is little different under the slow moving state, also can accomplish the task very well under the fast moving state, and the only problem is that only when rotating the visual angle in the twinkling of an eye, the short skew can appear but very fast can get back to the exact position to the detection frame. Experiments prove that the application can perfectly meet the real-time and accurate target detection purpose in practical application.

5. Unmanned aerial vehicle indoor application case

The experimental place is selected in a room with relatively complex scene, and the indoor environment is characterized by narrow space and more barriers. If the unmanned aerial vehicle flies indoors and the GPS navigation system fails, the unmanned aerial vehicle navigation is completed by fully utilizing the mode of combining inertial navigation and visual perception, firstly, the inertial navigation has higher precision when the carrier changes direction instantly, and the error is larger when the unmanned aerial vehicle runs for a long time, then the unmanned aerial vehicle can be positioned by utilizing the scene perception based on real-time space perception modeling, and then the gesture of the carrier detected by the inertial navigation when changing direction instantly is added to the rear-end global optimization stage of the real-time space perception modeling, thereby correcting the gesture of the unmanned aerial vehicle, and simultaneously providing basis for unmanned aerial vehicle navigation and path planning by utilizing the specific spatial position of the obstacle detected by utilizing the target detection. The indoor composition and detection effect are shown in fig. 9:

the output data of the system are the spatial position of the detection target and the gesture of the unmanned aerial vehicle, so that the spatial perception effect can be seen to be accurate, and the accurate target position can be provided for the unmanned aerial vehicle after normalizing the target frame. From this, it can be seen that the implementation of the accurate positioning of the unmanned aerial vehicle in the room can also be realized by combining real-time space perception modeling and target detection.

6. Summary

On the basis of integrating the real-time space perception modeling, the target detection and the unmanned aerial vehicle system, the unmanned aerial vehicle also carries the unmanned aerial vehicle system for controlling, positioning and navigation, path planning and other functions, so that when other applications are integrated, a lot of work is still done on the aspects of cooperative work among the modules and lifting effect. The real-time sensing method and the real-time target detection method are integrated into a microcomputer by using an unmanned aerial vehicle software platform ROS, and an intelligent unmanned aerial vehicle system is formed together with the unmanned aerial vehicle system.

(1) The method realizes SLAM-based spatial scene perception of the unmanned aerial vehicle, establishes a three-dimensional point cloud picture and positioning by utilizing real-time spatial perception modeling, and provides spatial position information for the unmanned aerial vehicle.

(2) The content scene perception based on the DETECTNET target detection method is realized, the position of a specific target in a three-dimensional space is provided for the unmanned aerial vehicle, and the unmanned aerial vehicle can be used for intelligent navigation and path planning.

(3) The unmanned aerial vehicle has no equipment such as a camera and a microcomputer, the research also realizes the hardware constitution design of the intelligent unmanned aerial vehicle, and the frame design of the cooperative work of the unmanned aerial vehicle carrying the specific equipment is completed.

Claims

1. An intelligent UAV real-time spatial perception and target precision detection system, characterized by integrating a 3D visual RGB spatial extraction method to improve the UAV's spatial perception module, optimizing the target detection method to enhance the UAV's real-time target precision positioning and recognition capabilities, and improving the content extraction module. Ultimately, by integrating these two methods, the UAV is provided with a navigation and path planning basis, and an intelligent UAV system, including software and hardware, is established. First, a fast deep convolutional neural network feature extractor is constructed to learn and extract the features of all targets of interest for specific target detection and recognition. Then, a single-purpose ORB-based real-time spatial perception model is used to achieve rapid composition of specific scenes. Finally, the integrated target detection module marks the targets of interest in the visual real-time spatial perception modeling diagram and provides specific location information.

This application establishes a sample library and feature library of indoor targets of interest, uses an end-to-end fast neural network to achieve real-time target detection and matching, and uses a three-dimensional visual RGB space extraction system to perform real-time composition while detecting targets. The identified targets of interest are marked in the simulation map and their specific location information is given. The core methods include:

(1) Based on the visual real-time spatial perception modeling method: the depth camera is mounted on the UAV to provide RGB acquisition images and depth acquisition images for real-time spatial perception modeling. The real-time spatial perception modeling, i.e., 3D visual RGB space extraction, is used to complete the perception of the spatial scene. Finally, a 3D point cloud map is provided for the UAV, which gives the spatial positioning of the UAV and provides data support for the UAV navigation.

(2) Object detection content extraction based on deep learning: Based on the speed of UAV flight, the object detection network optimizes real-time processing capabilities, combines the conversion relationship obtained by scene perception, transforms the target detected into three-dimensional space, provides the spatial position of various targets for the UAV, and uses the spatial relationship to perform obstacle avoidance and path planning.

(3) Establish a real-time perception and target detection system for UAVs: Based on the optimization of real-time scene perception and real-time target detection, the implementation of the two modules is embedded in a microcomputer to establish an intelligent system including software and hardware for real-time perception and target detection suitable for UAVs.

2. The intelligent unmanned aerial vehicle real-time spatial perception and target precision detection system according to claim 1, characterized in that the real-time target precision detection network data format: the training data samples include a large collection image of multiple objects, and for each object in the collection image, the training label includes not only the object class but also the coordinates of each corner point of the bounding box. The number of objects varies between different training collection images. By introducing a fixed three-dimensional label format, the problem of difficulty in defining the loss function due to the selection of label formats of different lengths and dimensions is solved. The defined format can input collection images of any size containing any number of objects;

The collected image is divided into regular grids with a grid size slightly smaller than the smallest object expected to be detected. Each grid has two key pieces of information, including the category of the object and the coordinates of the corner points of the grid that contains the object. In addition, when there is no object in the grid, a special custom class, the "dontcare" class, is used to uniformly maintain a fixed size in data representation, and an object coverage value represented by 0 or 1 is also set to indicate whether there is an object in the grid. For the case where many objects are in the same grid, the object that occupies the most pixels in the grid is selected. In the case of overlapping objects, the object with the bounding box with the smallest Y value is used.

3. The intelligent unmanned aerial vehicle real-time spatial perception and target precision detection system according to claim 1 is characterized in that the target real-time precision detection network training is divided into three steps:

Step 1: The data layer obtains training collection images and labels, and the conversion layer performs online data enhancement;

Step 2: The fully convolutional network extracts and predicts the object class and bounding box of each grid;

Step 3: Predict the object category and target bounding box for each grid separately, and then use the loss function to calculate the error of the two prediction tasks simultaneously;

The prediction process consists of two steps: first, during the validation process, a clustering function is used to generate the final bounding box set; second, the model performance is measured by a simplified mAP calculation value on the validation dataset;

The network accepts input images of varying sizes and effectively applies a CNN in a sliding window fashion with a stride, outputting a multidimensional array that is superimposed on the image. Using GoogLeNet with the final pooling layer removed, the CNN uses a sliding window of up to 555×555 pixels with a stride of 16 pixels.

The final optimized loss function is generated using a linear combination of two independent loss functions, which include the sum of squares of the difference between the true and predicted object coverage of all grids in the training data samples, and the mean absolute difference loss between the true and predicted corner points of the bounding box of the object covered by each grid.

4. The intelligent unmanned aerial vehicle real-time spatial perception and target precision detection system according to claim 1, wherein the process of real-time spatial perception modeling includes:

Process 1: Reading sensor image data: This involves reading and preprocessing the image information collected by the drone camera in real-time spatial perception modeling. The depth camera data includes the RGB image and its corresponding depth map.

Process 2: Visual odometry modeling: Visual odometry estimates the rotation and translation relationship between two adjacent captured images to calculate the camera's pose change and local map. The key to this step is feature point extraction and captured image matching.

Process 3: Back-end global optimization: The back-end uses a nonlinear global optimization algorithm to optimize the camera position and posture from the front-end and the past-and-back detection results obtained by another thread, correcting them to a globally unified trajectory map and point cloud map.

Process 4: Past-location detection: This process determines whether the sensor or the drone carrying the sensor has passed through a certain scene before. If it is detected that the sensor has been there before, the information is provided to the backend to correct the position and posture.

Process 4, composition: Based on the estimated camera trajectory, a drone cruise map that meets the mission requirements is created.

5. The intelligent unmanned aerial vehicle real-time spatial perception and target precision detection system according to claim 1 is characterized in that the real-time spatial perception modeling framework:

1) Visual odometry modeling

The front-end receives the camera's video stream, or captured image sequences, and estimates the camera's motion between adjacent frames using feature matching methods. This allows for the initial acquisition of mileage information with a certain degree of error accumulation. Visual odometry modeling consists of four parts:

First, the acquisition frame: The information carried includes the drone camera pose, RGB acquisition image, and depth map when the frame was captured;

Second, the camera model: corresponds to the camera used in the actual shooting, and only includes internal parameters;

Third, local map: including keyframes and landmark information points. Keyframes and landmark information points that meet the matching rules will be added to the map. The map is only a local map, not a global map. It only includes landmark information points near the current location, and landmark information points farther away are deleted.

Fourth, landmark information points: These are map points with known information. The known information included in the landmark information points is the feature description corresponding to them. The acquisition method is to use feature matching algorithms to extract them in batches.

2) Backend global optimization

The global process analyzes the data and handles noise problems, including linear global optimization and nonlinear global optimization algorithms. Linear global optimization assumes that there is a linear relationship between each frame of the acquisition image during the shooting process, and uses the Kalman filter algorithm to estimate the state. If it is assumed that only the previous frame and the next frame have a linear relationship, the state estimation is completed by the extended Kalman filter. The difference between the observed value and the algorithm estimate is calculated, that is, the error value between the pixel coordinates and the pixel coordinates of the corresponding 3D point projected onto the two-dimensional plane through the camera position. Linear global optimization assumes that the error between the camera position and the spatial point has a causal relationship. The camera position and posture are first calculated, and then the position of the spatial point is further calculated based on the camera position. Nonlinear global optimization directly puts all data into the same model for optimization and solution, downplaying the relationship between the data.

3) Testing of travel to and from the old place

The key to correcting the accumulated errors of the visual odometry for past-time detection lies in building a bag-of-words model. This model abstracts features into individual words. The detection process involves matching the words that appear in two images to determine whether they depict the same scene. To classify features into words, it is necessary to train a dictionary that includes all possible word sets. Building this dictionary requires massive amounts of data for training. Dictionary building is a clustering process. Assuming that a total of 100 million features are extracted from all images, they are clustered into 100,000 words using the K-means clustering method. During the dictionary training process, a tree with k branches and a depth of d is constructed. The upper nodes of the tree provide coarse classification, and the lower nodes provide fine classification, extending all the way to the leaf nodes. Using this tree, the time complexity is reduced to logarithmic level, accelerating feature matching.

4) Composition

The data collected by the camera is optimized and the camera posture is corrected to convert the two-dimensional plane points into three-dimensional space, forming three-dimensional space point cloud information. In addition to the point cloud map, the camera posture optimization process is represented in the g2o tool to form a posture graph. Depending on the specific situation, you can also define your own map for description.

6. The intelligent unmanned aerial vehicle real-time spatial perception and target precision detection system according to claim 1, characterized in that the three-dimensional visual RGB space extraction method:

The sensor uses a monocular camera to obtain depth information, and the data source includes RGB acquisition image and depth map.

Use the depth camera to obtain the color acquisition image and the depth acquisition image, and use the geometric model to convert the 2D plane data into a 3D stereo space. From the pixel coordinates to the acquisition image coordinates: the center points of the coordinate systems are different, and there is only an offset relationship between the two. From the acquisition image coordinate system to the camera coordinate system, the coordinate axes are parallel, and there is only a scaling relationship. The conversion relationship from the pixel coordinate system to the camera coordinate system is expressed as:

Where u and v are the offsets between the origins of the coordinate system and are also the center of the acquisition plane. _dx and _dy are the scaling ratios between the pixel coordinates and the actual imaging plane. The scaling ratios are _dx = _zc / _fx and _dy = _zc / _fy . _fx and _fy are the focal lengths of the camera on the x and y axes, respectively. Written in matrix form:

When the camera moves, the camera coordinate system and the world coordinate system are not parallel, and there is a rotation and translation relationship. In the subsequent visual odometry calculation, the relationship between the previous and next frames is the same as above, and the matrix relationship is given as:

Convert the points on the two-dimensional plane to three-dimensional space, and finally obtain a series of point cloud data, which are assigned RGB color attributes to initially obtain a color three-dimensional map.

7. The intelligent UAV real-time spatial perception and target precision detection system according to claim 1, characterized in that the 3D visual RGB space extraction is achieved:

(1) Front-end visual odometry

Initialization is based on the first frame of the acquisition image, and the search for key frames begins. The ORB algorithm is used to extract key points between the acquisition images, and then the BRIEF descriptor is calculated for each key point. Finally, the fast approximate nearest neighbor algorithm is used for fast matching. The ORB corner point extraction algorithm adds scale and rotation descriptions to the FAST corner point extraction algorithm, adding feature information, richer feature descriptions, higher matching accuracy, and more accurate and reliable composition. The BRIEF descriptor uses a binary descriptor and uses randomly selected points for comparison.

After matching is completed, the 2D points are projected into 3D space based on the depth acquisition image to obtain the 2D coordinates and corresponding 3D coordinates of a series of points. The camera's position is estimated by solving the PnP problem. The actual calculation result is the rotation and translation matrix between the two frames of acquisition. All data are matched pairwise in sequence and the camera pose is calculated to obtain a complete visual odometry.

(2) Back-end nonlinear global optimization

Three-dimensional visual RGB space extraction uses a pose graph to represent the pose calculated by the visual odometry, including nodes and edges. Nodes represent the poses of each camera, and edges represent the transformation between camera poses. The pose graph not only intuitively describes the visual odometry but also facilitates understanding of changes in camera poses. Nonlinear global optimization is expressed as graph optimization. The same scene does not appear in many locations, making the pose graph sparse. A sparse BA algorithm is used to solve the pose graph to correct the camera pose.

8. The intelligent UAV real-time spatial perception and target precision detection system according to claim 1 is characterized in that the UAV hardware system includes the following four parts: an onboard computer; an onboard module assembly; a camera and a gimbal; an M100 quad-rotor UAV, wherein the M100 quad-rotor UAV provides a flight platform to realize basic flight functions; the camera and the gimbal are image acquisition components in the onboard hardware system; the onboard module assembly is a hardware assembly part located between the camera gimbal, the M100 quad-rotor UAV and the onboard computer: 1) the video data of the camera is collected and sent to the onboard computer, 2) the onboard computer controls the gimbal through the onboard module assembly, 3) the onboard computer can realize flight control of the M100 quad-rotor UAV through the onboard module assembly, 4) the video acquisition image data of the camera can be input into the image transmission system of the M100 quad-rotor UAV through the onboard module assembly; 5. voltage conversion, the onboard module assembly converts the 24V voltage obtained from the M100 UAV battery into 12V to power the gimbal and the onboard computer;

(1) Onboard computer

The onboard computer uses the NVIDIA JETSON TX2 RTS-ASG003 microcomputer, with a total weight of 170g;

(2) Airborne module assembly

The airborne module assembly is an intermediate execution processing unit in the entire airborne hardware system. Its internal composition and implementation are as follows: 1) The video data output by the high-definition camera is divided into two paths through the HDMI distributor. One path goes to the video collector and is output by the video collector to the airborne computer; the other path is output to the wireless image transmission system of the M100 UAV through the N1 encoder; 2) The USB to UART and PWM modules are used by the airborne computer to control the flight control and gimbal control of the M100; 3) The visual sensor realizes autonomous obstacle avoidance of the M100 UAV; 4) The RC receiver receives the control signal of the ground remote control to control the movement of the gimbal; 5) The wireless data transmission module provides a low-bandwidth data link between the airborne hardware system and the ground system; 6) It supplies power to the airborne computer, gimbal, and various components within the airborne module assembly;

The video acquisition part of the airborne module assembly consists of three parts: an HDMI distributor, a video collector, and an N1 encoder. The HDMI distributor splits the video stream from the HD camera into two channels, which then enter the video collector and N1 encoder respectively through the HDMI interface.

In the airborne module assembly, the wireless data transmission module and the ground-side wireless data transmission module establish a low-bandwidth wireless data transmission link, providing a bidirectional data path for data transmission between the air and ground terminals. The Guidance establishes a relatively independent visual sensing system equipped with five sets of visual and ultrasonic combination sensors to monitor the environment in multiple directions in real time and detect obstacles. In conjunction with the UAV flight controller, it can timely avoid potential collisions during high-speed flight. The RC wireless receiver R7008SB receives control signals from the ground remote control, processes them internally, and outputs PWM waveforms to control the gimbal's motion. The receiver has 16 receive channels. The USB to UART and PWM module performs two functions: completing USB to UART (TTL level) conversion, connecting the module's UART interface to the M100's UART interface, enabling the onboard computer to control the M100's flight, and achieving autonomous flight. The USB to PWM converter enables the onboard computer to output PWM control signals through the module. The module's PWM control signals are then connected to the gimbal's heading and pitch control signals, enabling the onboard computer to control the gimbal's motion.

(3) Camera and gimbal

A GoPro Hero4 HD camera is used to collect video data, and a MiNi3DPro gimbal is used with the GoPro Hero4 HD camera to control the camera's viewing angle. The gimbal is a three-axis gimbal that can achieve motion control in three directions: pitch, roll, and heading. There are two control methods for the gimbal: one is to use the onboard computer to control the gimbal through the USB to UART and PWM module to output PWM waveforms to achieve control, and the other is to use the ground remote control to control the onboard gimbal receiver RS7008SB to output PWM waveforms to control the gimbal.

Set the control signals for the pan and tilt axes of the gimbal. The control signals for both the pan and tilt axes are PWM waveforms with a period of 50 Hz. Position control of the pan and tilt axes is achieved by adjusting the duty cycle of the control signals. A duty cycle of 5.1% corresponds to the minimum position, 7.6% corresponds to the equilibrium position, and 10.1% corresponds to the maximum position.

Set the gimbal's mode control signal to achieve control in three modes: lock mode, heading and pitch follow mode, and heading follow mode. When the signal input to the mode control pin is a PWM signal with a period of 50 Hz and a duty cycle between 5% and 6%, the gimbal enters lock mode, in which the heading, pitch, and roll are all locked, and the heading and pitch are controlled by the remote control or onboard computer. When the signal input to the mode control pin is a PWM signal with a period of 50 Hz and a duty cycle between 6% and 9%, the gimbal enters heading and pitch follow mode, in which the roll is locked, the heading rotates smoothly in the direction of the head, and the pitch rotates with the aircraft's pitch angle. When the signal input to the mode control pin is a PWM signal with a period of 50 Hz and a duty cycle between 9% and 100%, the MiNi3D Pro gimbal enters heading follow mode, in which the heading, pitch, and roll are all locked, the heading rotates smoothly in the direction of the head, and the pitch is controlled by the remote control or onboard computer.

9. The intelligent unmanned aerial vehicle real-time spatial perception and target precision detection system according to claim 1 is characterized in that the unmanned aerial vehicle hardware system is integrated: the unit modules of the unmanned aerial vehicle airborne module assembly are set, the gimbal power supply system is designed, the 25V voltage output by the unmanned aerial vehicle battery is output as 19V, 12V and 5V voltages respectively after passing through the three-way DCDC voltage conversion module of the airborne module assembly, wherein the 19V voltage is used to power the onboard computer; the 12V voltage is used to power the gimbal; the 5V voltage is used to power the RC receiver and the HDMI distributor; the internal power supply system of the unmanned aerial vehicle platform is used to power the Guidence visual sensor, N1 encoder, airborne wireless image transmission, and wireless data transmission respectively; the HDMI video collector and the airborne wireless data transmission are powered by the USB interface of the onboard computer, and the camera is powered by the camera's own battery.

10. The intelligent UAV real-time spatial perception and target precision detection system according to claim 1, wherein the integration of the UAV software system is completed using the ROS system platform, and the ROS message mechanism is used for communication between modules, and the coupling degree between the two modules is very loose;

1) Target Detection Process

The object detection system uses the Jetson Inference system, which includes classification, detection, and segmentation. The detection modules include image acquisition detection, video detection, and real-time camera detection. The video stream is ultimately decomposed into acquisition frames. The essence of these detections is image acquisition detection.

The target detection system converts the video stream into captured image frames and then uses the trained network model to detect the captured images. It ultimately obtains the target category and pixel coordinates of the target frame. The output of the target detection system is then passed to the visual real-time spatial perception modeling system for positioning and navigation.

2) 3D visual RGB space extraction process

The 3D visual RGB space extraction system receives RGB acquisition images and depth maps, first matches the RGB acquisition images to obtain key frames, then builds a point cloud image with the depth acquisition image, then corrects the point cloud image through nonlinear global optimization and local detection, and finally receives the target detection results to complete the target positioning in 3D space.

3) Integration of UAV onboard processing methods

The onboard processing process of the software system is that after TX2 receives the captured image data from the camera, the target detection module first detects the content of the captured image to obtain the target position coordinates, and then transmits it to the visual real-time spatial perception modeling system in real time. At this time, the visual real-time spatial perception modeling system is composing the key frames and reconstructing the target detection results corresponding to the key frames in three-dimensional space to realize spatial content extraction.