CN119516379A

CN119516379A - A step-by-step detection network for remote sensing images based on Transformer region proposal

Info

Publication number: CN119516379A
Application number: CN202411599828.3A
Authority: CN
Inventors: 李辰凯; 吴乔榕; 梁成; 王红梅; 王靖宇
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2024-11-11
Filing date: 2024-11-11
Publication date: 2025-02-25
Anticipated expiration: 2044-11-11
Also published as: CN119516379B

Abstract

The invention discloses a remote sensing image step-by-step detection network based on a transducer region suggestion. The network comprises a first input layer, a label embedding layer, a feature fusion layer, a transducer encoder, a transducer decoder and a full connection layer, a second input layer, a variation self-encoder, an embedding layer and a single attention layer, wherein the first input layer is connected with the label embedding layer, the label embedding layer is connected with the feature fusion layer, and the feature fusion layer is connected with the transducer encoder. The invention solves the technical problems that the existing remote sensing image target detection technology processes a large-size image by shrinking an input remote sensing image or cuts the remote sensing image into smaller image slices and sets the re-acquisition rate to detect the target in the remote sensing image, so that the detection time is long, the resource utilization is more, the small target information is lost and the miss rate is large.

Description

Remote sensing image step-by-step detection network based on transducer region suggestion

Technical Field

The invention relates to the technical field of target detection, in particular to a remote sensing image step-by-step detection network based on a transducer region suggestion.

Background

In recent years, progress in deep learning technology has driven significant development in the task of target detection. The single-stage detection network represented by YOLO series directly extracts features from the entire image using a Convolutional Neural Network (CNN), and sets a distribution grid at the detection head to predict target information within the grid. The method remarkably improves the detection speed, realizes the accuracy equivalent to other detection models under the condition of not depending on a candidate region suggestion mechanism, and plays an important role in the target detection development in industrial application. However, CNN-based methods have limited improvement in learning ability as model parameters increase. Therefore, the attention mechanism can be introduced to correlate global features, so that the model can learn more feature subspace information, and the characterization capability and decision accuracy of the model are improved.

In recent years, transformer has successfully introduced computer vision from the field of Natural Language Processing (NLP) and achieved significant results. Unlike CNN, the transducer associates global information of an image through a self-attention mechanism. In the target detection, a model such as DETR maps the feature map extracted by CNN into an embedded sequence, and sends the embedded sequence to a transducer encoding and decoding module, so that the target detection is regarded as a Query process. After the Query and the image features are subjected to mutual attention calculation through a decoder, the FPN of the detection head outputs the position and type information of the target. The improved Deformable-DETR adopts a Deformable multi-head attention mechanism, which allows the model to focus on a specific significant region in the image instead of the global region, thereby improving convergence speed and processing efficiency and significantly improving the detection capability of small targets in high-resolution images.

The model has good effect in the field of target detection. However, the existing target detection algorithm is mainly aimed at common images, and researches on remote sensing images with larger pixel size and small target duty ratio are less. Most methods cut the original high resolution image into smaller image slices and set the resequencing rate to prevent the target from being cut by the cutting line, thereby training and detecting the model. However, this approach can lead to a geometric multiplication of detection time and resources. Other methods process large-size images by shrinking the input image, which can result in loss of small target information and increase miss rate. Although the strategies are simple and direct, serious detection errors or calculation cost increase can be caused, and the requirements of the fields of aviation, aerospace, autopilot and the like on real-time performance and safety are not met.

In order to solve the problems, the invention designs a transducer region proposal step-by-step detection network based on feature learning, and shortens the detection time consumption of a large-size remote sensing image as much as possible under the condition of ensuring the detection precision.

Disclosure of Invention

The embodiment of the invention provides a remote sensing image step-by-step detection network based on a transducer region suggestion, which at least solves the technical problems that the existing remote sensing image target detection technology is used for processing a large-size image by shrinking an input remote sensing image or cutting the remote sensing image into smaller image slices and setting a re-acquisition rate to detect targets in the remote sensing image, so that the detection time is long, resources are more utilized, small target information is lost and the miss rate is large.

According to one aspect of the embodiment of the invention, a remote sensing image step-by-step detection network based on a transducer region suggestion is provided. The network comprises a first input layer, a tag embedded layer, a feature fusion layer, a transducer encoder, a transducer decoder and a full connection layer, a second input layer, a variable self-encoder, an embedded layer and a single attention layer, wherein the first input layer is connected with the tag embedded layer, the tag embedded layer is connected with the feature fusion layer, the feature fusion layer is connected with the transducer encoder, the second input layer is connected with the variable self-encoder, the variable self-encoder is connected with the embedded layer, the embedded layer is connected with the feature fusion layer, the feature fusion layer is connected with the single attention layer, the single attention layer and the transducer encoder are connected with the full connection layer, the primary remote sensing image is acquired by the transducer decoder and is input to the tag embedded layer to generate a memory hidden variable, the feature fusion layer adds the memory hidden variable and the position code to the transducer encoder to generate a target hidden variable, the second input layer acquires a feature map of a large-size target corresponding to the primary target, the feature fusion layer is connected with the high-size target, the feature fusion layer is input to the transducer encoder is input to the high-order feature vector layer, the feature fusion layer is input to the primary remote sensing image is acquired by the transducer, the feature fusion layer is input to the high-order-level, the feature vector is input to the primary remote sensing image is generated by the primary remote sensing image, the primary remote sensing image is input to the primary shadow vector image is generated by the primary remote sensing image, the primary shadow vector is input to the primary vector image is generated by the primary vector, the primary vector is input to the primary vector image is input to the primary vector, and the primary vector image is input to the high, and obtaining a small-size target two-dimensional distribution probability map of the original remote sensing image.

Optionally, the expression of the a priori detection feature is:

Wherein, For the initial query vector to be a vector,As a result of the initial key vector,As a vector of the initial values,Is a characteristic diagram of large-size target distribution corresponding to an original remote sensing image,In the form of a gaussian distribution,Is position coded, and its concrete expression is I is its sequence length, n is the characteristic dimension, k is a positive integer,In order for the convolution operation to be performed,And a characteristic map of Gaussian distribution corresponding to the characteristic map of large-size target distribution.

Optionally, after the a priori detection feature is input to the single attention layer, the expression for generating the target vector is:

Wherein, For the target query vector to be a vector,In order to query the vector parameters,As a result of the fact that the target key vector,The key vector parameter(s),As a vector of the target value,Is a value vector parameter;

Wherein, For the purpose of the transposition,As the dimension of the K vector,Is the target vector.

Optionally, an expression of a loss function of the remote sensing image step-by-step detection network based on the transducer region proposal is:

Wherein, In order to achieve a value of the loss function,Is thatAndIs used for the cross entropy loss function value of (c),For the small-size target two-dimensional distribution feature map of the original remote sensing image output by the step-by-step detection network,A real two-dimensional distribution feature map of a small-size target of an original remote sensing image, KL is a regularization function,As a result of the random noise,The small-size target two-dimensional distribution characteristic diagram of the original remote sensing image output by the step-by-step detection network corresponds to normal distribution, and the average value isVariance is,The mean value is 0 and the variance is 1 for standard normal distribution.

The method comprises the steps of obtaining the number of targets contained in each slice of an original remote sensing image, the probability value of each slice of a small-size target two-dimensional distribution feature map of the original remote sensing image, the initial slice number and the total slice number, sorting the number of targets contained in each slice from large to small to obtain a first sequence, sorting the probability value of each slice from large to small to obtain a second sequence, obtaining the coincidence degree of the first sequence and the second sequence corresponding to the initial slice number based on the initial slice number, the first sequence corresponding to the initial slice number and the second sequence corresponding to the initial slice number, and obtaining the evaluation index of the second sequence based on the coincidence degree and the total slice number of the first sequence and the second sequence corresponding to the initial slice number.

Optionally, the obtaining the expression of the overlap ratio of the first sequence corresponding to the initial slice number and the second sequence corresponding to the initial slice number based on the initial slice number, the first sequence corresponding to the initial slice number and the second sequence corresponding to the initial slice number is:

Wherein, For the coincidence of the first sequence and the second sequence corresponding to the initial number of slices,For the initial number of slices,For a first sequence corresponding to the initial number of slices,A second sequence corresponding to the initial number of slices.

Optionally, the expression for obtaining the evaluation index of the second sequence based on the coincidence ratio and the total slice number of the first sequence and the second sequence corresponding to the initial slice number is:

is an evaluation index of the second sequence, Is the total number of slices.

The invention has the beneficial effects that:

(1) The adopted transducer model can capture global information in the image through a self-attention mechanism, and the characteristic enables the model to fully consider more context information when processing a large-size image, learn the position relationship among targets and further accurately identify the target area. Compared with the detection result of a pure variational self-encoder (VAE) suggestion network, the recommendation area of the target distribution is greatly reduced due to the addition of image characteristics.

(2) The multi-step detection framework based on the transducer region suggestion network is designed, and the multi-step detection strategy can predict the region where the target possibly exists according to the coarse detection result, and then carefully detect the region, so that the detection speed of a large-size remote sensing image is effectively improved, and the consumption of calculation resources is reduced.

(3) An evaluation index MG based on a probability model is provided to accurately quantify and evaluate the performance of a regional advice network. Meanwhile, through a comparison test, the transducer region proposal detection network provided by the invention is higher than the VAE-based proposal network in the evaluation index, which shows that the invention can effectively remove noise interference of a VAE model and estimate the position of a small target better than the VAE-based proposal network. In addition, compared with the conventional blocking detection method, the method provided by the invention realizes remarkable performance improvement in detection time consumption, calculation resource consumption and detected small target quantity. This fully demonstrates that the invention is superior to other methods in real-time and accuracy of detection, and has considerable practicality.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a flow chart of a remote sensing image step-by-step detection network based on a transducer region proposal according to an embodiment of the invention;

FIG. 2 is a schematic diagram of the difference between a variable self-encoder and a normal self-decoder according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a region suggestion result according to an embodiment of the invention;

FIG. 4 is a schematic diagram of dense detection based results according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a distribution detection result according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, shall fall within the scope of the invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and describing a particular sequence or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

According to an embodiment of the present invention, there is provided a remote sensing image step detection network based on a transducer region suggestion, it should be noted that the steps illustrated in the flowchart of the figures may be performed in a computer system containing at least one set of computer executable instructions, and that although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different from that herein.

Fig. 1 is a flowchart of a remote sensing image step-by-step detection network based on a transducer region proposal according to an embodiment of the invention, and as shown in fig. 1, the network can comprise a first input layer, a tag embedding layer, a feature fusion layer, a transducer encoder, a transducer decoder and a full connection layer, a second input layer, a variation self-encoder, an embedding layer and a single attention layer, and the network comprises the following steps:

Step S101, a first input layer is connected with a tag embedding layer, the tag embedding layer is connected with a feature fusion layer, the feature fusion layer is connected with a transducer encoder, a second input layer is connected with a variation self-encoder, the variation self-encoder is connected with the embedding layer, the embedding layer is connected with the feature fusion layer, the feature fusion layer is connected with a single attention layer, the single attention layer and the transducer encoder are connected with a transducer decoder, and the transducer decoder is connected with a full connection layer.

In this embodiment, as shown in fig. 1, the single attention layer is a multi-head attention layer, the full connection layer is an MLP adapter, fig. 2 is a schematic diagram of differences between the variable self-encoder and the normal self-decoder according to an embodiment of the present invention, the coding and decoding processes of the normal self-encoder are above the dotted line in fig. 2, the coding and decoding processes of the variable self-encoder are below the dotted line in fig. 2, the first input layer is connected with the tag embedding layer, the tag embedding layer is connected with the feature fusion layer, the feature fusion layer is connected with the transducer encoder to form a first step detection network, the second input layer is connected with the variable self-encoder, the variable self-encoder is connected with the embedding layer, the embedding layer is connected with the feature fusion layer, and the feature fusion layer is connected with the single attention layer to form a second step detection network.

Step S102, a first input layer acquires an original remote sensing image, the original remote sensing image is input to a tag embedding layer to generate a memory hidden variable, and a feature fusion layer adds the memory hidden variable and a position code and inputs the memory hidden variable and the position code to a transform encoder to generate a target hidden variable.

In this embodiment, as shown in fig. 1, the first input layer acquires an original remote sensing image, the tag embedding layer processes the original remote sensing image to obtain a memory hidden variable, and the memory hidden variable and the position code are added and then input to a transducer encoder to obtain a target hidden variable.

Step S103, a second input layer acquires a feature map of large-size target distribution corresponding to an original remote sensing image, inputs the feature map of the large-size target distribution to a variation self-encoder to obtain a feature map of Gaussian distribution, inputs the feature map of Gaussian distribution to an embedded layer to obtain a high-dimensional feature map, a feature fusion layer adds the high-dimensional feature map and position codes to obtain prior detection features, and after the prior detection features are input to a single attention layer, a target vector is generated, wherein the prior detection features comprise an initial query vector, an initial key vector and an initial value vector.

In the embodiment, as shown in fig. 1, a second input layer acquires a feature map of large-size target distribution corresponding to an original remote sensing image, the feature map of large-size target distribution is processed through a variation self-encoder to obtain a feature map of Gaussian distribution, the feature map of Gaussian distribution is input to an embedded layer to obtain a high-dimensional feature map, the high-dimensional feature map is added with position codes to obtain a priori detection feature, and the priori detection feature is input to a single attention layer to generate a target vector, wherein the priori detection feature comprises an initial query vector, an initial key vector and an initial value vector.

Step S104, inputting the target vector generated by the single attention layer and the target hidden variable generated by the transducer encoder into the transducer decoder to obtain shallow features of the original remote sensing image, and inputting the shallow features of the original remote sensing image into the full-connection layer to obtain a small-size target two-dimensional distribution probability map of the original remote sensing image.

In this embodiment, as shown in fig. 1, the target vector and the target hidden variable are input to a transducer decoder to obtain shallow features of the original remote sensing image, and the shallow features of the original remote sensing image are input to a full-connection layer to obtain a small-size target two-dimensional distribution probability map of the original remote sensing image.

The above-described method of this embodiment is further described below.

As an alternative embodiment, in step S103, the expression of the a priori detection feature is:

As an optional embodiment, in step S103, after the a priori detection feature is input to the single attention layer, the expression for generating the target vector is:

As an alternative embodiment, an expression of a loss function of the remote sensing image step-by-step detection network based on the transducer region proposal is as follows:

In this embodiment of the present invention, in one embodiment,,The value of the regulating factor is [0,5], n is the number of detection categories,For the weight of each category,Tag for predicting network as true target typeIs a probability of (2).

As an alternative embodiment, the evaluation index of the remote sensing image step detection network based on the transducer region suggestion is the average slice sequence overlap ratio, and the process of determining the average slice sequence overlap ratio comprises the steps of obtaining the number of targets contained in each slice of an original remote sensing image, the number of initial slices and the total slice number of a small-size target two-dimensional distribution feature map of the original remote sensing image, sorting the number of targets contained in each slice from large to small to obtain a first sequence, sorting the probability values of each slice from large to small to obtain a second sequence, obtaining the overlap ratio of the first sequence and the second sequence corresponding to the number of initial slices based on the number of initial slices, the first sequence corresponding to the number of initial slices and the second sequence corresponding to the number of initial slices, and obtaining the evaluation index of the second sequence based on the overlap ratio and the total slice number of the first sequence and the second sequence corresponding to the number of initial slices.

In this embodiment, the original remote sensing image is segmented into regions of equal size, each slice is a region, for example, the first slice includes 3 targets, the second slice includes 6 targets, and each slice probability value of the small-size target two-dimensional distribution feature map of the original remote sensing image is,And the like, sorting the number of objects contained in each slice according to the large number to the small number to obtain a first sequence sorted according to the number of objects in each slice, sorting probability values of each slice according to the large number to the small number to obtain a second sequence sorted according to the probability values of each slice, calculating the initial slice number, the first sequence corresponding to the initial slice number and the second sequence corresponding to the initial slice number to obtain the coincidence ratio of the first sequence corresponding to the initial slice number and the second sequence, and calculating the coincidence ratio of the first sequence corresponding to the initial slice number and the second sequence and the total slice number to obtain the evaluation index of the second sequence.

As an optional embodiment, the obtaining the expression of the overlap ratio of the first sequence corresponding to the initial slice number and the second sequence corresponding to the initial slice number based on the initial slice number, the first sequence corresponding to the initial slice number, and the second sequence corresponding to the initial slice number is:

In this embodiment, the first and second embodiments are, for example,That is, taking the top 5 slices of the first sequence and the second sequence, when the first number of slice targets of the first sequence is the same as the first number of slice targets of the first sequence,1, When the first number of targets of the first sequence and the first number of targets of the first sequence are different,0, And so on, resulting in a total of 5 slices of the first and second sequences,That is, each slice isDivided by the number of slices after addition, i.e.。

As an optional embodiment, the expression for obtaining the evaluation index of the second sequence based on the coincidence ratio and the total slice number of the first sequence and the second sequence corresponding to the initial slice number is:

Wherein, Is an evaluation index of the second sequence,Is the total number of slices.

In this embodiment, the first and second embodiments are, for example,10 WhenTake different values from 1 to 10, calculateAnd accumulating and summing to obtain the evaluation index of the second sequence.

Experimental part:

The average slice sequence coincidence degree MG is adopted as an evaluation index, the sequence proportion is respectively selected to be 0.25,0.5 and 1, and the sequence proportion is compared with a common detection sequence without using a suggested network. The results are shown in Table 1:

TABLE 1 evaluation index (MG) of different recommendation models

As shown in Table 1, the evaluation index of the invention is higher than that of the non-regional suggestion network and the dense detection network, which shows that the suggestion network can learn the target geographical position relationship in the image and search the region more likely to have small targets according to the large target distribution. Compared with a suggested network using the VAE only, the method is 3.2%,9.1% and 4.8% higher than the suggested network using the VAE only under the condition that the evaluation indexes are 0.25,0.5 and 1, respectively, and the model can effectively remove noise interference of the VAE model, so that the small target position can be estimated more accurately.

On the other hand, the large-size image detection resource consumption pair is as shown in table 2:

Table 2 comparison of resource consumption for different detection strategies

As shown in Table 2, the method of the present invention was optimal in both detection time and FLOPS, and Params was suboptimal. Compared with the conventional blocking detection, the method reduces the calculation resource consumption by 71.8%, shortens the detection time, simultaneously detects the small target number to be close to 94.6%, and improves the detection time by 11.5% compared with 83.1% of the dense detection method. Experimental results show that the method can effectively learn the spatial position relation among targets and integrate image features when reasoning the suggested small target area, so that the method is more suitable for being used as an area suggestion network to improve detection performance. In addition, the staged detection strategy adopted by the invention can greatly reduce the detection time and reduce the consumption of computing resources when processing a large-scale image detection task, thereby remarkably improving the real-time performance of the detection task.

In order to further verify the effectiveness of the strategy of the invention, the subjective comparison of the detection result of the remote sensing image is used for carrying out deep analysis. Fig. 3 is a schematic view of an area recommendation result according to an embodiment of the present invention, fig. 3 (a) is an area recommendation result using VAE alone, and fig. 3 (b) is an area recommendation result according to the present invention. As shown in the figure, the recommendation result is more accurate, and the recommendation area of the target distribution is greatly reduced due to the addition of the image characteristics, so that the calculation resources can be effectively reduced.

A large-size remote sensing image is randomly selected, a dense detection result is shown in fig. 4, fig. 4 is a schematic diagram of a result based on dense detection according to an embodiment of the present invention, fig. 4 (a) is a large-size remote sensing image, that is, an original image, fig. 4 (b) is dense features and recommended regions corresponding to the large-size remote sensing image, fig. 4 (c) is a two-stage image to be detected, fig. 4 (d) is a detection result, and fig. 5 is a schematic diagram of a distribution detection result according to an embodiment of the present invention. Fig. 5 (a) is a large-size remote sensing image, that is, an original image, fig. 5 (b) is a one-stage detection result corresponding to the large-size remote sensing image, fig. 5 (c) is a feature learning region suggestion result corresponding to the large-size remote sensing image, fig. 5 (d) is a detection result corresponding to the large-size remote sensing image, firstly, the source image is downsampled to 1024×1024 to perform one-stage detection, and the position of a large-size target in the image is roughly detected by a detector, for example, the large-size targets such as a port and a playground in the figure are accurately detected. Then, using a transducer network, according to the position distribution information of the large target, a region where the small target may exist is recommended, and fine detection of the second stage is performed. The detection result shows that the detection frame has good wrapping property on the target, the boundary of the detection frame is closely overlapped with the boundary frame of the target, and the effectiveness of step detection based on feature learning is verified. The two-stage detector detects the undetected weak and small targets in the large-size targets detected in the first stage, so that the waste of computing resources is effectively reduced. Compared with the dense detection strategy, the transducer network can provide a suggested area which is not in a large target range, and as shown in fig. 5 (c), the recommendation is based on the potential spatial position relation mapped by the high-dimensional abstract information, so that the accuracy and the efficiency of detection can be further improved.

In conclusion, subjective and objective comparison experiments show that compared with other methods, the method can accurately detect the weak target by means of feature learning, greatly reduce detection time and calculation resource consumption, and improve the real-time performance of detection tasks. The advantages enable the method to have important application value and wide market prospect in the field of remote sensing image detection.

In the embodiment of the invention, the first input layer is connected with the tag embedding layer, the tag embedding layer is connected with the feature fusion layer, and the feature fusion layer is connected with the transducer encoder; the second input layer is connected with the variation self-encoder, the variation self-encoder is connected with the embedded layer, the embedded layer is connected with the characteristic fusion layer, and the characteristic fusion layer is connected with the single attention layer; the method comprises the steps of connecting a single attention layer and a transducer encoder with the transducer decoder, connecting the transducer decoder with a full connection layer, obtaining an original remote sensing image by a first input layer, inputting the original remote sensing image to a tag embedding layer to generate a memory hidden variable, adding the memory hidden variable and a position code to the transducer encoder to generate a target hidden variable by a feature fusion layer, obtaining a feature image of large-size target distribution corresponding to the original remote sensing image by a second input layer, inputting the feature image of large-size target distribution to a variable self-encoder to obtain a feature image of Gaussian distribution, inputting the feature image of Gaussian distribution to the embedding layer to obtain a high-dimensional feature image, adding the high-dimensional feature image to the position code by the feature fusion layer to obtain a priori detection feature, inputting the priori detection feature to the single attention layer to generate a target vector, wherein the priori detection feature comprises an initial query vector, an initial key vector and an initial value vector, inputting the target vector generated by the single attention layer and the target hidden variable generated by the transducer encoder to the transducer decoder to obtain a feature image of shallow layer corresponding to the original remote sensing image, inputting the feature image of the shallow layer of the feature image of the Gaussian distribution to the shallow layer to the original remote sensing image into the shallow-dimensional image, obtaining a small-size remote sensing image by the shallow-reduced-size remote sensing image corresponding to the small-size target image of the original remote sensing image, obtaining the small-size remote sensing image by the shallow-size image, and obtaining the small-size remote sensing image by the small-size image, the method has the advantages that the repeated acquisition rate is set to detect the target in the remote sensing image, so that the technical problems of long detection time, more resource utilization, small target information loss and large missing detection rate are caused, and the technical effect of shortening the time consumption of detecting the large-size remote sensing image as much as possible under the condition of ensuring the detection precision is achieved through the design of a Transformer region proposal step-by-step detection network based on feature learning.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of units may be a logic function division, and there may be another division manner in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one first processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. The remote sensing image step-by-step detection network based on the transducer region suggestion is characterized by comprising a first input layer, a tag embedding layer, a feature fusion layer, a transducer encoder, a transducer decoder and a full connection layer, a second input layer, a variation self-encoder, an embedding layer and a single attention layer, wherein the first input layer is used for receiving a first signal;

The first input layer is connected with the tag embedding layer, the tag embedding layer is connected with the feature fusion layer, and the feature fusion layer is connected with the transducer encoder; the second input layer is connected with a variation self-encoder, the variation self-encoder is connected with an embedded layer, the embedded layer is connected with a characteristic fusion layer, and the characteristic fusion layer is connected with a single attention layer;

The first input layer acquires an original remote sensing image, and inputs the original remote sensing image to the tag embedding layer to generate a memory hidden variable;

the method comprises the steps of obtaining a feature map of large-size target distribution corresponding to an original remote sensing image by a second input layer, inputting the feature map of the large-size target distribution to a variation self-encoder to obtain a feature map of Gaussian distribution, inputting the feature map of Gaussian distribution to an embedded layer to obtain a high-dimensional feature map, adding the high-dimensional feature map and position codes by a feature fusion layer to obtain a priori detection feature, inputting the priori detection feature to a single attention layer to generate a target vector, wherein the priori detection feature comprises an initial query vector, an initial key vector and an initial value vector;

And inputting the target vector generated by the single attention layer and the target hidden variable generated by the transducer encoder to a transducer decoder to obtain shallow features of the original remote sensing image, and inputting the shallow features of the original remote sensing image to the full-connection layer to obtain a small-size target two-dimensional distribution probability map of the original remote sensing image.

2. The network of claim 1, wherein the expression for the a priori detection feature is:

3. The network of claim 1, wherein the expression for generating the target vector after inputting the a priori detection features to the single attention layer is:

4. The network of claim 1, wherein the loss function is expressed as:

5. The network of claim 1, wherein the evaluation index is an average slice sequence overlap ratio, and wherein the determining the average slice sequence overlap ratio is:

Acquiring the number of targets contained in each slice of an original remote sensing image, the probability value of each slice of a small-size target two-dimensional distribution characteristic map of the original remote sensing image, the number of initial slices and the total number of slices;

sequencing the number of targets contained in each slice from large to small to obtain a first sequence;

sequencing each slice probability value from big to small to obtain a second sequence;

Obtaining the coincidence ratio of the first sequence corresponding to the initial slice number and the second sequence corresponding to the initial slice number based on the initial slice number, the first sequence corresponding to the initial slice number and the second sequence corresponding to the initial slice number;

and obtaining an evaluation index of the second sequence based on the coincidence ratio of the first sequence and the second sequence corresponding to the initial slice number and the total slice number.

6. The network of claim 5, wherein the expression for obtaining the overlap ratio of the first sequence corresponding to the initial number of slices and the second sequence corresponding to the initial number of slices based on the initial number of slices, the first sequence corresponding to the initial number of slices, and the second sequence corresponding to the initial number of slices is:

7. The network of claim 5, wherein the expression for obtaining the evaluation index of the second sequence based on the coincidence degree of the first sequence and the second sequence corresponding to the initial number of slices and the total number of slices is:

8. A computer system comprising one or more processors, a computer-readable storage medium storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the network of claim 1.

9. A computer readable storage medium, characterized by storing computer executable instructions that, when executed, are configured to implement the network of claim 1.

10. A computer program product comprising computer executable instructions which, when executed, are adapted to implement the network of claim 1.