CN116778391B - Multi-mode crop disease phenotype collaborative analysis model and device - Google Patents
Multi-mode crop disease phenotype collaborative analysis model and deviceInfo
- Publication number
- CN116778391B CN116778391B CN202310828903.8A CN202310828903A CN116778391B CN 116778391 B CN116778391 B CN 116778391B CN 202310828903 A CN202310828903 A CN 202310828903A CN 116778391 B CN116778391 B CN 116778391B
- Authority
- CN
- China
- Prior art keywords
- text
- model
- disease
- visual
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/02—Agriculture; Fishing; Forestry; Mining
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Business, Economics & Management (AREA)
- Multimedia (AREA)
- Strategic Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Agronomy & Crop Science (AREA)
- Economics (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Mining & Mineral Resources (AREA)
- Tourism & Hospitality (AREA)
- Primary Health Care (AREA)
- General Business, Economics & Management (AREA)
- Animal Husbandry (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Marine Sciences & Fisheries (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a multi-mode crop disease phenotype collaborative analysis model, a device and a model construction system, wherein the method comprises the steps of constructing a crop disease phenotype text generation model based on improved CNN and LSTM, and training the model through a constructed multi-mode training data set; the crop disease phenotype text generation model based on the improved CNN and LSTM trains the crop disease phenotype text generation model through the built multi-modal training data set, the visual language positioning Model (MQVL) based on query text guidance and multi-stage reasoning trains the crop disease phenotype text generation model through the built multi-modal training data set, and the CNN-transducer double-flow multi-modal few-sample recognition model (CTMF) based on the CNN-transducer double-flow multi-modal few-sample recognition model trains the crop disease phenotype text generation model through the built multi-modal training data set.
Description
Technical Field
The invention relates to the field of crop diseases, in particular to a multi-mode crop disease phenotype collaborative analysis model and a device.
Background
Plant diseases are responsible for significant economic losses in the global agricultural sector. They are directly related to food safety and sustainable food production. Quantifying the impact of plant pathology on crops is one of the most challenging problems in agriculture. The lack of nutrition or imbalance between soil moisture and oxygen makes the plant more susceptible to pathogens. Abnormalities in plants may be caused by pests, diseases or other abiotic stresses (e.g. low temperature). Disease identification tasks are often associated with time consuming, laborious and subjective. Traditionally, crop inspection has been performed by persons having some expertise in this area. However, this approach can produce a degree of uncertainty or error, resulting in erroneous decisions.
Recent advances in plant phenotyping allow the development of efficient and automated diagnostic systems for plant abnormality identification. Although the existing methods have shown some effects, there are some limitations in the problems of disease location and identification, especially in the actual scenario. To address this limitation, we propose a method to more effectively detect and locate plant abnormalities in a multi-modal form by combining visual object recognition with language generation, by generating detailed information about its symptoms.
Nuthalapati et al know the geographical position and time as priori and acquire the characteristics through nonlinear embedding, and then perform information fusion together with the visual characteristics through Relative Transformer layers, so that the recognition accuracy of the CUB-200-2011 bird dataset is improved. Huang et al propose a class attribute description Guided vision mechanism (AGAM) that guides branches to merge Attributes and visual features through Attributes for data sets with Attributes, and learns attention weights through feature selection for branches without Attributes.
Thus, in addition to the image itself, information such as the location of the photograph, the date, time, attributes of the image, and text descriptions can also be a significant source of a priori knowledge. Especially, text description information of the image contains rich semantic information. The text mode information and the image mode information have complementary relation, so that the problem caused by insufficient image training samples can be solved to a certain extent. However, the acquisition and construction of multi-modal datasets in the agricultural field is more difficult, requires manual annotation and annotation by students and specialists in the relevant fields, and is time and cost intensive. Therefore, the plant protection expert design problem and options are adopted in the patent, 5 characteristics of plant diseases are mainly surrounded, namely the number, color, shape and characteristics of the diseases and the area of the disease spots occupying the blades are displayed, and the multi-mode text description is collected rapidly.
The method has good effect on the visual language positioning of reasonable multi-mode application, and the visual language positioning is a task for positioning a target object or region in an image according to natural language expression. At present, most visual language positioning researches focus on natural images of head-up view angles of people, animals, automobiles and the like, and the existing method mainly adopts independent extraction of visual features and text embedding, and then fusion reasoning is carried out on the visual features and the text embedding so as to position target objects mentioned in query texts. However, features obtained by the independent visual feature extraction module often contain many visual features unrelated to the query text, and these redundant unrelated visual features may lead to unreasonable reasoning about the subsequent multimodal fusion module, thereby affecting target localization.
Aiming at the problems in the visual language positioning, the patent designs a combined network model based on a Swin-transformer architecture, which comprises a query text feature extraction module, a query text guided visual feature generation module and a multi-stage fusion reasoning module. The method comprises the steps of guiding by introducing the characteristic of the query text into the visual characteristic extraction module, reducing the interference of irrelevant visual characteristics, generating visual characteristics related to the query text, and carrying out multi-stage interactive reasoning on the related visual characteristics and the characteristic of the query text by the multi-stage fusion reasoning module so as to further focus on the accurate positioning of the query target object.
Disclosure of Invention
The invention aims to provide a multi-mode crop disease phenotype collaborative analysis model and a device thereof so as to solve the technical problems.
The invention aims to solve the technical problems, and is realized by adopting the following technical scheme:
a multimodal crop disease phenotype collaborative analysis model comprising the steps of:
S1, constructing a multi-mode data set;
A multi-mode disease data set is built by a crowdsourcing technology, namely a plant scholars design option, and a large number of non-professionals are used for quickly acquiring disease text description;
s2, constructing a crop disease phenotype text generation model based on the improved CNN and LSTM;
S3, constructing a visual language positioning model based on query text guidance and multi-stage reasoning;
S4, after the text descriptive sentences of the pictures in the step S2 are obtained, inputting the text descriptive sentences into a visual language positioning model of multi-stage reasoning in the step S3 to position disease positions;
S5, constructing a CNN-transducer double-flow multi-mode few-sample recognition model;
s6, after the disease type is identified in S5, the picture is stored and early-warned.
Preferably, the disease dataset comprises 24 diseases of apple scab, apple rust, cherry powdery mildew, corn rust, corn leaf blight, grape black rot, grape leaf blight, peach bacterial spot, pepper bacterial spot, potato early blight, potato late blight, tomato bacterial spot, tomato leaf mold, tomato leaf spot, apple black rot, pumpkin powdery mildew, tomato early blight, tomato Huang Qushe diseases, corn gray, orange yellow dragon, strawberry leaf char, tomato late blight, tomato round spot, tomato mosaic.
Preferably, the disease text description requires non-professionals to select options from 5 angles of quantity, color, shape, characteristics and lesion area, the options are collected by a program to form text description sentences after selection, and potential annotators need to reach more than 90% of the test accuracy of the underlying knowledge of image marking before dealing with our questions.
Preferably, the multi-mode few-sample recognition model adopts a double-flow architecture to simultaneously aim at global information and local information of a current task, and comprises a double-embedding module, a feature fusion module and a measurement module.
Preferably, the dual embedding module consists of two branches, a local branch and a global branch.
Preferably, the visual language localization model of multi-stage reasoning mainly comprises three modules, namely a query text feature extraction module, a query text guided visual feature generation module and a multi-stage reasoning module.
A computer device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to execute a multimodal crop disease phenotype collaborative analysis model.
The beneficial effects of the invention are as follows:
1. The multi-mode crop disease phenotype collaborative analysis model and the device have high practical application value, fully consider possible important text characteristics of diseases when preparing a data set, and construct the data set according to 5 different angles. In practical application, the crop phenotype can be continuously monitored by quickly constructing new disease types through crowdsourcing. Provides theoretical guidance and technical support for crop phenotype research.
2. The invention introduces a crowdsourcing technology in the construction of the multi-mode agricultural field data set, gives a task of constructing the multi-mode agricultural field data set to the crowdsourcing, and designs an image description generation model, a visual language positioning Model (MQVL) based on query text guidance and multi-stage reasoning and a CNN-converter double-flow multi-mode few-sample recognition model (CTMF) based on the basis. Wherein the image description generation model is used for automatically generating a disease text description, the MQVL model is used for automatically identifying possible disease areas and the multi-modal few-sample classification model is used for final identification. The invention combines text generation, disease positioning and disease identification and early warning, and can effectively improve the accuracy of identifying plant leaf diseases.
Drawings
FIG. 1 is a flow chart of a multimodal crop disease phenotype collaborative analysis model and apparatus of the present invention;
FIG. 2 is a display of crop disease in a data set in different complex contexts;
FIG. 3 is a schematic diagram of CTMF;
FIG. 4 is a schematic diagram of a dual channel mixed attention architecture of CTMF;
FIG. 5 is a schematic diagram of MQVL;
FIG. 6 is a schematic diagram of a text model structure generated from image descriptions.
Detailed Description
In order that the manner in which the above recited features, objects and advantages of the present invention are obtained, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Based on the examples in the embodiments, those skilled in the art can obtain other examples without making any inventive effort, which fall within the scope of the invention.
Specific embodiments of the present invention are described below with reference to the accompanying drawings.
Example 1:
as shown in fig. 1-6, a multimodal crop disease phenotype collaborative analysis model comprises the following steps:
s1, quickly acquiring disease text description by using a large number of non-professionals through a crowdsourcing technology, namely design options of a plant scholars, and constructing a multi-mode disease data set;
S2, classifying and marking the data set according to disease types and dividing the data set into a training set and a verification set according to proportion, S3, adjusting model pre-training parameters and designing image description to generate models, MQVL and CTMF;
s4, training the image description generation model, MQVL and CTMF model of the step S3 by using the data set processed in the step 2, and storing an optimal model;
And S5, identifying crop diseases by using the trained model.
Further, in step S1, the crop diseases in the dataset include 24 diseases of apple scab, apple rust, cherry powdery mildew, corn rust, corn leaf blight, grape black rot, grape leaf blight, peach bacterial spot, pepper bacterial spot, potato early blight, potato late blight, tomato bacterial spot, tomato leaf mold, tomato leaf spot, apple black rot, pumpkin powdery mildew, tomato early blight, tomato Huang Qushe diseases, corn gray, orange yellow dragon, strawberry leaf char, tomato late blight, tomato round spot, tomato mosaic. Disease text description requires non-professionals to select options from 5 angles of number, color, shape, characteristics and lesion occupation area, and the text description statement is formed by program collection options after selection. And before dealing with our questions, potential annotators need to reach more than 90% of the quiz accuracy of the underlying knowledge of the image markers to prevent the annotators from choosing options at will.
Further, in step S3 CTMF adds the two-way channel mix attention and minimizes the classification loss function after CNN and Swin transformers.
The dual-channel mixed attention is a simple and effective attention module, disease text description is carried out to obtain 768-dimensional feature vectors through a Bert model, maxpooling and avgpooling operations are carried out on the feature vectors and picture features at the same time, and the model generalization is improved through self-adaptive feature refinement along channel and space dimensions.
Wherein the method comprises the steps ofRepresenting a sigmoid function.
Further, in step S5, disease identification and positioning are performed using the trained model, and the specific location and name of the disease in the image can be identified after the image is input.
Example 2:
1-6, in the case that other parts are the same as embodiment 1, the embodiment is different from embodiment 1 in that a multi-mode crop disease phenotype collaborative analysis model and device are used to obtain disease text description and training model according to disease characteristic pertinence design options by using a public picture disease dataset PLANTVILLAGE. The disease position is automatically identified, the disease type is automatically identified, the accuracy and the efficiency of disease identification are improved, the disease type is accurately identified, and the early warning has an important effect on disease prevention and control. One of the important technical measures of modern agriculture for preventing and controlling crop diseases is pesticide spraying, and correct identification of disease types can be helpful for pesticide proportion. The invention is of practical value for crop phenotype management and prevention;
The invention provides a multimodal crop disease phenotype collaborative analysis model and a device, which specifically comprise the following steps:
S1, multi-modal dataset construction
A framework is employed to collect disease field descriptive text, desirably providing a comprehensive set of benchmarking and annotation types, to facilitate research in the next classification. Executing the image description task at Amazon Mechanical Turk (AMT), the employee can anonymously complete the short-term online task in exchange for a small fee. The main problem with using a large number of non-professionals is to ensure that the descriptive text of the disease is accurate and to keep the annotation fast and economical. From an economic standpoint, it is desirable to obtain the most accurate descriptive text at a lower price. Botanicals design questions and options, collect descriptive text quickly by way of selection, and potential annotators need to reach more than 90% of the quiz accuracy of the underlying knowledge of the image markers before dealing with our questions to prevent annotators from choosing options at will. Meanwhile, the picture data set is taken from PLANTVILLAGE, and each type of image comprises 275 pieces of picture data of 24 kinds of disease leaves (such as apple scab, apple rust, cherry powdery mildew, corn rust, corn leaf blight, grape anthracnose and the like), and the total number of picture data is 6600 pieces of image data. The auxiliary text description data set contains 720 types of text descriptions, each type of image is attached with Chinese text descriptions, and the images are randomly combined to form an image text pair.
S2, crop disease phenotype text generation model construction based on improved CNN and LSTM
And generating texts on the processed plant images to generate texts containing crop disease phenotype information. Text includes several main features of quantity, color, shape, features and location. First, a subject detector is trained using a region-based deep neural network to obtain a set of region features containing plant anomalies. Second, the language generator takes as input the characteristics of the object detection result and uses Long Short Term Memory (LSTM) to generate descriptive sentences having crop disease phenotypes. For this purpose we use Faster R-CNN (Ren et al 2016). It uses a two-stage process to detect objects in an image. In the first stage, the region suggestion network takes as input a feature map of the image and outputs a set of object suggestions with region scores. In the second stage, the feature vectors of the object suggestions are fed into the network to predict the positioning of the bounding box. Language generation the same region features generated by the object detector are then used as input to a language generator that associates each region with text. In this section, the LSTM module predicts words at each time step and uses these predictions to predict the next word from the init token to the end of the sentence. LSTM is a special unit of RNN that contains a built-in memory unit to store information and make use of long-range contexts (Hochreiter and Schmidhuber, 1997). They can learn long-term dependence while avoiding the problem of long-term dependence.
S3, constructing a visual language positioning Model (MQVL) based on query text guidance and multi-stage reasoning
The MQVL model mainly comprises three modules, namely a query text feature extraction module, a query text guided visual feature generation module and a multi-stage reasoning module. The query text feature extraction module is used for encoding the query text to generate text embedding, the query text-guided visual feature generation module is used for introducing the context information of the query text encoded by the text feature extraction module into each level of a Swin-transform architecture, guiding and learning visual features under different scales by means of an attention mechanism, aggregating the visual features under different scales to obtain visual features related to the query text, inputting the query text features and the visual features obtained by the first two modules into a multi-stage reasoning module, and gradually obtaining accurate positioning representation of a query object by means of multi-stage interactive reasoning of a transform decoder of the reasoning module.
(1) Query text feature extraction module
The query text feature extraction module extracts features of the query text using the BERT model. Firstly, marking a query text, then respectively adding [ CLS ] marks and [ SEP ] marks at the head and the tail of the marked query text expression as the input of a text feature extractor, and encoding the marked query text expression to obtain marks of the context information of the query text (by [ CLS ] marks the context information) and marks of each word in the query textAs a feature of query text, where channel sizeThe number of the dimensions is 768,Number of tokens for a word.
(2) Query text guided visual feature generation module
Given a pictureAs input to the visual feature generation module, wherein H, W represents the height and width of the picture, respectively, MQVL employs a query text to guide the network to extract relevant visual features and flatten them into a feature sequenceWherein the channel dimensionNumber of marks entered. The visual feature generation module is used for extracting visual features under the guidance of the query text features through an attention mechanism, and fusing the visual features with different scales to obtain visual features closely related to the query text.
(3) Visual characteristic diagram
The output from the feature extractor is a hierarchical list of visual feature maps due to the hierarchical structure of Swin-transformers. Each stage MQVL consists of a plurality of Swin-transducer blocks (i.e., a Swin module) and an attention module, and the image is segmented by a patchIs embedded intoWhereinIs the embedded dimension, thenAnd query text featuresInput together into the Swin-transducer architecture, the attention module guides the visual feature extraction of four stages, namely, when in the mth stage (1 < = m < = 4), the visual feature map of the last stage is obtainedAnd (3) withTogether with the input to a Swin-transducer block, implemented by an attention moduleGuiding and learning the visual characteristics to obtain a visual characteristic diagram of each stage。
The guided learning of the query text on the extraction of the visual features adopts the thought of QRNet, and a channel attention map and a space attention map which are related to the query text in the visual features are calculated by utilizing a dynamic linear layer so as to acquire different visual features.
Dynamic linear layer exploits contextual features of query textTo guide a given input vectorTo output vectorThe mapping between the two is as follows:
Wherein the method comprises the steps of Linear layer parametersBias of,Representing the way in which matrix decomposition is employedAnd (5) performing decomposition calculation.
Channel attention attempts to calculate. Visual feature map generated for Swin module in each stageAggregating spatial information by averaging pooling and max pooling and generating corresponding featuresThen the pooled visual features are processed through a dynamic linear layer and a ReLU function, and the processed average pooled visual features and the largest pooled visual features are summed through a Sigmoid function to obtain a channel attention mapThe calculation process is as follows:
we will be visual features And (3) withElement multiplication is carried out to obtain visual characteristics on the channelThe formula is as follows:
Spatial attention is sought to be calculated. Instead of compressing channel dimensions, dynamic linear layers are used to reduce the dimensions on the channels to learn the regions relevant to the query text, and Sigmoid functions are used to generate a spatial attention map, which is calculated as follows:
Wherein the method comprises the steps of A spatial attention map is represented and a spatial attention map is displayed,Is the final output of the attention module.
(4) Multi-scale feature fusion
Via the hierarchical structure of Swin-transducer, MQVL visual characteristic diagrams with 4 different scales are obtained, and the resolution is respectively. For the visual feature maps obtained at different stages to be effectively fused, MQVL uses a convolution kernel asIs used for the average pooling of multi-scale visual features, i.e. for the generated visual feature map of the m (1 < = m < = 3) th stageCarrying out average pooling to ensure that the dimension is the same as the dimension of the (m+1) th stage, and calculating the average value of two visual feature graphs to obtain. Finally, the visual characteristic diagram is displayedFlattened into a sequenceAs input to the following multimodal inference module.
(5) Multi-stage reasoning module
The relevance of the generated visual features and the query text is coarse-grained, and in order to obtain more accurate positioning, the relevance on fine granularity needs to be further established. MQVL iterative reasoning is performed by using a multi-stage decoder, and the visual information and the language information are repeatedly interacted by means of a cross-attention mechanism, so that ambiguity in the pushing is reduced, and the final target object position is gradually located.
Referring to the setting of the decoder layer number in VLTVG, the decoder layer number of MQVL is set to 6 layers, namely corresponding to 6 stages, each stage is composed of the same network architecture, and the characteristic output of each stage of the decoder is used as the input of the target query object characteristic of the next stage, so that iterative reasoning is performed. Specifically, in the first stage, we set a learnable query objectAs an initial representation of the target object and input into the decoder first layer, the target object is processed by the multi-headed cross-attention moduleWith text embeddingAnd visual characteristicsInteractive learning to learn from visual featuresIs collected into features related to the query text objectObtaining the characteristics of the target object in the first stage through a feedforward neural network (FFN) and layer normalizationVisual object characteristics generated by the first stageThe second stage as the target object is input to the decoder, the process is identical to the first stage, and the optimal target object is obtained through 6-stage iterative reasoning. Updating target object at each stage() The process of (2) is as follows:
Wherein the method comprises the steps of The representation layer is normalized and,Is composed of two linear projection layers and a ReLU activation function.
By dynamically updating the query object at different stages of the decoderEach stage can pay more attention to different descriptions of the query text, so that a target object can be found more finely, more complete target object characteristics are aggregated, and further a visual representation of the target object described by the query text can be obtained more accurately.
(6) Query object localization
MQVL inputting the characteristics of the target object output by each stage in the multi-mode reasoning module into an MLP with a ReLU activation function, wherein the coordinate position of the target object output by each intermediate stage is used for calculating a loss function, and the output of the last stage is used as the coordinate position of the final target object.
MQVL outputting the boundary frame coordinates of the final target object by the final MLP, calculating the Loss between the boundary frame predicted by each stage of the decoder and the real frame and summing them, where we useRepresenting the 1 st to the 1 st of the encoder The predicted target frame coordinates of the stage,Representing a real frame, wherein the training targets are as follows:
Wherein, the AndGIoU and L1 loss functions respectively,AndTo balance the two lost hyper-parameters during training.
S4, after the text descriptive sentences of the pictures in the step S2 are obtained, inputting the text descriptive sentences into a MQVL model in the step S3 to locate the disease positions.
S5, constructing a CNN-transducer-based double-flow multi-mode few-sample recognition model (CTMF)
CTMF employ a dual stream architecture that aims at simultaneously global information and local information for the current task. The Swin transducer network is added with a mobile sliding window, so that global characteristic information can be better captured and global information interaction can be carried out. The classification precision and generalization capability of the multi-mode few-sample model in classification tasks are greatly improved. The proposed model comprises a dual embedding module, a feature fusion module and a measurement module. The dual embedding module consists of two branches, a local branch and a global branch. Given a support sample x_s e S and a query sample x_q e Q, they are input to both the global and local branches of the model. In the global branching module, a Swin transducer is used as a feature extractor, and the image features are acquired and then input into a feature fusion module. Meanwhile, local branches adopt Resnet to acquire local features, the local features are also sent to a feature fusion module, and finally, the local branches and the global branches are subjected to feature fusion and input into cosine functions for classification.
(1) Local branching
The effect of a model is tested by adopting Resnet backbone network as a local embedding module to obtain picture characteristics F epsilon R (H multiplied by W multiplied by C). Resnet12 consists of 4 consecutive basic blocks, the number of filters being set to 64-128-256-512. Each convolution block contains three convolution layers, with a kernel size of 3 x 3, three batch normalization layers, one ReLU activation layer, and one kernel size of 2 steps. The Swin module of the global branch is identical to the Swin of MQVL.
(2) Dual channel mixing attention
Considering that a single model is difficult to learn global information and local details simultaneously and is inspired by mixed attention AGAM guided by class attributes, a double-path mixed attention is designed for fusing auxiliary modes, and in Multimodal-plant, a feature vector of 768 dimensions is obtained through Bert. We are obtainingAndThen carrying out maxpooling and avgpooling operations at the same time, and respectively splicing auxiliary mode characteristics. By performing adaptive feature refinement along the channel and spatial dimensions, the model generalization is intended to be improved.
S5, acquiring an image of the intercepted disease position and a global blade image in the S3, and inputting the image and the global blade image into the S4 for disease identification and classification.
S6, after the disease type is identified in S5, the picture is stored and early-warned.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the above-described embodiments, and that the above-described embodiments and descriptions are only preferred embodiments of the present invention, and are not intended to limit the invention, and that various changes and modifications may be made therein without departing from the spirit and scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (5)
1. A multimodal crop disease phenotype collaborative analysis model, comprising the steps of:
S1, constructing a multi-mode data set;
the multi-mode disease data set is built by using a crowdsourcing technology, namely a plant scholars design option, a large number of non-professional persons are used for quickly acquiring disease text descriptions, chinese text descriptions are attached to each type of image, and image text pairs are formed by random combination;
S2, constructing a crop disease phenotype text generation model based on improved CNN and LSTM, generating a text for a plant image, and generating a text containing crop disease phenotype information;
S3, constructing a visual language positioning model based on query text guidance and multi-stage reasoning, wherein the visual language positioning model comprises three modules, namely a query text feature extraction module, a query text guidance visual feature generation module and a multi-stage reasoning module;
The query text feature extraction module is used for encoding the query text to generate text embedding, the query text-guided visual feature generation module is used for introducing the context information of the query text encoded by the query text feature extraction module into each level of a Swin-transform architecture, guiding and learning visual features under different scales by means of an attention mechanism, and aggregating the visual features under different scales to obtain visual features related to the query text;
S4, inputting the text descriptive sentences of the plant images obtained in the step S2 into a visual language positioning model in the step S3 to position the disease position;
S5, constructing a CNN-transducer-based double-flow multi-mode few-sample recognition model;
The multi-mode few-sample recognition model comprises a double-embedding module, a feature fusion module and a measurement module, wherein the double-embedding module consists of a local branch and a global branch, a plant image sample and a query sample are simultaneously input into the global branch and the local branch of the model;
and S6, after the disease type is identified in S5, the picture is saved and early-warned.
2. The multi-modal crop disease phenotype collaborative analysis model according to claim 1, wherein the disease dataset comprises 24 diseases including apple scab, apple rust, cherry powdery mildew, corn rust, corn leaf blight, grape black rot, grape leaf blight, peach bacterial spot, pepper bacterial spot, potato early blight, potato late blight, tomato bacterial spot, tomato leaf mold, tomato leaf spot, apple black rot, pumpkin powdery mildew, tomato early blight, tomato Huang Qushe disease, corn gray, orange yellow dragon, strawberry leaf char, tomato late blight, tomato round spot, tomato mosaic.
3. The multi-modal crop disease phenotype collaborative analysis model according to claim 1, wherein disease text descriptions require non-professionals to select options from 5 angles of quantity, color, shape, characteristics, and lesion footprint, the options are collected by a program to compose text description sentences after selection, and potential annotators need to reach more than 90% of test accuracy of image marking base knowledge before dealing with a problem.
4. The multi-modal crop disease phenotype collaborative analysis model according to claim 1, wherein the multi-modal few sample recognition model employs a dual stream architecture aimed at simultaneously global and local information of a current task.
5. A computer device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being configured to run the computer program to execute the multimodal crop disease phenotype collaborative analysis model of claim 1.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310828903.8A CN116778391B (en) | 2023-07-07 | 2023-07-07 | Multi-mode crop disease phenotype collaborative analysis model and device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310828903.8A CN116778391B (en) | 2023-07-07 | 2023-07-07 | Multi-mode crop disease phenotype collaborative analysis model and device |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN116778391A CN116778391A (en) | 2023-09-19 |
| CN116778391B true CN116778391B (en) | 2025-09-16 |
Family
ID=88008029
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310828903.8A Active CN116778391B (en) | 2023-07-07 | 2023-07-07 | Multi-mode crop disease phenotype collaborative analysis model and device |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116778391B (en) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117708761B (en) * | 2024-02-06 | 2024-05-03 | 四川省亿尚农业旅游开发有限公司 | System and method for raising seedlings of hippeastrum with fusion of multi-index environmental conditions |
| CN118506349B (en) * | 2024-07-18 | 2025-01-21 | 安徽高哲信息技术有限公司 | Training method for grain identification model, grain identification method, equipment and medium |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114298158A (en) * | 2021-12-06 | 2022-04-08 | 湖南工业大学 | A Multimodal Pre-training Method Based on Linear Combination of Graphics and Text |
| CN115048537A (en) * | 2022-07-11 | 2022-09-13 | 河北农业大学 | Disease recognition system based on image-text multi-mode collaborative representation |
Family Cites Families (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108764268A (en) * | 2018-04-02 | 2018-11-06 | 华南理工大学 | A kind of multi-modal emotion identification method of picture and text based on deep learning |
| KR102425487B1 (en) * | 2019-10-25 | 2022-07-27 | 전북대학교산학협력단 | Method and system for glocal description of phytopathology based on Deep learning |
| CN113241135B (en) * | 2021-04-30 | 2023-05-05 | 山东大学 | Disease risk prediction method and system based on multi-modal fusion |
| CN116129164A (en) * | 2022-09-09 | 2023-05-16 | 广西大学 | A Crop Disease Recognition Model Based on Transformer Hybrid Architecture |
| CN115455970A (en) * | 2022-09-13 | 2022-12-09 | 北方民族大学 | Image-text combined named entity recognition method for multi-modal semantic collaborative interaction |
| CN115937689B (en) * | 2022-12-30 | 2023-08-11 | 安徽农业大学 | A technology for intelligent identification and monitoring of agricultural pests |
| CN116152810A (en) * | 2023-02-27 | 2023-05-23 | 中国科学院合肥物质科学研究院 | Visual positioning method and device based on hierarchical cross-modal contextual attention mechanism |
-
2023
- 2023-07-07 CN CN202310828903.8A patent/CN116778391B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114298158A (en) * | 2021-12-06 | 2022-04-08 | 湖南工业大学 | A Multimodal Pre-training Method Based on Linear Combination of Graphics and Text |
| CN115048537A (en) * | 2022-07-11 | 2022-09-13 | 河北农业大学 | Disease recognition system based on image-text multi-mode collaborative representation |
Also Published As
| Publication number | Publication date |
|---|---|
| CN116778391A (en) | 2023-09-19 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Wang et al. | Plant disease detection and classification method based on the optimized lightweight YOLOv5 model | |
| Temniranrat et al. | A system for automatic rice disease detection from rice paddy images serviced via a Chatbot | |
| Tzampazaki et al. | Machine vision—moving from Industry 4.0 to Industry 5.0 | |
| CN116778391B (en) | Multi-mode crop disease phenotype collaborative analysis model and device | |
| Aldakheel et al. | Detection and identification of plant leaf diseases using YOLOv4 | |
| Patel et al. | Deep learning-based plant organ segmentation and phenotyping of sorghum plants using LiDAR point cloud | |
| Liao et al. | A hybrid CNN-LSTM model for diagnosing rice nutrient levels at the rice panicle initiation stage | |
| Prashanthi et al. | Plant disease detection using Convolutional neural networks | |
| López-Barrios et al. | Green sweet pepper fruit and peduncle detection using mask R-CNN in greenhouses | |
| Teimouri et al. | Novel assessment of region-based CNNs for detecting monocot/dicot weeds in dense field environments | |
| Zhao et al. | Implementation of large language models and agricultural knowledge graphs for efficient plant disease detection | |
| Zhu et al. | Harnessing large vision and language models in agriculture: A review | |
| Hu et al. | Crop node detection and internode length estimation using an improved YOLOv5 model | |
| Hao et al. | CountShoots: Automatic detection and counting of slash pine new shoots using UAV imagery | |
| Jing et al. | Optimizing the yolov7-tiny model with multiple strategies for citrus fruit yield estimation in complex scenarios | |
| Geng et al. | Research on segmentation method of maize seedling plant instances based on uav multispectral remote sensing images | |
| Hou et al. | An occluded cherry tomato recognition model based on improved YOLOv7 | |
| Siri et al. | Enhanced deep learning models for automatic fish species identification in underwater imagery | |
| Yue et al. | Detection and Counting Model of Soybean at the Flowering and Podding Stage in the Field Based on Improved YOLOv5 | |
| Wang et al. | YOLOv5-AC: a method of uncrewed rice transplanter working quality detection | |
| Rahim et al. | Comparison of grape flower counting using patch-based instance segmentation and density-based estimation with convolutional neural networks | |
| Xu et al. | GLL-YOLO: A Lightweight Network for Detecting the Maturity of Blueberry Fruits | |
| Qian et al. | Cucumber Leaf Segmentation Based on Bilayer Convolutional Network | |
| Zhang et al. | E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition | |
| Corrigan | An investigation into machine learning solutions involving time series across different problem domains |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |