CN116778391B

CN116778391B - Multi-mode crop disease phenotype collaborative analysis model and device

Info

Publication number: CN116778391B
Application number: CN202310828903.8A
Authority: CN
Inventors: 王超; 朱家瑞; 罗伟; 辜丽川; 何进; 蒋婷婷; 夏迎春; 杨帅; 焦俊
Original assignee: Anhui Agricultural University AHAU
Current assignee: Anhui Agricultural University AHAU
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2025-09-16
Anticipated expiration: 2043-07-07
Also published as: CN116778391A

Abstract

The invention discloses a multi-mode crop disease phenotype collaborative analysis model, a device and a model construction system, wherein the method comprises the steps of constructing a crop disease phenotype text generation model based on improved CNN and LSTM, and training the model through a constructed multi-mode training data set; the crop disease phenotype text generation model based on the improved CNN and LSTM trains the crop disease phenotype text generation model through the built multi-modal training data set, the visual language positioning Model (MQVL) based on query text guidance and multi-stage reasoning trains the crop disease phenotype text generation model through the built multi-modal training data set, and the CNN-transducer double-flow multi-modal few-sample recognition model (CTMF) based on the CNN-transducer double-flow multi-modal few-sample recognition model trains the crop disease phenotype text generation model through the built multi-modal training data set.

Description

Multi-mode crop disease phenotype collaborative analysis model and device

Technical Field

The invention relates to the field of crop diseases, in particular to a multi-mode crop disease phenotype collaborative analysis model and a device.

Background

Plant diseases are responsible for significant economic losses in the global agricultural sector. They are directly related to food safety and sustainable food production. Quantifying the impact of plant pathology on crops is one of the most challenging problems in agriculture. The lack of nutrition or imbalance between soil moisture and oxygen makes the plant more susceptible to pathogens. Abnormalities in plants may be caused by pests, diseases or other abiotic stresses (e.g. low temperature). Disease identification tasks are often associated with time consuming, laborious and subjective. Traditionally, crop inspection has been performed by persons having some expertise in this area. However, this approach can produce a degree of uncertainty or error, resulting in erroneous decisions.

Recent advances in plant phenotyping allow the development of efficient and automated diagnostic systems for plant abnormality identification. Although the existing methods have shown some effects, there are some limitations in the problems of disease location and identification, especially in the actual scenario. To address this limitation, we propose a method to more effectively detect and locate plant abnormalities in a multi-modal form by combining visual object recognition with language generation, by generating detailed information about its symptoms.

Nuthalapati et al know the geographical position and time as priori and acquire the characteristics through nonlinear embedding, and then perform information fusion together with the visual characteristics through Relative Transformer layers, so that the recognition accuracy of the CUB-200-2011 bird dataset is improved. Huang et al propose a class attribute description Guided vision mechanism (AGAM) that guides branches to merge Attributes and visual features through Attributes for data sets with Attributes, and learns attention weights through feature selection for branches without Attributes.

Thus, in addition to the image itself, information such as the location of the photograph, the date, time, attributes of the image, and text descriptions can also be a significant source of a priori knowledge. Especially, text description information of the image contains rich semantic information. The text mode information and the image mode information have complementary relation, so that the problem caused by insufficient image training samples can be solved to a certain extent. However, the acquisition and construction of multi-modal datasets in the agricultural field is more difficult, requires manual annotation and annotation by students and specialists in the relevant fields, and is time and cost intensive. Therefore, the plant protection expert design problem and options are adopted in the patent, 5 characteristics of plant diseases are mainly surrounded, namely the number, color, shape and characteristics of the diseases and the area of the disease spots occupying the blades are displayed, and the multi-mode text description is collected rapidly.

The method has good effect on the visual language positioning of reasonable multi-mode application, and the visual language positioning is a task for positioning a target object or region in an image according to natural language expression. At present, most visual language positioning researches focus on natural images of head-up view angles of people, animals, automobiles and the like, and the existing method mainly adopts independent extraction of visual features and text embedding, and then fusion reasoning is carried out on the visual features and the text embedding so as to position target objects mentioned in query texts. However, features obtained by the independent visual feature extraction module often contain many visual features unrelated to the query text, and these redundant unrelated visual features may lead to unreasonable reasoning about the subsequent multimodal fusion module, thereby affecting target localization.

Aiming at the problems in the visual language positioning, the patent designs a combined network model based on a Swin-transformer architecture, which comprises a query text feature extraction module, a query text guided visual feature generation module and a multi-stage fusion reasoning module. The method comprises the steps of guiding by introducing the characteristic of the query text into the visual characteristic extraction module, reducing the interference of irrelevant visual characteristics, generating visual characteristics related to the query text, and carrying out multi-stage interactive reasoning on the related visual characteristics and the characteristic of the query text by the multi-stage fusion reasoning module so as to further focus on the accurate positioning of the query target object.

Disclosure of Invention

The invention aims to provide a multi-mode crop disease phenotype collaborative analysis model and a device thereof so as to solve the technical problems.

The invention aims to solve the technical problems, and is realized by adopting the following technical scheme:

a multimodal crop disease phenotype collaborative analysis model comprising the steps of:

S1, constructing a multi-mode data set;

A multi-mode disease data set is built by a crowdsourcing technology, namely a plant scholars design option, and a large number of non-professionals are used for quickly acquiring disease text description;

s2, constructing a crop disease phenotype text generation model based on the improved CNN and LSTM;

S3, constructing a visual language positioning model based on query text guidance and multi-stage reasoning;

S4, after the text descriptive sentences of the pictures in the step S2 are obtained, inputting the text descriptive sentences into a visual language positioning model of multi-stage reasoning in the step S3 to position disease positions;

S5, constructing a CNN-transducer double-flow multi-mode few-sample recognition model;

s6, after the disease type is identified in S5, the picture is stored and early-warned.

Preferably, the disease dataset comprises 24 diseases of apple scab, apple rust, cherry powdery mildew, corn rust, corn leaf blight, grape black rot, grape leaf blight, peach bacterial spot, pepper bacterial spot, potato early blight, potato late blight, tomato bacterial spot, tomato leaf mold, tomato leaf spot, apple black rot, pumpkin powdery mildew, tomato early blight, tomato Huang Qushe diseases, corn gray, orange yellow dragon, strawberry leaf char, tomato late blight, tomato round spot, tomato mosaic.

Preferably, the disease text description requires non-professionals to select options from 5 angles of quantity, color, shape, characteristics and lesion area, the options are collected by a program to form text description sentences after selection, and potential annotators need to reach more than 90% of the test accuracy of the underlying knowledge of image marking before dealing with our questions.

Preferably, the multi-mode few-sample recognition model adopts a double-flow architecture to simultaneously aim at global information and local information of a current task, and comprises a double-embedding module, a feature fusion module and a measurement module.

Preferably, the dual embedding module consists of two branches, a local branch and a global branch.

Preferably, the visual language localization model of multi-stage reasoning mainly comprises three modules, namely a query text feature extraction module, a query text guided visual feature generation module and a multi-stage reasoning module.

A computer device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to execute a multimodal crop disease phenotype collaborative analysis model.

The beneficial effects of the invention are as follows:

1. The multi-mode crop disease phenotype collaborative analysis model and the device have high practical application value, fully consider possible important text characteristics of diseases when preparing a data set, and construct the data set according to 5 different angles. In practical application, the crop phenotype can be continuously monitored by quickly constructing new disease types through crowdsourcing. Provides theoretical guidance and technical support for crop phenotype research.

2. The invention introduces a crowdsourcing technology in the construction of the multi-mode agricultural field data set, gives a task of constructing the multi-mode agricultural field data set to the crowdsourcing, and designs an image description generation model, a visual language positioning Model (MQVL) based on query text guidance and multi-stage reasoning and a CNN-converter double-flow multi-mode few-sample recognition model (CTMF) based on the basis. Wherein the image description generation model is used for automatically generating a disease text description, the MQVL model is used for automatically identifying possible disease areas and the multi-modal few-sample classification model is used for final identification. The invention combines text generation, disease positioning and disease identification and early warning, and can effectively improve the accuracy of identifying plant leaf diseases.

Drawings

FIG. 1 is a flow chart of a multimodal crop disease phenotype collaborative analysis model and apparatus of the present invention;

FIG. 2 is a display of crop disease in a data set in different complex contexts;

FIG. 3 is a schematic diagram of CTMF;

FIG. 4 is a schematic diagram of a dual channel mixed attention architecture of CTMF;

FIG. 5 is a schematic diagram of MQVL;

FIG. 6 is a schematic diagram of a text model structure generated from image descriptions.

Detailed Description

In order that the manner in which the above recited features, objects and advantages of the present invention are obtained, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Based on the examples in the embodiments, those skilled in the art can obtain other examples without making any inventive effort, which fall within the scope of the invention.

Specific embodiments of the present invention are described below with reference to the accompanying drawings.

Example 1:

as shown in fig. 1-6, a multimodal crop disease phenotype collaborative analysis model comprises the following steps:

s1, quickly acquiring disease text description by using a large number of non-professionals through a crowdsourcing technology, namely design options of a plant scholars, and constructing a multi-mode disease data set;

S2, classifying and marking the data set according to disease types and dividing the data set into a training set and a verification set according to proportion, S3, adjusting model pre-training parameters and designing image description to generate models, MQVL and CTMF;

s4, training the image description generation model, MQVL and CTMF model of the step S3 by using the data set processed in the step 2, and storing an optimal model;

And S5, identifying crop diseases by using the trained model.

Further, in step S1, the crop diseases in the dataset include 24 diseases of apple scab, apple rust, cherry powdery mildew, corn rust, corn leaf blight, grape black rot, grape leaf blight, peach bacterial spot, pepper bacterial spot, potato early blight, potato late blight, tomato bacterial spot, tomato leaf mold, tomato leaf spot, apple black rot, pumpkin powdery mildew, tomato early blight, tomato Huang Qushe diseases, corn gray, orange yellow dragon, strawberry leaf char, tomato late blight, tomato round spot, tomato mosaic. Disease text description requires non-professionals to select options from 5 angles of number, color, shape, characteristics and lesion occupation area, and the text description statement is formed by program collection options after selection. And before dealing with our questions, potential annotators need to reach more than 90% of the quiz accuracy of the underlying knowledge of the image markers to prevent the annotators from choosing options at will.

Further, in step S3 CTMF adds the two-way channel mix attention and minimizes the classification loss function after CNN and Swin transformers.

The dual-channel mixed attention is a simple and effective attention module, disease text description is carried out to obtain 768-dimensional feature vectors through a Bert model, maxpooling and avgpooling operations are carried out on the feature vectors and picture features at the same time, and the model generalization is improved through self-adaptive feature refinement along channel and space dimensions.

Wherein the method comprises the steps ofRepresenting a sigmoid function.

Further, in step S5, disease identification and positioning are performed using the trained model, and the specific location and name of the disease in the image can be identified after the image is input.

Example 2:

1-6, in the case that other parts are the same as embodiment 1, the embodiment is different from embodiment 1 in that a multi-mode crop disease phenotype collaborative analysis model and device are used to obtain disease text description and training model according to disease characteristic pertinence design options by using a public picture disease dataset PLANTVILLAGE. The disease position is automatically identified, the disease type is automatically identified, the accuracy and the efficiency of disease identification are improved, the disease type is accurately identified, and the early warning has an important effect on disease prevention and control. One of the important technical measures of modern agriculture for preventing and controlling crop diseases is pesticide spraying, and correct identification of disease types can be helpful for pesticide proportion. The invention is of practical value for crop phenotype management and prevention;

The invention provides a multimodal crop disease phenotype collaborative analysis model and a device, which specifically comprise the following steps:

S1, multi-modal dataset construction

A framework is employed to collect disease field descriptive text, desirably providing a comprehensive set of benchmarking and annotation types, to facilitate research in the next classification. Executing the image description task at Amazon Mechanical Turk (AMT), the employee can anonymously complete the short-term online task in exchange for a small fee. The main problem with using a large number of non-professionals is to ensure that the descriptive text of the disease is accurate and to keep the annotation fast and economical. From an economic standpoint, it is desirable to obtain the most accurate descriptive text at a lower price. Botanicals design questions and options, collect descriptive text quickly by way of selection, and potential annotators need to reach more than 90% of the quiz accuracy of the underlying knowledge of the image markers before dealing with our questions to prevent annotators from choosing options at will. Meanwhile, the picture data set is taken from PLANTVILLAGE, and each type of image comprises 275 pieces of picture data of 24 kinds of disease leaves (such as apple scab, apple rust, cherry powdery mildew, corn rust, corn leaf blight, grape anthracnose and the like), and the total number of picture data is 6600 pieces of image data. The auxiliary text description data set contains 720 types of text descriptions, each type of image is attached with Chinese text descriptions, and the images are randomly combined to form an image text pair.

S2, crop disease phenotype text generation model construction based on improved CNN and LSTM

And generating texts on the processed plant images to generate texts containing crop disease phenotype information. Text includes several main features of quantity, color, shape, features and location. First, a subject detector is trained using a region-based deep neural network to obtain a set of region features containing plant anomalies. Second, the language generator takes as input the characteristics of the object detection result and uses Long Short Term Memory (LSTM) to generate descriptive sentences having crop disease phenotypes. For this purpose we use Faster R-CNN (Ren et al 2016). It uses a two-stage process to detect objects in an image. In the first stage, the region suggestion network takes as input a feature map of the image and outputs a set of object suggestions with region scores. In the second stage, the feature vectors of the object suggestions are fed into the network to predict the positioning of the bounding box. Language generation the same region features generated by the object detector are then used as input to a language generator that associates each region with text. In this section, the LSTM module predicts words at each time step and uses these predictions to predict the next word from the init token to the end of the sentence. LSTM is a special unit of RNN that contains a built-in memory unit to store information and make use of long-range contexts (Hochreiter and Schmidhuber, 1997). They can learn long-term dependence while avoiding the problem of long-term dependence.

S3, constructing a visual language positioning Model (MQVL) based on query text guidance and multi-stage reasoning

The MQVL model mainly comprises three modules, namely a query text feature extraction module, a query text guided visual feature generation module and a multi-stage reasoning module. The query text feature extraction module is used for encoding the query text to generate text embedding, the query text-guided visual feature generation module is used for introducing the context information of the query text encoded by the text feature extraction module into each level of a Swin-transform architecture, guiding and learning visual features under different scales by means of an attention mechanism, aggregating the visual features under different scales to obtain visual features related to the query text, inputting the query text features and the visual features obtained by the first two modules into a multi-stage reasoning module, and gradually obtaining accurate positioning representation of a query object by means of multi-stage interactive reasoning of a transform decoder of the reasoning module.

(1) Query text feature extraction module

The query text feature extraction module extracts features of the query text using the BERT model. Firstly, marking a query text, then respectively adding [ CLS ] marks and [ SEP ] marks at the head and the tail of the marked query text expression as the input of a text feature extractor, and encoding the marked query text expression to obtain marks of the context information of the query text (by [ CLS ] marks the context information) and marks of each word in the query textAs a feature of query text, where channel sizeThe number of the dimensions is 768,Number of tokens for a word.

(2) Query text guided visual feature generation module

Given a pictureAs input to the visual feature generation module, wherein H, W represents the height and width of the picture, respectively, MQVL employs a query text to guide the network to extract relevant visual features and flatten them into a feature sequenceWherein the channel dimensionNumber of marks entered. The visual feature generation module is used for extracting visual features under the guidance of the query text features through an attention mechanism, and fusing the visual features with different scales to obtain visual features closely related to the query text.

(3) Visual characteristic diagram

The output from the feature extractor is a hierarchical list of visual feature maps due to the hierarchical structure of Swin-transformers. Each stage MQVL consists of a plurality of Swin-transducer blocks (i.e., a Swin module) and an attention module, and the image is segmented by a patchIs embedded intoWhereinIs the embedded dimension, thenAnd query text featuresInput together into the Swin-transducer architecture, the attention module guides the visual feature extraction of four stages, namely, when in the mth stage (1 < = m < = 4), the visual feature map of the last stage is obtainedAnd (3) withTogether with the input to a Swin-transducer block, implemented by an attention moduleGuiding and learning the visual characteristics to obtain a visual characteristic diagram of each stage。

The guided learning of the query text on the extraction of the visual features adopts the thought of QRNet, and a channel attention map and a space attention map which are related to the query text in the visual features are calculated by utilizing a dynamic linear layer so as to acquire different visual features.

Dynamic linear layer exploits contextual features of query textTo guide a given input vectorTo output vectorThe mapping between the two is as follows:

Wherein the method comprises the steps of Linear layer parametersBias of,Representing the way in which matrix decomposition is employedAnd (5) performing decomposition calculation.

Channel attention attempts to calculate. Visual feature map generated for Swin module in each stageAggregating spatial information by averaging pooling and max pooling and generating corresponding featuresThen the pooled visual features are processed through a dynamic linear layer and a ReLU function, and the processed average pooled visual features and the largest pooled visual features are summed through a Sigmoid function to obtain a channel attention mapThe calculation process is as follows:

we will be visual features And (3) withElement multiplication is carried out to obtain visual characteristics on the channelThe formula is as follows:

Spatial attention is sought to be calculated. Instead of compressing channel dimensions, dynamic linear layers are used to reduce the dimensions on the channels to learn the regions relevant to the query text, and Sigmoid functions are used to generate a spatial attention map, which is calculated as follows:

Wherein the method comprises the steps of A spatial attention map is represented and a spatial attention map is displayed,Is the final output of the attention module.

(4) Multi-scale feature fusion

Via the hierarchical structure of Swin-transducer, MQVL visual characteristic diagrams with 4 different scales are obtained, and the resolution is respectively. For the visual feature maps obtained at different stages to be effectively fused, MQVL uses a convolution kernel asIs used for the average pooling of multi-scale visual features, i.e. for the generated visual feature map of the m (1 < = m < = 3) th stageCarrying out average pooling to ensure that the dimension is the same as the dimension of the (m+1) th stage, and calculating the average value of two visual feature graphs to obtain. Finally, the visual characteristic diagram is displayedFlattened into a sequenceAs input to the following multimodal inference module.

(5) Multi-stage reasoning module

The relevance of the generated visual features and the query text is coarse-grained, and in order to obtain more accurate positioning, the relevance on fine granularity needs to be further established. MQVL iterative reasoning is performed by using a multi-stage decoder, and the visual information and the language information are repeatedly interacted by means of a cross-attention mechanism, so that ambiguity in the pushing is reduced, and the final target object position is gradually located.

Referring to the setting of the decoder layer number in VLTVG, the decoder layer number of MQVL is set to 6 layers, namely corresponding to 6 stages, each stage is composed of the same network architecture, and the characteristic output of each stage of the decoder is used as the input of the target query object characteristic of the next stage, so that iterative reasoning is performed. Specifically, in the first stage, we set a learnable query objectAs an initial representation of the target object and input into the decoder first layer, the target object is processed by the multi-headed cross-attention moduleWith text embeddingAnd visual characteristicsInteractive learning to learn from visual featuresIs collected into features related to the query text objectObtaining the characteristics of the target object in the first stage through a feedforward neural network (FFN) and layer normalizationVisual object characteristics generated by the first stageThe second stage as the target object is input to the decoder, the process is identical to the first stage, and the optimal target object is obtained through 6-stage iterative reasoning. Updating target object at each stage() The process of (2) is as follows:

Wherein the method comprises the steps of The representation layer is normalized and,Is composed of two linear projection layers and a ReLU activation function.

By dynamically updating the query object at different stages of the decoderEach stage can pay more attention to different descriptions of the query text, so that a target object can be found more finely, more complete target object characteristics are aggregated, and further a visual representation of the target object described by the query text can be obtained more accurately.

(6) Query object localization

MQVL inputting the characteristics of the target object output by each stage in the multi-mode reasoning module into an MLP with a ReLU activation function, wherein the coordinate position of the target object output by each intermediate stage is used for calculating a loss function, and the output of the last stage is used as the coordinate position of the final target object.

MQVL outputting the boundary frame coordinates of the final target object by the final MLP, calculating the Loss between the boundary frame predicted by each stage of the decoder and the real frame and summing them, where we useRepresenting the 1 st to the 1 st of the encoder The predicted target frame coordinates of the stage,Representing a real frame, wherein the training targets are as follows:

Wherein, the AndGIoU and L1 loss functions respectively,AndTo balance the two lost hyper-parameters during training.

S4, after the text descriptive sentences of the pictures in the step S2 are obtained, inputting the text descriptive sentences into a MQVL model in the step S3 to locate the disease positions.

S5, constructing a CNN-transducer-based double-flow multi-mode few-sample recognition model (CTMF)

CTMF employ a dual stream architecture that aims at simultaneously global information and local information for the current task. The Swin transducer network is added with a mobile sliding window, so that global characteristic information can be better captured and global information interaction can be carried out. The classification precision and generalization capability of the multi-mode few-sample model in classification tasks are greatly improved. The proposed model comprises a dual embedding module, a feature fusion module and a measurement module. The dual embedding module consists of two branches, a local branch and a global branch. Given a support sample x_s e S and a query sample x_q e Q, they are input to both the global and local branches of the model. In the global branching module, a Swin transducer is used as a feature extractor, and the image features are acquired and then input into a feature fusion module. Meanwhile, local branches adopt Resnet to acquire local features, the local features are also sent to a feature fusion module, and finally, the local branches and the global branches are subjected to feature fusion and input into cosine functions for classification.

(1) Local branching

The effect of a model is tested by adopting Resnet backbone network as a local embedding module to obtain picture characteristics F epsilon R (H multiplied by W multiplied by C). Resnet12 consists of 4 consecutive basic blocks, the number of filters being set to 64-128-256-512. Each convolution block contains three convolution layers, with a kernel size of 3 x 3, three batch normalization layers, one ReLU activation layer, and one kernel size of 2 steps. The Swin module of the global branch is identical to the Swin of MQVL.

(2) Dual channel mixing attention

Considering that a single model is difficult to learn global information and local details simultaneously and is inspired by mixed attention AGAM guided by class attributes, a double-path mixed attention is designed for fusing auxiliary modes, and in Multimodal-plant, a feature vector of 768 dimensions is obtained through Bert. We are obtainingAndThen carrying out maxpooling and avgpooling operations at the same time, and respectively splicing auxiliary mode characteristics. By performing adaptive feature refinement along the channel and spatial dimensions, the model generalization is intended to be improved.

S5, acquiring an image of the intercepted disease position and a global blade image in the S3, and inputting the image and the global blade image into the S4 for disease identification and classification.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the above-described embodiments, and that the above-described embodiments and descriptions are only preferred embodiments of the present invention, and are not intended to limit the invention, and that various changes and modifications may be made therein without departing from the spirit and scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A multimodal crop disease phenotype collaborative analysis model, comprising the steps of:

S1, constructing a multi-mode data set;

the multi-mode disease data set is built by using a crowdsourcing technology, namely a plant scholars design option, a large number of non-professional persons are used for quickly acquiring disease text descriptions, chinese text descriptions are attached to each type of image, and image text pairs are formed by random combination;

S2, constructing a crop disease phenotype text generation model based on improved CNN and LSTM, generating a text for a plant image, and generating a text containing crop disease phenotype information;

S3, constructing a visual language positioning model based on query text guidance and multi-stage reasoning, wherein the visual language positioning model comprises three modules, namely a query text feature extraction module, a query text guidance visual feature generation module and a multi-stage reasoning module;

The query text feature extraction module is used for encoding the query text to generate text embedding, the query text-guided visual feature generation module is used for introducing the context information of the query text encoded by the query text feature extraction module into each level of a Swin-transform architecture, guiding and learning visual features under different scales by means of an attention mechanism, and aggregating the visual features under different scales to obtain visual features related to the query text;

S4, inputting the text descriptive sentences of the plant images obtained in the step S2 into a visual language positioning model in the step S3 to position the disease position;

S5, constructing a CNN-transducer-based double-flow multi-mode few-sample recognition model;

The multi-mode few-sample recognition model comprises a double-embedding module, a feature fusion module and a measurement module, wherein the double-embedding module consists of a local branch and a global branch, a plant image sample and a query sample are simultaneously input into the global branch and the local branch of the model;

and S6, after the disease type is identified in S5, the picture is saved and early-warned.

2. The multi-modal crop disease phenotype collaborative analysis model according to claim 1, wherein the disease dataset comprises 24 diseases including apple scab, apple rust, cherry powdery mildew, corn rust, corn leaf blight, grape black rot, grape leaf blight, peach bacterial spot, pepper bacterial spot, potato early blight, potato late blight, tomato bacterial spot, tomato leaf mold, tomato leaf spot, apple black rot, pumpkin powdery mildew, tomato early blight, tomato Huang Qushe disease, corn gray, orange yellow dragon, strawberry leaf char, tomato late blight, tomato round spot, tomato mosaic.

3. The multi-modal crop disease phenotype collaborative analysis model according to claim 1, wherein disease text descriptions require non-professionals to select options from 5 angles of quantity, color, shape, characteristics, and lesion footprint, the options are collected by a program to compose text description sentences after selection, and potential annotators need to reach more than 90% of test accuracy of image marking base knowledge before dealing with a problem.

4. The multi-modal crop disease phenotype collaborative analysis model according to claim 1, wherein the multi-modal few sample recognition model employs a dual stream architecture aimed at simultaneously global and local information of a current task.

5. A computer device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being configured to run the computer program to execute the multimodal crop disease phenotype collaborative analysis model of claim 1.