CN118468138A

CN118468138A - A multimodal sentiment analysis model construction method, analysis model and analysis method

Info

Publication number: CN118468138A
Application number: CN202410842227.4A
Authority: CN
Inventors: 王硕; 陈柯舟; 何向南; 郝艳宾
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2024-06-27
Filing date: 2024-06-27
Publication date: 2024-08-09
Anticipated expiration: 2044-06-27
Also published as: CN118468138B

Abstract

The invention discloses a multi-mode emotion analysis model construction method, an analysis model and an analysis method, wherein the construction method comprises the following steps: obtaining a pre-training language sub-model and a training set; extracting sample features in a training set, constructing a multi-modal expert in a pre-training language sub-model, and introducing the multi-modal features into the pre-training language sub-model to obtain the output of the multi-modal expert; adding the mixed output to an output of a transducer layer of the pre-trained language sub-model; predicting emotion by adopting a preset MLP classifier to obtain a target model; and taking the trained target model as an emotion analysis model. By inserting the multi-modal expert into the pre-training language sub-model, the problem of overlarge parameter caused by using the pre-training language sub-model in multi-modal emotion analysis is solved; the characteristics of different modes at different times are analyzed through the multi-mode routing, and the token is distributed to an appropriate expert for processing, so that the judgment on the validity of the multi-mode characteristics is increased, and the classification accuracy is improved.

Description

A multimodal sentiment analysis model construction method, analysis model and analysis method

技术领域Technical Field

本申请涉及模型训练技术领域，具体的说是一种多模态情感分析模型构建方法、分析模型及分析方法。The present application relates to the field of model training technology, and specifically to a multimodal sentiment analysis model construction method, an analysis model, and an analysis method.

背景技术Background Art

情感分析的目标是从人类对话中识别出情感的表达。随着预训练语言子模型和相关Transformer结构的发展，文本的情感分析取得了重大的进步。然而，仅仅处理文本信息不能精确的表达人类的情感，这是因为不仅仅包含文字描述，也包含肢体动作，表情和语气等等。这些不同模态的信息可以共同反应说话者的情感。为了解决这个问题，人们提出了多模态情感分析 (MSA) 这一任务。其旨在融合多模态特征来分析情感。The goal of sentiment analysis is to identify emotional expressions from human conversations. With the development of pre-trained language sub-models and related Transformer structures, sentiment analysis of text has made significant progress. However, processing text information alone cannot accurately express human emotions, because it contains not only text descriptions, but also body movements, expressions, and tone of voice. These different modal information can jointly reflect the speaker's emotions. To solve this problem, the task of multimodal sentiment analysis (MSA) was proposed. It aims to integrate multimodal features to analyze emotions.

首先，目前在多模态情感分析(MSA)任务中往往采用全微调预训练模型的方法。尽管这种方法和从头训练相比有巨大的优势，但是随着数据量和预训练模型参数量的增加，使用这种方法训练模型会带来太多需要训练的参数，导致消耗资源过多。其次，现有的融合手段往往只关注怎样实现更加理想化的融合。然而，多模态情感分析需要分析视频和音频中隐藏的情感，这二者往往都包含较多的噪声和误导性内容，例如视频背景中人物的移动和音频中的环境噪音等等。分析这些无关的内容反而会对模型预测造成不利的影响。First, the method of fully fine-tuning pre-trained models is often used in multimodal sentiment analysis (MSA) tasks. Although this method has great advantages over training from scratch, as the amount of data and the number of pre-trained model parameters increase, using this method to train the model will bring too many parameters that need to be trained, resulting in excessive resource consumption. Secondly, existing fusion methods often only focus on how to achieve a more ideal fusion. However, multimodal sentiment analysis requires analyzing the emotions hidden in video and audio, both of which often contain more noise and misleading content, such as the movement of people in the background of the video and environmental noise in the audio. Analyzing these irrelevant contents will have an adverse effect on model prediction.

自然语言处理(NLP)中的常用解决方法是参数高效微调(PEFT)，其通过向预训练语言子模型中引入一小部分新的参数，就可以实现预训练模型向各类的下游任务上的迁移。但是这些PEFT方法仅仅支持单模态数据，限制了它们在多模态情感分析上的应用。此外，现有的多模态情感分析方法往往会在特征融合之前进行对齐。然而，他们忽视了对不同模态不同时间特征的分析，并且没有在融合之前对无效特征进行过滤。A common solution in natural language processing (NLP) is parameter efficient fine tuning (PEFT), which can achieve the migration of pre-trained models to various downstream tasks by introducing a small number of new parameters into the pre-trained language sub-model. However, these PEFT methods only support unimodal data, which limits their application in multimodal sentiment analysis. In addition, existing multimodal sentiment analysis methods often align features before fusion. However, they ignore the analysis of different temporal features of different modalities and do not filter invalid features before fusion.

基于上述分析，目前如何减少模型中需要训练的参数量，同时在特征融合之前对多模态信息的有效性做判断，从而提升模型的训练效率和准确度，是多模态情感分析中急需解决的问题。Based on the above analysis, how to reduce the number of parameters that need to be trained in the model and judge the validity of multimodal information before feature fusion to improve the training efficiency and accuracy of the model is an urgent problem to be solved in multimodal sentiment analysis.

发明内容Summary of the invention

在本实施例中提供了一种多模态情感分析模型构建方法、分析模型及分析方法，以解决相关技术中在多模态情感分析上的应用被限制以及未对无效特征进行过滤的问题。In this embodiment, a multimodal sentiment analysis model construction method, an analysis model and an analysis method are provided to solve the problem that the application of multimodal sentiment analysis in the related art is limited and invalid features are not filtered.

第一方面，本发明提供了一种多模态情感分析模型构建方法，所述构建方法包括：In a first aspect, the present invention provides a method for constructing a multimodal sentiment analysis model, the method comprising:

获取预训练语言子模型、训练集，所述训练集包括多个具有标签的训练样本；Obtaining a pre-trained language sub-model and a training set, wherein the training set includes a plurality of labeled training samples;

对所述训练集中的样本特征提取，得到多模态特征并对多模态特征对齐处理；Extracting features of samples in the training set to obtain multimodal features and aligning the multimodal features;

在预训练语言子模型中构建多模态专家，将所述多模态特征引入所述预训练语言子模型，得到多模态专家的输出；Constructing a multimodal expert in a pre-trained language sub-model, introducing the multimodal features into the pre-trained language sub-model, and obtaining an output of the multimodal expert;

计算多模态专家的门控值，通过离散路由集成所述多模态专家的输出，将所述多模态专家的输出和所述门控值的加权和作为混合专家网络的输出，将混合输出加至预训练语言子模型的Transformer层的输出；Calculating the gating value of the multimodal expert, integrating the output of the multimodal expert through discrete routing, taking the weighted sum of the output of the multimodal expert and the gating value as the output of the hybrid expert network, and adding the hybrid output to the output of the Transformer layer of the pre-trained language sub-model;

采用预设置的MLP分类器预测情感，得到目标模型；Use the preset MLP classifier to predict sentiment and obtain the target model;

所述训练样本输入目标模型进行分类训练，将训练后的目标模型作为情感分析模型。The training samples are input into the target model for classification training, and the trained target model is used as a sentiment analysis model.

在一个可选的实施例中，对所述训练集中的样本特征提取，得到多模态特征并对多模态特征对齐处理包括；In an optional embodiment, extracting sample features from the training set to obtain multimodal features and aligning the multimodal features includes:

通过所述预训练语言子模型获取训练集中的样本的文本特征；Acquire text features of samples in a training set through the pre-trained language sub-model;

通过预训练的编码器对训练集中的样本编码，得到底层特征，所述底层特征包括视频特征和音频特征。The samples in the training set are encoded by a pre-trained encoder to obtain underlying features, where the underlying features include video features and audio features.

在一个可选的实施例中，在预训练语言子模型中构建多模态专家，将所述多模态特征引入所述预训练语言子模型，得到多模态专家的输出包括：In an optional embodiment, a multimodal expert is constructed in a pre-trained language sub-model, and the multimodal features are introduced into the pre-trained language sub-model to obtain the output of the multimodal expert, including:

将多模态专家在第i个词语处输入记为三元组，通过计算获得多模态专家的输入；The multimodal expert input at the i-th word is recorded as a triple, and the multimodal expert input is obtained by calculation;

通过多模态专家捕捉融合过程的表征；Capturing the representation of the fusion process through multimodal experts;

通过上投影矩阵将融合表示映射至预训练语言子模型的嵌入维度，得到多模态专家的输出。The fused representation is mapped to the embedding dimension of the pre-trained language sub-model through the up-projection matrix to obtain the output of the multimodal expert.

在一个可选的实施例中，计算多模态专家的门控值，集成所述多模态专家的输出，将所述多模态专家的输出和所述门控值的加权和作为混合专家网络的输出，将混合输出加至预训练语言子模型的Transformer层的输出包括；In an optional embodiment, calculating a gating value of a multimodal expert, integrating the output of the multimodal expert, taking a weighted sum of the output of the multimodal expert and the gating value as the output of a hybrid expert network, and adding the hybrid output to the output of a Transformer layer of a pre-trained language sub-model includes:

通过多模态门计算多模态专家的门控值；Calculate the gating value of the multimodal expert through the multimodal gate;

通过离散路由集成多模态专家的输出；Integrate the output of multimodal experts through discrete routing;

将所述多模态专家的输出和所述门控值的加权和作为混合专家网络的输出；Taking the weighted sum of the output of the multimodal expert and the gating value as the output of the hybrid expert network;

将混合输出加至预训练语言子模型的Transformer层的输出。Add the mixed output to the output of the Transformer layer of the pre-trained language sub-model.

在一个可选的实施例中，采用预设置的MLP分类器预测情感，得到目标模型包括；In an optional embodiment, a preset MLP classifier is used to predict emotions, and the target model obtained includes:

将预训练语言子模型的Transformer最后一层视为最终的多模态表征；The last layer of the Transformer of the pre-trained language sub-model is regarded as the final multimodal representation;

将最终的多模态表征输入至MLP分类器预测情感；The final multimodal representation is input into the MLP classifier to predict sentiment;

得到目标模型。Get the target model.

在一个可选的实施例中，所述训练样本输入目标模型进行分类训练，将训练后的目标模型作为情感分析模型：In an optional embodiment, the training samples are input into a target model for classification training, and the trained target model is used as a sentiment analysis model:

通过梯度下降算法训练所述目标模型；Training the target model by a gradient descent algorithm;

计算损失函数，根据损失函数更新参数，所述参数包括混合专家网络参数和MLP分类器参数；Calculating a loss function, and updating parameters according to the loss function, wherein the parameters include hybrid expert network parameters and MLP classifier parameters;

使训练迭代次数达到预设值，得到训练后的目标模型；Make the number of training iterations reach the preset value and obtain the trained target model;

将训练后的目标模型作为情感分析模型。The trained target model is used as the sentiment analysis model.

与现有技术相比，本发明的多模态情感分析模型构建方法的有益效果如下：Compared with the prior art, the multimodal sentiment analysis model construction method of the present invention has the following beneficial effects:

通过向预训练语言子模型中插入多模态专家，并在训练时冻结预训练参数，解决了多模态情感分析使用预训练语言子模型带来的参数量过大的问题。借助混合专家网络将不同的适配器集成，解决了多模态情感分析中视频和音频中误导性内容影响模型预测的问题；进一步地，通过多模态路由分析不同模态在不同时间的特征，并将token分配给恰当的专家处理，相较于传统的融合方法，增加了对多模态特征有效性的判断，提升了分类准确度。By inserting multimodal experts into the pre-trained language sub-model and freezing the pre-trained parameters during training, the problem of too many parameters in multimodal sentiment analysis using pre-trained language sub-models is solved. By integrating different adapters with the help of hybrid expert networks, the problem of misleading content in video and audio affecting model prediction in multimodal sentiment analysis is solved; further, by analyzing the features of different modalities at different times through multimodal routing and assigning tokens to appropriate experts for processing, compared with traditional fusion methods, the judgment of the effectiveness of multimodal features is increased, and the classification accuracy is improved.

第二方面，本发明提出一种多模态情感分析模型，采用上述的多模态情感分析模型构建方法进行构建，所述分析模型包括预训练语言子模型、输入层、多模态融合层和输出层；In a second aspect, the present invention proposes a multimodal sentiment analysis model, which is constructed using the above-mentioned multimodal sentiment analysis model construction method, wherein the analysis model includes a pre-trained language sub-model, an input layer, a multimodal fusion layer, and an output layer;

所述输入层和预训练语言子模型用于对输入样本特征提取，以获取多模态特征；所述多模态融合层用于将所述多模态特征选择性融合，并将其输出与预训练语言子模型的Transformer层联合输出；所述输出层用于输出输入样本属于不同情感标签的概率。The input layer and the pre-trained language sub-model are used to extract features of the input samples to obtain multimodal features; the multimodal fusion layer is used to selectively fuse the multimodal features and jointly output the output with the Transformer layer of the pre-trained language sub-model; the output layer is used to output the probability that the input samples belong to different emotion labels.

在一个可选的实施例中，所述输入层包括编码器，所述编码器用于对输入样本的视频数据和音频数据处理得到视频特征和音频特征；所述预训练语言子模型用于对输入样本的文本信息提取，以获得文本特征。In an optional embodiment, the input layer includes an encoder, which is used to process the video data and audio data of the input sample to obtain video features and audio features; the pre-trained language sub-model is used to extract text information of the input sample to obtain text features.

在一个可选的实施例中，所述多模态融合层包括多模态专家和多模态路由，所述多模态路由用于对多模态特征选择性融合，所述多模态专家用于将多模态特征引入预训练语言子模型中，并将其输出与预训练语言子模型的Transformer层联合输出，所述预训练语言子模型的Transformer最后一层作为多模态表征。In an optional embodiment, the multimodal fusion layer includes a multimodal expert and a multimodal routing, the multimodal routing is used to selectively fuse multimodal features, the multimodal expert is used to introduce multimodal features into a pre-trained language sub-model, and outputs the multimodal features together with the Transformer layer of the pre-trained language sub-model, and the last Transformer layer of the pre-trained language sub-model is used as the multimodal representation.

第三方面，本发明提出一种多模态情感分析方法，所述分析方法包括：In a third aspect, the present invention proposes a multimodal sentiment analysis method, the analysis method comprising:

获取待识别情感数据集；Obtain the emotion dataset to be identified;

将待识别情感数据集的样本输入上述的多模态情感分析模型中；Input samples of the emotion data set to be identified into the above multimodal emotion analysis model;

判断所述待识别情感数据集中的样本属于不同情感标签的概率；Determine the probability that samples in the emotion data set to be identified belong to different emotion labels;

输出概率最大的作为分析结果。The output with the highest probability is taken as the analysis result.

与现有技术相比，本发明的多模态情感分析模型与多模态情感分析方法的有益效果与第一方面所述的多模态情感分析模型构建方法相同，故此处不再赘述。Compared with the prior art, the beneficial effects of the multimodal sentiment analysis model and the multimodal sentiment analysis method of the present invention are the same as the multimodal sentiment analysis model construction method described in the first aspect, so they will not be repeated here.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例中多模态情感分析模型构建方法的流程图；FIG1 is a flow chart of a method for constructing a multimodal sentiment analysis model in an embodiment of the present invention;

图2为本发明实施例中预训练语言子模型文本特征提取的示意图；FIG2 is a schematic diagram of text feature extraction of a pre-trained language sub-model in an embodiment of the present invention;

图3为本发明实施例中多模态专家与多模态路由的原理示意图；FIG3 is a schematic diagram of the principles of multimodal experts and multimodal routing in an embodiment of the present invention;

图4为本发明实施例中多模态情感分析方法的流程图；FIG4 is a flow chart of a multimodal sentiment analysis method according to an embodiment of the present invention;

图5为本发明实施例中多模态情感分析模型构建装置的结构框图。FIG5 is a structural block diagram of a multimodal sentiment analysis model building device in an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为了更清楚地理解本申请的目的、技术方案和优点，下面结合附图和实施例，对本申请进行了描述和说明。In order to more clearly understand the purpose, technical solutions and advantages of the present application, the present application is described and illustrated below in conjunction with the accompanying drawings and embodiments.

除另作定义外，本申请所涉及的技术术语或者科学术语应具有本申请所属技术领域具备一般技能的人所理解的一般含义。在本申请中的“一”、“一个”、“一种”、“该”、“这些”等类似的词并不表示数量上的限制，它们可以是单数或者复数。在本申请中所涉及的术语“包括”、“包含”、“具有”及其任何变体，其目的是涵盖不排他的包含；例如，包含一系列步骤或模块（单元）的过程、方法和系统、产品或设备并未限定于列出的步骤或模块（单元），而可包括未列出的步骤或模块（单元），或者可包括这些过程、方法、产品或设备固有的其他步骤或模块（单元）。在本申请中所涉及的“连接”、“相连”、“耦接”等类似的词语并不限定于物理的或机械连接，而可以包括电气连接，无论是直接连接还是间接连接。在本申请中所涉及的“多个”是指两个或两个以上。“和/或”描述关联对象的关联关系，表示可以存在三种关系，例如，“A和/或B”可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。通常情况下，字符“/”表示前后关联的对象是一种“或”的关系。在本申请中所涉及的术语“第一”、“第二”、“第三”等，只是对相似对象进行区分，并不代表针对对象的特定排序。Unless otherwise defined, the technical terms or scientific terms involved in this application shall have the general meaning understood by people with general skills in the technical field to which this application belongs. The words "one", "a", "the", "these" and the like in this application do not indicate a quantitative limitation, and they may be singular or plural. The terms "include", "comprise", "have" and any variants thereof involved in this application are intended to cover non-exclusive inclusions; for example, a process, method and system, product or device comprising a series of steps or modules (units) is not limited to the listed steps or modules (units), but may include unlisted steps or modules (units), or may include other steps or modules (units) inherent to these processes, methods, products or devices. The words "connect", "connected", "coupled" and the like involved in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The "multiple" involved in this application refers to two or more. "And/or" describes the association relationship of associated objects, indicating that there may be three relationships, for example, "A and/or B" may mean: A exists alone, A and B exist at the same time, and B exists alone. Generally, the character "/" indicates that the objects associated with each other are in an "or" relationship. The terms "first", "second", "third", etc. involved in this application are only used to distinguish similar objects and do not represent a specific ordering of the objects.

在本发明实施例中提供了一种多模态情感分析模型构建方法，图1是本发明的多模态情感分析模型构建方法的流程图，如图1所示，该构建方法包括如下步骤：In an embodiment of the present invention, a method for constructing a multimodal sentiment analysis model is provided. FIG. 1 is a flow chart of the method for constructing a multimodal sentiment analysis model of the present invention. As shown in FIG. 1 , the construction method includes the following steps:

S101、获取预训练语言子模型和训练集，训练集包括多个具有标签的训练样本；S101, obtaining a pre-trained language sub-model and a training set, where the training set includes a plurality of labeled training samples;

需要说明的是，本实施例中，预训练语言子模型采用基于层Transformer的预训练语言子模型，其可以提取训练样本的文本特征，需要进一步说明的是，训练集中的训练样本包括视频数据、音频数据和文本数据。It should be noted that in this embodiment, the pre-trained language sub-model is based on The pre-trained language sub-model of the layer Transformer can extract text features of training samples. It should be further explained that the training samples in the training set include video data, audio data and text data.

S102、对训练集中的样本特征提取，得到多模态特征并对多模态特征对齐处理；本实施例中，多模态特征包括视频特征、音频特征和文本特征；S102, extracting sample features in the training set to obtain multimodal features and aligning the multimodal features; in this embodiment, the multimodal features include video features, audio features, and text features;

具体地，对情感数据集中样本特征提取，得到多模态特征包括并对多模态特征对齐处理包括；Specifically, extracting sample features from the sentiment data set to obtain multimodal features includes and aligning the multimodal features includes;

通过预训练语言子模型获取训练集中的样本的文本特征；Obtain text features of samples in the training set through the pre-trained language sub-model;

结合图2和图3所示，将文本数据送入预训练语言子模型的分词器和嵌入层，得到文本序列,其中，是文本序列的长度。是预训练语言子模型的嵌入维度。As shown in Figure 2 and Figure 3, the text data is fed into the word segmenter and embedding layer of the pre-trained language sub-model to obtain the text sequence ,in, is the length of the text sequence. is the embedding dimension of the pre-trained language sub-model.

将第个Transformer层的输入和输出分别记为和，其中，满足The first The input and output of the Transformer layer are recorded as and ,in, satisfy

； ;

根据上式对每层Transformer的输入使用预训练的自注意力层对其做进一步的编码。具体地，自注意力层采用LoRA进行低秩适配（LoRA是低秩微调技术）。According to the above formula, the input of each layer of Transformer is further encoded using the pre-trained self-attention layer. Specifically, the self-attention layer uses LoRA for low-rank adaptation (LoRA is a low-rank fine-tuning technology).

其表达式为：。Its expression is: .

通过预训练的编码器对训练集中的样本编码，得到底层特征，底层特征包括视频特征和音频特征。The samples in the training set are encoded by a pre-trained encoder to obtain the underlying features, which include video features and audio features.

本实施例中，使用预训练的编码器对视频数据和音频数据编码，得到底层特征；底层特征包括视频特征和音频特征。In this embodiment, a pre-trained encoder is used to encode the video data. and audio data Encode to obtain underlying features; the underlying features include video features and audio features.

其中，和是冻结的视频和音频特征提取工具。和是提取的视频特征与音频特征，和是视频和音频的特征维度，和是视频和音频的序列长度。in, and It is a frozen video and audio feature extraction tool. and are the extracted video and audio features, and is the feature dimension of video and audio, and is the sequence length of video and audio.

将多模态特征对齐处理；Align multimodal features;

具体地，将视频特征和音频特征进行字对齐，针对每个文本特征中词语出现的时间，将对应的视频特征和音频特征在该时间段内取均值，使三种模态特征的序列长度相等，即，将其记作。Specifically, the video features and audio features are word-aligned, and for each word appearance time in the text feature, the corresponding video features and audio features are averaged within the time period to make the sequence lengths of the three modal features equal, that is, , which is recorded as .

S103、在预训练语言子模型中构建多模态专家，将多模态特征引入预训练语言子模型，得到多模态专家的输出；S103, constructing a multimodal expert in the pre-trained language sub-model, introducing the multimodal features into the pre-trained language sub-model, and obtaining the output of the multimodal expert;

需要说明的是，为了使预训练语言子模型执行情感分析任务，因此需要将多模态特征引入预训练语言子模型中，对于每种模态，设计一定数量的专家（并行适配器）来捕捉其位于该时间的特征表示。It should be noted that in order to enable the pre-trained language sub-model to perform sentiment analysis tasks, multimodal features need to be introduced into the pre-trained language sub-model. For each modality, a certain number of experts (parallel adapters) are designed to capture its feature representation at that time.

具体地，在预训练语言子模型中构建多模态专家，将多模态特征引入预训练语言子模型，得到多模态专家的输出包括：Specifically, a multimodal expert is constructed in the pre-trained language sub-model, and multimodal features are introduced into the pre-trained language sub-model. The output of the multimodal expert includes:

结合上述，本实施例中，多模态专家包括文本专家、视频专家和音频专家，多模态特征包括文本特征、视频特征和音频特征，其分别对应了一种模态。In combination with the above, in this embodiment, the multimodal experts include text experts, video experts and audio experts, and the multimodal features include text features, video features and audio features, which respectively correspond to a modality.

因此，将多模态专家在第i个词语（token）处的输入记为三元组（）；Therefore, the input of the multimodal expert at the i-th word (token) is recorded as a triple ( );

其中，，使用下式得到文本专家、视频专家和音频专家的输入；in, , use the following formula to get the input of text expert, video expert and audio expert;

； ;

其中，式中的{；}表示拼接。Among them, {;} in the formula represents splicing.

首先按照下式得到降维后的融合表示；First, the fusion representation after dimension reduction is obtained according to the following formula;

； ;

其中，，是适配器的内在维度，是非线性激活函数，为下投影矩阵，是偏置向量。。适配器的内在维度远远小于是预训练语言子模型的嵌入维度，以确保并行适配器中的参数量远远小于预训练语言子模型中的参数量。in, , is the intrinsic dimension of the adapter, is a nonlinear activation function, is the down-projection matrix, is the bias vector. The intrinsic dimensions of the adapter Much smaller than the embedding dimension of the pre-trained language sub-model , to ensure that the number of parameters in the parallel adapter is much smaller than the number of parameters in the pre-trained language sub-model.

进一步地，按照下式用上投影矩阵将降维后的融合表示映射至预训练语言子模型的嵌入维度维度；Further, the projection matrix is used as follows The fusion after dimensionality reduction is represented as Mapping to the embedding dimension of the pre-trained language sub-model Dimension;

； ;

其中，是基于第i个词（token）关于模态m的第n个专家的输出。是并行适配器的索引，是模态m的专家数量。为上投影矩阵，是偏置向量，是一个初始化为1的可学习的缩放因子，其可控制对应的并行适配器的影响。in, It is the output of the nth expert on modality m based on the i-th word (token). is the index of the parallel adapter, is the number of experts for modality m. is the upper projection matrix, is the bias vector, is a learnable scaling factor initialized to 1 that controls the influence of the corresponding parallel adapter.

S104、计算多模态专家的门控值，通过离散路由集成多模态专家的输出，将多模态专家的输出和门控值的加权和作为混合专家网络的输出，将混合输出加至预训练语言子模型的Transformer层的输出。S104, calculating the gating value of the multimodal expert, integrating the output of the multimodal expert through discrete routing, taking the weighted sum of the output of the multimodal expert and the gating value as the output of the hybrid expert network, and adding the hybrid output to the output of the Transformer layer of the pre-trained language sub-model.

需要说明的是，为了避免无关信息引入，本实施例中，对多模态特征选择性融合，为每个专家计算对应的门控值，根据门控值选择前K个专家。It should be noted that, in order to avoid the introduction of irrelevant information, in this embodiment, multimodal features are selectively fused, a corresponding gating value is calculated for each expert, and the first K experts are selected according to the gating value.

具体地，计算多模态专家的门控值，集成多模态专家的输出，将多模态专家的输出和门控值的加权和作为混合专家网络的输出，将混合输出加至预训练语言子模型的Transformer层的输出包括；Specifically, the gating value of the multimodal expert is calculated, the output of the multimodal expert is integrated, the weighted sum of the output of the multimodal expert and the gating value is used as the output of the hybrid expert network, and the hybrid output is added to the output of the Transformer layer of the pre-trained language sub-model, including;

进一步地，通过下式对门控值计算：Furthermore, the gate value is calculated by the following formula:

； ;

其中，，是第i个token的门控向量。in, , is the gating vector of the i-th token.

按照下式得到门控向量中的前K个值：The first K values in the gated vector are obtained as follows:

； ;

其中，是门控向量前K大的值。是门控向量前K大值的索引。in, is the gate vector The first K largest values. is the gate vector The indices of the top K values.

将多模态专家的输出和门控值的加权和作为混合专家网络的输出；The weighted sum of the multimodal expert output and the gate value is used as the output of the hybrid expert network;

具体地，多模态专家和离散路由组成混合专家网络，通过softmax运算对前K个值进行归一化，按照下式将选中专家的输出和门控值的加权和作为混合专家的输出：Specifically, the multimodal expert and discrete routing form a hybrid expert network. The first K values are normalized by softmax operation, and the weighted sum of the selected expert's output and the gated value is used as the output of the hybrid expert according to the following formula:

。 .

按照下式将混合专家网络的输出加入至预训练语言子模型的Transformer层的输出上：The output of the hybrid expert network is added to the output of the Transformer layer of the pre-trained language sub-model as follows:

； ;

其中，FFN是预训练语言子模型的前馈神经网络模块。Among them, FFN is a feedforward neural network module of the pre-trained language sub-model.

S105、采用预设置的MLP（多层感知机）分类器预测情感，得到目标模型；S105, using a preset MLP (multi-layer perceptron) classifier to predict sentiment and obtain a target model;

具体地，采用预设置的MLP分类器预测情感，得到目标模型包括；Specifically, the preset MLP classifier is used to predict emotions, and the target model includes;

将最终的多模态表征输入至MLP分类器预测情感，具体如下式所示：The final multimodal representation is input into the MLP classifier to predict sentiment, as shown in the following formula:

； ;

其中是第L层Transformer最后一个token的表征。in It is the representation of the last token of the L-th layer Transformer.

得到目标模型。Get the target model.

S106、训练样本输入目标模型进行分类训练，将训练后的目标模型作为情感分析模型。S106. The training samples are input into the target model for classification training, and the trained target model is used as a sentiment analysis model.

示例性地，给定样本标签，损失函数按照下式，采用平均绝对误差进行训练：For example, given a sample label , the loss function is trained using the mean absolute error according to the following formula:

； ;

具体地，训练样本输入目标模型进行分类训练，将训练后的目标模型作为情感分析模型：Specifically, the training samples are input into the target model for classification training, and the trained target model is used as the sentiment analysis model:

通过梯度下降算法训练目标模型；Train the target model through the gradient descent algorithm;

计算损失函数，根据损失函数更新参数，参数包括混合专家网络参数和MLP分类器参数；需要说明的是，参数还包括LORA的参数，通过更新参数以进一步优化目标模型。Calculate the loss function and update the parameters according to the loss function. The parameters include the hybrid expert network parameters and the MLP classifier parameters. It should be noted that the parameters also include the parameters of LORA. The target model is further optimized by updating the parameters.

通常在模型构建后得到MLP分类器，或者对MLP分类器进行参数更新或训练，以得到所需模型。Usually, an MLP classifier is obtained after the model is constructed, or parameters of the MLP classifier are updated or trained to obtain the desired model.

本发明实施例还提出一种多模态情感分析模型，采用上述的多模态情感分析模型构建方法进行构建，分析模型包括预训练语言子模型、输入层、多模态融合层和输出层；The embodiment of the present invention further proposes a multimodal sentiment analysis model, which is constructed using the above-mentioned multimodal sentiment analysis model construction method, and the analysis model includes a pre-trained language sub-model, an input layer, a multimodal fusion layer and an output layer;

输入层和预训练语言子模型用于对输入样本特征提取，以获取多模态特征；多模态融合层用于将多模态特征选择性融合，并将其输出与预训练语言子模型的Transformer层联合输出；输出层用于输出输入样本属于不同情感标签的概率。The input layer and the pre-trained language sub-model are used to extract features from input samples to obtain multimodal features; the multimodal fusion layer is used to selectively fuse multimodal features and jointly output their output with the Transformer layer of the pre-trained language sub-model; the output layer is used to output the probability that the input samples belong to different sentiment labels.

进一步地，输入层包括编码器，编码器用于对输入样本的视频数据和音频数据处理得到视频特征和音频特征；预训练语言子模型用于对输入样本的文本信息提取，以获得文本特征。Furthermore, the input layer includes an encoder, which is used to process the video data and audio data of the input sample to obtain video features and audio features; the pre-trained language sub-model is used to extract text information of the input sample to obtain text features.

更进一步地，多模态融合层包括多模态专家和多模态路由，多模态路由用于对多模态特征选择性融合，多模态专家用于将多模态特征引入预训练语言子模型中，并将其输出与预训练语言子模型的Transformer层联合输出，预训练语言子模型的Transformer最后一层作为多模态表征。Furthermore, the multimodal fusion layer includes a multimodal expert and a multimodal routing. The multimodal routing is used to selectively fuse multimodal features. The multimodal expert is used to introduce multimodal features into the pre-trained language sub-model and jointly output its output with the Transformer layer of the pre-trained language sub-model. The last Transformer layer of the pre-trained language sub-model is used as the multimodal representation.

本发明通过预训练语言子模型提取文本特征，通过编码器提取视频特征和音频特征，通过在预训练语言子模型中构建多模态专家，借助混合专家网络将不同的适配器集成，解决了多模态情感分析中视频和音频中误导性内容影响模型预测的问题；通过多模态路由分析不同模态在不同时间的特征，并将token分配给恰当的专家处理，相较于传统的融合方法，增加了对多模态特征有效性的判断，提升了分类准确度。The present invention extracts text features through a pre-trained language sub-model, extracts video features and audio features through an encoder, builds a multimodal expert in the pre-trained language sub-model, and integrates different adapters with the help of a hybrid expert network, thereby solving the problem of misleading content in video and audio affecting model prediction in multimodal sentiment analysis; by analyzing the features of different modalities at different times through multimodal routing, and assigning tokens to appropriate experts for processing, compared with traditional fusion methods, the judgment on the effectiveness of multimodal features is increased, and the classification accuracy is improved.

如图4所示，本发明实施例还提出一种多模态情感分析方法，分析方法包括如下步骤：As shown in FIG4 , the embodiment of the present invention further proposes a multimodal sentiment analysis method, which includes the following steps:

S201、获取待识别情感数据集；S201, obtaining a dataset of emotions to be identified;

S202、将待识别情感数据集的样本输入如上述的多模态情感分析模型中；S202, inputting samples of the emotion data set to be identified into the multimodal emotion analysis model as described above;

S203、判断待识别情感数据集中的样本属于不同情感标签的概率；S203, determining the probability that the samples in the emotion data set to be identified belong to different emotion labels;

S204、输出概率最大的作为分析结果。S204. Output the one with the highest probability as the analysis result.

需要说明的是，待识别情感数据集中的样本包括文本数据、视频数据和音频数据。通过将样本输入多模态情感分析模型中，获取该样本属于不同情感标签的概率，从而预测情感分析结果。It should be noted that the samples in the emotion data set to be identified include text data, video data and audio data. By inputting the samples into the multimodal emotion analysis model, the probability of the samples belonging to different emotion labels is obtained, thereby predicting the emotion analysis results.

本发明实施例还提出一种多模态情感模型构建装置，包括：The embodiment of the present invention further provides a multimodal emotion model construction device, comprising:

获取模块100，用于获取预训练语言子模型、训练集，训练集包括多个具有标签的训练样本；An acquisition module 100 is used to acquire a pre-trained language sub-model and a training set, where the training set includes a plurality of labeled training samples;

多模态特征提取模块200，用于对训练集中的样本特征提取，得到多模态特征并对多模态特征对齐处理；The multimodal feature extraction module 200 is used to extract features of samples in the training set, obtain multimodal features and perform alignment processing on the multimodal features;

多模态专家构建模块300，用于在预训练语言子模型中构建多模态专家，将多模态特征引入预训练语言子模型，得到多模态专家的输出；A multimodal expert construction module 300 is used to construct a multimodal expert in a pre-trained language sub-model, introduce multimodal features into the pre-trained language sub-model, and obtain the output of the multimodal expert;

混合输出模块400，用于计算多模态专家的门控值，通过离散路由集成多模态专家的输出，将多模态专家的输出和门控值的加权和作为混合专家网络的输出，将混合输出加至预训练语言子模型的Transformer层的输出；A hybrid output module 400 is used to calculate the gating value of the multimodal expert, integrate the output of the multimodal expert through discrete routing, take the weighted sum of the output of the multimodal expert and the gating value as the output of the hybrid expert network, and add the hybrid output to the output of the Transformer layer of the pre-trained language sub-model;

情感预测模块500，采用预设置的MLP分类器预测情感，得到目标模型；The emotion prediction module 500 uses a preset MLP classifier to predict emotions and obtain a target model;

训练模块600，训练样本输入目标模型进行分类训练，将训练后的目标模型作为情感分析模型。In the training module 600, the training samples are input into the target model for classification training, and the trained target model is used as a sentiment analysis model.

通过获取模块100获取预训练语言子模型和训练集，通过多模态特征提取模块200对训练集中的样本特征提取，并对多模态特征对齐处理；通过多模态专家构建模块300在预训练语言子模型中构建多模态专家，将多模态特征引入预训练语言子模型，得到多模态专家的输出；通过混合输出模块400计算多模态专家的门控值，通过离散路由集成多模态专家的输出，将多模态专家的输出和门控值的加权和作为混合专家网络的输出，将混合输出加至预训练语言子模型的Transformer层的输出；通过情感预测模块500预测情感，得到目标模型；通过训练模块600对目标模型进行分类训练，以得到情感分析模型。The acquisition module 100 acquires the pre-trained language sub-model and the training set, the multi-modal feature extraction module 200 extracts the sample features in the training set, and aligns the multi-modal features; the multi-modal expert construction module 300 constructs the multi-modal expert in the pre-trained language sub-model, introduces the multi-modal features into the pre-trained language sub-model, and obtains the output of the multi-modal expert; the gating value of the multi-modal expert is calculated by the mixed output module 400, the output of the multi-modal expert is integrated by discrete routing, the weighted sum of the output of the multi-modal expert and the gating value is used as the output of the mixed expert network, and the mixed output is added to the output of the Transformer layer of the pre-trained language sub-model; the emotion is predicted by the emotion prediction module 500 to obtain the target model; the target model is classified and trained by the training module 600 to obtain the emotion analysis model.

本发明实施例还提供了一种电子设备，电子设备包括存储器以及处理器，存储器存储有至少一条计算机可执行指令，处理器被配置为运行计算机可执行指令，计算机可执行指令被处理器运行时以实现上述方法实施例，例如如下方法：An embodiment of the present invention further provides an electronic device, the electronic device comprising a memory and a processor, the memory storing at least one computer executable instruction, the processor being configured to execute the computer executable instruction, and the computer executable instruction being executed by the processor to implement the above method embodiment, for example, the following method:

获取预训练语言子模型、训练集，训练集包括多个具有标签的训练样本；Obtain a pre-trained language sub-model and a training set, where the training set includes multiple labeled training samples;

对训练集中的样本特征提取，得到多模态特征并对多模态特征对齐处理；Extract sample features from the training set, obtain multimodal features, and align the multimodal features;

在预训练语言子模型中构建多模态专家，将多模态特征引入预训练语言子模型，得到多模态专家的输出；Construct a multimodal expert in the pre-trained language sub-model, introduce multimodal features into the pre-trained language sub-model, and obtain the output of the multimodal expert;

计算多模态专家的门控值，通过离散路由集成多模态专家的输出，将多模态专家的输出和门控值的加权和作为混合专家网络的输出，将混合输出加至预训练语言子模型的Transformer层的输出；Calculate the gating value of the multimodal expert, integrate the output of the multimodal expert through discrete routing, take the weighted sum of the multimodal expert output and the gating value as the output of the hybrid expert network, and add the hybrid output to the output of the Transformer layer of the pre-trained language sub-model;

训练样本输入目标模型进行分类训练，将训练后的目标模型作为情感分析模型。The training samples are input into the target model for classification training, and the trained target model is used as the sentiment analysis model.

上述的存储器中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-OnlyMemory)、随机存取存储器(RAM，RandomAccessMemory)、磁碟或者光盘等各种可以存储程序代码的介质。The logic instructions in the above-mentioned memory can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the technical solution of the present invention can be essentially or partly embodied in the form of a software product that contributes to the prior art. The computer software product is stored in a storage medium, including several instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk, etc. Various media that can store program codes.

本发明实施例还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各实施例提供的方法，例如包括：An embodiment of the present invention further provides a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the method provided in each of the above embodiments is implemented, for example, including:

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods of each embodiment or some parts of the embodiment.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for constructing a multimodal sentiment analysis model, characterized in that the construction method comprises:

Obtaining a pre-trained language sub-model and a training set, wherein the training set includes a plurality of labeled training samples;

Extracting features of samples in the training set to obtain multimodal features and aligning the multimodal features;

Constructing a multimodal expert in the pre-trained language sub-model, introducing the multimodal features into the pre-trained language sub-model, and obtaining an output of the multimodal expert;

Calculating the gating value of the multimodal expert, integrating the output of the multimodal expert through discrete routing, taking the weighted sum of the output of the multimodal expert and the gating value as the output of the hybrid expert network, and adding the hybrid output to the output of the Transformer layer of the pre-trained language sub-model;

Use the preset MLP classifier to predict sentiment and obtain the target model;

The training samples are input into the target model for classification training, and the trained target model is used as a sentiment analysis model.

2. The method for constructing a multimodal sentiment analysis model according to claim 1, characterized in that:

Extracting sample features from the training set to obtain multimodal features and aligning the multimodal features includes:

Acquire text features of samples in a training set through the pre-trained language sub-model;

The samples in the training set are encoded by a pre-trained encoder to obtain underlying features, where the underlying features include video features and audio features.

3. The method for constructing a multimodal sentiment analysis model according to claim 2, characterized in that a multimodal expert is constructed in a pre-trained language sub-model, the multimodal features are introduced into the pre-trained language sub-model, and the output of the multimodal expert is obtained, comprising:

The multimodal expert input at the i-th word is recorded as a triple, and the multimodal expert input is obtained by calculation;

Capturing the representation of the fusion process through multimodal experts;

The fused representation is mapped to the embedding dimension of the pre-trained language sub-model through the up-projection matrix to obtain the output of the multimodal expert.

4. The method for constructing a multimodal sentiment analysis model according to claim 3, characterized in that the gating value of the multimodal expert is calculated, the output of the multimodal expert is integrated, the weighted sum of the output of the multimodal expert and the gating value is used as the output of the hybrid expert network, and the hybrid output is added to the output of the Transformer layer of the pre-trained language sub-model;

Calculate the gating value of the multimodal expert through the multimodal gate;

Integrate the output of multimodal experts through discrete routing;

Taking the weighted sum of the output of the multimodal expert and the gating value as the output of the hybrid expert network;

Add the mixed output to the output of the Transformer layer of the pre-trained language sub-model.

5. The method for constructing a multimodal sentiment analysis model according to claim 4, wherein the target model is obtained by using a preset MLP classifier to predict sentiments, and the target model comprises:

The last layer of the Transformer of the pre-trained language sub-model is regarded as the final multimodal representation;

The final multimodal representation is input into the MLP classifier to predict sentiment;

Get the target model.

6. The method for constructing a multimodal sentiment analysis model according to claim 5, characterized in that the training samples are input into a target model for classification training, and the trained target model is used as a sentiment analysis model:

Training the target model by a gradient descent algorithm;

Calculating a loss function, and updating parameters according to the loss function, wherein the parameters include hybrid expert network parameters and MLP classifier parameters;

Make the number of training iterations reach the preset value and obtain the trained target model;

The trained target model is used as the sentiment analysis model.

7. A multimodal sentiment analysis model, constructed by the multimodal sentiment analysis model construction method according to any one of claims 1 to 6, characterized in that the analysis model comprises a pre-trained language sub-model, an input layer, a multimodal fusion layer and an output layer;

The input layer and the pre-trained language sub-model are used to extract features of the input samples to obtain multimodal features; the multimodal fusion layer is used to selectively fuse the multimodal features and jointly output the output with the Transformer layer of the pre-trained language sub-model; the output layer is used to output the probability that the input samples belong to different emotion labels.

8. A multimodal sentiment analysis model according to claim 7, characterized in that the input layer includes an encoder, which is used to process the video data and audio data of the input sample to obtain video features and audio features; the pre-trained language sub-model is used to extract text information of the input sample to obtain text features.

9. A multimodal sentiment analysis model according to claim 7, characterized in that the multimodal fusion layer includes a multimodal expert and a multimodal routing, the multimodal routing is used to selectively fuse multimodal features, the multimodal expert is used to introduce multimodal features into a pre-trained language sub-model, and outputs the multimodal features together with the Transformer layer of the pre-trained language sub-model, and the last Transformer layer of the pre-trained language sub-model is used as a multimodal representation.

10. A multimodal sentiment analysis method, characterized in that the analysis method comprises:

Obtain the emotion dataset to be identified;

Inputting a sample of the emotion data set to be identified into the multimodal emotion analysis model according to any one of claims 7 to 9;

Determine the probability that samples in the emotion data set to be identified belong to different emotion labels;

The output with the highest probability is taken as the analysis result.