+

CN114330279A - Cross-modal semantic consistency recovery method - Google Patents

Cross-modal semantic consistency recovery method Download PDF

Info

Publication number
CN114330279A
CN114330279A CN202111638661.3A CN202111638661A CN114330279A CN 114330279 A CN114330279 A CN 114330279A CN 202111638661 A CN202111638661 A CN 202111638661A CN 114330279 A CN114330279 A CN 114330279A
Authority
CN
China
Prior art keywords
attention
matrix
sentence
head
dimension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111638661.3A
Other languages
Chinese (zh)
Other versions
CN114330279B (en
Inventor
杨阳
史文浩
宾燚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202111638661.3A priority Critical patent/CN114330279B/en
Publication of CN114330279A publication Critical patent/CN114330279A/en
Application granted granted Critical
Publication of CN114330279B publication Critical patent/CN114330279B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a cross-modal semantic consistency recovery method which comprises the steps of firstly obtaining basic features and context features of a text modality and an image modality, then converting the context features of the two modalities into a semantic public space through linear projection to obtain attention information of a cross-modal ordered position, and finally sequencing disordered sentences by utilizing the attention information with the ordered position so as to finish the consistency recovery of the disordered sentences.

Description

一种跨模态语义连贯性恢复方法A Cross-modal Semantic Coherence Restoration Method

技术领域technical field

本发明属于自然语言处理技术领域,更为具体地讲,涉及一种跨模态语义连贯性恢复方法。The invention belongs to the technical field of natural language processing, and more particularly, relates to a method for restoring cross-modal semantic coherence.

背景技术Background technique

连贯性建模一直是一个重要的研究课题,在自然语言处理领域被广泛研究,旨在将一组句子组织成一个连贯的文本,在逻辑上形成一致的顺序。研究取得了一定的进展,目前对语义连贯性建模的研究仍然停留在文本这单一模态。现有的语义连贯性分析和恢复方法是单模态下的,针对文本模态下的一组句子,通常采用编码器-解码器的体系结构,利用指针网络进行序列预测。Coherence modeling has always been an important research topic, widely studied in the field of natural language processing, aiming to organize a set of sentences into a coherent text that logically forms a consistent order. The research has made some progress, and the current research on semantic coherence modeling is still stuck in the single modality of text. Existing methods for semantic coherence analysis and recovery are unimodal, targeting a set of sentences in text modality, usually using an encoder-decoder architecture and using pointer networks for sequence prediction.

语义连贯性最初是衡量文本在语言学上是否具有语义意义,它可以扩展到更广泛的含义,用于评估各种模态中元素的逻辑、有序和一致的关系。对人类来说,连贯性建模是一种自然而必不可少的感知世界的能力,它使我们能够从整体上理解和感知世界,所以对信息的连贯性建模对于促进人类对物理世界的感知和理解非常重要。Semantic coherence is initially a measure of whether a text is linguistically semantically meaningful, and it can be extended to a wider range of meanings to assess the logical, ordered, and consistent relationships of elements in various modalities. For humans, coherence modeling is a natural and essential ability to perceive the world, which enables us to understand and perceive the world as a whole, so the coherence modeling of information is important for promoting human understanding of the physical world. Perception and understanding are very important.

当前主流的单模态语义连贯性分析和恢复方法是一种自回归的注意力分析和恢复方法,利用Bi-LSTM来提取基本的句子特征向量,启发于自注意力机制,采用去除位置编码的Transformer变体结构来提取可靠的段落表征以消除句子输入顺序带来的影响,从而获得段落中的句子特征,平均池化后获得段落特征来初始化循环神经网络解码器的隐层状态,通过指针网络,采用贪心搜索或集束搜索递归地预测有序连贯的段落组成,从而完成单模态语义连贯性分析和恢复。The current mainstream single-modal semantic coherence analysis and recovery method is an autoregressive attention analysis and recovery method. It uses Bi-LSTM to extract the basic sentence feature vector, inspired by the self-attention mechanism, and adopts the position encoding removal method. Transformer variant structure to extract reliable paragraph representations to eliminate the influence of sentence input order, thereby obtaining sentence features in paragraphs, and obtaining paragraph features after average pooling to initialize the hidden layer state of the recurrent neural network decoder, through the pointer network. , using greedy search or beam search to recursively predict the composition of ordered and coherent paragraphs, so as to complete the analysis and recovery of unimodal semantic coherence.

现有的语义连贯性建模工作主要集中在文本这单一模态上,编码时利用双向长短时记忆网络提取句子的基本特征向量,并利用自注意力机制提取句子上下文特征,然后通过平均池化操作得到段落特征,特别注意,这里采用了去除位置编码的Transformer变体结构。解码时采用指针网络架构作为解码器,该解码器由长短时记忆网络单元组成,基本的句子特征向量作为解码器的输入,第一步的输入向量是零向量,段落特征作为隐层初始状态。尽管现有的方法能有效解决模态语义连贯性分析和恢复,并进一步提高单模态下的性能,然而忽略了多模态之间的信息集成和语义一致性的影响,缺乏跨模态的信息。Existing semantic coherence modeling work mainly focuses on the single modality of text. During encoding, a bidirectional long-short-term memory network is used to extract the basic feature vector of the sentence, and the self-attention mechanism is used to extract the sentence context feature, and then average pooling is used. The operation obtains the paragraph features. In particular, the Transformer variant structure that removes the positional encoding is used here. The pointer network architecture is used as the decoder during decoding. The decoder is composed of long and short-term memory network units. The basic sentence feature vector is used as the input of the decoder. The input vector of the first step is the zero vector, and the paragraph feature is used as the initial state of the hidden layer. Although the existing methods can effectively solve the modal semantic coherence analysis and recovery, and further improve the performance under single modality, they ignore the influence of information integration and semantic consistency between multimodalities, and lack cross-modality analysis. information.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于克服现有技术的不足,提供一种跨模态语义连贯性恢复方法,根据文本和图像两种模态之间的语义一致性,有效利用跨模态信息引导文本模态下的语义连贯性恢复。The purpose of the present invention is to overcome the deficiencies of the prior art, and to provide a cross-modal semantic coherence recovery method. Semantic coherence restoration.

为实现上述发明目的,本发明一种跨模态语义连贯性恢复方法,其特征在于,包括以下步骤:In order to achieve the above purpose of the invention, a method for restoring cross-modal semantic coherence of the present invention is characterized in that it includes the following steps:

(1)、设文本模态下语义连贯性待恢复的乱序语句为X={x1,x2,…,xi,…,xm},xi表示第i句语句,m为乱序语句数量;设图像模态下的一组有序连贯图像为Y={y1,y2,…,yj,…,yn},yj表示第j张图像,n表示图像数量;设文本模态和图像模态之间有相似的语义;(1) Set the out-of-order sentences whose semantic coherence is to be restored in the text mode as X={x 1 ,x 2 ,..., xi ,...,x m }, x i represents the i-th sentence, and m is the out-of-order sentence The number of sequential sentences; let a group of sequential consecutive images in the image mode be Y={y 1 , y 2 ,...,y j ,...,y n }, y j represents the jth image, and n represents the number of images; Suppose there are similar semantics between text modalities and image modalities;

(2)、获取文本模态和图像模态的基本特征;(2), obtain the basic features of text mode and image mode;

(2.1)、利用双向长短时记忆网络获取乱序语句的基本特征:将X输入至双向长短时记忆网络,从而输出乱序语句的基本特征

Figure BDA0003442397980000021
其中,
Figure BDA0003442397980000022
表示第i句语句的基本特征,其维度大小为1×d;(2.1) Use the bidirectional long and short-term memory network to obtain the basic characteristics of the out-of-order sentence: input X into the bidirectional long-term and short-term memory network, so as to output the basic characteristics of the out-of-order sentence
Figure BDA0003442397980000021
in,
Figure BDA0003442397980000022
Represents the basic features of the i-th sentence, and its dimension is 1×d;

(2.2)、采用卷积神经网络获取有序连贯图像的基本特征:将Y输入至卷积神经网络,从而输出有序连贯图像的基本特征

Figure BDA0003442397980000023
Figure BDA0003442397980000024
表示第j张图像的基本特征,其维度大小为1×d;(2.2) Use the convolutional neural network to obtain the basic features of the ordered and coherent images: input Y into the convolutional neural network to output the basic features of the ordered and coherent images
Figure BDA0003442397980000023
Figure BDA0003442397980000024
Represents the basic features of the jth image, and its dimension is 1×d;

(3)、获取文本模态和图像模态的上下文特征;(3), obtain the contextual features of text modalities and image modalities;

(3.1)、利用去除位置嵌入的Transformer变体结构获取文本模态的上下文特征;(3.1), use the Transformer variant structure that removes the position embedding to obtain the contextual features of the text mode;

(3.1.1)、将各语句的基本特征进行拼接,得到矩阵

Figure BDA0003442397980000025
其维度大小为m×d;(3.1.1), splicing the basic features of each statement to obtain a matrix
Figure BDA0003442397980000025
Its dimension size is m×d;

(3.1.2)、利用Transformer的h头注意力层将基本特征

Figure BDA0003442397980000026
先映射为查询矩阵
Figure BDA0003442397980000027
键矩阵
Figure BDA0003442397980000028
和值矩阵
Figure BDA0003442397980000029
(3.1.2), use the Transformer's h-head attention layer to convert the basic features
Figure BDA0003442397980000026
First map to query matrix
Figure BDA0003442397980000027
bond matrix
Figure BDA0003442397980000028
sum value matrix
Figure BDA0003442397980000029

Figure BDA00034423979800000210
Figure BDA00034423979800000210

其中,k∈[1,h]表示第k个注意力头,

Figure BDA00034423979800000211
为第k个注意力头的权值矩阵,其维度大小均为
Figure BDA00034423979800000212
where k∈[1,h] denotes the kth attention head,
Figure BDA00034423979800000211
is the weight matrix of the kth attention head, and its dimension size is
Figure BDA00034423979800000212

然后通过注意力机制提取各个注意力头间的交互信息

Figure BDA0003442397980000031
Then, the interaction information between each attention head is extracted through the attention mechanism
Figure BDA0003442397980000031

Figure BDA0003442397980000032
Figure BDA0003442397980000032

其中,

Figure BDA0003442397980000033
表示第k个注意力头的维度,上标T表示转置;in,
Figure BDA0003442397980000033
Represents the dimension of the kth attention head, and the superscript T represents the transposition;

最后将各个注意力头间的交互信息

Figure BDA0003442397980000034
连接起来
Figure BDA0003442397980000035
再通过前向反馈网络得到乱序语句的上下文特征
Figure BDA0003442397980000036
Figure BDA0003442397980000037
表示第i句语句的上下文特征;Finally, the interaction information between each attention head
Figure BDA0003442397980000034
connect them
Figure BDA0003442397980000035
Then, the context features of the out-of-order sentences are obtained through the forward feedback network.
Figure BDA0003442397980000036
Figure BDA0003442397980000037
Represents the contextual features of the i-th sentence;

(3.2)、利用保留位置嵌入的Transformer变体结构获取图像模态的上下文特征;(3.2), use the Transformer variant structure embedded in the reserved position to obtain the contextual features of the image modality;

(3.2.1)、将各图像的基本特征进行拼接,得到矩阵

Figure BDA0003442397980000038
其维度大小为n×d;(3.2.1), stitch the basic features of each image to obtain a matrix
Figure BDA0003442397980000038
Its dimension size is n×d;

(3.2.2)、将基本特征

Figure BDA0003442397980000039
中各图像基本特征
Figure BDA00034423979800000310
的离散位置投影嵌入为紧凑位置,记为pj;(3.2.2), the basic features
Figure BDA0003442397980000039
The basic characteristics of each image in
Figure BDA00034423979800000310
The discrete position projection embedding of is a compact position, denoted as p j ;

在基本特征

Figure BDA00034423979800000311
中,对偶数项的维度进行投影嵌入为:pj,2l=sin(j/100002l/d);对奇数项的维度进行投影嵌入为:pj,2l+1=cos(j/100002l/d);in basic features
Figure BDA00034423979800000311
, the projection embedding for the dimensions of the even items is: p j,2l =sin(j/10000 2l/d ); the projection embedding for the dimensions of the odd items is: p j,2l+1 =cos(j/10000 2l /d );

其中,pj,2l、pj,2l+1分别表示偶数项维度和奇数项维度投影嵌入后的值,l为常数,2l,2l+1∈[1,d];Among them, p j,2l and p j,2l+1 respectively represent the even-numbered dimension and odd-numbered dimension after projection embedding, l is a constant, 2l,2l+1∈[1,d];

基本特征

Figure BDA00034423979800000312
的所有维度投影嵌入完成后得到紧凑位置pj;Basic Features
Figure BDA00034423979800000312
The compact position p j is obtained after the projection embedding of all dimensions of ;

最后将各图像的紧凑位置pj进行拼接,得到位置嵌入矩阵

Figure BDA00034423979800000313
其维度大小为n×d;Finally, the compact positions p j of each image are spliced to obtain the position embedding matrix
Figure BDA00034423979800000313
Its dimension size is n×d;

(3.2.3)、将基本特征

Figure BDA00034423979800000314
和位置嵌入
Figure BDA00034423979800000315
相加后利用Transformer的h头注意力层先映射为查询矩阵
Figure BDA00034423979800000316
键矩阵
Figure BDA00034423979800000317
和值矩阵
Figure BDA00034423979800000318
(3.2.3), the basic features
Figure BDA00034423979800000314
and position embedding
Figure BDA00034423979800000315
After the addition, use the Transformer's h-head attention layer to first map it to the query matrix
Figure BDA00034423979800000316
bond matrix
Figure BDA00034423979800000317
sum value matrix
Figure BDA00034423979800000318

Figure BDA00034423979800000319
Figure BDA00034423979800000319

其中,k∈[1,h]表示第k个注意力头,

Figure BDA00034423979800000320
为第k个注意力头的权值矩阵,其维度大小均为
Figure BDA00034423979800000321
where k∈[1,h] denotes the kth attention head,
Figure BDA00034423979800000320
is the weight matrix of the kth attention head, and its dimension size is
Figure BDA00034423979800000321

然后通过注意力机制提取各个注意力头间的交互信息

Figure BDA00034423979800000322
Then, the interaction information between each attention head is extracted through the attention mechanism
Figure BDA00034423979800000322

Figure BDA00034423979800000323
Figure BDA00034423979800000323

其中,

Figure BDA0003442397980000041
表示第k个注意力头的维度,上标T表示转置;in,
Figure BDA0003442397980000041
Represents the dimension of the kth attention head, and the superscript T represents the transposition;

最后将各个注意力头间的交互信息

Figure BDA0003442397980000042
连接起来
Figure BDA0003442397980000043
再通过前向反馈网络得到有序连贯图像的上下文特征
Figure BDA0003442397980000044
Figure BDA0003442397980000045
表示第j张图像的上下文特征;Finally, the interaction information between each attention head
Figure BDA0003442397980000042
connect them
Figure BDA0003442397980000043
Then, the contextual features of the ordered and coherent images are obtained through the forward feedback network.
Figure BDA0003442397980000044
Figure BDA0003442397980000045
Represents the contextual features of the jth image;

(4)、获取跨模态有序位置的注意力信息(4) Obtain attention information of cross-modal ordered positions

(4.1)、通过线性投影将两种模态的上下文特征转换到语义公共空间;(4.1) Convert the contextual features of the two modalities to the semantic common space through linear projection;

(4.1.1)、对两种模态的上下文特征进行线性投影;(4.1.1), perform linear projection on the contextual features of the two modalities;

Figure BDA0003442397980000046
Figure BDA0003442397980000046

Figure BDA0003442397980000047
Figure BDA0003442397980000047

其中,W1、W2为权重参数,b1、b2为偏置项,ReLU(·)为校正线性激活函数;Among them, W 1 , W 2 are weight parameters, b 1 , b 2 are bias terms, and ReLU(·) is the corrected linear activation function;

(4.1.2)、语义公共空间转换;(4.1.2), semantic public space conversion;

将线性投影后的上下文特征

Figure BDA0003442397980000048
拼接,得到文本模态下的语义表示矩阵
Figure BDA0003442397980000049
Context features after linear projection
Figure BDA0003442397980000048
Splicing to get the semantic representation matrix in the text mode
Figure BDA0003442397980000049

将线性投影后的上下文特征

Figure BDA00034423979800000410
拼接,得到图像模态下的语义表示矩阵
Figure BDA00034423979800000411
Context features after linear projection
Figure BDA00034423979800000410
Splicing to get the semantic representation matrix in the image modality
Figure BDA00034423979800000411

(4.2)、计算两模态间的语义相关性Corr;(4.2) Calculate the semantic correlation Corr between the two modalities;

Figure BDA00034423979800000412
Figure BDA00034423979800000412

(4.3)、利用两模态的语义相关性将图像模态中有序图像的位置嵌入转换为文本模态中的注意力信息;(4.3), using the semantic correlation of the two modalities to convert the position embedding of the ordered image in the image modality into the attention information in the text modality;

(4.3.1)、利用注意力机制获得文本模态中各语句的隐性位置信息

Figure BDA00034423979800000413
(4.3.1), use the attention mechanism to obtain the implicit position information of each sentence in the text mode
Figure BDA00034423979800000413

α=soft max(Corr)α=soft max(Corr)

Figure BDA00034423979800000414
Figure BDA00034423979800000414

(4.3.2)、将

Figure BDA00034423979800000415
中各语句的上下文特征进行拼接后和隐性位置信息
Figure BDA00034423979800000416
相加,得到带有有序位置注意力信息的语句上下文特征
Figure BDA00034423979800000417
其维度大小为n×d;(4.3.2), will
Figure BDA00034423979800000415
The context features of each sentence in the splicing and implicit location information
Figure BDA00034423979800000416
Add up to get sentence context features with ordered positional attention information
Figure BDA00034423979800000417
Its dimension size is n×d;

(5)、乱序语句的连贯性恢复;(5) Coherence recovery of out-of-order statements;

(5.1)、将基本特征

Figure BDA00034423979800000418
中各语句基本特征
Figure BDA00034423979800000419
的离散位置投影嵌入为紧凑位置,记为pi;(5.1), the basic features
Figure BDA00034423979800000418
The basic characteristics of each sentence in
Figure BDA00034423979800000419
The discrete position projection embedding of is a compact position, denoted as p i ;

在基本特征

Figure BDA00034423979800000420
中,对偶数项的维度进行投影嵌入为:pi,2l=sin(i/100002l/d);对奇数项的维度进行投影嵌入为:pi,2l+1=cos(i/100002l/d);in basic features
Figure BDA00034423979800000420
, the projection embedding of the dimensions of the even items is: p i,2l =sin(i/10000 2l/d ); the dimensions of the odd items are projected and embedded as: p i,2l+1 =cos(i/10000 2l /d );

其中,pi,2l、pi,2l+1分别表示偶数项维度和奇数项维度投影嵌入后的值,l为常数,2l,2l+1∈[1,d];Among them, p i,2l and p i,2l+1 represent the even-numbered dimension and odd-numbered dimension dimension after projection embedding, l is a constant, 2l,2l+1∈[1,d];

基本特征

Figure BDA0003442397980000051
的所有维度投影嵌入完成后得到紧凑位置pi;Basic Features
Figure BDA0003442397980000051
The compact position p i is obtained after the projection embedding of all dimensions of ;

最后将各语句的紧凑位置pi进行拼接,得到位置嵌入矩阵

Figure BDA0003442397980000052
其维度大小为m×d;Finally, splicing the compact positions p i of each sentence to obtain the position embedding matrix
Figure BDA0003442397980000052
Its dimension size is m×d;

(5.2)、利用Transformer的h头注意力层将位置嵌入矩阵

Figure BDA0003442397980000053
先映射为查询矩阵
Figure BDA0003442397980000054
键矩阵
Figure BDA0003442397980000055
和值矩阵
Figure BDA0003442397980000056
(5.2), use Transformer's h head attention layer to embed the position into the matrix
Figure BDA0003442397980000053
First map to query matrix
Figure BDA0003442397980000054
bond matrix
Figure BDA0003442397980000055
sum value matrix
Figure BDA0003442397980000056

Figure BDA0003442397980000057
Figure BDA0003442397980000057

其中,k∈[1,h]表示第k个注意力头,

Figure BDA0003442397980000058
为第k个注意力头的权值矩阵,其维度大小均为
Figure BDA0003442397980000059
where k∈[1,h] denotes the kth attention head,
Figure BDA0003442397980000058
is the weight matrix of the kth attention head, and its dimension size is
Figure BDA0003442397980000059

然后通过注意力机制提取各个注意力头间的交互信息

Figure BDA00034423979800000510
Then, the interaction information between each attention head is extracted through the attention mechanism
Figure BDA00034423979800000510

Figure BDA00034423979800000511
Figure BDA00034423979800000511

其中,

Figure BDA00034423979800000512
表示第k个注意力头的维度,上标T表示转置;in,
Figure BDA00034423979800000512
Represents the dimension of the kth attention head, and the superscript T represents the transposition;

最后将各个注意力头间的交互信息

Figure BDA00034423979800000513
连接起来
Figure BDA00034423979800000514
再通过前向反馈网络得到语句位置之间的交互特
Figure BDA00034423979800000515
Figure BDA00034423979800000516
表示第i个语句位置的交互特征;Finally, the interaction information between each attention head
Figure BDA00034423979800000513
connect them
Figure BDA00034423979800000514
Then, the interaction characteristics between sentence positions are obtained through the forward feedback network.
Figure BDA00034423979800000515
Figure BDA00034423979800000516
represents the interaction feature of the i-th sentence position;

(5.3)、通过多头互注意力模块以获取各语句关于位置的注意力特征;(5.3), through the multi-head mutual attention module to obtain the attention features of each sentence about the position;

(5.3.1)、将各语句位置的交互特征

Figure BDA00034423979800000517
拼接,得到矩阵
Figure BDA00034423979800000518
其维度大小为m×d;(5.3.1), the interaction characteristics of each sentence position
Figure BDA00034423979800000517
concatenate to get the matrix
Figure BDA00034423979800000518
Its dimension size is m×d;

(5.3.2)、利用Transformer的h头注意力层将矩阵

Figure BDA00034423979800000519
先映射为查询矩阵
Figure BDA00034423979800000520
再将矩阵
Figure BDA00034423979800000521
映射为键矩阵
Figure BDA00034423979800000522
和值矩阵
Figure BDA00034423979800000523
(5.3.2), use Transformer's h head attention layer to convert the matrix
Figure BDA00034423979800000519
First map to query matrix
Figure BDA00034423979800000520
then the matrix
Figure BDA00034423979800000521
map to key matrix
Figure BDA00034423979800000522
sum value matrix
Figure BDA00034423979800000523

Figure BDA00034423979800000524
Figure BDA00034423979800000524

其中,k∈[1,h]表示第k个注意力头,

Figure BDA00034423979800000525
为第k个注意力头的权值矩阵,其维度大小均为
Figure BDA00034423979800000526
where k∈[1,h] denotes the kth attention head,
Figure BDA00034423979800000525
is the weight matrix of the kth attention head, and its dimension size is
Figure BDA00034423979800000526

然后通过注意力机制提取各个注意力头间的交互信息

Figure BDA00034423979800000527
Then, the interaction information between each attention head is extracted through the attention mechanism
Figure BDA00034423979800000527

Figure BDA0003442397980000061
Figure BDA0003442397980000061

其中,

Figure BDA0003442397980000062
表示第k个注意力头的维度,上标T表示转置;in,
Figure BDA0003442397980000062
Represents the dimension of the kth attention head, and the superscript T represents the transposition;

最后将各个注意力头间的交互信息

Figure BDA0003442397980000063
连接起来
Figure BDA0003442397980000064
再通过前向反馈网络得到语句关于位置的注意力特征
Figure BDA0003442397980000065
Figure BDA0003442397980000066
表示语句关于第i个位置的注意力特征;Finally, the interaction information between each attention head
Figure BDA0003442397980000063
connect them
Figure BDA0003442397980000064
Then, the attention feature of the sentence about the position is obtained through the forward feedback network.
Figure BDA0003442397980000065
Figure BDA0003442397980000066
Represents the attention feature of the sentence about the i-th position;

(5.4)、计算各语句所处位置的概率;(5.4) Calculate the probability of the position of each statement;

(5.4.1)、计算第i句语句处于m个位置的概率,其中,第i句语句处于第i个位置的注意力值为ωi(5.4.1), calculate the probability that the i-th sentence is in m positions, where the attention value of the i-th sentence in the i-th position is ω i :

Figure BDA0003442397980000067
Figure BDA0003442397980000067

ptri=softmax(ωi)ptr i =softmax(ω i )

其中,Wp、Wb为权重矩阵,u为列权重向量;Among them, W p and W b are weight matrices, and u is a column weight vector;

同理,按照上述公式计算出第i句语句处于m个位置的概率,记为位置概率集合{ptr1,ptr2,…,ptri,…,ptrm};Similarly, according to the above formula, calculate the probability that the i-th sentence is in m positions, which is recorded as the position probability set {ptr 1 ,ptr 2 ,…,ptr i ,…,ptr m };

(5.4.2)、在位置概率集合中取概率值最大的一个位置概率,作为第i句语句所处位置的最终概率,记为Ptri;同理,得到各个语句所处位置的最终概率,记为{Ptr1,Ptr2,…,Ptri,…,Ptrm};(5.4.2), take a position probability with the largest probability value in the position probability set, as the final probability of the position of the i-th sentence sentence, denoted as Pt i ; Similarly, obtain the final probability of the position of each sentence, Denoted as {Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m };

(5.5)、按照位置概率对乱序语句排序;(5.5), sort the out-of-order sentences according to the position probability;

从第一个位置开始,在集合{Ptr1,Ptr2,…,Ptri,…,Ptrm}中选出概率值最大对应的语句,并排在第一个位置,然后将已排序语句概率值置为零,然后以此类推,直到第m个位置排序结束,从而完成乱序语句的连贯性恢复。Starting from the first position, from the set {Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m }, select the sentence corresponding to the highest probability value and rank it in the first position, and then put the probability value of the sorted sentence Set to zero, and so on, until the end of the mth position sorting, so as to complete the coherence recovery of the out-of-order statement.

本发明的发明目的是这样实现的:The purpose of the invention of the present invention is achieved in this way:

本发明一种跨模态语义连贯性恢复方法,首先获取文本模态和图像模态的基本特征和上下文特征,然后通过线性投影将两种模态的上下文特征转换到语义公共空间,进行获取跨模态有序位置的注意力信息,最后利用带有有序位置注意力信息对乱序语句进行排序,从而完成乱序语句的连贯性恢复。The present invention is a cross-modal semantic coherence recovery method. First, basic features and context features of text modality and image modality are acquired, and then the context features of the two modalities are converted into semantic common space through linear projection to obtain cross-modal semantics. The attention information of the modal ordered position is used to sort the disordered sentences by the attention information with the ordered position finally, so as to complete the coherence recovery of the disordered sentences.

同时,本发明一种跨模态语义连贯性恢复方法还具有以下有益效果:At the same time, a cross-modal semantic coherence recovery method of the present invention also has the following beneficial effects:

(1)、本发明提出的跨模态语义连贯性分析和恢复方法可以有效地对不同模态中的元素进行特征提取,充分利用跨模态位置信息辅助和促进单模态下语义连贯性分析和恢复,并行地预测恢复每个位置的元素,进一步提升该任务的速度和精度;(1) The cross-modal semantic coherence analysis and recovery method proposed by the present invention can effectively extract features for elements in different modalities, and make full use of cross-modal location information to assist and promote semantic coherence analysis in a single modality and recovery, predicting and recovering elements at each position in parallel, further improving the speed and accuracy of the task;

(2)、本发明通过跨模态的方式有效地将具有相似语义的文本模态和图像模态进行连接,有利于语义连贯性的分析和引入有序连贯模态下的位置注意力信息。(2) The present invention effectively connects text modalities and image modalities with similar semantics in a cross-modal manner, which is conducive to the analysis of semantic coherence and the introduction of positional attention information in orderly coherent modalities.

附图说明Description of drawings

图1是本发明一种跨模态语义连贯性恢复方法流程图;1 is a flowchart of a method for restoring cross-modal semantic coherence of the present invention;

具体实施方式Detailed ways

下面结合附图对本发明的具体实施方式进行描述,以便本领域的技术人员更好地理解本发明。需要特别提醒注意的是,在以下的描述中,当已知功能和设计的详细描述也许会淡化本发明的主要内容时,这些描述在这里将被忽略。The specific embodiments of the present invention are described below with reference to the accompanying drawings, so that those skilled in the art can better understand the present invention. It should be noted that, in the following description, when the detailed description of known functions and designs may dilute the main content of the present invention, these descriptions will be omitted here.

实施例Example

图1是本发明一种跨模态语义连贯性恢复方法流程图。FIG. 1 is a flow chart of a method for restoring cross-modal semantic coherence according to the present invention.

在本实施例中,如图1所示,本发明一种跨模态语义连贯性恢复方法,包括以下步骤:In this embodiment, as shown in FIG. 1 , a method for restoring cross-modal semantic coherence of the present invention includes the following steps:

S1、设文本模态下语义连贯性待恢复的乱序语句为X={x1,x2,…,xi,…,xm},xi表示第i句语句,m为乱序语句数量;设图像模态下的一组有序连贯图像为Y={y1,y2,…,yj,…,yn},yj表示第j张图像,n表示图像数量;设文本模态和图像模态之间有相似的语义,现利用图像辅助文本恢复为有序连贯的段落。S1. Set the out-of-order sentences whose semantic coherence is to be restored in the text mode as X={x 1 ,x 2 ,..., xi ,...,x m }, x i represents the i-th sentence, and m is the out-of-order sentence Quantity; let a set of ordered coherent images in image mode be Y={y 1 , y 2 ,...,y j ,...,y n }, y j represents the jth image, and n represents the number of images; let the text There are similar semantics between modalities and image modalities, and images are now used to aid text restoration into ordered and coherent paragraphs.

S2、获取文本模态和图像模态的基本特征;S2. Obtain the basic features of text modality and image modality;

S2.1、利用双向长短时记忆网络获取乱序语句的基本特征:将X输入至双向长短时记忆网络,从而输出乱序语句的基本特征

Figure BDA0003442397980000071
其中,
Figure BDA0003442397980000072
表示第i句语句的基本特征,其维度大小为1×d,d取值512;S2.1. Use the bidirectional long and short-term memory network to obtain the basic characteristics of the out-of-order sentence: input X into the bidirectional long-term and short-term memory network, so as to output the basic characteristics of the out-of-order sentence
Figure BDA0003442397980000071
in,
Figure BDA0003442397980000072
Represents the basic features of the i-th sentence, its dimension is 1×d, and d is 512;

S2.2、采用卷积神经网络获取有序连贯图像的基本特征:将Y输入至卷积神经网络,从而输出有序连贯图像的基本特征

Figure BDA0003442397980000073
Figure BDA0003442397980000074
表示第j张图像的基本特征,其维度大小为1×d;S2.2. Use the convolutional neural network to obtain the basic features of the ordered and coherent images: input Y into the convolutional neural network to output the basic features of the ordered and coherent images
Figure BDA0003442397980000073
Figure BDA0003442397980000074
Represents the basic features of the jth image, and its dimension is 1×d;

S3、获取文本模态和图像模态的上下文特征;S3. Obtain the contextual features of the text modality and the image modality;

S3.1、为了利用上下文语义关系,利用去除位置嵌入的Transformer变体结构获取文本模态的上下文特征,其使用了缩放点积的自注意力机制以利用上下文信息。S3.1. In order to take advantage of the contextual semantic relationship, the Transformer variant structure that removes the positional embedding is used to obtain the contextual features of the text modality, which uses the self-attention mechanism of the scaled dot product to utilize the contextual information.

S3.1.1、将各语句的基本特征进行拼接,得到矩阵

Figure BDA0003442397980000081
其维度大小为m×d;S3.1.1. Splicing the basic features of each sentence to obtain a matrix
Figure BDA0003442397980000081
Its dimension size is m×d;

S3.1.2、利用Transformer的h头注意力层将基本特征

Figure BDA0003442397980000082
先映射为查询矩阵
Figure BDA0003442397980000083
键矩阵
Figure BDA0003442397980000084
和值矩阵
Figure BDA0003442397980000085
S3.1.2. Use Transformer's h-head attention layer to convert basic features
Figure BDA0003442397980000082
First map to query matrix
Figure BDA0003442397980000083
bond matrix
Figure BDA0003442397980000084
sum value matrix
Figure BDA0003442397980000085

Figure BDA0003442397980000086
Figure BDA0003442397980000086

其中,k∈[1,h]表示第k个注意力头,

Figure BDA0003442397980000087
为第k个注意力头的权值矩阵,其维度大小均为
Figure BDA0003442397980000088
h取值4;where k∈[1,h] denotes the kth attention head,
Figure BDA0003442397980000087
is the weight matrix of the kth attention head, and its dimension size is
Figure BDA0003442397980000088
h value is 4;

然后通过注意力机制提取各个注意力头间的交互信息

Figure BDA0003442397980000089
Then, the interaction information between each attention head is extracted through the attention mechanism
Figure BDA0003442397980000089

Figure BDA00034423979800000810
Figure BDA00034423979800000810

其中,

Figure BDA00034423979800000811
表示第k个注意力头的维度,上标T表示转置;in,
Figure BDA00034423979800000811
Represents the dimension of the kth attention head, and the superscript T represents the transposition;

最后将各个注意力头间的交互信息

Figure BDA00034423979800000812
连接起来
Figure BDA00034423979800000813
再通过前向反馈网络得到乱序语句的上下文特征
Figure BDA00034423979800000814
Figure BDA00034423979800000815
表示第i句语句的上下文特征;Finally, the interaction information between each attention head
Figure BDA00034423979800000812
connect them
Figure BDA00034423979800000813
Then, the context features of the out-of-order sentences are obtained through the forward feedback network.
Figure BDA00034423979800000814
Figure BDA00034423979800000815
Represents the contextual features of the i-th sentence;

S3.2、为了给图像连贯语义信息建模,这里采用保留了位置嵌入的Transformer变体结构获取图像模态的上下文特征;S3.2. In order to model the coherent semantic information of the image, the Transformer variant structure that retains the position embedding is used to obtain the contextual features of the image modality;

S3.2.1、将各图像的基本特征进行拼接,得到矩阵

Figure BDA00034423979800000816
其维度大小为n×d;S3.2.1. Splicing the basic features of each image to obtain a matrix
Figure BDA00034423979800000816
Its dimension size is n×d;

S3.2.2、将基本特征

Figure BDA00034423979800000817
中各图像基本特征
Figure BDA00034423979800000818
的离散位置投影嵌入为紧凑位置,记为pj;S3.2.2, the basic characteristics
Figure BDA00034423979800000817
The basic characteristics of each image in
Figure BDA00034423979800000818
The discrete position projection embedding of is a compact position, denoted as p j ;

在基本特征

Figure BDA00034423979800000819
中,对偶数项的维度进行投影嵌入为:pj,2l=sin(j/100002l/d);对奇数项的维度进行投影嵌入为:pj,2l+1=cos(j/100002l/d);in basic features
Figure BDA00034423979800000819
, the projection embedding for the dimensions of the even items is: p j,2l =sin(j/10000 2l/d ); the projection embedding for the dimensions of the odd items is: p j,2l+1 =cos(j/10000 2l /d );

其中,pj,2l、pj,2l+1分别表示偶数项维度和奇数项维度投影嵌入后的值,l为常数,2l,2l+1∈[1,d];Among them, p j,2l and p j,2l+1 respectively represent the even-numbered dimension and odd-numbered dimension after projection embedding, l is a constant, 2l,2l+1∈[1,d];

基本特征

Figure BDA00034423979800000820
的所有维度投影嵌入完成后得到紧凑位置pj;Basic Features
Figure BDA00034423979800000820
The compact position p j is obtained after the projection embedding of all dimensions of ;

最后将各图像的紧凑位置pj进行拼接,得到位置嵌入矩阵

Figure BDA00034423979800000821
其维度大小为n×d;Finally, the compact positions p j of each image are spliced to obtain the position embedding matrix
Figure BDA00034423979800000821
Its dimension size is n×d;

S3.2.3、将基本特征

Figure BDA0003442397980000091
和位置嵌入
Figure BDA0003442397980000092
相加后利用Transformer的h头注意力层先映射为查询矩阵
Figure BDA0003442397980000093
键矩阵
Figure BDA0003442397980000094
和值矩阵
Figure BDA0003442397980000095
S3.2.3, the basic characteristics
Figure BDA0003442397980000091
and position embedding
Figure BDA0003442397980000092
After the addition, use the Transformer's h-head attention layer to first map it to the query matrix
Figure BDA0003442397980000093
bond matrix
Figure BDA0003442397980000094
sum value matrix
Figure BDA0003442397980000095

Figure BDA0003442397980000096
Figure BDA0003442397980000096

其中,k∈[1,h]表示第k个注意力头,

Figure BDA0003442397980000097
为第k个注意力头的权值矩阵,其维度大小均为
Figure BDA0003442397980000098
where k∈[1,h] denotes the kth attention head,
Figure BDA0003442397980000097
is the weight matrix of the kth attention head, and its dimension size is
Figure BDA0003442397980000098

然后通过注意力机制提取各个注意力头间的交互信息

Figure BDA0003442397980000099
Then, the interaction information between each attention head is extracted through the attention mechanism
Figure BDA0003442397980000099

Figure BDA00034423979800000910
Figure BDA00034423979800000910

其中,

Figure BDA00034423979800000911
表示第k个注意力头的维度,上标T表示转置;in,
Figure BDA00034423979800000911
Represents the dimension of the kth attention head, and the superscript T represents the transposition;

最后将各个注意力头间的交互信息

Figure BDA00034423979800000912
连接起来
Figure BDA00034423979800000913
再通过前向反馈网络得到有序连贯图像的上下文特征
Figure BDA00034423979800000914
Figure BDA00034423979800000915
表示第j张图像的上下文特征;Finally, the interaction information between each attention head
Figure BDA00034423979800000912
connect them
Figure BDA00034423979800000913
Then, the contextual features of the ordered and coherent images are obtained through the forward feedback network.
Figure BDA00034423979800000914
Figure BDA00034423979800000915
Represents the contextual features of the jth image;

S4、获取跨模态有序位置的注意力信息S4. Obtain attention information of cross-modal ordered positions

S4.1、为了利用来自图像模态的跨模态顺序信息,通过跨模态位置注意力模块来连接两种模态之间的语义一致性。S4.1. To exploit the cross-modal order information from image modalities, the semantic consistency between two modalities is connected through a cross-modal location attention module.

首先通过线性投影将两种模态的上下文特征转换到语义公共空间;First, the contextual features of the two modalities are transformed into the semantic common space through linear projection;

S4.1.1、对两种模态的上下文特征进行线性投影;S4.1.1. Perform linear projection on the contextual features of the two modalities;

Figure BDA00034423979800000916
Figure BDA00034423979800000916

Figure BDA00034423979800000917
Figure BDA00034423979800000917

其中,W1、W2为权重参数,b1、b2为偏置项,ReLU(·)为校正线性激活函数;Among them, W 1 , W 2 are weight parameters, b 1 , b 2 are bias terms, and ReLU(·) is the corrected linear activation function;

S4.1.2、语义公共空间转换;S4.1.2, semantic public space conversion;

将线性投影后的上下文特征

Figure BDA00034423979800000918
拼接,得到文本模态下的语义表示矩阵
Figure BDA00034423979800000919
Context features after linear projection
Figure BDA00034423979800000918
Splicing to get the semantic representation matrix in the text mode
Figure BDA00034423979800000919

将线性投影后的上下文特征

Figure BDA00034423979800000920
拼接,得到图像模态下的语义表示矩阵
Figure BDA00034423979800000921
Context features after linear projection
Figure BDA00034423979800000920
Splicing to get the semantic representation matrix in the image modality
Figure BDA00034423979800000921

S4.2、计算两模态间的语义相关性Corr;S4.2. Calculate the semantic correlation Corr between two modalities;

Figure BDA00034423979800000922
Figure BDA00034423979800000922

S4.3、利用两模态的语义相关性将图像模态中有序图像的位置嵌入转换为文本模态中的注意力信息;S4.3, using the semantic correlation of the two modalities to convert the position embedding of the ordered image in the image modality into attention information in the text modality;

S4.3.1、利用注意力机制获得文本模态中各语句的隐性位置信息

Figure BDA0003442397980000101
S4.3.1. Use the attention mechanism to obtain the implicit position information of each sentence in the text modality
Figure BDA0003442397980000101

α=soft max(Corr)α=soft max(Corr)

Figure BDA0003442397980000102
Figure BDA0003442397980000102

S4.3.2、将

Figure BDA0003442397980000103
中各语句的上下文特征进行拼接后和隐性位置信息
Figure BDA0003442397980000104
相加,得到带有有序位置注意力信息的语句上下文特征
Figure BDA0003442397980000105
其维度大小为n×d;S4.3.2, will
Figure BDA0003442397980000103
The context features of each sentence in the splicing and implicit location information
Figure BDA0003442397980000104
Add up to get sentence context features with ordered positional attention information
Figure BDA0003442397980000105
Its dimension size is n×d;

S5、进行连贯性恢复;S5, perform continuity recovery;

S5.1、将基本特征

Figure BDA0003442397980000106
中各语句基本特征
Figure BDA0003442397980000107
的离散位置投影嵌入为紧凑位置,记为pi;S5.1, the basic characteristics
Figure BDA0003442397980000106
The basic characteristics of each sentence in
Figure BDA0003442397980000107
The discrete position projection embedding of is a compact position, denoted as p i ;

在基本特征

Figure BDA0003442397980000108
中,对偶数项的维度进行投影嵌入为:pi,2l=sin(i/100002l/d);对奇数项的维度进行投影嵌入为:pi,2l+1=cos(i/100002l/d);in basic features
Figure BDA0003442397980000108
, the projection embedding of the dimensions of the even items is: p i,2l =sin(i/10000 2l/d ); the dimensions of the odd items are projected and embedded as: p i,2l+1 =cos(i/10000 2l /d );

其中,pi,2l、pi,2l+1分别表示偶数项维度和奇数项维度投影嵌入后的值,l为常数,2l,2l+1∈[1,d];Among them, p i,2l and p i,2l+1 represent the even-numbered dimension and odd-numbered dimension dimension after projection embedding, l is a constant, 2l,2l+1∈[1,d];

基本特征

Figure BDA0003442397980000109
的所有维度投影嵌入完成后得到紧凑位置pi;Basic Features
Figure BDA0003442397980000109
The compact position p i is obtained after the projection embedding of all dimensions of ;

最后将各语句的紧凑位置pi进行拼接,得到位置嵌入矩阵

Figure BDA00034423979800001010
其维度大小为m×d;Finally, splicing the compact positions p i of each sentence to obtain the position embedding matrix
Figure BDA00034423979800001010
Its dimension size is m×d;

S5.2、利用Transformer的h头注意力层将位置嵌入矩阵

Figure BDA00034423979800001011
先映射为查询矩阵
Figure BDA00034423979800001012
键矩阵
Figure BDA00034423979800001013
和值矩阵
Figure BDA00034423979800001014
S5.2, use Transformer's h-head attention layer to embed the position into the matrix
Figure BDA00034423979800001011
First map to query matrix
Figure BDA00034423979800001012
bond matrix
Figure BDA00034423979800001013
sum value matrix
Figure BDA00034423979800001014

Figure BDA00034423979800001015
Figure BDA00034423979800001015

其中,k∈[1,h]表示第k个注意力头,

Figure BDA00034423979800001016
为第k个注意力头的权值矩阵,其维度大小均为
Figure BDA00034423979800001017
where k∈[1,h] denotes the kth attention head,
Figure BDA00034423979800001016
is the weight matrix of the kth attention head, and its dimension size is
Figure BDA00034423979800001017

然后通过注意力机制提取各个注意力头间的交互信息

Figure BDA00034423979800001018
Then the interaction information between each attention head is extracted through the attention mechanism
Figure BDA00034423979800001018

Figure BDA00034423979800001019
Figure BDA00034423979800001019

其中,

Figure BDA00034423979800001020
表示第k个注意力头的维度,上标T表示转置;in,
Figure BDA00034423979800001020
Represents the dimension of the kth attention head, and the superscript T represents the transposition;

最后将各个注意力头间的交互信息

Figure BDA0003442397980000111
连接起来
Figure BDA0003442397980000112
再通过前向反馈网络得到语句位置之间的交互特
Figure BDA0003442397980000113
Figure BDA0003442397980000114
表示第i个语句位置的交互特征;Finally, the interaction information between each attention head
Figure BDA0003442397980000111
connect them
Figure BDA0003442397980000112
Then, the interaction characteristics between sentence positions are obtained through the forward feedback network.
Figure BDA0003442397980000113
Figure BDA0003442397980000114
represents the interaction feature of the i-th sentence position;

S5.3、通过多头互注意力模块以获取各语句关于位置的注意力特征;S5.3, through the multi-head mutual attention module to obtain the attention features of each sentence about the position;

S5.3.1、将各语句位置的交互特征

Figure BDA0003442397980000115
拼接,得到矩阵
Figure BDA0003442397980000116
其维度大小为m×d;S5.3.1, the interaction characteristics of each sentence position
Figure BDA0003442397980000115
concatenate to get the matrix
Figure BDA0003442397980000116
Its dimension size is m×d;

S5.3.2、利用Transformer的h头注意力层将矩阵

Figure BDA0003442397980000117
先映射为查询矩阵
Figure BDA0003442397980000118
再将矩阵
Figure BDA0003442397980000119
映射为键矩阵
Figure BDA00034423979800001110
和值矩阵
Figure BDA00034423979800001111
S5.3.2, use Transformer's h head attention layer to convert the matrix
Figure BDA0003442397980000117
First map to query matrix
Figure BDA0003442397980000118
then the matrix
Figure BDA0003442397980000119
map to key matrix
Figure BDA00034423979800001110
sum value matrix
Figure BDA00034423979800001111

Figure BDA00034423979800001112
Figure BDA00034423979800001112

其中,k∈[1,h]表示第k个注意力头,

Figure BDA00034423979800001113
为第k个注意力头的权值矩阵,其维度大小均为
Figure BDA00034423979800001114
where k∈[1,h] denotes the kth attention head,
Figure BDA00034423979800001113
is the weight matrix of the kth attention head, and its dimension size is
Figure BDA00034423979800001114

然后通过注意力机制提取各个注意力头间的交互信息

Figure BDA00034423979800001115
Then, the interaction information between each attention head is extracted through the attention mechanism
Figure BDA00034423979800001115

Figure BDA00034423979800001116
Figure BDA00034423979800001116

其中,

Figure BDA00034423979800001117
表示第k个注意力头的维度,上标T表示转置;in,
Figure BDA00034423979800001117
Represents the dimension of the kth attention head, and the superscript T represents the transposition;

最后将各个注意力头间的交互信息

Figure BDA00034423979800001118
连接起来
Figure BDA00034423979800001119
再通过前向反馈网络得到语句关于位置的注意力特征
Figure BDA00034423979800001120
Figure BDA00034423979800001121
表示语句关于第i个位置的注意力特征;Finally, the interaction information between each attention head
Figure BDA00034423979800001118
connect them
Figure BDA00034423979800001119
Then, the attention feature of the sentence about the position is obtained through the forward feedback network.
Figure BDA00034423979800001120
Figure BDA00034423979800001121
Represents the attention feature of the sentence about the i-th position;

S5.4、计算各语句所处位置的概率;S5.4. Calculate the probability of the position of each statement;

S5.4.1、计算第i句语句处于m个位置的概率,其中,第i句语句处于第i个位置的注意力值为ωiS5.4.1. Calculate the probability that the ith sentence is in m positions, where the attention value of the ith sentence in the ith position is ω i :

Figure BDA00034423979800001122
Figure BDA00034423979800001122

ptri=softmax(ωi)ptr i =softmax(ω i )

其中,Wp、Wb为权重矩阵,u为列权重向量;Among them, W p and W b are weight matrices, and u is a column weight vector;

同理,按照上述公式计算出第i句语句处于m个位置的概率,记为位置概率集合{ptr1,ptr2,…,ptri,…,ptrm};Similarly, according to the above formula, calculate the probability that the i-th sentence is in m positions, which is recorded as the position probability set {ptr 1 ,ptr 2 ,…,ptr i ,…,ptr m };

S5.4.2、在位置概率集合中取概率值最大的一个位置概率,作为第i句语句所处位置的最终概率,记为Ptri;同理,得到各个语句所处位置的最终概率,记为{Ptr1,Ptr2,…,Ptri,…,Ptrm};S5.4.2. Take the position probability with the largest probability value in the position probability set, as the final probability of the position of the i-th sentence sentence, denoted as Pt i ; in the same way, obtain the final probability of the position of each sentence, denoted as {Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m };

S5.5、按照位置概率对乱序语句排序;S5.5. Sort out-of-order sentences according to position probability;

从第一个位置开始,在集合{Ptr1,Ptr2,…,Ptri,…,Ptrm}中选出概率值最大对应的语句,并排在第一个位置,然后将已排序语句概率值置为零,然后以此类推,直到第m个位置排序结束,从而完成乱序语句的连贯性恢复。Starting from the first position, from the set {Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m }, select the sentence corresponding to the highest probability value and rank it in the first position, and then put the probability value of the sorted sentence Set to zero, and so on, until the end of the mth position sorting, so as to complete the coherence recovery of the out-of-order statement.

在本实施例中,将本发明用于几个常用数据集,其中包括SIND和TACoS两个视觉叙事和故事理解语料库,具有文本和图片两种形式的数据。本发明采用完全匹配率(PMR),准确率(Acc)和τ度量来作为评价指标。完全匹配率(PMR)在整体上衡量元素位置预测的性能。准确率(Acc)计算单个元素的绝对位置预测的准确性,是更为宽松的度量指标。τ度量用于衡量预测中所有元素对之间的相对顺序,与人类的判断更相近。In this embodiment, the present invention is applied to several commonly used datasets, including two visual narrative and story understanding corpora, SIND and TACoS, with data in both text and picture forms. The present invention adopts perfect matching rate (PMR), accuracy rate (Acc) and τ metric as evaluation indicators. The perfect match rate (PMR) measures the performance of element position prediction as a whole. Accuracy (Acc) calculates the accuracy of the absolute position prediction of a single element and is a looser metric. The τ metric is used to measure the relative order between all pairs of elements in the prediction and is closer to human judgment.

通过本发明及现有方法进行语句连贯性恢复,其实验结果如表1所示,其中,LSTM+PtrNet是长短时记忆网络加指针网络的方法,AON-UM是单模态的自回归注意恢复方法,AON-CM是跨模态的自回归注意恢复方法,NAD-UM是单模态的非自回归恢复方法,NAD-CM1是没有采用位置嵌入和位置注意力的跨模态非自回归方法,NAD-CM2是没有采用位置注意力的跨模态非自回归方法,NAD-CM3是没有采用位置嵌入的跨模态非自回归方法,NACON(noexl)是没有采用贪心选择和掩码排除的方法,NACON为本发明方法。从实验结果可以看出,跨模态的语义连贯性分析和恢复方法的性能极大优于现有的单模态的方法。本发明比起NAD-CM1、NAD-CM2、NAD-CM3的各项评价指标有所提升,验证了跨模态位置注意力信息的有效性。此外,与AON-CM和NACON(no exl)相比,性能也有明显改善,验证了本发明设计的连贯性恢复方法贪心选择和掩码排除的推理方式的有效性。The sentence coherence restoration is carried out by the present invention and the existing method, and the experimental results are shown in Table 1. Among them, LSTM+PtrNet is a method of long-short-term memory network plus pointer network, and AON-UM is a single-modal autoregressive attention restoration. Methods, AON-CM is a cross-modal autoregressive attention recovery method, NAD-UM is a single-modal non-autoregressive recovery method, NAD-CM1 is a cross-modal non-autoregressive method without position embedding and position attention , NAD-CM2 is a cross-modal non-autoregressive method without positional attention, NAD-CM3 is a cross-modal non-autoregressive method without positional embedding, NACON(noexl) does not use greedy selection and mask exclusion method, NACON is the method of the present invention. From the experimental results, it can be seen that the performance of the cross-modality semantic coherence analysis and recovery method is significantly better than the existing single-modality methods. Compared with the evaluation indicators of NAD-CM1, NAD-CM2 and NAD-CM3, the present invention improves the effectiveness of the cross-modal location attention information. In addition, compared with AON-CM and NACON (no exl), the performance is also significantly improved, which verifies the effectiveness of the reasoning mode of greedy selection and mask exclusion of the coherence restoration method designed by the present invention.

表1是SIND,TACoS数据集上的实验结果;Table 1 shows the experimental results on the SIND and TACoS datasets;

Figure BDA0003442397980000121
Figure BDA0003442397980000121

尽管上面对本发明说明性的具体实施方式进行了描述,以便于本技术领域的技术人员理解本发明,但应该清楚,本发明不限于具体实施方式的范围,对本技术领域的普通技术人员来讲,只要各种变化在所附的权利要求限定和确定的本发明的精神和范围内,这些变化是显而易见的,一切利用本发明构思的发明创造均在保护之列。Although the illustrative specific embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be clear that the present invention is not limited to the scope of the specific embodiments. For those skilled in the art, As long as various changes are within the spirit and scope of the present invention as defined and determined by the appended claims, these changes are obvious, and all inventions and creations utilizing the inventive concept are included in the protection list.

Claims (1)

1. A cross-modal semantic consistency recovery method is characterized by comprising the following steps:
(1) and setting the out-of-order sentence with semantic consistency to be restored under the text mode as X ═ X1,x2,…,xi,…,xm},xiThe i sentence is expressed, and m is the quantity of the unordered sentences; let a set of ordered consecutive images in the image modality be Y ═ Y1,y2,…,yj,…,yn},yjRepresents the j image, and n represents the number of images; setting similar semantics between a text mode and an image mode;
(2) acquiring basic characteristics of a text mode and an image mode;
(2.1) acquiring basic characteristics of the out-of-order statements by using a bidirectional long-time and short-time memory network: inputting X into bidirectional long-time memory network to output basic characteristics of out-of-order statement
Figure FDA0003442397970000011
Wherein,
Figure FDA0003442397970000012
representing the basic characteristics of the sentence of the ith sentence, wherein the dimension size of the sentence is 1 × d;
(2.2) acquiring basic features of the ordered and coherent images by adopting a convolutional neural network: inputting Y into the convolution neural network, thereby outputting basic features of the sequential image
Figure FDA0003442397970000013
Figure FDA0003442397970000014
Representing the basic characteristics of the jth image, and the dimension size of the jth image is 1 x d;
(3) acquiring context characteristics of a text mode and an image mode;
(3.1) acquiring context characteristics of a text mode by using a Transformer variant structure with embedded position positions removed;
(3.1.1) splicing the basic characteristics of the sentences to obtain a matrix
Figure FDA0003442397970000015
The dimension size is mxd;
(3.1.2) Using Transformer's h-head attention layer to characterize the underlying features
Figure FDA0003442397970000016
First mapping to a query matrix
Figure FDA0003442397970000017
Key matrix
Figure FDA0003442397970000018
Sum matrix
Figure FDA0003442397970000019
Figure FDA00034423979700000110
Wherein k is ∈ [1, h ]]The k-th head of attention is shown,
Figure FDA00034423979700000111
the weight matrix of the kth attention head has the dimensions of
Figure FDA00034423979700000112
Then extracting the mutual information among the attention heads through the attention mechanism
Figure FDA00034423979700000113
Figure FDA00034423979700000114
Wherein,
Figure FDA00034423979700000115
the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is obtained
Figure FDA00034423979700000119
Are connected together
Figure FDA00034423979700000116
And obtaining the context characteristics of the out-of-order statement through a forward feedback network
Figure FDA00034423979700000117
Figure FDA00034423979700000118
Representing the context characteristics of the sentence i;
(3.2) acquiring context characteristics of an image modality by using a Transformer variant structure embedded in a reserved position;
(3.2.1) splicing the basic characteristics of the images to obtain a matrix
Figure FDA0003442397970000021
The dimension size is n x d;
(3.2.2) general features
Figure FDA0003442397970000022
Basic features of each image
Figure FDA0003442397970000023
Is embedded as a compact position, denoted as pj
In the basic characteristics
Figure FDA0003442397970000024
In (2), embedding the projection of the dimension of the even term as: p is a radical ofj,2l=sin(j/100002l/d) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical ofj,2l+1=cos(j/100002l/d);
Wherein p isj,2l、pj,2l+1Respectively representing the values of the projection embedding of even number item dimension and odd number item dimension, wherein l is a constant, 2l,2l +1 belongs to [1, d ]];
Basic features
Figure FDA0003442397970000025
Obtaining a compact position p after embedding of all-dimension projectionj
Finally, the compact position p of each image is determinedjSplicing to obtain a position embedded matrix
Figure FDA0003442397970000026
The dimension size is n x d;
(3.2.3) basic features
Figure FDA0003442397970000027
And position embedding
Figure FDA0003442397970000028
Mapping the h-head attention layer into a query matrix by using a Transformer after addition
Figure FDA0003442397970000029
Key matrix
Figure FDA00034423979700000210
Sum matrix
Figure FDA00034423979700000211
Figure FDA00034423979700000212
Wherein k is ∈ [1, h ]]The k-th head of attention is shown,
Figure FDA00034423979700000213
the weight matrix of the kth attention head has the dimensions of
Figure FDA00034423979700000214
Then extracting the mutual information among the attention heads through the attention mechanism
Figure FDA00034423979700000215
Figure FDA00034423979700000216
Wherein,
Figure FDA00034423979700000217
the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is obtained
Figure FDA00034423979700000218
Are connected together
Figure FDA00034423979700000219
And obtaining the context characteristics of the ordered coherent images through a forward feedback network
Figure FDA00034423979700000220
Figure FDA00034423979700000221
Representing the context feature of the jth image;
(4) obtaining attention information of cross-modal ordered positions
(4.1) converting the context features of the two modes into a semantic common space through linear projection;
(4.1.1) carrying out linear projection on the context characteristics of the two modes;
Figure FDA0003442397970000031
Figure FDA0003442397970000032
wherein, W1、W2As weight parameter, b1、b2For the bias term, ReLU (-) is the correct linear activation function;
(4.1.2) semantic public space conversion;
linearly projected contextual features
Figure FDA0003442397970000033
Splicing to obtain a semantic representation matrix in the text mode
Figure FDA0003442397970000034
Linearly projected contextual features
Figure FDA0003442397970000035
Splicing to obtain a semantic representation matrix under the image mode
Figure FDA0003442397970000036
(4.2) calculating semantic correlation Corr between the two modes;
Figure FDA0003442397970000037
(4.3) embedding and converting the position of the ordered image in the image modality into attention information in the text modality by utilizing semantic correlation of the two modalities;
(4.31) obtaining implicit position information of each statement in text mode by using attention mechanism
Figure FDA00034423979700000317
α=soft max(Corr)
Figure FDA0003442397970000038
(4.3.2) mixing
Figure FDA0003442397970000039
After the context features of the sentences are spliced and the implicit position information is obtained
Figure FDA00034423979700000310
Adding to obtain the sentence context characteristics with ordered position attention information
Figure FDA00034423979700000311
The dimension size is n x d;
(5) restoring the consistency of the out-of-order statement;
(5.1) general characteristics
Figure FDA00034423979700000312
Basic characteristics of each sentence in
Figure FDA00034423979700000313
Is embedded as a compact position, denoted as pi
In the basic characteristics
Figure FDA00034423979700000314
In (2), embedding the projection of the dimension of the even term as: p is a radical ofi,2l=sin(i/100002l/d) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical ofi,2l+1=cos(i/100002l/d);
Wherein p isi,2l、pi,2l+1Respectively representing the values of the projection embedding of even number item dimension and odd number item dimension, wherein l is a constant, 2l,2l +1 belongs to [1, d ]];
Basic features
Figure FDA00034423979700000315
Obtaining a compact position p after embedding of all-dimension projectioni
Finally, the compact position p of each image is determinediSplicing to obtain a position embedded matrix
Figure FDA00034423979700000316
The dimension size is mxd;
(5.2) embedding the position into the matrix using the Transformer's h-head attention layer
Figure FDA0003442397970000041
First mapping to a query matrix
Figure FDA0003442397970000042
Key matrix
Figure FDA0003442397970000043
Sum matrix
Figure FDA0003442397970000044
Figure FDA0003442397970000045
Wherein k is ∈ [1, h ]]The k-th head of attention is shown,
Figure FDA0003442397970000046
the weight matrix of the kth attention head has the dimensions of
Figure FDA0003442397970000047
Then extracting the mutual information among the attention heads through the attention mechanism
Figure FDA0003442397970000048
Figure FDA0003442397970000049
Wherein,
Figure FDA00034423979700000410
the dimension of the kth attention head is represented, and the superscript T represents transposition;
finally, the mutual information among all attention heads is obtained
Figure FDA00034423979700000411
Are connected together
Figure FDA00034423979700000412
And then the interactive characteristics among the sentence positions are obtained through a forward feedback network
Figure FDA00034423979700000413
Figure FDA00034423979700000414
An interactive feature representing the ith sentence position;
(5.3) acquiring attention characteristics of each sentence about the position through a multi-head mutual attention module;
(5.3.1) Interactive features of respective sentence positions
Figure FDA00034423979700000415
Splicing to obtain a matrix
Figure FDA00034423979700000416
Size of its dimensionIs m × d;
(5.3.2) matrix alignment using Transformer's h-head attention layer
Figure FDA00034423979700000417
First mapping to a query matrix
Figure FDA00034423979700000418
Key matrix
Figure FDA00034423979700000419
Sum matrix
Figure FDA00034423979700000420
Figure FDA00034423979700000421
Wherein k is ∈ [1, h ]]The k-th head of attention is shown,
Figure FDA00034423979700000422
the weight matrix of the kth attention head has the dimensions of
Figure FDA00034423979700000423
Then extracting the mutual information among the attention heads through the attention mechanism
Figure FDA00034423979700000424
Figure FDA00034423979700000425
Wherein,
Figure FDA00034423979700000426
representing the dimension of the kth attention head,superscript T denotes transpose;
finally, the mutual information among all attention heads is obtained
Figure FDA00034423979700000427
Are connected together
Figure FDA00034423979700000428
Then the attention characteristics of the statement about the position are obtained through a forward feedback network
Figure FDA00034423979700000429
Figure FDA00034423979700000430
An attention feature representing the sentence with respect to the ith position;
(5.4) calculating the probability of the position of each sentence;
(5.4.1) calculating the probability that the ith sentence is at m positions, wherein the probability that the ith sentence is at the ith position is omegai
Figure FDA0003442397970000051
ptri=softmax(ωi)
Wherein, Wp、WbIs a weight matrix, u is a column weight vector;
similarly, the probability of the ith sentence at m positions is calculated according to the formula and is recorded as a position probability set { ptr }1,ptr2,…,ptri,…,ptrm};
(5.4.2) taking the position probability with the maximum probability value in the position probability set as the final probability of the position of the ith sentence, and marking the final probability as Ptri(ii) a In the same way, the final probability of the position of each sentence is obtained and is marked as { Ptr1,Ptr2,…,Ptri,…,Ptrm};
(5.5) sequencing the out-of-order sentences according to the position probability;
starting from the first position, in the set { Ptr1,Ptr2,…,Ptri,…,PtrmAnd (4) selecting the statement with the maximum probability value, arranging the statements at the first position side by side, setting the probability value of the sequenced statements to be zero, and repeating the steps until the sequencing at the mth position is finished, thereby completing the continuity recovery of the out-of-order statements.
CN202111638661.3A 2021-12-29 2021-12-29 Cross-modal semantic consistency recovery method Active CN114330279B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111638661.3A CN114330279B (en) 2021-12-29 2021-12-29 Cross-modal semantic consistency recovery method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111638661.3A CN114330279B (en) 2021-12-29 2021-12-29 Cross-modal semantic consistency recovery method

Publications (2)

Publication Number Publication Date
CN114330279A true CN114330279A (en) 2022-04-12
CN114330279B CN114330279B (en) 2023-04-18

Family

ID=81016638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111638661.3A Active CN114330279B (en) 2021-12-29 2021-12-29 Cross-modal semantic consistency recovery method

Country Status (1)

Country Link
CN (1) CN114330279B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118839699A (en) * 2024-07-12 2024-10-25 电子科技大学 Weak-supervision cross-mode semantic consistency recovery method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090235228A1 (en) * 2008-03-11 2009-09-17 Ching-Tsun Chou Methodology and tools for table-based protocol specification and model generation
CN108897852A (en) * 2018-06-29 2018-11-27 北京百度网讯科技有限公司 Judgment method, device and the equipment of conversation content continuity
CN110472242A (en) * 2019-08-05 2019-11-19 腾讯科技(深圳)有限公司 A kind of text handling method, device and computer readable storage medium
CN111951207A (en) * 2020-08-25 2020-11-17 福州大学 Image Quality Enhancement Method Based on Deep Reinforcement Learning and Semantic Loss
US20210117778A1 (en) * 2019-10-16 2021-04-22 Apple Inc. Semantic coherence analysis of deep neural networks
CN112966127A (en) * 2021-04-07 2021-06-15 北方民族大学 Cross-modal retrieval method based on multilayer semantic alignment
CN112991350A (en) * 2021-02-18 2021-06-18 西安电子科技大学 RGB-T image semantic segmentation method based on modal difference reduction
CN113378546A (en) * 2021-06-10 2021-09-10 电子科技大学 Non-autoregressive sentence sequencing method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090235228A1 (en) * 2008-03-11 2009-09-17 Ching-Tsun Chou Methodology and tools for table-based protocol specification and model generation
CN108897852A (en) * 2018-06-29 2018-11-27 北京百度网讯科技有限公司 Judgment method, device and the equipment of conversation content continuity
CN110472242A (en) * 2019-08-05 2019-11-19 腾讯科技(深圳)有限公司 A kind of text handling method, device and computer readable storage medium
US20210117778A1 (en) * 2019-10-16 2021-04-22 Apple Inc. Semantic coherence analysis of deep neural networks
CN111951207A (en) * 2020-08-25 2020-11-17 福州大学 Image Quality Enhancement Method Based on Deep Reinforcement Learning and Semantic Loss
CN112991350A (en) * 2021-02-18 2021-06-18 西安电子科技大学 RGB-T image semantic segmentation method based on modal difference reduction
CN112966127A (en) * 2021-04-07 2021-06-15 北方民族大学 Cross-modal retrieval method based on multilayer semantic alignment
CN113378546A (en) * 2021-06-10 2021-09-10 电子科技大学 Non-autoregressive sentence sequencing method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHILIANG WU等: "DAPC-Net:Deformable Alignment and Pyramid Context Completion Networks for Video Inpainting" *
李京谕 等: "基于联合注意力机制的篇章级机器翻译" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118839699A (en) * 2024-07-12 2024-10-25 电子科技大学 Weak-supervision cross-mode semantic consistency recovery method
CN118839699B (en) * 2024-07-12 2025-09-26 电子科技大学 A weakly supervised cross-modal semantic coherence restoration method

Also Published As

Publication number Publication date
CN114330279B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN109783657B (en) Multi-step self-attention cross-media retrieval method and system based on limited text space
CN107133211B (en) Composition scoring method based on attention mechanism
CN109492227A (en) It is a kind of that understanding method is read based on the machine of bull attention mechanism and Dynamic iterations
CN113779996B (en) Standard entity text determining method and device based on BiLSTM model and storage medium
CN109871538A (en) A Named Entity Recognition Method for Chinese Electronic Medical Records
CN113157885B (en) An efficient intelligent question answering system for artificial intelligence domain knowledge
CN110781683A (en) A method for joint extraction of entity relations
CN112100348A (en) Knowledge base question-answer relation detection method and system of multi-granularity attention mechanism
CN116151256A (en) A Few-Shot Named Entity Recognition Method Based on Multi-task and Hint Learning
CN114091450B (en) Judicial domain relation extraction method and system based on graph convolution network
CN116450883B (en) Video moment retrieval method based on fine-grained information of video content
CN112231491A (en) Similar test question identification method based on knowledge structure
CN110347857A (en) The semanteme marking method of remote sensing image based on intensified learning
CN114912512A (en) A method for automatic evaluation of the results of image descriptions
CN117708339B (en) ICD automatic coding method based on pre-training language model
CN115630145A (en) A dialogue recommendation method and system based on multi-granularity emotion
CN116821292B (en) Entity and relation linking method based on abstract semantic representation in knowledge base question and answer
CN112651225A (en) Multi-item selection machine reading understanding method based on multi-stage maximum attention
CN115329120A (en) Weak label Hash image retrieval framework with knowledge graph embedded attention mechanism
CN118035433A (en) An adaptive text summarization method with multimodal anchors
CN119313966A (en) Small sample image classification method and system based on multi-scale cross-modal cue enhancement
CN119988664A (en) Cross-modal image and text retrieval processing method and system
CN118069877A (en) Lightweight multimodal image description generation method based on CLIP encoder
CN113806551A (en) Domain knowledge extraction method based on multi-text structure data
CN114330279B (en) Cross-modal semantic consistency recovery method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载