CN114330279A

CN114330279A - Cross-modal semantic consistency recovery method

Info

Publication number: CN114330279A
Application number: CN202111638661.3A
Authority: CN
Inventors: 杨阳; 史文浩; 宾燚
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-04-12
Anticipated expiration: 2041-12-29
Also published as: CN114330279B

Abstract

The invention discloses a cross-modal semantic consistency recovery method which comprises the steps of firstly obtaining basic features and context features of a text modality and an image modality, then converting the context features of the two modalities into a semantic public space through linear projection to obtain attention information of a cross-modal ordered position, and finally sequencing disordered sentences by utilizing the attention information with the ordered position so as to finish the consistency recovery of the disordered sentences.

Description

A Cross-modal Semantic Coherence Restoration Method

技术领域technical field

本发明属于自然语言处理技术领域，更为具体地讲，涉及一种跨模态语义连贯性恢复方法。The invention belongs to the technical field of natural language processing, and more particularly, relates to a method for restoring cross-modal semantic coherence.

背景技术Background technique

连贯性建模一直是一个重要的研究课题，在自然语言处理领域被广泛研究，旨在将一组句子组织成一个连贯的文本，在逻辑上形成一致的顺序。研究取得了一定的进展，目前对语义连贯性建模的研究仍然停留在文本这单一模态。现有的语义连贯性分析和恢复方法是单模态下的，针对文本模态下的一组句子，通常采用编码器-解码器的体系结构，利用指针网络进行序列预测。Coherence modeling has always been an important research topic, widely studied in the field of natural language processing, aiming to organize a set of sentences into a coherent text that logically forms a consistent order. The research has made some progress, and the current research on semantic coherence modeling is still stuck in the single modality of text. Existing methods for semantic coherence analysis and recovery are unimodal, targeting a set of sentences in text modality, usually using an encoder-decoder architecture and using pointer networks for sequence prediction.

语义连贯性最初是衡量文本在语言学上是否具有语义意义，它可以扩展到更广泛的含义，用于评估各种模态中元素的逻辑、有序和一致的关系。对人类来说，连贯性建模是一种自然而必不可少的感知世界的能力，它使我们能够从整体上理解和感知世界，所以对信息的连贯性建模对于促进人类对物理世界的感知和理解非常重要。Semantic coherence is initially a measure of whether a text is linguistically semantically meaningful, and it can be extended to a wider range of meanings to assess the logical, ordered, and consistent relationships of elements in various modalities. For humans, coherence modeling is a natural and essential ability to perceive the world, which enables us to understand and perceive the world as a whole, so the coherence modeling of information is important for promoting human understanding of the physical world. Perception and understanding are very important.

当前主流的单模态语义连贯性分析和恢复方法是一种自回归的注意力分析和恢复方法，利用Bi-LSTM来提取基本的句子特征向量，启发于自注意力机制，采用去除位置编码的Transformer变体结构来提取可靠的段落表征以消除句子输入顺序带来的影响，从而获得段落中的句子特征，平均池化后获得段落特征来初始化循环神经网络解码器的隐层状态，通过指针网络，采用贪心搜索或集束搜索递归地预测有序连贯的段落组成，从而完成单模态语义连贯性分析和恢复。The current mainstream single-modal semantic coherence analysis and recovery method is an autoregressive attention analysis and recovery method. It uses Bi-LSTM to extract the basic sentence feature vector, inspired by the self-attention mechanism, and adopts the position encoding removal method. Transformer variant structure to extract reliable paragraph representations to eliminate the influence of sentence input order, thereby obtaining sentence features in paragraphs, and obtaining paragraph features after average pooling to initialize the hidden layer state of the recurrent neural network decoder, through the pointer network. , using greedy search or beam search to recursively predict the composition of ordered and coherent paragraphs, so as to complete the analysis and recovery of unimodal semantic coherence.

现有的语义连贯性建模工作主要集中在文本这单一模态上，编码时利用双向长短时记忆网络提取句子的基本特征向量，并利用自注意力机制提取句子上下文特征，然后通过平均池化操作得到段落特征，特别注意，这里采用了去除位置编码的Transformer变体结构。解码时采用指针网络架构作为解码器，该解码器由长短时记忆网络单元组成，基本的句子特征向量作为解码器的输入，第一步的输入向量是零向量，段落特征作为隐层初始状态。尽管现有的方法能有效解决模态语义连贯性分析和恢复，并进一步提高单模态下的性能，然而忽略了多模态之间的信息集成和语义一致性的影响，缺乏跨模态的信息。Existing semantic coherence modeling work mainly focuses on the single modality of text. During encoding, a bidirectional long-short-term memory network is used to extract the basic feature vector of the sentence, and the self-attention mechanism is used to extract the sentence context feature, and then average pooling is used. The operation obtains the paragraph features. In particular, the Transformer variant structure that removes the positional encoding is used here. The pointer network architecture is used as the decoder during decoding. The decoder is composed of long and short-term memory network units. The basic sentence feature vector is used as the input of the decoder. The input vector of the first step is the zero vector, and the paragraph feature is used as the initial state of the hidden layer. Although the existing methods can effectively solve the modal semantic coherence analysis and recovery, and further improve the performance under single modality, they ignore the influence of information integration and semantic consistency between multimodalities, and lack cross-modality analysis. information.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于克服现有技术的不足，提供一种跨模态语义连贯性恢复方法，根据文本和图像两种模态之间的语义一致性，有效利用跨模态信息引导文本模态下的语义连贯性恢复。The purpose of the present invention is to overcome the deficiencies of the prior art, and to provide a cross-modal semantic coherence recovery method. Semantic coherence restoration.

为实现上述发明目的，本发明一种跨模态语义连贯性恢复方法，其特征在于，包括以下步骤：In order to achieve the above purpose of the invention, a method for restoring cross-modal semantic coherence of the present invention is characterized in that it includes the following steps:

(1)、设文本模态下语义连贯性待恢复的乱序语句为X＝{x₁,x₂,…,x_i,…,x_m}，x_i表示第i句语句，m为乱序语句数量；设图像模态下的一组有序连贯图像为Y＝{y₁,y₂,…,y_j,…,y_n}，y_j表示第j张图像，n表示图像数量；设文本模态和图像模态之间有相似的语义；(1) Set the out-of-order sentences whose semantic coherence is to be restored in the text mode as X={x ₁ ,x ₂ ,..., _xi ,...,x _m }, x _i represents the i-th sentence, and m is the out-of-order sentence The number of sequential sentences; let a group of sequential consecutive images in the image mode be Y={y ₁ , y ₂ ,...,y _j ,...,y _n }, y _j represents the jth image, and n represents the number of images; Suppose there are similar semantics between text modalities and image modalities;

(2)、获取文本模态和图像模态的基本特征；(2), obtain the basic features of text mode and image mode;

(2.1)、利用双向长短时记忆网络获取乱序语句的基本特征：将X输入至双向长短时记忆网络，从而输出乱序语句的基本特征

其中，

表示第i句语句的基本特征，其维度大小为1×d；(2.1) Use the bidirectional long and short-term memory network to obtain the basic characteristics of the out-of-order sentence: input X into the bidirectional long-term and short-term memory network, so as to output the basic characteristics of the out-of-order sentence

in,

Represents the basic features of the i-th sentence, and its dimension is 1×d;

(2.2)、采用卷积神经网络获取有序连贯图像的基本特征：将Y输入至卷积神经网络，从而输出有序连贯图像的基本特征

表示第j张图像的基本特征，其维度大小为1×d；(2.2) Use the convolutional neural network to obtain the basic features of the ordered and coherent images: input Y into the convolutional neural network to output the basic features of the ordered and coherent images

Represents the basic features of the jth image, and its dimension is 1×d;

(3)、获取文本模态和图像模态的上下文特征；(3), obtain the contextual features of text modalities and image modalities;

(3.1)、利用去除位置嵌入的Transformer变体结构获取文本模态的上下文特征；(3.1), use the Transformer variant structure that removes the position embedding to obtain the contextual features of the text mode;

(3.1.1)、将各语句的基本特征进行拼接，得到矩阵

其维度大小为m×d；(3.1.1), splicing the basic features of each statement to obtain a matrix

Its dimension size is m×d;

(3.1.2)、利用Transformer的h头注意力层将基本特征

先映射为查询矩阵

键矩阵

和值矩阵

(3.1.2), use the Transformer's h-head attention layer to convert the basic features

First map to query matrix

bond matrix

sum value matrix

其中，k∈[1,h]表示第k个注意力头，

为第k个注意力头的权值矩阵，其维度大小均为

where k∈[1,h] denotes the kth attention head,

is the weight matrix of the kth attention head, and its dimension size is

然后通过注意力机制提取各个注意力头间的交互信息

Then, the interaction information between each attention head is extracted through the attention mechanism

其中，

表示第k个注意力头的维度，上标T表示转置；in,

Represents the dimension of the kth attention head, and the superscript T represents the transposition;

最后将各个注意力头间的交互信息

连接起来

再通过前向反馈网络得到乱序语句的上下文特征

表示第i句语句的上下文特征；Finally, the interaction information between each attention head

connect them

Then, the context features of the out-of-order sentences are obtained through the forward feedback network.

Represents the contextual features of the i-th sentence;

(3.2)、利用保留位置嵌入的Transformer变体结构获取图像模态的上下文特征；(3.2), use the Transformer variant structure embedded in the reserved position to obtain the contextual features of the image modality;

(3.2.1)、将各图像的基本特征进行拼接，得到矩阵

其维度大小为n×d；(3.2.1), stitch the basic features of each image to obtain a matrix

Its dimension size is n×d;

(3.2.2)、将基本特征

中各图像基本特征

的离散位置投影嵌入为紧凑位置，记为p_j；(3.2.2), the basic features

The basic characteristics of each image in

The discrete position projection embedding of is a compact position, denoted as p _j ;

在基本特征

中，对偶数项的维度进行投影嵌入为：p_j,2l＝sin(j/10000^2l/d)；对奇数项的维度进行投影嵌入为：p_j,2l+1＝cos(j/10000^2l/d)；in basic features

, the projection embedding for the dimensions of the even items is: p _j,2l =sin(j/10000 ^2l/d ); the projection embedding for the dimensions of the odd items is: p _j,2l+1 =cos(j/10000 ^{2l /d} );

其中，p_j,2l、p_j,2l+1分别表示偶数项维度和奇数项维度投影嵌入后的值，l为常数，2l,2l+1∈[1,d]；Among them, p _j,2l and p _j,2l+1 respectively represent the even-numbered dimension and odd-numbered dimension after projection embedding, l is a constant, 2l,2l+1∈[1,d];

基本特征

的所有维度投影嵌入完成后得到紧凑位置p_j；Basic Features

The compact position p _j is obtained after the projection embedding of all dimensions of ;

最后将各图像的紧凑位置p_j进行拼接，得到位置嵌入矩阵

其维度大小为n×d；Finally, the compact positions p _j of each image are spliced to obtain the position embedding matrix

Its dimension size is n×d;

(3.2.3)、将基本特征

和位置嵌入

相加后利用Transformer的h头注意力层先映射为查询矩阵

键矩阵

和值矩阵

(3.2.3), the basic features

and position embedding

After the addition, use the Transformer's h-head attention layer to first map it to the query matrix

bond matrix

sum value matrix

其中，k∈[1,h]表示第k个注意力头，

为第k个注意力头的权值矩阵，其维度大小均为

where k∈[1,h] denotes the kth attention head,

is the weight matrix of the kth attention head, and its dimension size is

然后通过注意力机制提取各个注意力头间的交互信息

其中，

表示第k个注意力头的维度，上标T表示转置；in,

最后将各个注意力头间的交互信息

连接起来

再通过前向反馈网络得到有序连贯图像的上下文特征

表示第j张图像的上下文特征；Finally, the interaction information between each attention head

connect them

Then, the contextual features of the ordered and coherent images are obtained through the forward feedback network.

Represents the contextual features of the jth image;

(4)、获取跨模态有序位置的注意力信息(4) Obtain attention information of cross-modal ordered positions

(4.1)、通过线性投影将两种模态的上下文特征转换到语义公共空间；(4.1) Convert the contextual features of the two modalities to the semantic common space through linear projection;

(4.1.1)、对两种模态的上下文特征进行线性投影；(4.1.1), perform linear projection on the contextual features of the two modalities;

其中，W₁、W₂为权重参数，b₁、b₂为偏置项，ReLU(·)为校正线性激活函数；Among them, W ₁ , W ₂ are weight parameters, b ₁ , b ₂ are bias terms, and ReLU(·) is the corrected linear activation function;

(4.1.2)、语义公共空间转换；(4.1.2), semantic public space conversion;

将线性投影后的上下文特征

拼接，得到文本模态下的语义表示矩阵

Context features after linear projection

Splicing to get the semantic representation matrix in the text mode

将线性投影后的上下文特征

拼接，得到图像模态下的语义表示矩阵

Context features after linear projection

Splicing to get the semantic representation matrix in the image modality

(4.2)、计算两模态间的语义相关性Corr；(4.2) Calculate the semantic correlation Corr between the two modalities;

(4.3)、利用两模态的语义相关性将图像模态中有序图像的位置嵌入转换为文本模态中的注意力信息；(4.3), using the semantic correlation of the two modalities to convert the position embedding of the ordered image in the image modality into the attention information in the text modality;

(4.3.1)、利用注意力机制获得文本模态中各语句的隐性位置信息

(4.3.1), use the attention mechanism to obtain the implicit position information of each sentence in the text mode

α＝soft max(Corr)α=soft max(Corr)

(4.3.2)、将

中各语句的上下文特征进行拼接后和隐性位置信息

相加，得到带有有序位置注意力信息的语句上下文特征

其维度大小为n×d；(4.3.2), will

The context features of each sentence in the splicing and implicit location information

Add up to get sentence context features with ordered positional attention information

Its dimension size is n×d;

(5)、乱序语句的连贯性恢复；(5) Coherence recovery of out-of-order statements;

(5.1)、将基本特征

中各语句基本特征

的离散位置投影嵌入为紧凑位置，记为p_i；(5.1), the basic features

The basic characteristics of each sentence in

The discrete position projection embedding of is a compact position, denoted as p _i ;

在基本特征

中，对偶数项的维度进行投影嵌入为：p_i,2l＝sin(i/10000^2l/d)；对奇数项的维度进行投影嵌入为：p_i,2l+1＝cos(i/10000^2l/d)；in basic features

, the projection embedding of the dimensions of the even items is: p _i,2l =sin(i/10000 ^2l/d ); the dimensions of the odd items are projected and embedded as: p _i,2l+1 =cos(i/10000 ^{2l /d} );

其中，p_i,2l、p_i,2l+1分别表示偶数项维度和奇数项维度投影嵌入后的值，l为常数，2l,2l+1∈[1,d]；Among them, p _i,2l and p _i,2l+1 represent the even-numbered dimension and odd-numbered dimension dimension after projection embedding, l is a constant, 2l,2l+1∈[1,d];

基本特征

的所有维度投影嵌入完成后得到紧凑位置p_i；Basic Features

The compact position p _i is obtained after the projection embedding of all dimensions of ;

最后将各语句的紧凑位置p_i进行拼接，得到位置嵌入矩阵

其维度大小为m×d；Finally, splicing the compact positions p _i of each sentence to obtain the position embedding matrix

Its dimension size is m×d;

(5.2)、利用Transformer的h头注意力层将位置嵌入矩阵

先映射为查询矩阵

键矩阵

和值矩阵

(5.2), use Transformer's h head attention layer to embed the position into the matrix

First map to query matrix

bond matrix

sum value matrix

其中，k∈[1,h]表示第k个注意力头，

为第k个注意力头的权值矩阵，其维度大小均为

where k∈[1,h] denotes the kth attention head,

is the weight matrix of the kth attention head, and its dimension size is

然后通过注意力机制提取各个注意力头间的交互信息

其中，

表示第k个注意力头的维度，上标T表示转置；in,

最后将各个注意力头间的交互信息

连接起来

再通过前向反馈网络得到语句位置之间的交互特

表示第i个语句位置的交互特征；Finally, the interaction information between each attention head

connect them

Then, the interaction characteristics between sentence positions are obtained through the forward feedback network.

represents the interaction feature of the i-th sentence position;

(5.3)、通过多头互注意力模块以获取各语句关于位置的注意力特征；(5.3), through the multi-head mutual attention module to obtain the attention features of each sentence about the position;

(5.3.1)、将各语句位置的交互特征

拼接，得到矩阵

其维度大小为m×d；(5.3.1), the interaction characteristics of each sentence position

concatenate to get the matrix

Its dimension size is m×d;

(5.3.2)、利用Transformer的h头注意力层将矩阵

先映射为查询矩阵

再将矩阵

映射为键矩阵

和值矩阵

(5.3.2), use Transformer's h head attention layer to convert the matrix

First map to query matrix

then the matrix

map to key matrix

sum value matrix

其中，k∈[1,h]表示第k个注意力头，

为第k个注意力头的权值矩阵，其维度大小均为

where k∈[1,h] denotes the kth attention head,

is the weight matrix of the kth attention head, and its dimension size is

然后通过注意力机制提取各个注意力头间的交互信息

其中，

表示第k个注意力头的维度，上标T表示转置；in,

最后将各个注意力头间的交互信息

连接起来

再通过前向反馈网络得到语句关于位置的注意力特征

表示语句关于第i个位置的注意力特征；Finally, the interaction information between each attention head

connect them

Then, the attention feature of the sentence about the position is obtained through the forward feedback network.

Represents the attention feature of the sentence about the i-th position;

(5.4)、计算各语句所处位置的概率；(5.4) Calculate the probability of the position of each statement;

(5.4.1)、计算第i句语句处于m个位置的概率，其中，第i句语句处于第i个位置的注意力值为ω_i：(5.4.1), calculate the probability that the i-th sentence is in m positions, where the attention value of the i-th sentence in the i-th position is ω _i :

ptr_i＝softmax(ω_i)ptr _i =softmax(ω _i )

其中，W_p、W_b为权重矩阵，u为列权重向量；Among them, W _p and W _b are weight matrices, and u is a column weight vector;

同理，按照上述公式计算出第i句语句处于m个位置的概率，记为位置概率集合{ptr₁,ptr₂,…,ptr_i,…,ptr_m}；Similarly, according to the above formula, calculate the probability that the i-th sentence is in m positions, which is recorded as the position probability set {ptr ₁ ,ptr ₂ ,…,ptr _i ,…,ptr _m };

(5.4.2)、在位置概率集合中取概率值最大的一个位置概率，作为第i句语句所处位置的最终概率，记为Ptr_i；同理，得到各个语句所处位置的最终概率，记为{Ptr₁,Ptr₂,…,Ptr_i,…,Ptr_m}；(5.4.2), take a position probability with the largest probability value in the position probability set, as the final probability of the position of the i-th sentence sentence, denoted as Pt _i ; Similarly, obtain the final probability of the position of each sentence, Denoted as {Ptr ₁ ,Ptr ₂ ,…,Ptr _i ,…,Ptr _m };

(5.5)、按照位置概率对乱序语句排序；(5.5), sort the out-of-order sentences according to the position probability;

从第一个位置开始，在集合{Ptr₁,Ptr₂,…,Ptr_i,…,Ptr_m}中选出概率值最大对应的语句，并排在第一个位置，然后将已排序语句概率值置为零，然后以此类推，直到第m个位置排序结束，从而完成乱序语句的连贯性恢复。Starting from the first position, from the set {Ptr ₁ ,Ptr ₂ ,…,Ptr _i ,…,Ptr _m }, select the sentence corresponding to the highest probability value and rank it in the first position, and then put the probability value of the sorted sentence Set to zero, and so on, until the end of the mth position sorting, so as to complete the coherence recovery of the out-of-order statement.

本发明的发明目的是这样实现的：The purpose of the invention of the present invention is achieved in this way:

本发明一种跨模态语义连贯性恢复方法，首先获取文本模态和图像模态的基本特征和上下文特征，然后通过线性投影将两种模态的上下文特征转换到语义公共空间，进行获取跨模态有序位置的注意力信息，最后利用带有有序位置注意力信息对乱序语句进行排序，从而完成乱序语句的连贯性恢复。The present invention is a cross-modal semantic coherence recovery method. First, basic features and context features of text modality and image modality are acquired, and then the context features of the two modalities are converted into semantic common space through linear projection to obtain cross-modal semantics. The attention information of the modal ordered position is used to sort the disordered sentences by the attention information with the ordered position finally, so as to complete the coherence recovery of the disordered sentences.

同时，本发明一种跨模态语义连贯性恢复方法还具有以下有益效果：At the same time, a cross-modal semantic coherence recovery method of the present invention also has the following beneficial effects:

(1)、本发明提出的跨模态语义连贯性分析和恢复方法可以有效地对不同模态中的元素进行特征提取，充分利用跨模态位置信息辅助和促进单模态下语义连贯性分析和恢复，并行地预测恢复每个位置的元素，进一步提升该任务的速度和精度；(1) The cross-modal semantic coherence analysis and recovery method proposed by the present invention can effectively extract features for elements in different modalities, and make full use of cross-modal location information to assist and promote semantic coherence analysis in a single modality and recovery, predicting and recovering elements at each position in parallel, further improving the speed and accuracy of the task;

(2)、本发明通过跨模态的方式有效地将具有相似语义的文本模态和图像模态进行连接，有利于语义连贯性的分析和引入有序连贯模态下的位置注意力信息。(2) The present invention effectively connects text modalities and image modalities with similar semantics in a cross-modal manner, which is conducive to the analysis of semantic coherence and the introduction of positional attention information in orderly coherent modalities.

附图说明Description of drawings

图1是本发明一种跨模态语义连贯性恢复方法流程图；1 is a flowchart of a method for restoring cross-modal semantic coherence of the present invention;

具体实施方式Detailed ways

下面结合附图对本发明的具体实施方式进行描述，以便本领域的技术人员更好地理解本发明。需要特别提醒注意的是，在以下的描述中，当已知功能和设计的详细描述也许会淡化本发明的主要内容时，这些描述在这里将被忽略。The specific embodiments of the present invention are described below with reference to the accompanying drawings, so that those skilled in the art can better understand the present invention. It should be noted that, in the following description, when the detailed description of known functions and designs may dilute the main content of the present invention, these descriptions will be omitted here.

实施例Example

图1是本发明一种跨模态语义连贯性恢复方法流程图。FIG. 1 is a flow chart of a method for restoring cross-modal semantic coherence according to the present invention.

在本实施例中，如图1所示，本发明一种跨模态语义连贯性恢复方法，包括以下步骤：In this embodiment, as shown in FIG. 1 , a method for restoring cross-modal semantic coherence of the present invention includes the following steps:

S1、设文本模态下语义连贯性待恢复的乱序语句为X＝{x₁,x₂,…,x_i,…,x_m}，x_i表示第i句语句，m为乱序语句数量；设图像模态下的一组有序连贯图像为Y＝{y₁,y₂,…,y_j,…,y_n}，y_j表示第j张图像，n表示图像数量；设文本模态和图像模态之间有相似的语义，现利用图像辅助文本恢复为有序连贯的段落。S1. Set the out-of-order sentences whose semantic coherence is to be restored in the text mode as X={x ₁ ,x ₂ ,..., _xi ,...,x _m }, x _i represents the i-th sentence, and m is the out-of-order sentence Quantity; let a set of ordered coherent images in image mode be Y={y ₁ , y ₂ ,...,y _j ,...,y _n }, y _j represents the jth image, and n represents the number of images; let the text There are similar semantics between modalities and image modalities, and images are now used to aid text restoration into ordered and coherent paragraphs.

S2、获取文本模态和图像模态的基本特征；S2. Obtain the basic features of text modality and image modality;

S2.1、利用双向长短时记忆网络获取乱序语句的基本特征：将X输入至双向长短时记忆网络，从而输出乱序语句的基本特征

其中，

表示第i句语句的基本特征，其维度大小为1×d，d取值512；S2.1. Use the bidirectional long and short-term memory network to obtain the basic characteristics of the out-of-order sentence: input X into the bidirectional long-term and short-term memory network, so as to output the basic characteristics of the out-of-order sentence

in,

Represents the basic features of the i-th sentence, its dimension is 1×d, and d is 512;

S2.2、采用卷积神经网络获取有序连贯图像的基本特征：将Y输入至卷积神经网络，从而输出有序连贯图像的基本特征

表示第j张图像的基本特征，其维度大小为1×d；S2.2. Use the convolutional neural network to obtain the basic features of the ordered and coherent images: input Y into the convolutional neural network to output the basic features of the ordered and coherent images

Represents the basic features of the jth image, and its dimension is 1×d;

S3、获取文本模态和图像模态的上下文特征；S3. Obtain the contextual features of the text modality and the image modality;

S3.1、为了利用上下文语义关系，利用去除位置嵌入的Transformer变体结构获取文本模态的上下文特征，其使用了缩放点积的自注意力机制以利用上下文信息。S3.1. In order to take advantage of the contextual semantic relationship, the Transformer variant structure that removes the positional embedding is used to obtain the contextual features of the text modality, which uses the self-attention mechanism of the scaled dot product to utilize the contextual information.

S3.1.1、将各语句的基本特征进行拼接，得到矩阵

其维度大小为m×d；S3.1.1. Splicing the basic features of each sentence to obtain a matrix

Its dimension size is m×d;

S3.1.2、利用Transformer的h头注意力层将基本特征

先映射为查询矩阵

键矩阵

和值矩阵

S3.1.2. Use Transformer's h-head attention layer to convert basic features

First map to query matrix

bond matrix

sum value matrix

其中，k∈[1,h]表示第k个注意力头，

为第k个注意力头的权值矩阵，其维度大小均为

h取值4；where k∈[1,h] denotes the kth attention head,

is the weight matrix of the kth attention head, and its dimension size is

h value is 4;

然后通过注意力机制提取各个注意力头间的交互信息

其中，

表示第k个注意力头的维度，上标T表示转置；in,

最后将各个注意力头间的交互信息

连接起来

再通过前向反馈网络得到乱序语句的上下文特征

connect them

Represents the contextual features of the i-th sentence;

S3.2、为了给图像连贯语义信息建模，这里采用保留了位置嵌入的Transformer变体结构获取图像模态的上下文特征；S3.2. In order to model the coherent semantic information of the image, the Transformer variant structure that retains the position embedding is used to obtain the contextual features of the image modality;

S3.2.1、将各图像的基本特征进行拼接，得到矩阵

其维度大小为n×d；S3.2.1. Splicing the basic features of each image to obtain a matrix

Its dimension size is n×d;

S3.2.2、将基本特征

中各图像基本特征

的离散位置投影嵌入为紧凑位置，记为p_j；S3.2.2, the basic characteristics

The basic characteristics of each image in

在基本特征

基本特征

的所有维度投影嵌入完成后得到紧凑位置p_j；Basic Features

最后将各图像的紧凑位置p_j进行拼接，得到位置嵌入矩阵

Its dimension size is n×d;

S3.2.3、将基本特征

和位置嵌入

相加后利用Transformer的h头注意力层先映射为查询矩阵

键矩阵

和值矩阵

S3.2.3, the basic characteristics

and position embedding

bond matrix

sum value matrix

其中，k∈[1,h]表示第k个注意力头，

为第k个注意力头的权值矩阵，其维度大小均为

where k∈[1,h] denotes the kth attention head,

is the weight matrix of the kth attention head, and its dimension size is

然后通过注意力机制提取各个注意力头间的交互信息

其中，

表示第k个注意力头的维度，上标T表示转置；in,

最后将各个注意力头间的交互信息

连接起来

再通过前向反馈网络得到有序连贯图像的上下文特征

connect them

Represents the contextual features of the jth image;

S4、获取跨模态有序位置的注意力信息S4. Obtain attention information of cross-modal ordered positions

S4.1、为了利用来自图像模态的跨模态顺序信息，通过跨模态位置注意力模块来连接两种模态之间的语义一致性。S4.1. To exploit the cross-modal order information from image modalities, the semantic consistency between two modalities is connected through a cross-modal location attention module.

首先通过线性投影将两种模态的上下文特征转换到语义公共空间；First, the contextual features of the two modalities are transformed into the semantic common space through linear projection;

S4.1.1、对两种模态的上下文特征进行线性投影；S4.1.1. Perform linear projection on the contextual features of the two modalities;

S4.1.2、语义公共空间转换；S4.1.2, semantic public space conversion;

将线性投影后的上下文特征

拼接，得到文本模态下的语义表示矩阵

Context features after linear projection

Splicing to get the semantic representation matrix in the text mode

将线性投影后的上下文特征

拼接，得到图像模态下的语义表示矩阵

Context features after linear projection

Splicing to get the semantic representation matrix in the image modality

S4.2、计算两模态间的语义相关性Corr；S4.2. Calculate the semantic correlation Corr between two modalities;

S4.3、利用两模态的语义相关性将图像模态中有序图像的位置嵌入转换为文本模态中的注意力信息；S4.3, using the semantic correlation of the two modalities to convert the position embedding of the ordered image in the image modality into attention information in the text modality;

S4.3.1、利用注意力机制获得文本模态中各语句的隐性位置信息

S4.3.1. Use the attention mechanism to obtain the implicit position information of each sentence in the text modality

α＝soft max(Corr)α=soft max(Corr)

S4.3.2、将

中各语句的上下文特征进行拼接后和隐性位置信息

相加，得到带有有序位置注意力信息的语句上下文特征

其维度大小为n×d；S4.3.2, will

Its dimension size is n×d;

S5、进行连贯性恢复；S5, perform continuity recovery;

S5.1、将基本特征

中各语句基本特征

的离散位置投影嵌入为紧凑位置，记为p_i；S5.1, the basic characteristics

The basic characteristics of each sentence in

在基本特征

基本特征

的所有维度投影嵌入完成后得到紧凑位置p_i；Basic Features

最后将各语句的紧凑位置p_i进行拼接，得到位置嵌入矩阵

Its dimension size is m×d;

S5.2、利用Transformer的h头注意力层将位置嵌入矩阵

先映射为查询矩阵

键矩阵

和值矩阵

S5.2, use Transformer's h-head attention layer to embed the position into the matrix

First map to query matrix

bond matrix

sum value matrix

其中，k∈[1,h]表示第k个注意力头，

为第k个注意力头的权值矩阵，其维度大小均为

where k∈[1,h] denotes the kth attention head,

is the weight matrix of the kth attention head, and its dimension size is

然后通过注意力机制提取各个注意力头间的交互信息

Then the interaction information between each attention head is extracted through the attention mechanism

其中，

表示第k个注意力头的维度，上标T表示转置；in,

最后将各个注意力头间的交互信息

连接起来

再通过前向反馈网络得到语句位置之间的交互特

connect them

represents the interaction feature of the i-th sentence position;

S5.3、通过多头互注意力模块以获取各语句关于位置的注意力特征；S5.3, through the multi-head mutual attention module to obtain the attention features of each sentence about the position;

S5.3.1、将各语句位置的交互特征

拼接，得到矩阵

其维度大小为m×d；S5.3.1, the interaction characteristics of each sentence position

concatenate to get the matrix

Its dimension size is m×d;

S5.3.2、利用Transformer的h头注意力层将矩阵

先映射为查询矩阵

再将矩阵

映射为键矩阵

和值矩阵

S5.3.2, use Transformer's h head attention layer to convert the matrix

First map to query matrix

then the matrix

map to key matrix

sum value matrix

其中，k∈[1,h]表示第k个注意力头，

为第k个注意力头的权值矩阵，其维度大小均为

where k∈[1,h] denotes the kth attention head,

is the weight matrix of the kth attention head, and its dimension size is

然后通过注意力机制提取各个注意力头间的交互信息

其中，

表示第k个注意力头的维度，上标T表示转置；in,

最后将各个注意力头间的交互信息

连接起来

再通过前向反馈网络得到语句关于位置的注意力特征

connect them

Represents the attention feature of the sentence about the i-th position;

S5.4、计算各语句所处位置的概率；S5.4. Calculate the probability of the position of each statement;

S5.4.1、计算第i句语句处于m个位置的概率，其中，第i句语句处于第i个位置的注意力值为ω_i：S5.4.1. Calculate the probability that the ith sentence is in m positions, where the attention value of the ith sentence in the ith position is ω _i :

ptr_i＝softmax(ω_i)ptr _i =softmax(ω _i )

S5.4.2、在位置概率集合中取概率值最大的一个位置概率，作为第i句语句所处位置的最终概率，记为Ptr_i；同理，得到各个语句所处位置的最终概率，记为{Ptr₁,Ptr₂,…,Ptr_i,…,Ptr_m}；S5.4.2. Take the position probability with the largest probability value in the position probability set, as the final probability of the position of the i-th sentence sentence, denoted as Pt _i ; in the same way, obtain the final probability of the position of each sentence, denoted as {Ptr ₁ ,Ptr ₂ ,…,Ptr _i ,…,Ptr _m };

S5.5、按照位置概率对乱序语句排序；S5.5. Sort out-of-order sentences according to position probability;

在本实施例中，将本发明用于几个常用数据集，其中包括SIND和TACoS两个视觉叙事和故事理解语料库，具有文本和图片两种形式的数据。本发明采用完全匹配率(PMR)，准确率(Acc)和τ度量来作为评价指标。完全匹配率(PMR)在整体上衡量元素位置预测的性能。准确率(Acc)计算单个元素的绝对位置预测的准确性，是更为宽松的度量指标。τ度量用于衡量预测中所有元素对之间的相对顺序，与人类的判断更相近。In this embodiment, the present invention is applied to several commonly used datasets, including two visual narrative and story understanding corpora, SIND and TACoS, with data in both text and picture forms. The present invention adopts perfect matching rate (PMR), accuracy rate (Acc) and τ metric as evaluation indicators. The perfect match rate (PMR) measures the performance of element position prediction as a whole. Accuracy (Acc) calculates the accuracy of the absolute position prediction of a single element and is a looser metric. The τ metric is used to measure the relative order between all pairs of elements in the prediction and is closer to human judgment.

通过本发明及现有方法进行语句连贯性恢复，其实验结果如表1所示，其中，LSTM+PtrNet是长短时记忆网络加指针网络的方法，AON-UM是单模态的自回归注意恢复方法，AON-CM是跨模态的自回归注意恢复方法，NAD-UM是单模态的非自回归恢复方法，NAD-CM1是没有采用位置嵌入和位置注意力的跨模态非自回归方法，NAD-CM2是没有采用位置注意力的跨模态非自回归方法，NAD-CM3是没有采用位置嵌入的跨模态非自回归方法，NACON(noexl)是没有采用贪心选择和掩码排除的方法，NACON为本发明方法。从实验结果可以看出，跨模态的语义连贯性分析和恢复方法的性能极大优于现有的单模态的方法。本发明比起NAD-CM1、NAD-CM2、NAD-CM3的各项评价指标有所提升，验证了跨模态位置注意力信息的有效性。此外，与AON-CM和NACON(no exl)相比，性能也有明显改善，验证了本发明设计的连贯性恢复方法贪心选择和掩码排除的推理方式的有效性。The sentence coherence restoration is carried out by the present invention and the existing method, and the experimental results are shown in Table 1. Among them, LSTM+PtrNet is a method of long-short-term memory network plus pointer network, and AON-UM is a single-modal autoregressive attention restoration. Methods, AON-CM is a cross-modal autoregressive attention recovery method, NAD-UM is a single-modal non-autoregressive recovery method, NAD-CM1 is a cross-modal non-autoregressive method without position embedding and position attention , NAD-CM2 is a cross-modal non-autoregressive method without positional attention, NAD-CM3 is a cross-modal non-autoregressive method without positional embedding, NACON(noexl) does not use greedy selection and mask exclusion method, NACON is the method of the present invention. From the experimental results, it can be seen that the performance of the cross-modality semantic coherence analysis and recovery method is significantly better than the existing single-modality methods. Compared with the evaluation indicators of NAD-CM1, NAD-CM2 and NAD-CM3, the present invention improves the effectiveness of the cross-modal location attention information. In addition, compared with AON-CM and NACON (no exl), the performance is also significantly improved, which verifies the effectiveness of the reasoning mode of greedy selection and mask exclusion of the coherence restoration method designed by the present invention.

表1是SIND,TACoS数据集上的实验结果；Table 1 shows the experimental results on the SIND and TACoS datasets;

尽管上面对本发明说明性的具体实施方式进行了描述，以便于本技术领域的技术人员理解本发明，但应该清楚，本发明不限于具体实施方式的范围，对本技术领域的普通技术人员来讲，只要各种变化在所附的权利要求限定和确定的本发明的精神和范围内，这些变化是显而易见的，一切利用本发明构思的发明创造均在保护之列。Although the illustrative specific embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be clear that the present invention is not limited to the scope of the specific embodiments. For those skilled in the art, As long as various changes are within the spirit and scope of the present invention as defined and determined by the appended claims, these changes are obvious, and all inventions and creations utilizing the inventive concept are included in the protection list.

Claims

1. A cross-modal semantic consistency recovery method is characterized by comprising the following steps:

(1) and setting the out-of-order sentence with semantic consistency to be restored under the text mode as X ═ X₁,x₂,…,x_i,…,x_m}，x_iThe i sentence is expressed, and m is the quantity of the unordered sentences; let a set of ordered consecutive images in the image modality be Y ═ Y₁,y₂,…,y_j,…,y_n}，y_jRepresents the j image, and n represents the number of images; setting similar semantics between a text mode and an image mode;

(2) acquiring basic characteristics of a text mode and an image mode;

(2.1) acquiring basic characteristics of the out-of-order statements by using a bidirectional long-time and short-time memory network: inputting X into bidirectional long-time memory network to output basic characteristics of out-of-order statement

Wherein,

representing the basic characteristics of the sentence of the ith sentence, wherein the dimension size of the sentence is 1 × d;

(2.2) acquiring basic features of the ordered and coherent images by adopting a convolutional neural network: inputting Y into the convolution neural network, thereby outputting basic features of the sequential image

Representing the basic characteristics of the jth image, and the dimension size of the jth image is 1 x d;

(3) acquiring context characteristics of a text mode and an image mode;

(3.1) acquiring context characteristics of a text mode by using a Transformer variant structure with embedded position positions removed;

(3.1.1) splicing the basic characteristics of the sentences to obtain a matrix

The dimension size is mxd;

(3.1.2) Using Transformer's h-head attention layer to characterize the underlying features

First mapping to a query matrix

Key matrix

Sum matrix

Wherein k is ∈ [1, h ]]The k-th head of attention is shown,

the weight matrix of the kth attention head has the dimensions of

Then extracting the mutual information among the attention heads through the attention mechanism

Wherein,

the dimension of the kth attention head is represented, and the superscript T represents transposition;

finally, the mutual information among all attention heads is obtained

Are connected together

And obtaining the context characteristics of the out-of-order statement through a forward feedback network

Representing the context characteristics of the sentence i;

(3.2) acquiring context characteristics of an image modality by using a Transformer variant structure embedded in a reserved position;

(3.2.1) splicing the basic characteristics of the images to obtain a matrix

The dimension size is n x d;

(3.2.2) general features

Basic features of each image

Is embedded as a compact position, denoted as p_j；

In the basic characteristics

In (2), embedding the projection of the dimension of the even term as: p is a radical of_j,2l＝sin(j/10000^2l/d) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of_j,2l+1＝cos(j/10000^2l/d)；

Wherein p is_j,2l、p_j,2l+1Respectively representing the values of the projection embedding of even number item dimension and odd number item dimension, wherein l is a constant, 2l,2l +1 belongs to [1, d ]]；

Basic features

Obtaining a compact position p after embedding of all-dimension projection_j；

Finally, the compact position p of each image is determined_jSplicing to obtain a position embedded matrix

The dimension size is n x d;

(3.2.3) basic features

And position embedding

Mapping the h-head attention layer into a query matrix by using a Transformer after addition

Key matrix

Sum matrix

Wherein k is ∈ [1, h ]]The k-th head of attention is shown,

the weight matrix of the kth attention head has the dimensions of

Wherein,

finally, the mutual information among all attention heads is obtained

Are connected together

And obtaining the context characteristics of the ordered coherent images through a forward feedback network

Representing the context feature of the jth image;

(4) obtaining attention information of cross-modal ordered positions

(4.1) converting the context features of the two modes into a semantic common space through linear projection;

(4.1.1) carrying out linear projection on the context characteristics of the two modes;

wherein, W₁、W₂As weight parameter, b₁、b₂For the bias term, ReLU (-) is the correct linear activation function;

(4.1.2) semantic public space conversion;

linearly projected contextual features

Splicing to obtain a semantic representation matrix in the text mode

Linearly projected contextual features

Splicing to obtain a semantic representation matrix under the image mode

(4.2) calculating semantic correlation Corr between the two modes;

(4.3) embedding and converting the position of the ordered image in the image modality into attention information in the text modality by utilizing semantic correlation of the two modalities;

(4.31) obtaining implicit position information of each statement in text mode by using attention mechanism

α＝soft max(Corr)

(4.3.2) mixing

After the context features of the sentences are spliced and the implicit position information is obtained

Adding to obtain the sentence context characteristics with ordered position attention information

The dimension size is n x d;

(5) restoring the consistency of the out-of-order statement;

(5.1) general characteristics

Basic characteristics of each sentence in

Is embedded as a compact position, denoted as p_i；

In the basic characteristics

In (2), embedding the projection of the dimension of the even term as: p is a radical of_i,2l＝sin(i/10000^2l/d) (ii) a The projection embedding is performed on the dimensions of the odd terms as: p is a radical of_i,2l+1＝cos(i/10000^2l/^d)；

Wherein p is_i,2l、p_i,2l+1Respectively representing the values of the projection embedding of even number item dimension and odd number item dimension, wherein l is a constant, 2l,2l +1 belongs to [1, d ]]；

Basic features

Obtaining a compact position p after embedding of all-dimension projection_i；

Finally, the compact position p of each image is determined_iSplicing to obtain a position embedded matrix

The dimension size is mxd;

(5.2) embedding the position into the matrix using the Transformer's h-head attention layer

First mapping to a query matrix

Key matrix

Sum matrix

Wherein k is ∈ [1, h ]]The k-th head of attention is shown,

the weight matrix of the kth attention head has the dimensions of

Wherein,

finally, the mutual information among all attention heads is obtained

Are connected together

And then the interactive characteristics among the sentence positions are obtained through a forward feedback network

An interactive feature representing the ith sentence position;

(5.3) acquiring attention characteristics of each sentence about the position through a multi-head mutual attention module;

(5.3.1) Interactive features of respective sentence positions

Splicing to obtain a matrix

Size of its dimensionIs m × d;

(5.3.2) matrix alignment using Transformer's h-head attention layer

First mapping to a query matrix

Key matrix

Sum matrix

Wherein k is ∈ [1, h ]]The k-th head of attention is shown,

the weight matrix of the kth attention head has the dimensions of

Wherein,

representing the dimension of the kth attention head,superscript T denotes transpose;

finally, the mutual information among all attention heads is obtained

Are connected together

Then the attention characteristics of the statement about the position are obtained through a forward feedback network

An attention feature representing the sentence with respect to the ith position;

(5.4) calculating the probability of the position of each sentence;

(5.4.1) calculating the probability that the ith sentence is at m positions, wherein the probability that the ith sentence is at the ith position is omega_i：

ptr_i＝softmax(ω_i)

Wherein, W_p、W_bIs a weight matrix, u is a column weight vector;

similarly, the probability of the ith sentence at m positions is calculated according to the formula and is recorded as a position probability set { ptr }₁,ptr₂,…,ptr_i,…,ptr_m}；

(5.4.2) taking the position probability with the maximum probability value in the position probability set as the final probability of the position of the ith sentence, and marking the final probability as Ptr_i(ii) a In the same way, the final probability of the position of each sentence is obtained and is marked as { Ptr₁,Ptr₂,…,Ptr_i,…,Ptr_m}；

(5.5) sequencing the out-of-order sentences according to the position probability;

starting from the first position, in the set { Ptr₁,Ptr₂,…,Ptr_i,…,Ptr_mAnd (4) selecting the statement with the maximum probability value, arranging the statements at the first position side by side, setting the probability value of the sequenced statements to be zero, and repeating the steps until the sequencing at the mth position is finished, thereby completing the continuity recovery of the out-of-order statements.