CN114330279A - Cross-modal semantic consistency recovery method - Google Patents
Cross-modal semantic consistency recovery method Download PDFInfo
- Publication number
- CN114330279A CN114330279A CN202111638661.3A CN202111638661A CN114330279A CN 114330279 A CN114330279 A CN 114330279A CN 202111638661 A CN202111638661 A CN 202111638661A CN 114330279 A CN114330279 A CN 114330279A
- Authority
- CN
- China
- Prior art keywords
- attention
- matrix
- sentence
- head
- dimension
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000011084 recovery Methods 0.000 title claims abstract description 23
- 238000012163 sequencing technique Methods 0.000 claims abstract 3
- 239000011159 matrix material Substances 0.000 claims description 78
- 230000007246 mechanism Effects 0.000 claims description 18
- 230000001427 coherent effect Effects 0.000 claims description 14
- 230000017105 transposition Effects 0.000 claims description 11
- 230000002457 bidirectional effect Effects 0.000 claims description 7
- 238000013527 convolutional neural network Methods 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 4
- 230000015654 memory Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 claims description 2
- 238000013507 mapping Methods 0.000 claims 4
- 230000002452 interceptive effect Effects 0.000 claims 3
- 230000003993 interaction Effects 0.000 description 22
- 238000004458 analytical method Methods 0.000 description 10
- 230000007787 long-term memory Effects 0.000 description 5
- 230000006403 short-term memory Effects 0.000 description 5
- 238000011160 research Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 2
- 230000007717 exclusion Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Machine Translation (AREA)
Abstract
Description
技术领域technical field
本发明属于自然语言处理技术领域,更为具体地讲,涉及一种跨模态语义连贯性恢复方法。The invention belongs to the technical field of natural language processing, and more particularly, relates to a method for restoring cross-modal semantic coherence.
背景技术Background technique
连贯性建模一直是一个重要的研究课题,在自然语言处理领域被广泛研究,旨在将一组句子组织成一个连贯的文本,在逻辑上形成一致的顺序。研究取得了一定的进展,目前对语义连贯性建模的研究仍然停留在文本这单一模态。现有的语义连贯性分析和恢复方法是单模态下的,针对文本模态下的一组句子,通常采用编码器-解码器的体系结构,利用指针网络进行序列预测。Coherence modeling has always been an important research topic, widely studied in the field of natural language processing, aiming to organize a set of sentences into a coherent text that logically forms a consistent order. The research has made some progress, and the current research on semantic coherence modeling is still stuck in the single modality of text. Existing methods for semantic coherence analysis and recovery are unimodal, targeting a set of sentences in text modality, usually using an encoder-decoder architecture and using pointer networks for sequence prediction.
语义连贯性最初是衡量文本在语言学上是否具有语义意义,它可以扩展到更广泛的含义,用于评估各种模态中元素的逻辑、有序和一致的关系。对人类来说,连贯性建模是一种自然而必不可少的感知世界的能力,它使我们能够从整体上理解和感知世界,所以对信息的连贯性建模对于促进人类对物理世界的感知和理解非常重要。Semantic coherence is initially a measure of whether a text is linguistically semantically meaningful, and it can be extended to a wider range of meanings to assess the logical, ordered, and consistent relationships of elements in various modalities. For humans, coherence modeling is a natural and essential ability to perceive the world, which enables us to understand and perceive the world as a whole, so the coherence modeling of information is important for promoting human understanding of the physical world. Perception and understanding are very important.
当前主流的单模态语义连贯性分析和恢复方法是一种自回归的注意力分析和恢复方法,利用Bi-LSTM来提取基本的句子特征向量,启发于自注意力机制,采用去除位置编码的Transformer变体结构来提取可靠的段落表征以消除句子输入顺序带来的影响,从而获得段落中的句子特征,平均池化后获得段落特征来初始化循环神经网络解码器的隐层状态,通过指针网络,采用贪心搜索或集束搜索递归地预测有序连贯的段落组成,从而完成单模态语义连贯性分析和恢复。The current mainstream single-modal semantic coherence analysis and recovery method is an autoregressive attention analysis and recovery method. It uses Bi-LSTM to extract the basic sentence feature vector, inspired by the self-attention mechanism, and adopts the position encoding removal method. Transformer variant structure to extract reliable paragraph representations to eliminate the influence of sentence input order, thereby obtaining sentence features in paragraphs, and obtaining paragraph features after average pooling to initialize the hidden layer state of the recurrent neural network decoder, through the pointer network. , using greedy search or beam search to recursively predict the composition of ordered and coherent paragraphs, so as to complete the analysis and recovery of unimodal semantic coherence.
现有的语义连贯性建模工作主要集中在文本这单一模态上,编码时利用双向长短时记忆网络提取句子的基本特征向量,并利用自注意力机制提取句子上下文特征,然后通过平均池化操作得到段落特征,特别注意,这里采用了去除位置编码的Transformer变体结构。解码时采用指针网络架构作为解码器,该解码器由长短时记忆网络单元组成,基本的句子特征向量作为解码器的输入,第一步的输入向量是零向量,段落特征作为隐层初始状态。尽管现有的方法能有效解决模态语义连贯性分析和恢复,并进一步提高单模态下的性能,然而忽略了多模态之间的信息集成和语义一致性的影响,缺乏跨模态的信息。Existing semantic coherence modeling work mainly focuses on the single modality of text. During encoding, a bidirectional long-short-term memory network is used to extract the basic feature vector of the sentence, and the self-attention mechanism is used to extract the sentence context feature, and then average pooling is used. The operation obtains the paragraph features. In particular, the Transformer variant structure that removes the positional encoding is used here. The pointer network architecture is used as the decoder during decoding. The decoder is composed of long and short-term memory network units. The basic sentence feature vector is used as the input of the decoder. The input vector of the first step is the zero vector, and the paragraph feature is used as the initial state of the hidden layer. Although the existing methods can effectively solve the modal semantic coherence analysis and recovery, and further improve the performance under single modality, they ignore the influence of information integration and semantic consistency between multimodalities, and lack cross-modality analysis. information.
发明内容SUMMARY OF THE INVENTION
本发明的目的在于克服现有技术的不足,提供一种跨模态语义连贯性恢复方法,根据文本和图像两种模态之间的语义一致性,有效利用跨模态信息引导文本模态下的语义连贯性恢复。The purpose of the present invention is to overcome the deficiencies of the prior art, and to provide a cross-modal semantic coherence recovery method. Semantic coherence restoration.
为实现上述发明目的,本发明一种跨模态语义连贯性恢复方法,其特征在于,包括以下步骤:In order to achieve the above purpose of the invention, a method for restoring cross-modal semantic coherence of the present invention is characterized in that it includes the following steps:
(1)、设文本模态下语义连贯性待恢复的乱序语句为X={x1,x2,…,xi,…,xm},xi表示第i句语句,m为乱序语句数量;设图像模态下的一组有序连贯图像为Y={y1,y2,…,yj,…,yn},yj表示第j张图像,n表示图像数量;设文本模态和图像模态之间有相似的语义;(1) Set the out-of-order sentences whose semantic coherence is to be restored in the text mode as X={x 1 ,x 2 ,..., xi ,...,x m }, x i represents the i-th sentence, and m is the out-of-order sentence The number of sequential sentences; let a group of sequential consecutive images in the image mode be Y={y 1 , y 2 ,...,y j ,...,y n }, y j represents the jth image, and n represents the number of images; Suppose there are similar semantics between text modalities and image modalities;
(2)、获取文本模态和图像模态的基本特征;(2), obtain the basic features of text mode and image mode;
(2.1)、利用双向长短时记忆网络获取乱序语句的基本特征:将X输入至双向长短时记忆网络,从而输出乱序语句的基本特征其中,表示第i句语句的基本特征,其维度大小为1×d;(2.1) Use the bidirectional long and short-term memory network to obtain the basic characteristics of the out-of-order sentence: input X into the bidirectional long-term and short-term memory network, so as to output the basic characteristics of the out-of-order sentence in, Represents the basic features of the i-th sentence, and its dimension is 1×d;
(2.2)、采用卷积神经网络获取有序连贯图像的基本特征:将Y输入至卷积神经网络,从而输出有序连贯图像的基本特征 表示第j张图像的基本特征,其维度大小为1×d;(2.2) Use the convolutional neural network to obtain the basic features of the ordered and coherent images: input Y into the convolutional neural network to output the basic features of the ordered and coherent images Represents the basic features of the jth image, and its dimension is 1×d;
(3)、获取文本模态和图像模态的上下文特征;(3), obtain the contextual features of text modalities and image modalities;
(3.1)、利用去除位置嵌入的Transformer变体结构获取文本模态的上下文特征;(3.1), use the Transformer variant structure that removes the position embedding to obtain the contextual features of the text mode;
(3.1.1)、将各语句的基本特征进行拼接,得到矩阵其维度大小为m×d;(3.1.1), splicing the basic features of each statement to obtain a matrix Its dimension size is m×d;
(3.1.2)、利用Transformer的h头注意力层将基本特征先映射为查询矩阵键矩阵和值矩阵 (3.1.2), use the Transformer's h-head attention layer to convert the basic features First map to query matrix bond matrix sum value matrix
其中,k∈[1,h]表示第k个注意力头,为第k个注意力头的权值矩阵,其维度大小均为 where k∈[1,h] denotes the kth attention head, is the weight matrix of the kth attention head, and its dimension size is
然后通过注意力机制提取各个注意力头间的交互信息 Then, the interaction information between each attention head is extracted through the attention mechanism
其中,表示第k个注意力头的维度,上标T表示转置;in, Represents the dimension of the kth attention head, and the superscript T represents the transposition;
最后将各个注意力头间的交互信息连接起来再通过前向反馈网络得到乱序语句的上下文特征 表示第i句语句的上下文特征;Finally, the interaction information between each attention head connect them Then, the context features of the out-of-order sentences are obtained through the forward feedback network. Represents the contextual features of the i-th sentence;
(3.2)、利用保留位置嵌入的Transformer变体结构获取图像模态的上下文特征;(3.2), use the Transformer variant structure embedded in the reserved position to obtain the contextual features of the image modality;
(3.2.1)、将各图像的基本特征进行拼接,得到矩阵其维度大小为n×d;(3.2.1), stitch the basic features of each image to obtain a matrix Its dimension size is n×d;
(3.2.2)、将基本特征中各图像基本特征的离散位置投影嵌入为紧凑位置,记为pj;(3.2.2), the basic features The basic characteristics of each image in The discrete position projection embedding of is a compact position, denoted as p j ;
在基本特征中,对偶数项的维度进行投影嵌入为:pj,2l=sin(j/100002l/d);对奇数项的维度进行投影嵌入为:pj,2l+1=cos(j/100002l/d);in basic features , the projection embedding for the dimensions of the even items is: p j,2l =sin(j/10000 2l/d ); the projection embedding for the dimensions of the odd items is: p j,2l+1 =cos(j/10000 2l /d );
其中,pj,2l、pj,2l+1分别表示偶数项维度和奇数项维度投影嵌入后的值,l为常数,2l,2l+1∈[1,d];Among them, p j,2l and p j,2l+1 respectively represent the even-numbered dimension and odd-numbered dimension after projection embedding, l is a constant, 2l,2l+1∈[1,d];
基本特征的所有维度投影嵌入完成后得到紧凑位置pj;Basic Features The compact position p j is obtained after the projection embedding of all dimensions of ;
最后将各图像的紧凑位置pj进行拼接,得到位置嵌入矩阵其维度大小为n×d;Finally, the compact positions p j of each image are spliced to obtain the position embedding matrix Its dimension size is n×d;
(3.2.3)、将基本特征和位置嵌入相加后利用Transformer的h头注意力层先映射为查询矩阵键矩阵和值矩阵 (3.2.3), the basic features and position embedding After the addition, use the Transformer's h-head attention layer to first map it to the query matrix bond matrix sum value matrix
其中,k∈[1,h]表示第k个注意力头,为第k个注意力头的权值矩阵,其维度大小均为 where k∈[1,h] denotes the kth attention head, is the weight matrix of the kth attention head, and its dimension size is
然后通过注意力机制提取各个注意力头间的交互信息 Then, the interaction information between each attention head is extracted through the attention mechanism
其中,表示第k个注意力头的维度,上标T表示转置;in, Represents the dimension of the kth attention head, and the superscript T represents the transposition;
最后将各个注意力头间的交互信息连接起来再通过前向反馈网络得到有序连贯图像的上下文特征 表示第j张图像的上下文特征;Finally, the interaction information between each attention head connect them Then, the contextual features of the ordered and coherent images are obtained through the forward feedback network. Represents the contextual features of the jth image;
(4)、获取跨模态有序位置的注意力信息(4) Obtain attention information of cross-modal ordered positions
(4.1)、通过线性投影将两种模态的上下文特征转换到语义公共空间;(4.1) Convert the contextual features of the two modalities to the semantic common space through linear projection;
(4.1.1)、对两种模态的上下文特征进行线性投影;(4.1.1), perform linear projection on the contextual features of the two modalities;
其中,W1、W2为权重参数,b1、b2为偏置项,ReLU(·)为校正线性激活函数;Among them, W 1 , W 2 are weight parameters, b 1 , b 2 are bias terms, and ReLU(·) is the corrected linear activation function;
(4.1.2)、语义公共空间转换;(4.1.2), semantic public space conversion;
将线性投影后的上下文特征拼接,得到文本模态下的语义表示矩阵 Context features after linear projection Splicing to get the semantic representation matrix in the text mode
将线性投影后的上下文特征拼接,得到图像模态下的语义表示矩阵 Context features after linear projection Splicing to get the semantic representation matrix in the image modality
(4.2)、计算两模态间的语义相关性Corr;(4.2) Calculate the semantic correlation Corr between the two modalities;
(4.3)、利用两模态的语义相关性将图像模态中有序图像的位置嵌入转换为文本模态中的注意力信息;(4.3), using the semantic correlation of the two modalities to convert the position embedding of the ordered image in the image modality into the attention information in the text modality;
(4.3.1)、利用注意力机制获得文本模态中各语句的隐性位置信息 (4.3.1), use the attention mechanism to obtain the implicit position information of each sentence in the text mode
α=soft max(Corr)α=soft max(Corr)
(4.3.2)、将中各语句的上下文特征进行拼接后和隐性位置信息相加,得到带有有序位置注意力信息的语句上下文特征其维度大小为n×d;(4.3.2), will The context features of each sentence in the splicing and implicit location information Add up to get sentence context features with ordered positional attention information Its dimension size is n×d;
(5)、乱序语句的连贯性恢复;(5) Coherence recovery of out-of-order statements;
(5.1)、将基本特征中各语句基本特征的离散位置投影嵌入为紧凑位置,记为pi;(5.1), the basic features The basic characteristics of each sentence in The discrete position projection embedding of is a compact position, denoted as p i ;
在基本特征中,对偶数项的维度进行投影嵌入为:pi,2l=sin(i/100002l/d);对奇数项的维度进行投影嵌入为:pi,2l+1=cos(i/100002l/d);in basic features , the projection embedding of the dimensions of the even items is: p i,2l =sin(i/10000 2l/d ); the dimensions of the odd items are projected and embedded as: p i,2l+1 =cos(i/10000 2l /d );
其中,pi,2l、pi,2l+1分别表示偶数项维度和奇数项维度投影嵌入后的值,l为常数,2l,2l+1∈[1,d];Among them, p i,2l and p i,2l+1 represent the even-numbered dimension and odd-numbered dimension dimension after projection embedding, l is a constant, 2l,2l+1∈[1,d];
基本特征的所有维度投影嵌入完成后得到紧凑位置pi;Basic Features The compact position p i is obtained after the projection embedding of all dimensions of ;
最后将各语句的紧凑位置pi进行拼接,得到位置嵌入矩阵其维度大小为m×d;Finally, splicing the compact positions p i of each sentence to obtain the position embedding matrix Its dimension size is m×d;
(5.2)、利用Transformer的h头注意力层将位置嵌入矩阵先映射为查询矩阵键矩阵和值矩阵 (5.2), use Transformer's h head attention layer to embed the position into the matrix First map to query matrix bond matrix sum value matrix
其中,k∈[1,h]表示第k个注意力头,为第k个注意力头的权值矩阵,其维度大小均为 where k∈[1,h] denotes the kth attention head, is the weight matrix of the kth attention head, and its dimension size is
然后通过注意力机制提取各个注意力头间的交互信息 Then, the interaction information between each attention head is extracted through the attention mechanism
其中,表示第k个注意力头的维度,上标T表示转置;in, Represents the dimension of the kth attention head, and the superscript T represents the transposition;
最后将各个注意力头间的交互信息连接起来再通过前向反馈网络得到语句位置之间的交互特 表示第i个语句位置的交互特征;Finally, the interaction information between each attention head connect them Then, the interaction characteristics between sentence positions are obtained through the forward feedback network. represents the interaction feature of the i-th sentence position;
(5.3)、通过多头互注意力模块以获取各语句关于位置的注意力特征;(5.3), through the multi-head mutual attention module to obtain the attention features of each sentence about the position;
(5.3.1)、将各语句位置的交互特征拼接,得到矩阵其维度大小为m×d;(5.3.1), the interaction characteristics of each sentence position concatenate to get the matrix Its dimension size is m×d;
(5.3.2)、利用Transformer的h头注意力层将矩阵先映射为查询矩阵再将矩阵映射为键矩阵和值矩阵 (5.3.2), use Transformer's h head attention layer to convert the matrix First map to query matrix then the matrix map to key matrix sum value matrix
其中,k∈[1,h]表示第k个注意力头,为第k个注意力头的权值矩阵,其维度大小均为 where k∈[1,h] denotes the kth attention head, is the weight matrix of the kth attention head, and its dimension size is
然后通过注意力机制提取各个注意力头间的交互信息 Then, the interaction information between each attention head is extracted through the attention mechanism
其中,表示第k个注意力头的维度,上标T表示转置;in, Represents the dimension of the kth attention head, and the superscript T represents the transposition;
最后将各个注意力头间的交互信息连接起来再通过前向反馈网络得到语句关于位置的注意力特征 表示语句关于第i个位置的注意力特征;Finally, the interaction information between each attention head connect them Then, the attention feature of the sentence about the position is obtained through the forward feedback network. Represents the attention feature of the sentence about the i-th position;
(5.4)、计算各语句所处位置的概率;(5.4) Calculate the probability of the position of each statement;
(5.4.1)、计算第i句语句处于m个位置的概率,其中,第i句语句处于第i个位置的注意力值为ωi:(5.4.1), calculate the probability that the i-th sentence is in m positions, where the attention value of the i-th sentence in the i-th position is ω i :
ptri=softmax(ωi)ptr i =softmax(ω i )
其中,Wp、Wb为权重矩阵,u为列权重向量;Among them, W p and W b are weight matrices, and u is a column weight vector;
同理,按照上述公式计算出第i句语句处于m个位置的概率,记为位置概率集合{ptr1,ptr2,…,ptri,…,ptrm};Similarly, according to the above formula, calculate the probability that the i-th sentence is in m positions, which is recorded as the position probability set {ptr 1 ,ptr 2 ,…,ptr i ,…,ptr m };
(5.4.2)、在位置概率集合中取概率值最大的一个位置概率,作为第i句语句所处位置的最终概率,记为Ptri;同理,得到各个语句所处位置的最终概率,记为{Ptr1,Ptr2,…,Ptri,…,Ptrm};(5.4.2), take a position probability with the largest probability value in the position probability set, as the final probability of the position of the i-th sentence sentence, denoted as Pt i ; Similarly, obtain the final probability of the position of each sentence, Denoted as {Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m };
(5.5)、按照位置概率对乱序语句排序;(5.5), sort the out-of-order sentences according to the position probability;
从第一个位置开始,在集合{Ptr1,Ptr2,…,Ptri,…,Ptrm}中选出概率值最大对应的语句,并排在第一个位置,然后将已排序语句概率值置为零,然后以此类推,直到第m个位置排序结束,从而完成乱序语句的连贯性恢复。Starting from the first position, from the set {Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m }, select the sentence corresponding to the highest probability value and rank it in the first position, and then put the probability value of the sorted sentence Set to zero, and so on, until the end of the mth position sorting, so as to complete the coherence recovery of the out-of-order statement.
本发明的发明目的是这样实现的:The purpose of the invention of the present invention is achieved in this way:
本发明一种跨模态语义连贯性恢复方法,首先获取文本模态和图像模态的基本特征和上下文特征,然后通过线性投影将两种模态的上下文特征转换到语义公共空间,进行获取跨模态有序位置的注意力信息,最后利用带有有序位置注意力信息对乱序语句进行排序,从而完成乱序语句的连贯性恢复。The present invention is a cross-modal semantic coherence recovery method. First, basic features and context features of text modality and image modality are acquired, and then the context features of the two modalities are converted into semantic common space through linear projection to obtain cross-modal semantics. The attention information of the modal ordered position is used to sort the disordered sentences by the attention information with the ordered position finally, so as to complete the coherence recovery of the disordered sentences.
同时,本发明一种跨模态语义连贯性恢复方法还具有以下有益效果:At the same time, a cross-modal semantic coherence recovery method of the present invention also has the following beneficial effects:
(1)、本发明提出的跨模态语义连贯性分析和恢复方法可以有效地对不同模态中的元素进行特征提取,充分利用跨模态位置信息辅助和促进单模态下语义连贯性分析和恢复,并行地预测恢复每个位置的元素,进一步提升该任务的速度和精度;(1) The cross-modal semantic coherence analysis and recovery method proposed by the present invention can effectively extract features for elements in different modalities, and make full use of cross-modal location information to assist and promote semantic coherence analysis in a single modality and recovery, predicting and recovering elements at each position in parallel, further improving the speed and accuracy of the task;
(2)、本发明通过跨模态的方式有效地将具有相似语义的文本模态和图像模态进行连接,有利于语义连贯性的分析和引入有序连贯模态下的位置注意力信息。(2) The present invention effectively connects text modalities and image modalities with similar semantics in a cross-modal manner, which is conducive to the analysis of semantic coherence and the introduction of positional attention information in orderly coherent modalities.
附图说明Description of drawings
图1是本发明一种跨模态语义连贯性恢复方法流程图;1 is a flowchart of a method for restoring cross-modal semantic coherence of the present invention;
具体实施方式Detailed ways
下面结合附图对本发明的具体实施方式进行描述,以便本领域的技术人员更好地理解本发明。需要特别提醒注意的是,在以下的描述中,当已知功能和设计的详细描述也许会淡化本发明的主要内容时,这些描述在这里将被忽略。The specific embodiments of the present invention are described below with reference to the accompanying drawings, so that those skilled in the art can better understand the present invention. It should be noted that, in the following description, when the detailed description of known functions and designs may dilute the main content of the present invention, these descriptions will be omitted here.
实施例Example
图1是本发明一种跨模态语义连贯性恢复方法流程图。FIG. 1 is a flow chart of a method for restoring cross-modal semantic coherence according to the present invention.
在本实施例中,如图1所示,本发明一种跨模态语义连贯性恢复方法,包括以下步骤:In this embodiment, as shown in FIG. 1 , a method for restoring cross-modal semantic coherence of the present invention includes the following steps:
S1、设文本模态下语义连贯性待恢复的乱序语句为X={x1,x2,…,xi,…,xm},xi表示第i句语句,m为乱序语句数量;设图像模态下的一组有序连贯图像为Y={y1,y2,…,yj,…,yn},yj表示第j张图像,n表示图像数量;设文本模态和图像模态之间有相似的语义,现利用图像辅助文本恢复为有序连贯的段落。S1. Set the out-of-order sentences whose semantic coherence is to be restored in the text mode as X={x 1 ,x 2 ,..., xi ,...,x m }, x i represents the i-th sentence, and m is the out-of-order sentence Quantity; let a set of ordered coherent images in image mode be Y={y 1 , y 2 ,...,y j ,...,y n }, y j represents the jth image, and n represents the number of images; let the text There are similar semantics between modalities and image modalities, and images are now used to aid text restoration into ordered and coherent paragraphs.
S2、获取文本模态和图像模态的基本特征;S2. Obtain the basic features of text modality and image modality;
S2.1、利用双向长短时记忆网络获取乱序语句的基本特征:将X输入至双向长短时记忆网络,从而输出乱序语句的基本特征其中,表示第i句语句的基本特征,其维度大小为1×d,d取值512;S2.1. Use the bidirectional long and short-term memory network to obtain the basic characteristics of the out-of-order sentence: input X into the bidirectional long-term and short-term memory network, so as to output the basic characteristics of the out-of-order sentence in, Represents the basic features of the i-th sentence, its dimension is 1×d, and d is 512;
S2.2、采用卷积神经网络获取有序连贯图像的基本特征:将Y输入至卷积神经网络,从而输出有序连贯图像的基本特征 表示第j张图像的基本特征,其维度大小为1×d;S2.2. Use the convolutional neural network to obtain the basic features of the ordered and coherent images: input Y into the convolutional neural network to output the basic features of the ordered and coherent images Represents the basic features of the jth image, and its dimension is 1×d;
S3、获取文本模态和图像模态的上下文特征;S3. Obtain the contextual features of the text modality and the image modality;
S3.1、为了利用上下文语义关系,利用去除位置嵌入的Transformer变体结构获取文本模态的上下文特征,其使用了缩放点积的自注意力机制以利用上下文信息。S3.1. In order to take advantage of the contextual semantic relationship, the Transformer variant structure that removes the positional embedding is used to obtain the contextual features of the text modality, which uses the self-attention mechanism of the scaled dot product to utilize the contextual information.
S3.1.1、将各语句的基本特征进行拼接,得到矩阵其维度大小为m×d;S3.1.1. Splicing the basic features of each sentence to obtain a matrix Its dimension size is m×d;
S3.1.2、利用Transformer的h头注意力层将基本特征先映射为查询矩阵键矩阵和值矩阵 S3.1.2. Use Transformer's h-head attention layer to convert basic features First map to query matrix bond matrix sum value matrix
其中,k∈[1,h]表示第k个注意力头,为第k个注意力头的权值矩阵,其维度大小均为h取值4;where k∈[1,h] denotes the kth attention head, is the weight matrix of the kth attention head, and its dimension size is h value is 4;
然后通过注意力机制提取各个注意力头间的交互信息 Then, the interaction information between each attention head is extracted through the attention mechanism
其中,表示第k个注意力头的维度,上标T表示转置;in, Represents the dimension of the kth attention head, and the superscript T represents the transposition;
最后将各个注意力头间的交互信息连接起来再通过前向反馈网络得到乱序语句的上下文特征 表示第i句语句的上下文特征;Finally, the interaction information between each attention head connect them Then, the context features of the out-of-order sentences are obtained through the forward feedback network. Represents the contextual features of the i-th sentence;
S3.2、为了给图像连贯语义信息建模,这里采用保留了位置嵌入的Transformer变体结构获取图像模态的上下文特征;S3.2. In order to model the coherent semantic information of the image, the Transformer variant structure that retains the position embedding is used to obtain the contextual features of the image modality;
S3.2.1、将各图像的基本特征进行拼接,得到矩阵其维度大小为n×d;S3.2.1. Splicing the basic features of each image to obtain a matrix Its dimension size is n×d;
S3.2.2、将基本特征中各图像基本特征的离散位置投影嵌入为紧凑位置,记为pj;S3.2.2, the basic characteristics The basic characteristics of each image in The discrete position projection embedding of is a compact position, denoted as p j ;
在基本特征中,对偶数项的维度进行投影嵌入为:pj,2l=sin(j/100002l/d);对奇数项的维度进行投影嵌入为:pj,2l+1=cos(j/100002l/d);in basic features , the projection embedding for the dimensions of the even items is: p j,2l =sin(j/10000 2l/d ); the projection embedding for the dimensions of the odd items is: p j,2l+1 =cos(j/10000 2l /d );
其中,pj,2l、pj,2l+1分别表示偶数项维度和奇数项维度投影嵌入后的值,l为常数,2l,2l+1∈[1,d];Among them, p j,2l and p j,2l+1 respectively represent the even-numbered dimension and odd-numbered dimension after projection embedding, l is a constant, 2l,2l+1∈[1,d];
基本特征的所有维度投影嵌入完成后得到紧凑位置pj;Basic Features The compact position p j is obtained after the projection embedding of all dimensions of ;
最后将各图像的紧凑位置pj进行拼接,得到位置嵌入矩阵其维度大小为n×d;Finally, the compact positions p j of each image are spliced to obtain the position embedding matrix Its dimension size is n×d;
S3.2.3、将基本特征和位置嵌入相加后利用Transformer的h头注意力层先映射为查询矩阵键矩阵和值矩阵 S3.2.3, the basic characteristics and position embedding After the addition, use the Transformer's h-head attention layer to first map it to the query matrix bond matrix sum value matrix
其中,k∈[1,h]表示第k个注意力头,为第k个注意力头的权值矩阵,其维度大小均为 where k∈[1,h] denotes the kth attention head, is the weight matrix of the kth attention head, and its dimension size is
然后通过注意力机制提取各个注意力头间的交互信息 Then, the interaction information between each attention head is extracted through the attention mechanism
其中,表示第k个注意力头的维度,上标T表示转置;in, Represents the dimension of the kth attention head, and the superscript T represents the transposition;
最后将各个注意力头间的交互信息连接起来再通过前向反馈网络得到有序连贯图像的上下文特征 表示第j张图像的上下文特征;Finally, the interaction information between each attention head connect them Then, the contextual features of the ordered and coherent images are obtained through the forward feedback network. Represents the contextual features of the jth image;
S4、获取跨模态有序位置的注意力信息S4. Obtain attention information of cross-modal ordered positions
S4.1、为了利用来自图像模态的跨模态顺序信息,通过跨模态位置注意力模块来连接两种模态之间的语义一致性。S4.1. To exploit the cross-modal order information from image modalities, the semantic consistency between two modalities is connected through a cross-modal location attention module.
首先通过线性投影将两种模态的上下文特征转换到语义公共空间;First, the contextual features of the two modalities are transformed into the semantic common space through linear projection;
S4.1.1、对两种模态的上下文特征进行线性投影;S4.1.1. Perform linear projection on the contextual features of the two modalities;
其中,W1、W2为权重参数,b1、b2为偏置项,ReLU(·)为校正线性激活函数;Among them, W 1 , W 2 are weight parameters, b 1 , b 2 are bias terms, and ReLU(·) is the corrected linear activation function;
S4.1.2、语义公共空间转换;S4.1.2, semantic public space conversion;
将线性投影后的上下文特征拼接,得到文本模态下的语义表示矩阵 Context features after linear projection Splicing to get the semantic representation matrix in the text mode
将线性投影后的上下文特征拼接,得到图像模态下的语义表示矩阵 Context features after linear projection Splicing to get the semantic representation matrix in the image modality
S4.2、计算两模态间的语义相关性Corr;S4.2. Calculate the semantic correlation Corr between two modalities;
S4.3、利用两模态的语义相关性将图像模态中有序图像的位置嵌入转换为文本模态中的注意力信息;S4.3, using the semantic correlation of the two modalities to convert the position embedding of the ordered image in the image modality into attention information in the text modality;
S4.3.1、利用注意力机制获得文本模态中各语句的隐性位置信息 S4.3.1. Use the attention mechanism to obtain the implicit position information of each sentence in the text modality
α=soft max(Corr)α=soft max(Corr)
S4.3.2、将中各语句的上下文特征进行拼接后和隐性位置信息相加,得到带有有序位置注意力信息的语句上下文特征其维度大小为n×d;S4.3.2, will The context features of each sentence in the splicing and implicit location information Add up to get sentence context features with ordered positional attention information Its dimension size is n×d;
S5、进行连贯性恢复;S5, perform continuity recovery;
S5.1、将基本特征中各语句基本特征的离散位置投影嵌入为紧凑位置,记为pi;S5.1, the basic characteristics The basic characteristics of each sentence in The discrete position projection embedding of is a compact position, denoted as p i ;
在基本特征中,对偶数项的维度进行投影嵌入为:pi,2l=sin(i/100002l/d);对奇数项的维度进行投影嵌入为:pi,2l+1=cos(i/100002l/d);in basic features , the projection embedding of the dimensions of the even items is: p i,2l =sin(i/10000 2l/d ); the dimensions of the odd items are projected and embedded as: p i,2l+1 =cos(i/10000 2l /d );
其中,pi,2l、pi,2l+1分别表示偶数项维度和奇数项维度投影嵌入后的值,l为常数,2l,2l+1∈[1,d];Among them, p i,2l and p i,2l+1 represent the even-numbered dimension and odd-numbered dimension dimension after projection embedding, l is a constant, 2l,2l+1∈[1,d];
基本特征的所有维度投影嵌入完成后得到紧凑位置pi;Basic Features The compact position p i is obtained after the projection embedding of all dimensions of ;
最后将各语句的紧凑位置pi进行拼接,得到位置嵌入矩阵其维度大小为m×d;Finally, splicing the compact positions p i of each sentence to obtain the position embedding matrix Its dimension size is m×d;
S5.2、利用Transformer的h头注意力层将位置嵌入矩阵先映射为查询矩阵键矩阵和值矩阵 S5.2, use Transformer's h-head attention layer to embed the position into the matrix First map to query matrix bond matrix sum value matrix
其中,k∈[1,h]表示第k个注意力头,为第k个注意力头的权值矩阵,其维度大小均为 where k∈[1,h] denotes the kth attention head, is the weight matrix of the kth attention head, and its dimension size is
然后通过注意力机制提取各个注意力头间的交互信息 Then the interaction information between each attention head is extracted through the attention mechanism
其中,表示第k个注意力头的维度,上标T表示转置;in, Represents the dimension of the kth attention head, and the superscript T represents the transposition;
最后将各个注意力头间的交互信息连接起来再通过前向反馈网络得到语句位置之间的交互特 表示第i个语句位置的交互特征;Finally, the interaction information between each attention head connect them Then, the interaction characteristics between sentence positions are obtained through the forward feedback network. represents the interaction feature of the i-th sentence position;
S5.3、通过多头互注意力模块以获取各语句关于位置的注意力特征;S5.3, through the multi-head mutual attention module to obtain the attention features of each sentence about the position;
S5.3.1、将各语句位置的交互特征拼接,得到矩阵其维度大小为m×d;S5.3.1, the interaction characteristics of each sentence position concatenate to get the matrix Its dimension size is m×d;
S5.3.2、利用Transformer的h头注意力层将矩阵先映射为查询矩阵再将矩阵映射为键矩阵和值矩阵 S5.3.2, use Transformer's h head attention layer to convert the matrix First map to query matrix then the matrix map to key matrix sum value matrix
其中,k∈[1,h]表示第k个注意力头,为第k个注意力头的权值矩阵,其维度大小均为 where k∈[1,h] denotes the kth attention head, is the weight matrix of the kth attention head, and its dimension size is
然后通过注意力机制提取各个注意力头间的交互信息 Then, the interaction information between each attention head is extracted through the attention mechanism
其中,表示第k个注意力头的维度,上标T表示转置;in, Represents the dimension of the kth attention head, and the superscript T represents the transposition;
最后将各个注意力头间的交互信息连接起来再通过前向反馈网络得到语句关于位置的注意力特征 表示语句关于第i个位置的注意力特征;Finally, the interaction information between each attention head connect them Then, the attention feature of the sentence about the position is obtained through the forward feedback network. Represents the attention feature of the sentence about the i-th position;
S5.4、计算各语句所处位置的概率;S5.4. Calculate the probability of the position of each statement;
S5.4.1、计算第i句语句处于m个位置的概率,其中,第i句语句处于第i个位置的注意力值为ωi:S5.4.1. Calculate the probability that the ith sentence is in m positions, where the attention value of the ith sentence in the ith position is ω i :
ptri=softmax(ωi)ptr i =softmax(ω i )
其中,Wp、Wb为权重矩阵,u为列权重向量;Among them, W p and W b are weight matrices, and u is a column weight vector;
同理,按照上述公式计算出第i句语句处于m个位置的概率,记为位置概率集合{ptr1,ptr2,…,ptri,…,ptrm};Similarly, according to the above formula, calculate the probability that the i-th sentence is in m positions, which is recorded as the position probability set {ptr 1 ,ptr 2 ,…,ptr i ,…,ptr m };
S5.4.2、在位置概率集合中取概率值最大的一个位置概率,作为第i句语句所处位置的最终概率,记为Ptri;同理,得到各个语句所处位置的最终概率,记为{Ptr1,Ptr2,…,Ptri,…,Ptrm};S5.4.2. Take the position probability with the largest probability value in the position probability set, as the final probability of the position of the i-th sentence sentence, denoted as Pt i ; in the same way, obtain the final probability of the position of each sentence, denoted as {Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m };
S5.5、按照位置概率对乱序语句排序;S5.5. Sort out-of-order sentences according to position probability;
从第一个位置开始,在集合{Ptr1,Ptr2,…,Ptri,…,Ptrm}中选出概率值最大对应的语句,并排在第一个位置,然后将已排序语句概率值置为零,然后以此类推,直到第m个位置排序结束,从而完成乱序语句的连贯性恢复。Starting from the first position, from the set {Ptr 1 ,Ptr 2 ,…,Ptr i ,…,Ptr m }, select the sentence corresponding to the highest probability value and rank it in the first position, and then put the probability value of the sorted sentence Set to zero, and so on, until the end of the mth position sorting, so as to complete the coherence recovery of the out-of-order statement.
在本实施例中,将本发明用于几个常用数据集,其中包括SIND和TACoS两个视觉叙事和故事理解语料库,具有文本和图片两种形式的数据。本发明采用完全匹配率(PMR),准确率(Acc)和τ度量来作为评价指标。完全匹配率(PMR)在整体上衡量元素位置预测的性能。准确率(Acc)计算单个元素的绝对位置预测的准确性,是更为宽松的度量指标。τ度量用于衡量预测中所有元素对之间的相对顺序,与人类的判断更相近。In this embodiment, the present invention is applied to several commonly used datasets, including two visual narrative and story understanding corpora, SIND and TACoS, with data in both text and picture forms. The present invention adopts perfect matching rate (PMR), accuracy rate (Acc) and τ metric as evaluation indicators. The perfect match rate (PMR) measures the performance of element position prediction as a whole. Accuracy (Acc) calculates the accuracy of the absolute position prediction of a single element and is a looser metric. The τ metric is used to measure the relative order between all pairs of elements in the prediction and is closer to human judgment.
通过本发明及现有方法进行语句连贯性恢复,其实验结果如表1所示,其中,LSTM+PtrNet是长短时记忆网络加指针网络的方法,AON-UM是单模态的自回归注意恢复方法,AON-CM是跨模态的自回归注意恢复方法,NAD-UM是单模态的非自回归恢复方法,NAD-CM1是没有采用位置嵌入和位置注意力的跨模态非自回归方法,NAD-CM2是没有采用位置注意力的跨模态非自回归方法,NAD-CM3是没有采用位置嵌入的跨模态非自回归方法,NACON(noexl)是没有采用贪心选择和掩码排除的方法,NACON为本发明方法。从实验结果可以看出,跨模态的语义连贯性分析和恢复方法的性能极大优于现有的单模态的方法。本发明比起NAD-CM1、NAD-CM2、NAD-CM3的各项评价指标有所提升,验证了跨模态位置注意力信息的有效性。此外,与AON-CM和NACON(no exl)相比,性能也有明显改善,验证了本发明设计的连贯性恢复方法贪心选择和掩码排除的推理方式的有效性。The sentence coherence restoration is carried out by the present invention and the existing method, and the experimental results are shown in Table 1. Among them, LSTM+PtrNet is a method of long-short-term memory network plus pointer network, and AON-UM is a single-modal autoregressive attention restoration. Methods, AON-CM is a cross-modal autoregressive attention recovery method, NAD-UM is a single-modal non-autoregressive recovery method, NAD-CM1 is a cross-modal non-autoregressive method without position embedding and position attention , NAD-CM2 is a cross-modal non-autoregressive method without positional attention, NAD-CM3 is a cross-modal non-autoregressive method without positional embedding, NACON(noexl) does not use greedy selection and mask exclusion method, NACON is the method of the present invention. From the experimental results, it can be seen that the performance of the cross-modality semantic coherence analysis and recovery method is significantly better than the existing single-modality methods. Compared with the evaluation indicators of NAD-CM1, NAD-CM2 and NAD-CM3, the present invention improves the effectiveness of the cross-modal location attention information. In addition, compared with AON-CM and NACON (no exl), the performance is also significantly improved, which verifies the effectiveness of the reasoning mode of greedy selection and mask exclusion of the coherence restoration method designed by the present invention.
表1是SIND,TACoS数据集上的实验结果;Table 1 shows the experimental results on the SIND and TACoS datasets;
尽管上面对本发明说明性的具体实施方式进行了描述,以便于本技术领域的技术人员理解本发明,但应该清楚,本发明不限于具体实施方式的范围,对本技术领域的普通技术人员来讲,只要各种变化在所附的权利要求限定和确定的本发明的精神和范围内,这些变化是显而易见的,一切利用本发明构思的发明创造均在保护之列。Although the illustrative specific embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be clear that the present invention is not limited to the scope of the specific embodiments. For those skilled in the art, As long as various changes are within the spirit and scope of the present invention as defined and determined by the appended claims, these changes are obvious, and all inventions and creations utilizing the inventive concept are included in the protection list.
Claims (1)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111638661.3A CN114330279B (en) | 2021-12-29 | 2021-12-29 | Cross-modal semantic consistency recovery method |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111638661.3A CN114330279B (en) | 2021-12-29 | 2021-12-29 | Cross-modal semantic consistency recovery method |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN114330279A true CN114330279A (en) | 2022-04-12 |
| CN114330279B CN114330279B (en) | 2023-04-18 |
Family
ID=81016638
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111638661.3A Active CN114330279B (en) | 2021-12-29 | 2021-12-29 | Cross-modal semantic consistency recovery method |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114330279B (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118839699A (en) * | 2024-07-12 | 2024-10-25 | 电子科技大学 | Weak-supervision cross-mode semantic consistency recovery method |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20090235228A1 (en) * | 2008-03-11 | 2009-09-17 | Ching-Tsun Chou | Methodology and tools for table-based protocol specification and model generation |
| CN108897852A (en) * | 2018-06-29 | 2018-11-27 | 北京百度网讯科技有限公司 | Judgment method, device and the equipment of conversation content continuity |
| CN110472242A (en) * | 2019-08-05 | 2019-11-19 | 腾讯科技(深圳)有限公司 | A kind of text handling method, device and computer readable storage medium |
| CN111951207A (en) * | 2020-08-25 | 2020-11-17 | 福州大学 | Image Quality Enhancement Method Based on Deep Reinforcement Learning and Semantic Loss |
| US20210117778A1 (en) * | 2019-10-16 | 2021-04-22 | Apple Inc. | Semantic coherence analysis of deep neural networks |
| CN112966127A (en) * | 2021-04-07 | 2021-06-15 | 北方民族大学 | Cross-modal retrieval method based on multilayer semantic alignment |
| CN112991350A (en) * | 2021-02-18 | 2021-06-18 | 西安电子科技大学 | RGB-T image semantic segmentation method based on modal difference reduction |
| CN113378546A (en) * | 2021-06-10 | 2021-09-10 | 电子科技大学 | Non-autoregressive sentence sequencing method |
-
2021
- 2021-12-29 CN CN202111638661.3A patent/CN114330279B/en active Active
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20090235228A1 (en) * | 2008-03-11 | 2009-09-17 | Ching-Tsun Chou | Methodology and tools for table-based protocol specification and model generation |
| CN108897852A (en) * | 2018-06-29 | 2018-11-27 | 北京百度网讯科技有限公司 | Judgment method, device and the equipment of conversation content continuity |
| CN110472242A (en) * | 2019-08-05 | 2019-11-19 | 腾讯科技(深圳)有限公司 | A kind of text handling method, device and computer readable storage medium |
| US20210117778A1 (en) * | 2019-10-16 | 2021-04-22 | Apple Inc. | Semantic coherence analysis of deep neural networks |
| CN111951207A (en) * | 2020-08-25 | 2020-11-17 | 福州大学 | Image Quality Enhancement Method Based on Deep Reinforcement Learning and Semantic Loss |
| CN112991350A (en) * | 2021-02-18 | 2021-06-18 | 西安电子科技大学 | RGB-T image semantic segmentation method based on modal difference reduction |
| CN112966127A (en) * | 2021-04-07 | 2021-06-15 | 北方民族大学 | Cross-modal retrieval method based on multilayer semantic alignment |
| CN113378546A (en) * | 2021-06-10 | 2021-09-10 | 电子科技大学 | Non-autoregressive sentence sequencing method |
Non-Patent Citations (2)
| Title |
|---|
| ZHILIANG WU等: "DAPC-Net:Deformable Alignment and Pyramid Context Completion Networks for Video Inpainting" * |
| 李京谕 等: "基于联合注意力机制的篇章级机器翻译" * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118839699A (en) * | 2024-07-12 | 2024-10-25 | 电子科技大学 | Weak-supervision cross-mode semantic consistency recovery method |
| CN118839699B (en) * | 2024-07-12 | 2025-09-26 | 电子科技大学 | A weakly supervised cross-modal semantic coherence restoration method |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114330279B (en) | 2023-04-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN109783657B (en) | Multi-step self-attention cross-media retrieval method and system based on limited text space | |
| CN107133211B (en) | Composition scoring method based on attention mechanism | |
| CN109492227A (en) | It is a kind of that understanding method is read based on the machine of bull attention mechanism and Dynamic iterations | |
| CN113779996B (en) | Standard entity text determining method and device based on BiLSTM model and storage medium | |
| CN109871538A (en) | A Named Entity Recognition Method for Chinese Electronic Medical Records | |
| CN113157885B (en) | An efficient intelligent question answering system for artificial intelligence domain knowledge | |
| CN110781683A (en) | A method for joint extraction of entity relations | |
| CN112100348A (en) | Knowledge base question-answer relation detection method and system of multi-granularity attention mechanism | |
| CN116151256A (en) | A Few-Shot Named Entity Recognition Method Based on Multi-task and Hint Learning | |
| CN114091450B (en) | Judicial domain relation extraction method and system based on graph convolution network | |
| CN116450883B (en) | Video moment retrieval method based on fine-grained information of video content | |
| CN112231491A (en) | Similar test question identification method based on knowledge structure | |
| CN110347857A (en) | The semanteme marking method of remote sensing image based on intensified learning | |
| CN114912512A (en) | A method for automatic evaluation of the results of image descriptions | |
| CN117708339B (en) | ICD automatic coding method based on pre-training language model | |
| CN115630145A (en) | A dialogue recommendation method and system based on multi-granularity emotion | |
| CN116821292B (en) | Entity and relation linking method based on abstract semantic representation in knowledge base question and answer | |
| CN112651225A (en) | Multi-item selection machine reading understanding method based on multi-stage maximum attention | |
| CN115329120A (en) | Weak label Hash image retrieval framework with knowledge graph embedded attention mechanism | |
| CN118035433A (en) | An adaptive text summarization method with multimodal anchors | |
| CN119313966A (en) | Small sample image classification method and system based on multi-scale cross-modal cue enhancement | |
| CN119988664A (en) | Cross-modal image and text retrieval processing method and system | |
| CN118069877A (en) | Lightweight multimodal image description generation method based on CLIP encoder | |
| CN113806551A (en) | Domain knowledge extraction method based on multi-text structure data | |
| CN114330279B (en) | Cross-modal semantic consistency recovery method |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |