CN115936997A

CN115936997A - An image super-resolution reconstruction method for cross-modal communication

Info

Publication number: CN115936997A
Application number: CN202310011043.9A
Authority: CN
Inventors: 周亮; 刘恒发; 魏昕; 高赟
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-01-05
Filing date: 2023-01-05
Publication date: 2023-04-07

Abstract

本发明属于视觉信号的超分辨率重建技术领域，公开了一种面向跨模态通信的图像超分辨率重建方法，首先在发送端仅传输低分辨率视觉信号和相应的触觉信号，而后通过模态内鉴别性和模态间相关性的学习弥补不同模态间存在的语义鸿沟，通过信道传输后，在接收端通过有效的特征融合方式实现互补性的学习，最后利用得到的融合特征去生成所需的高分辨率视觉信号。本发明很好地解决了在多模态服务中存在的因带宽受限和多模态信号间的竞争导致的视觉质量下降，最终影响用户体验的问题，实现了跨模态信号一致性、互补性的学习，保证了在有限带宽下，接收端高分辨率视觉信号的获取，提升用户的沉浸式体验。The invention belongs to the technical field of super-resolution reconstruction of visual signals, and discloses an image super-resolution reconstruction method for cross-modal communication. First, only low-resolution visual signals and corresponding tactile signals are transmitted at the sending end, and then through the The learning of intra-modal discrimination and inter-modal correlation bridges the semantic gap between different modalities. After transmission through the channel, complementary learning is realized through effective feature fusion at the receiving end, and finally the obtained fusion features are used to generate High-resolution visual signals required. The present invention well solves the problems in the multi-modal service that the visual quality decreases due to the limited bandwidth and the competition between multi-modal signals, which ultimately affects the user experience, and realizes the consistency and complementarity of the cross-modal signals Advanced learning ensures the acquisition of high-resolution visual signals at the receiving end under limited bandwidth, improving the user's immersive experience.

Description

An image super-resolution reconstruction method for cross-modal communication

技术领域Technical Field

本发明属于视觉信号的超分辨率重建技术领域，具体的说是涉及一种面向跨模态通信的图像超分辨率重建方法。The present invention belongs to the technical field of super-resolution reconstruction of visual signals, and specifically relates to an image super-resolution reconstruction method for cross-modal communication.

背景技术Background Art

随着无线通信和多媒体通信技术的快速发展，人类的视听需求得到极大满足，开始追求更多元化、更丰富的体验。当触觉信号与传统的视听信号相结合时，多模态服务出现，它可以提供更细粒度的交互和沉浸式体验。多项研究发现在线上多模态服务场景中，人们能够通过高分辨率的视觉信号和高保真的触觉信号提高对产品的感知和交互体验。例如，在网络购物中，消费者可以通过触摸和观察获得产品细节、内在感知质地、硬度等特性的详细信息。为了支持多模态服务，跨模态通信应运而生，其通过利用不同模态间的相关性确保多个模态之间的协同传输和处理。然而，受限的带宽和模态间竞争的存在使现有的跨模态通信方案很难实施，这将导致用户的沉浸式体验下降，特别是不满意的视觉体验。With the rapid development of wireless and multimedia communication technologies, human audiovisual needs have been greatly met, and people have begun to pursue more diversified and richer experiences. When tactile signals are combined with traditional audiovisual signals, multimodal services emerge, which can provide more fine-grained interactions and immersive experiences. Many studies have found that in online multimodal service scenarios, people can improve their perception and interactive experience of products through high-resolution visual signals and high-fidelity tactile signals. For example, in online shopping, consumers can obtain detailed information about product details, intrinsic perceived texture, hardness and other characteristics through touch and observation. In order to support multimodal services, cross-modal communication has emerged, which ensures the coordinated transmission and processing between multiple modalities by exploiting the correlation between different modalities. However, the limited bandwidth and the existence of inter-modal competition make it difficult to implement existing cross-modal communication schemes, which will lead to a decline in the user's immersive experience, especially an unsatisfactory visual experience.

具体来说，一方面，高保真的视觉信号是用户沉浸式体验的重要保障，然而由于受限的带宽，在线多媒体通信服务中难以传输如此高分辨率的图片/视频；另一方面，视觉和触觉模态在传输过程中存在竞争，为了满足触觉信号低时延、高可靠的需求，现有方案通常给触觉信号一个更高的优先级，但触觉信号频繁、无规律地出现会严重影响视觉信号的传输质量，尤其当用户具有频繁触摸的需求时，如线上购物。Specifically, on the one hand, high-fidelity visual signals are an important guarantee for user immersive experience. However, due to limited bandwidth, it is difficult to transmit such high-resolution pictures/videos in online multimedia communication services. On the other hand, there is competition between visual and tactile modalities during the transmission process. In order to meet the requirements of low latency and high reliability of tactile signals, existing solutions usually give tactile signals a higher priority. However, the frequent and irregular appearance of tactile signals will seriously affect the transmission quality of visual signals, especially when users have frequent touch needs, such as online shopping.

目前，针对因为带宽不足以及模态间的竞争导致的视觉信号的质量下降问题，可从多模态通信和超分辨率重建两个思路解决。多模态通信方案主要包括传统的音视频通信方案、触觉通信方案，这些方案能单独实现音视频或者触觉信号的高保真传输，但当涉及到同时传输音视频和触觉信号的场景时，无法保证接收端的质量；超分辨率重建方案主要利用低分辨率视觉通过基于单个视觉信号或基于参考信息(如不同角度的视觉信号、相邻帧、边界图等)来完成高分辨率视觉信号的重建，但他们大都是在本地终端完成重建任务，没有涉及通信任务。At present, the problem of visual signal quality degradation caused by insufficient bandwidth and competition between modalities can be solved from two perspectives: multimodal communication and super-resolution reconstruction. Multimodal communication solutions mainly include traditional audio and video communication solutions and tactile communication solutions. These solutions can achieve high-fidelity transmission of audio, video or tactile signals alone, but when it comes to scenarios where audio, video and tactile signals are transmitted simultaneously, the quality of the receiving end cannot be guaranteed; super-resolution reconstruction solutions mainly use low-resolution vision to complete the reconstruction of high-resolution visual signals based on a single visual signal or based on reference information (such as visual signals at different angles, adjacent frames, boundary maps, etc.), but most of them complete the reconstruction task at the local terminal and do not involve communication tasks.

上述现有的面对有限带宽下的多模态传输方案主要存在以下缺陷：把各个模态单独考虑，没有合理利用多模态数据间的一致性和互补性；没有考虑通信过程，仅在终端处理数据，没有考虑多模态数据传输过程中存在的模态间的竞争。The above-mentioned existing multimodal transmission schemes under limited bandwidth have the following main defects: each mode is considered separately, and the consistency and complementarity between multimodal data are not reasonably utilized; the communication process is not considered, and the data is only processed at the terminal, and the competition between modes in the multimodal data transmission process is not considered.

发明内容Summary of the invention

为了克服现有技术的不足，本发明提供一种面向跨模态通信的图像超分辨率重建方法，依赖触觉信号和视觉信号之间语义的一致性，通过充分考虑模态内和模态间的关系实现各模态特征的提取、映射，之后借助强有力的特征融合网络，有效地实现利用低分辨率视觉信号和触觉信号去生成与高分辨率视觉信号特征尽可能相似的融合特征，并最终获得高分辨率视觉信号，实现在带宽受限的多模态应用场景中，用户沉浸式体验的保障。In order to overcome the shortcomings of the prior art, the present invention provides an image super-resolution reconstruction method for cross-modal communication, which relies on the semantic consistency between tactile signals and visual signals, and realizes the extraction and mapping of each modal feature by fully considering the relationship within and between modalities. Then, with the help of a powerful feature fusion network, it is effectively realized to use low-resolution visual signals and tactile signals to generate fusion features that are as similar as possible to high-resolution visual signal features, and finally obtain high-resolution visual signals, thereby ensuring the user's immersive experience in bandwidth-limited multimodal application scenarios.

为了达到上述目的，本发明是通过以下技术方案实现的：In order to achieve the above object, the present invention is achieved through the following technical solutions:

本发明是一种面向跨模态通信的图像超分辨率重建方法，包括以下步骤：The present invention is an image super-resolution reconstruction method for cross-modal communication, comprising the following steps:

步骤(1)、利用完整的高分辨率视觉信号，进行高分辨率视觉信号的编码和解码，通过编码步骤训练高分辨率视觉信号的编码网络，并得到高分辨率视觉信号的编码特征，通过解码步骤训练高分辨率视觉信号的解码(生成)网络，用于为之后的视觉信号超分辨率重建模型提供支撑；Step (1), using the complete high-resolution visual signal, encoding and decoding the high-resolution visual signal, training the encoding network of the high-resolution visual signal through the encoding step, and obtaining the encoding features of the high-resolution visual signal, and training the decoding (generation) network of the high-resolution visual signal through the decoding step, so as to provide support for the subsequent visual signal super-resolution reconstruction model;

步骤(2)、设计一个触觉辅助的视觉信号超分辨率重建HaSR(Haptic-acid Super-resolution Reconstruction)模型；超分辨率重建HaSR模型具体如下：Step (2), design a haptic-acid super-resolution reconstruction HaSR (Haptic-acid Super-resolution Reconstruction) model for visual signals; the super-resolution reconstruction HaSR model is as follows:

从终端采集到视觉信号和触觉信号之后，首先在边缘节点对视觉信号进行下采样从而得到低分辨率视觉信号；进一步利用预训练的广泛使用的编码网络对低分辨率的视觉信号和相应的触觉信号进行初步的特征提取；After the visual signal and the tactile signal are collected from the terminal, the visual signal is first downsampled at the edge node to obtain a low-resolution visual signal; further, a pre-trained and widely used encoding network is used to perform preliminary feature extraction on the low-resolution visual signal and the corresponding tactile signal;

然后利用模态内的鉴别性和模态间的一致性，通过映射网络来降低模态间的差异，挖掘不同模态的相关性来弥补模态间的语义鸿沟，从而基于编码提取的初步特征得到具有语义鉴别和语义关联的映射特征，而后将得到的映射特征经过归一化之后通过信道模型，从而用于下一步的特征融合；Then, by using the intra-modality discriminability and inter-modality consistency, the mapping network is used to reduce the difference between modalities, and the correlation between different modalities is mined to fill the semantic gap between modalities. Based on the preliminary features extracted by encoding, mapping features with semantic discrimination and semantic association are obtained. The obtained mapping features are then normalized and passed through the channel model for the next step of feature fusion.

根据低分视觉信号的映射特征和触觉信号的映射特征以及获得的高分辨率视觉信号的编码特征，结合生成对抗网络强大的数据拟合能力，得到融合特征；Based on the mapping features of low-resolution visual signals and tactile signals and the encoding features of high-resolution visual signals, combined with the powerful data fitting ability of the generative adversarial network, the fusion features are obtained;

最后，将得到的融合特征输入高分辨率视觉信号的生成网络，实现高分辨率视觉信号的重建；Finally, the obtained fusion features are input into the generation network of high-resolution visual signals to achieve reconstruction of high-resolution visual signals;

步骤(3)、利用优化方法对HaSR模型进行训练，最终得到最优的模型参数，用于之后的测试阶段；Step (3), using the optimization method to train the HaSR model, and finally obtaining the optimal model parameters for the subsequent testing phase;

步骤(4)、将待测的成对的低分辨率视觉信号和触觉信号输入最优的HaSR模型，最优的HaSR模型用于提取低分辨率视觉信号和相应的触觉信号的特征并进行融合，利用融合后的特征生成所需的高分辨率视觉信号。Step (4): input the paired low-resolution visual signal and tactile signal to be tested into the optimal HaSR model. The optimal HaSR model is used to extract the features of the low-resolution visual signal and the corresponding tactile signal and fuse them, and use the fused features to generate the required high-resolution visual signal.

本发明的进一步改进在于：步骤(1)包括以下步骤：A further improvement of the present invention is that step (1) comprises the following steps:

(1-1)、对于训练数据集

其包含配对的触觉、低分辨视觉信号和高分辨率视觉信号，N为配对的视觉信号和触觉信号的数量，d_i＝{h_i,l_i,t_i}分别代表高分辨率视觉信号、低分辨率视觉信号和对应的触觉信号，将第i个高分辨率视觉信号h_i输入高分辨率视觉信号的编码网络，提取视觉信号的编码特征z_h；(1-1) For the training data set

It includes paired tactile, low-resolution visual signals and high-resolution visual signals, N is the number of paired visual signals and tactile signals, d _i ={h _i ,l _i ,t _i } respectively represents the high-resolution visual signal, the low-resolution visual signal and the corresponding tactile signal, the i-th high-resolution visual signal h _i is input into the encoding network of the high-resolution visual signal, and the encoding feature z _h of the visual signal is extracted;

(1-2)、把得到的高分辨率视觉信号的编码特征z_h输入到由生成对抗网络构成的高分辨率视觉信号的解码器中，而后把解码器重建的高分辨率视觉信号输入高分辨率视觉信号的鉴别器中，通过联合训练编码器和解码器，并用重构损失和鉴别损失来优化，最终学习到高分辨率视觉信号的编码特征，以及对应的解码网络；具体地，定义的损失函数如下所示：(1-2), the obtained encoding feature z _h of the high-resolution visual signal is input into the decoder of the high-resolution visual signal composed of the generative adversarial network, and then the high-resolution visual signal reconstructed by the decoder is input into the discriminator of the high-resolution visual signal. By jointly training the encoder and decoder, and optimizing with reconstruction loss and identification loss, the encoding feature of the high-resolution visual signal and the corresponding decoding network are finally learned; specifically, the loss function is defined as follows:

L_pre＝L_rec+αL_pre-adv,L _pre = L _rec + α L _pre-adv ,

其中α是一个系数，用来调整不同损失的比例，第一项损失是重构损失：Where α is a coefficient used to adjust the proportion of different losses. The first loss is the reconstruction loss:

其中(C,H,W)是高分辨率视觉信号的尺寸，G_h代表相应的高分辨率视觉信号的解码(生成)网络，

代表相应的编码网络得到的高分辨率视觉信号的编码特征，||·||₁代表相应的L₁范数；第二项是生成对抗网络的损失,具体的损失函数定义如下：Where (C, H, W) is the size of the high-resolution visual signal, _Gh represents the corresponding decoding (generation) network of the high-resolution visual signal,

represents the encoding features of the high-resolution visual signal obtained by the corresponding encoding network, and ||·|| ₁ represents the corresponding _L1 norm; the second term is the loss of the generative adversarial network, and the specific loss function is defined as follows:

其中E(*)表示分布函数的期望值,p(z_h)表示高分图像特征的分布，p(h)代表真实高分辨率图像的分布,其中D_h代表相应的高分辨率视觉信号的鉴别网络，用来完成对重建的高分辨率视觉信号的判断，D_h代表相应的高分辨率视觉信号的鉴别网络，用来完成对重建的高分辨率视觉信号的判断，θ_gh和θ_dh分别代表相应的高分辨率视觉信号的生成器和鉴别器的参数，通过最小化L_pre得到最优的高分辨率视觉信号的编码网络以及对应的高分辨率视觉信号的编码特征，对应的解码网络(即生成网络)和相应的高分辨视觉信号的鉴别网络。Where E(*) represents the expected value of the distribution function, p(z _h ) represents the distribution of high-resolution image features, p(h) represents the distribution of real high-resolution images, D _h represents the corresponding high-resolution visual signal identification network, which is used to complete the judgment of the reconstructed high-resolution visual signal, D _h represents the corresponding high-resolution visual signal identification network, which is used to complete the judgment of the reconstructed high-resolution visual signal, θ _gh and θ _dh represent the parameters of the corresponding high-resolution visual signal generator and discriminator, respectively. By minimizing L _pre, the optimal high-resolution visual signal encoding network and the corresponding high-resolution visual signal encoding features, the corresponding decoding network (i.e., the generation network) and the corresponding high-resolution visual signal identification network are obtained.

本发明的进一步改进在于：步骤(2)包括以下步骤：A further improvement of the present invention is that step (2) comprises the following steps:

(2-1)、低分辨率视觉信号和对应的触觉信号的初步特征提取，基于

中存在的配对的低分辨率视觉信号l_i和触觉信号h_i，利用现有的成熟的深度神经网络，完成低分辨率视觉信号和触觉信号初步的特征提取,获得相应的低分辨率视觉信号的编码特征f_l和对应的触觉信号编码特征f_t；(2-1) Preliminary feature extraction of low-resolution visual signals and corresponding tactile signals, based on

The paired low-resolution visual signal l _i and tactile signal h _i in the image are used to perform preliminary feature extraction of the low-resolution visual signal and the tactile signal using the existing mature deep neural network, and obtain the corresponding encoding feature f _l of the low-resolution visual signal and the corresponding encoding feature f _t of the tactile signal;

(2-2)、基于获得的低分辨率视觉信号的编码特征f_l和对应的触觉信号编码特征f_t，建立一个特征映射网络，来有效的降低模态间的异质性差异，从同模态和跨模态两个角度来学习模态内的鉴别表示和模态间的一致表示，并最终获得映射特征，其包括低分辨率视觉信号的映射特征z_l和触觉信号的映射特征z_t,之后把获得的映射特征输入信道模型，并在主终端接收之后执行对应的特征融合步骤；(2-2) Based on the obtained low-resolution visual signal encoding feature _fl and the corresponding tactile signal encoding feature _ft , a feature mapping network is established to effectively reduce the heterogeneity between modalities, learn the intra-modal identification representation and the inter-modal consistent representation from the perspectives of the same modality and cross-modality, and finally obtain the mapping features, which include the mapping features _zl of the low-resolution visual signal and the mapping features _zt of the tactile signal. Then, the obtained mapping features are input into the channel model, and the corresponding feature fusion step is performed after the main terminal receives them;

跨模态的语义相关性学习：选用三元组损失来进行跨模态的语义相关性学习，经过映射网络的学习达到下述效果，即对来自同一类别的低分辨率视觉信号特征和触觉信号特征来说，他们之间的距离应该靠近，对来自不同类别的低分辨率视觉信号特征和触觉信号特征来说，他们之间的距离应该远离；定义如下的损失函数：Cross-modal semantic relevance learning: Triplet loss is used for cross-modal semantic relevance learning. After learning the mapping network, the following effect is achieved: for low-resolution visual signal features and tactile signal features from the same category, the distance between them should be close, and for low-resolution visual signal features and tactile signal features from different categories, the distance between them should be far away; the following loss function is defined:

其中θ_l和θ_t分别代表对应的低分辨率视觉信号映射网络和触觉信号的映射网络的参数，p和q代表了不同的类别，N代表相应的低分辨率视觉信号和触觉信号的实例的数量，σ代表相应的阈值，L₂＝||·||₂代表相应的L₂范数，语义相关性的总损失可表示为上述两者的和，即为：Where θ _l and θ _t represent the parameters of the corresponding low-resolution visual signal mapping network and tactile signal mapping network, respectively, p and q represent different categories, N represents the number of instances of the corresponding low-resolution visual signal and tactile signal, σ represents the corresponding threshold, L ₂ =||·|| ₂ represents the corresponding L ₂ norm, and the total loss of semantic relevance can be expressed as the sum of the above two, that is:

同模态内的鉴别性学习：在保障语义相关性的同时，应该有效的解决同模态内的语义鉴别问题，即对同一模态内(视觉模态或触觉模态)的样本，同属于一个类别的样本应该距离更近，属于不同类别的样本应该距离更远，主要是通过在映射网络之后加一个公共分类器来完成，具体的损失表示如下：Discriminative learning within the same modality: While ensuring semantic relevance, the semantic discrimination problem within the same modality should be effectively solved. That is, for samples within the same modality (visual modality or tactile modality), samples belonging to the same category should be closer, and samples belonging to different categories should be farther away. This is mainly achieved by adding a common classifier after the mapping network. The specific loss is expressed as follows:

其中，p_i(z)代表分类器预测的概率分布，y_i是真实的标签，θ_c代表相应的公共分类器的参数；经过上述处理之后，把得到的低分辨率视觉信号的映射特征z_l和触觉信号的映射特征z_t归一化之后输入信道模型；Wherein, p _i (z) represents the probability distribution predicted by the classifier, _yi is the real label, and θ _c represents the parameter of the corresponding public classifier; after the above processing, the mapping features z _l of the low-resolution visual signal and the mapping features z _t of the tactile signal are normalized and then input into the channel model;

(2-3)、经过信道的传输之后，在主终端获得了相应的含噪声的低分辨率视觉信号的映射特征z_l-n和触觉信号的映射特征z_t-n,本步骤的目的是利用二者的互补性完成融合特征的生成，使生成的融合特征尽可能与步骤(1)中高分辨率视觉信号的编码特征z_h相似，基于此目的，利用生成对抗网络在拟合数据分布上的能力，选用其来完成特征融合任务；其中z_h被视为真实样本，z_l-n和z_t-n被视为生成器的输入，z_m代表获得的融合特征，定义的融合网络的损失如下：(2-3) After the transmission through the channel, the corresponding mapping features z _ln of the noisy low-resolution visual signal and z _tn of the tactile signal are obtained at the main terminal. The purpose of this step is to use the complementarity of the two to complete the generation of fusion features, so that the generated fusion features are as similar as possible to the encoding features z _h of the high-resolution visual signal in step (1). Based on this purpose, the ability of the generative adversarial network in fitting data distribution is used to complete the feature fusion task; where z _h is regarded as a real sample, z _ln and z _tn are regarded as the input of the generator, z _m represents the obtained fusion feature, and the loss of the defined fusion network is as follows:

L_m＝L_m-adv(G_m,D_m)+L₂(z_m,z_h)，L _m =L _m-adv (G _m ,D _m )+L ₂ (z _m ,z _h ),

其中第一项代表普通的生成对抗网络的损失，G_m代表特征融合网络，D_m代表与特征融合网络对应的特征鉴别网络，该项具体表示为：The first term represents the loss of the ordinary generative adversarial network, _Gm represents the feature fusion network, and _Dm represents the feature identification network corresponding to the feature fusion network. The specific expression of this term is:

其中θ_gm代表融合网络生成器的对应参数，θ_dm代表与融合网络生成器对应的鉴别网络的模型参数；第二项代表L₂损失，有利于保持语义的一致性；Where θ _gm represents the corresponding parameters of the fusion network generator, and θ _dm represents the model parameters of the identification network corresponding to the fusion network generator; the second term represents the L ₂ loss, which is conducive to maintaining semantic consistency;

(2-4)、获得融合特征之后，利用融合特征实现高分辨率视觉信号的重建，利用第一步获得的高分辨率视觉信号解码(生成)网络，和对应的鉴别网络，利用其网络结构和参数作为该步骤的初始化网络参数，此外，在之前损失的基础上，加入了感知损失，来使生成的视觉信号更符合人类的感知特性。该步骤的损失具体表示如下：(2-4) After obtaining the fusion features, the fusion features are used to reconstruct the high-resolution visual signal. The high-resolution visual signal decoding (generation) network obtained in the first step and the corresponding identification network are used, and their network structure and parameters are used as the initialization network parameters of this step. In addition, based on the previous loss, the perceptual loss is added to make the generated visual signal more consistent with human perceptual characteristics. The loss of this step is specifically expressed as follows:

L_finet＝L_per+βL_adv-finet+γL_rec，L _finet =L _per +βL _adv-finet +γL _rec ,

其中第一项为感知损失，其具体可表示为：The first term is the perceptual loss, which can be specifically expressed as:

其中M_i,j代表相应的特征图的参数的数量，F^i,j代表VGG-19网络的第i层卷积层之后，第j层最大池化层之前的输出，第二项为高分辨率视觉信号的生成网络的损失，其具体可表示为Where Mi _,j represents the number of parameters of the corresponding feature map, Fi ^,j represents the output of the VGG-19 network after the i-th convolutional layer and before the j-th maximum pooling layer, and the second term is the loss of the generative network of the high-resolution visual signal, which can be specifically expressed as

其中θ_gh代表相应的利用融合特征z_m进行高分辨率视觉信号重建的生成网络的参数，θ_dh代表相应的鉴别网络的参数，第三项为重构损失；此外，β和γ为超参数。Where θ _gh represents the parameters of the corresponding generative network that uses the fused features z _m to reconstruct high-resolution visual signals, θ _dh represents the parameters of the corresponding discriminative network, and the third term is the reconstruction loss; in addition, β and γ are hyperparameters.

本发明的进一步改进在于：步骤(3)包括以下步骤：A further improvement of the present invention is that step (3) comprises the following steps:

(3-1)、利用存在的高分辨率视觉信号，训练高分辨率视觉信号的编码网络和相应的解码(生成)网络，以及对应的高分辨率视觉信号的鉴别网络，具体过程如下：(3-1) Using the existing high-resolution visual signals, train the encoding network of the high-resolution visual signals and the corresponding decoding (generation) network, as well as the corresponding identification network of the high-resolution visual signals. The specific process is as follows:

步骤311、初始化参数θ_eh(0)，θ_gh(0)，θ_dh(0)为第0次迭代相应参数的值；Step 311, initializing parameters _θeh (0), _θgh (0), _θdh (0) to the values of the corresponding parameters at the 0th iteration;

步骤312、设置迭代次数为n₁，学习率为μ₁；Step 312, setting the number of iterations to n ₁ and the learning rate to μ ₁ ;

步骤313、采用RMSProp算法，优化网络参数:Step 313: Use RMSProp algorithm to optimize network parameters:

其中θ_eh(n+1)，θ_dh(n+1)，θ_gh(n+1)和θ_eh(n)，θ_dh(n)，θ_gh(n)分别为第n+1次和第n次对应的高分辨率视觉信号的编码网络，高分辨率视觉信号的鉴别网络，高分辨率视觉信号的生成网络的参数，

为对各个损失函数做偏导；Wherein _θeh (n+1), _θdh (n+1), _θgh (n+1) and _θeh (n), _θdh (n), _θgh (n) are the parameters of the encoding network of the high-resolution visual signal, the identification network of the high-resolution visual signal, and the generation network of the high-resolution visual signal corresponding to the n+1th and nth times, respectively.

To make partial derivatives for each loss function;

步骤314、若n<n₁,则重复步骤313，经过n₁轮迭代之后，得到收敛至最优的网络，包括高分辨率视觉信号的编码网络，高分辨率视觉信号的鉴别网络，高分辨率视觉信号的生成网络；Step 314: if n<n ₁ , repeat step 313. After n ₁ rounds of iteration, a network that converges to the best is obtained, including a coding network for high-resolution visual signals, a discrimination network for high-resolution visual signals, and a generation network for high-resolution visual signals.

(3-2)、基于第一步获得高分辨率视觉信号的编码网络，高分辨率视觉信号的解码(生成)网络和对应的鉴别网络，利用低分辨率视觉信号和相应的触觉信号，训练相应的低分辨率视觉信号的编码和映射网络、触觉信号的编码和映射网络、以及特征融合网络，并微调相应的高分辨率视觉信号的生成网络，其中低分辨率视觉信号编码网络和触觉信号的编码网络通过加载预训练的网络得到，在该步骤中不参与更新；具体过程如下：(3-2) Based on the encoding network of the high-resolution visual signal obtained in the first step, the decoding (generation) network of the high-resolution visual signal and the corresponding identification network, the low-resolution visual signal and the corresponding tactile signal are used to train the corresponding low-resolution visual signal encoding and mapping network, the tactile signal encoding and mapping network, and the feature fusion network, and fine-tune the corresponding high-resolution visual signal generation network, wherein the low-resolution visual signal encoding network and the tactile signal encoding network are obtained by loading the pre-trained network and do not participate in the update in this step; the specific process is as follows:

步骤321、初始化参数θ_l(0)，θ_t(0)，θ_c(0)，其表示对应网络的初始随机参数，加载步骤3-1中获得高分辨率视觉信号的生成网络和鉴别网络的参数作为θ_gh(0)，θ_dh(0)的初始参数；Step 321, initializing parameters θ _l (0), θ _t (0), θ _c (0), which represent the initial random parameters of the corresponding network, and loading the parameters of the generation network and the identification network of the high-resolution visual signal obtained in step 3-1 as the initial parameters of θ _gh (0), θ _dh (0);

步骤322、开始迭代，设置迭代次数为n₂，学习率为μ₂；Step 322, start iteration, set the number of iterations to n ₂ and the learning rate to μ ₂ ;

步骤323、采用Adam算法，优化低分辨率视觉信号映射网络和触觉信号映射网络以及公共分类器的参数：Step 323: Adopt the Adam algorithm to optimize the parameters of the low-resolution visual signal mapping network, the tactile signal mapping network, and the common classifier:

其中θ_c(n+1)，θ_l(n+1)，θ_t(n+1)和θ_c(n)，θ_l(n)，θ_t(n)分别为第n+1和第n次对应的公共分类器的参数，低分辨率视觉信号映射网络和对应触觉信号映射网络，

为对各个损失函数做偏导；where θ _c (n+1), θ _l (n+1), θ _t (n+1) and θ _c (n), θ _l (n), θ _t (n) are the parameters of the n+1th and nth corresponding public classifiers, the low-resolution visual signal mapping network and the corresponding tactile signal mapping network, respectively.

To make partial derivatives for each loss function;

步骤324、采用Adam算法，优化特征融合网络和相应的融合特征的鉴别网络的参数：Step 324: Adopt the Adam algorithm to optimize the parameters of the feature fusion network and the corresponding fusion feature identification network:

其中θ_gm(n+1)，θ_dm(n+1)，θ_gm(n)和θ_dm(n)分别为第n+1和第n次对应的融合特征的生成网络和融合特征的鉴别网络的参数，

为对各个损失函数做偏导；Among them, θ _gm (n+1), θ _dm (n+1), θ _gm (n) and θ _dm (n) are the parameters of the generative network and the discriminative network of the fusion features corresponding to the n+1th and nth times, respectively.

To make partial derivatives for each loss function;

步骤325、采用RMSProp算法，微调高分辨率视觉信号的生成网络和鉴别网络，并通过如下函数优化更新：Step 325: Use the RMSProp algorithm to fine-tune the generation network and the identification network of the high-resolution visual signal, and optimize and update them through the following function:

其中θ_gh(n+1)，θ_dh(n+1)，θ_gh(n)和θ_dh(n)分别表示分别为第n+1和第n次对应的高分辨率视觉信号的生成网络和鉴别网络对应的参数，

为对各个损失函数做偏导；where _θgh (n+1), _θdh (n+1), _θgh (n) and _θdh (n) represent the parameters of the generation network and the discrimination network corresponding to the n+1th and nth high-resolution visual signals, respectively.

To make partial derivatives for each loss function;

步骤326、若n<n₂,则跳转至步骤323，经过n₂轮迭代之后，得到收敛至最优的HaSR网络，包括低分辨率视觉信号的映射网络，触觉信号的映射网络，特征融合网络和相应的融合特征的鉴别网络，以及微调后收敛至最优的高分辨率视觉信号的鉴别网络，高分辨率视觉信号的生成网络。Step 326: If n<n ₂ , jump to step 323. After n ₂ rounds of iteration, a HaSR network that converges to the optimal is obtained, including a mapping network for low-resolution visual signals, a mapping network for tactile signals, a feature fusion network and a corresponding identification network for fused features, and an identification network for high-resolution visual signals that converges to the optimal after fine-tuning, and a generation network for high-resolution visual signals.

本发明的进一改进在于：步骤(4)包括以下步骤：A further improvement of the present invention is that step (4) comprises the following steps:

(4-1)采用经过训练完成的HaSR模型；(4-1) Using the trained HaSR model;

(4-2)将一组配对的低分辨率视觉信号和对应的触觉信号输入HaSR模型，完成模态特征的编码、映射、融合，并最终获得对应的高分辨率的视觉信号。(4-2) A set of paired low-resolution visual signals and corresponding tactile signals are input into the HaSR model to complete the encoding, mapping, and fusion of modal features, and finally obtain the corresponding high-resolution visual signals.

本发明的有益效果是：The beneficial effects of the present invention are:

本发明借助于触觉信号与视觉信号的匹配可以克服传统单模态的视觉信号超分辨率重建导致的不适定问题，即一个低分辨率视觉信号可能与多个高分辨率视觉信号对应；The present invention can overcome the ill-posed problem caused by the traditional single-modal super-resolution reconstruction of visual signals by matching tactile signals with visual signals, that is, a low-resolution visual signal may correspond to multiple high-resolution visual signals;

本发明通过充分挖掘视觉信号和触觉信号模态内和模态间的关系，克服不同模态的异质性差异；The present invention overcomes the heterogeneity of different modalities by fully exploring the relationship between visual signals and tactile signals within and between modalities;

本发明依赖有效的特征融合方式，能够充分利用不同模态的互补性提高生成的高分辨率视觉信号的质量。The present invention relies on an effective feature fusion method and can fully utilize the complementarity of different modalities to improve the quality of the generated high-resolution visual signal.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明的图像超分辨率重建方法流程图。FIG1 is a flow chart of an image super-resolution reconstruction method according to the present invention.

图2是本发明的完整网络结构示意图。FIG. 2 is a schematic diagram of a complete network structure of the present invention.

图3是本发明与其他对比方法的超分辨率重建结果图。FIG. 3 is a diagram showing super-resolution reconstruction results of the present invention and other comparative methods.

具体实施方式DETAILED DESCRIPTION

为了使本发明的目的、技术方案和优点更加清楚，下面结合附图以及具体实施案例对本发明进行详细描述。In order to make the purpose, technical solutions and advantages of the present invention more clear, the present invention is described in detail below with reference to the accompanying drawings and specific implementation examples.

本发明提供的一种面向跨模态通信的图像超分辨率重建方法，其流程图如图1所示，该方法包括如下步骤：The present invention provides an image super-resolution reconstruction method for cross-modal communication, the flow chart of which is shown in FIG1 . The method comprises the following steps:

步骤1：利用完整的高分辨率视觉信号，进行高分辨率视觉信号的编码和解码，即图2中的阶段1部分，通过编码步骤训练高分辨率视觉信号的编码网络，并得到高分辨率视觉信号的编码特征，通过解码步骤训练高分辨率视觉信号的解码(生成)网络，用于为之后的视觉信号超分辨率重建模型提供支撑。Step 1: Use the complete high-resolution visual signal to encode and decode the high-resolution visual signal, that is, stage 1 in Figure 2. Through the encoding step, the encoding network of the high-resolution visual signal is trained to obtain the encoding features of the high-resolution visual signal. Through the decoding step, the decoding (generation) network of the high-resolution visual signal is trained to provide support for the subsequent visual signal super-resolution reconstruction model.

步骤1-1、对于训练数据集

其包含配对的触觉、低分辨视觉信号和高分辨率视觉信号，N为配对的视觉信号和触觉信号的数量，d_i＝{h_i,l_i,_ti}分别代表高分辨率视觉信号128×128、低分辨率视觉信号32×32和对应的触觉信号，通过短时傅里叶变换得到的频谱图，将第i个高分辨率视觉信号h_i输入高分辨率视觉信号的编码网络，编码网络由在ImageNet上预训练的VGG-16构成，其后添加维度分别为512和128的全连接层，提取视觉信号的编码特征z_h，其维度为128；Step 1-1: For the training data set

It contains paired tactile, low-resolution visual signals and high-resolution visual signals, N is the number of paired visual signals and tactile signals, d _i ={h _i ,l _i , _t i} respectively represents the high-resolution visual signal 128×128, the low-resolution visual signal 32×32 and the corresponding tactile signal, and the spectrum obtained by short-time Fourier transform is used to input the i-th high-resolution visual signal h _i into the encoding network of the high-resolution visual signal. The encoding network is composed of VGG-16 pre-trained on ImageNet, and then fully connected layers with dimensions of 512 and 128 are added to extract the encoding feature z _h of the visual signal, whose dimension is 128;

步骤1-2、把得到的高分辨率视觉信号的编码特征z_h输入到由生成对抗网络构成的高分辨率视觉信号的解码器中，该解码器由具有3×3的卷积核的逆卷积网络构成，相应的滤波器大小分别为512,256,128,64,3，而后把解码器重建的高分辨率视觉信号输入高分辨率视觉信号的鉴别器中，鉴别器网络由卷积网络堆叠而成，其卷积核的大小为3×3，相应的滤波器数据量为64,128,256,512之后加入维度分别为512,128,1的全连接层，通过联合训练编码器和解码器，并用重构损失和鉴别损失来优化，最终学习到高分辨率视觉信号的编码特征，以及对应的解码网络；具体地，定义的损失函数如下所示：Step 1-2, input the obtained encoding feature z _h of the high-resolution visual signal into the decoder of the high-resolution visual signal composed of the generative adversarial network, which is composed of an inverse convolutional network with a 3×3 convolution kernel, and the corresponding filter sizes are 512, 256, 128, 64, and 3 respectively. Then input the high-resolution visual signal reconstructed by the decoder into the discriminator of the high-resolution visual signal. The discriminator network is stacked by convolutional networks, and its convolution kernel size is 3×3. The corresponding filter data volume is 64, 128, 256, and 512. Then, a fully connected layer with dimensions of 512, 128, and 1 is added. By jointly training the encoder and decoder, and optimizing with reconstruction loss and identification loss, the encoding features of the high-resolution visual signal and the corresponding decoding network are finally learned; specifically, the defined loss function is as follows:

L_pre＝L_rec+αL_pre-adv,L _pre = L _rec + α L _pre-adv ,

represents the encoding features of the high-resolution visual signal obtained by the corresponding encoding network, and ||·|| ₁ represents the corresponding _L1 norm; the second term is the loss of the generated adversarial network, and the specific loss function is defined as follows:

步骤2：设计一个触觉辅助的视觉信号超分辨率重建HaSR模型；超分辨率重建HaSR模型结构图如图2的阶段2所示：Step 2: Design a tactile-assisted visual signal super-resolution reconstruction HaSR model; the structure diagram of the super-resolution reconstruction HaSR model is shown in stage 2 of Figure 2:

最后，将得到的融合特征输入高分辨率视觉信号的生成网络，实现高分辨率视觉信号的重建。Finally, the obtained fusion features are input into the generative network of high-resolution visual signals to achieve reconstruction of high-resolution visual signals.

具体实施步骤如下:The specific implementation steps are as follows:

步骤2-1、低分辨率视觉信号和对应的触觉信号的初步特征提取，基于

中存在的配对的低分辨率视觉信号l_i和触觉信号h_i，利用现有的成熟的深度神经网络具体来说对触觉信号来说使用DenseNet-121网络对于低分辨率视觉信号采用VGG-16网络，完成低分辨率视觉信号和触觉信号初步的特征提取,获得相应的低分辨率视觉信号的512维编码特征f_l和对应的512维触觉信号编码特征f_t；Step 2-1: Preliminary feature extraction of low-resolution visual signals and corresponding tactile signals, based on

The paired low-resolution visual signal l _i and tactile signal h _i in the image are used to extract the initial features of the low-resolution visual signal and the tactile signal by using the existing mature deep neural network. Specifically, the DenseNet-121 network is used for the tactile signal and the VGG-16 network is used for the low-resolution visual signal. The corresponding 512-dimensional encoding feature f _l of the low-resolution visual signal and the corresponding 512-dimensional encoding feature f _t of the tactile signal are obtained.

步骤2-2、基于获得的低分辨率视觉信号的编码特征f_l和对应的触觉信号编码特征f_t，建立一个特征映射网络，其包括由512-256-128维的全连接层构成的低分辨率视觉信号特征映射网络和512-256-128维的全连接层构成触觉信号映射网络，来有效的降低模态间的异质性差异，从同模态和跨模态两个角度来学习模态内的鉴别表示和模态间的一致表示，并最终获得映射特征，其包括128维的低分辨率视觉信号的映射特征z_l和128维的触觉信号的映射特征z_t,之后把获得的映射特征输入信道模型，并在主终端接收之后执行对应的特征融合步骤；Step 2-2: Based on the obtained low-resolution visual signal coding feature _fl and the corresponding tactile signal coding feature _ft , a feature mapping network is established, which includes a low-resolution visual signal feature mapping network composed of a 512-256-128-dimensional fully connected layer and a tactile signal mapping network composed of a 512-256-128-dimensional fully connected layer, so as to effectively reduce the heterogeneity difference between modalities, learn the identification representation within the modality and the consistent representation between the modalities from the perspectives of the same modality and cross-modality, and finally obtain mapping features, which include a 128-dimensional low-resolution visual signal mapping feature _zl and a 128-dimensional tactile signal mapping feature _zt , and then input the obtained mapping features into the channel model, and perform the corresponding feature fusion step after the master terminal receives them;

同模态内的鉴别性学习：在保障语义相关性的同时，应该有效的解决同模态内的语义鉴别问题，即对同一模态内(视觉模态或触觉模态)的样本，同属于一个类别的样本应该距离更近，属于不同类别的样本应该距离更远，主要是通过在映射网络之后加一个公共分类器来完成该分类器由128，32，9维的全连接层构成，具体的损失表示如下：Discriminative learning within the same modality: While ensuring semantic relevance, the semantic discrimination problem within the same modality should be effectively solved, that is, for samples within the same modality (visual modality or tactile modality), samples belonging to the same category should be closer, and samples belonging to different categories should be farther away. This is mainly accomplished by adding a common classifier after the mapping network. The classifier consists of 128, 32, and 9-dimensional fully connected layers. The specific loss is expressed as follows:

步骤2-3、经过信道的传输之后，在主终端获得了相应的含噪声的低分辨率视觉信号的映射特征z_l-n和触觉信号的映射特征z_t-n,本步骤的目的是利用二者的互补性完成融合特征的生成，使生成的融合特征尽可能与步骤1中高分辨率视觉信号的编码特征z_h相似，基于此目的，利用生成对抗网络在拟合数据分布上的能力，选用其来完成特征融合任务；具体来说，首先把z_l-n和z_t-n串联，而后输入由256-128维的全连接层构成的生成网络中，其中z_h被视为真实样本，z_l-n和z_t-n被视为生成器的输入，z_m代表获得的融合特征，定义的融合网络的损失如下：Step 2-3, after the transmission through the channel, the corresponding mapping features z _ln of the noisy low-resolution visual signal and z _tn of the tactile signal are obtained at the main terminal. The purpose of this step is to use the complementarity of the two to complete the generation of fusion features, so that the generated fusion features are as similar as possible to the encoding features z _h of the high-resolution visual signal in step 1. For this purpose, the generative adversarial network is used to fit the data distribution and is selected to complete the feature fusion task; specifically, z _ln and z _tn are first connected in series, and then input into the generation network composed of 256-128 dimensional fully connected layers, where z _h is regarded as a real sample, z _ln and z _tn are regarded as the input of the generator, z _m represents the obtained fusion features, and the loss of the defined fusion network is as follows:

其中第一项代表普通的生成对抗网络的损失，G_m代表特征融合网络，D_m代表与特征融合网络对应的特征鉴别网络，该网络由128,64,1维的全连接层构成，该项具体表示为：The first term represents the loss of the ordinary generative adversarial network, _Gm represents the feature fusion network, and _Dm represents the feature identification network corresponding to the feature fusion network. The network consists of 128, 64, and 1-dimensional fully connected layers. The specific expression of this term is:

其中θ_gm代表融合网络生成器的对应参数，θ_dm代表与融合网络生成器对应的鉴别网络的模型参数；第二项代表L₂损失，有利于保持语义的一致性。Where _θgm represents the corresponding parameters of the fusion network generator, _θdm represents the model parameters of the identification network corresponding to the fusion network generator; the second term represents the _L2 loss, which is conducive to maintaining semantic consistency.

步骤2-4、获得融合特征之后，利用融合特征实现高分辨率视觉信号的重建。利用第一步获得的高分辨率视觉信号解码(生成)网络，和对应的鉴别网络，利用其网络结构和参数作为该步骤的初始化网络参数。此外，在步骤1-2中损失L_pre的基础上，加入了感知损失，来使生成的视觉信号更符合人类的感知特性。该步骤的损失具体表示如下：Step 2-4: After obtaining the fusion features, the fusion features are used to reconstruct the high-resolution visual signal. The high-resolution visual signal decoding (generation) network obtained in the first step and the corresponding identification network are used, and their network structure and parameters are used as the initialization network parameters of this step. In addition, based on the loss L _pre in step 1-2, a perceptual loss is added to make the generated visual signal more consistent with human perceptual characteristics. The loss of this step is specifically expressed as follows:

其中M_i,j代表相应的特征图的参数的数量，F^i,j代表VGG-19网络的第i层卷积层之后，第j层最大池化层之前的输出，在这里我们设置为i＝4，j＝5，第二项为高分辨率视觉信号的生成网络的损失，其具体可表示为Where Mi _,j represents the number of parameters of the corresponding feature map, and F ^i,j represents the output of the VGG-19 network after the i-th convolutional layer and before the j-th maximum pooling layer. Here we set i = 4, j = 5. The second term is the loss of the generative network of the high-resolution visual signal, which can be specifically expressed as

步骤3：利用优化方法对HaSR模型进行训练，最终得到最优的模型参数，用于之后的测试阶段；Step 3: Use the optimization method to train the HaSR model and finally obtain the optimal model parameters for the subsequent testing phase;

步骤3-1、利用存在的高分辨率视觉信号，训练高分辨率视觉信号的编码网络和相应的解码(生成)网络，以及对应的高分辨率视觉信号的鉴别网络，具体过程如下：Step 3-1: Using the existing high-resolution visual signal, train the encoding network of the high-resolution visual signal and the corresponding decoding (generation) network, as well as the corresponding identification network of the high-resolution visual signal. The specific process is as follows:

步骤312、设置迭代次数为n₁＝3000，学习率为μ₁＝0.0008；Step 312, setting the number of iterations to n ₁ =3000 and the learning rate to μ ₁ =0.0008;

To make partial derivatives for each loss function;

步骤3-2、基于第一步获得高分辨率视觉信号的编码网络，高分辨率视觉信号的解码(生成)网络和对应的鉴别网络，利用低分辨率视觉信号和相应的触觉信号，训练相应的低分辨率视觉信号的编码和映射网络、触觉信号的编码和映射网络、以及特征融合网络，并微调相应的高分辨率视觉信号的生成网络，其中低分辨率视觉信号编码网络和触觉信号的编码网络通过加载预训练的网络得到，在该步骤中不参与更新。具体过程如下：Step 3-2: Based on the encoding network of high-resolution visual signals obtained in the first step, the decoding (generation) network of high-resolution visual signals and the corresponding identification network, the low-resolution visual signals and the corresponding tactile signals are used to train the corresponding low-resolution visual signal encoding and mapping network, the tactile signal encoding and mapping network, and the feature fusion network, and fine-tune the corresponding high-resolution visual signal generation network, wherein the low-resolution visual signal encoding network and the tactile signal encoding network are obtained by loading the pre-trained network and do not participate in the update in this step. The specific process is as follows:

步骤322、开始迭代，设置迭代次数为n₂＝2000，学习率为μ₂＝0.0015；Step 322, start iteration, set the number of iterations to n ₂ =2000, and the learning rate to μ ₂ =0.0015;

To make partial derivatives for each loss function;

To make partial derivatives for each loss function;

To make partial derivatives for each loss function;

步骤4：将待测的成对的低分辨率视觉信号和触觉信号输入最优的HaSR模型，最优的HaSR模型用于提取低分辨率视觉信号和相应的触觉信号的特征并进行融合，利用融合后的特征生成所需的高分辨率视觉信号。Step 4: Input the paired low-resolution visual signals and tactile signals to be tested into the optimal HaSR model. The optimal HaSR model is used to extract the features of the low-resolution visual signals and the corresponding tactile signals and fuse them, and use the fused features to generate the required high-resolution visual signals.

步骤4-1：采用经过训练完成的HaSR模型；Step 4-1: Use the trained HaSR model;

步骤4-2：将一组配对的低分辨率视觉信号和对应的触觉信号输入HaSR模型，完成模态特征的编码、映射、融合，并最终获得对应的高分辨率的视觉信号。Step 4-2: Input a set of paired low-resolution visual signals and corresponding tactile signals into the HaSR model to complete the encoding, mapping, and fusion of modal features, and finally obtain the corresponding high-resolution visual signals.

下面的实验结果表明，与现有方法相比，本发明利用多模态信号的一致性和互补性实现了视觉信号的超分辨率重建，并取得到了更好的生成效果。The following experimental results show that compared with existing methods, the present invention utilizes the consistency and complementarity of multimodal signals to achieve super-resolution reconstruction of visual signals and achieves better generation effects.

本发明利用LMT-108多模态数据集来进行实验，其常用于跨模态检索和跨模态生成任务，其中包含108种日常生活中常见的材料。这些不同的表面材料根据物理特性大致可以分为九类，即网格，石材，金属，木材，橡胶，纤维、泡沫、箔和纸、纺织品和织物。各类别进一步包含5-17个子类别。对于每个类别，数据集包括各种类型纹理的视觉信号、滑动或敲击产生的三轴(X、Y、Z)加速度信号。基于之前使用这个数据集进行跨模态学习方面的工作，本发明中的视觉信号样本主要是由非闪光的RGB视觉信号表示；对于触觉数据，在采集器移动过程中z轴加速度信号表现出最明显的振动，因此选择z轴加速度信号作为触觉信号。原始的高分辨率视觉信号表示为128×128的大小，低分辨率视觉信号通过4×的下采样得到，故其尺寸是32×32，对于触觉信号，考虑到采集器接触和离开时候的压力影响，只截取中间的触觉信号，而后我们通过STFT变换得到相应的频谱图。The present invention uses the LMT-108 multimodal dataset to conduct experiments, which is commonly used in cross-modal retrieval and cross-modal generation tasks, and contains 108 materials commonly found in daily life. These different surface materials can be roughly divided into nine categories according to their physical properties, namely, grid, stone, metal, wood, rubber, fiber, foam, foil and paper, textiles and fabrics. Each category further contains 5-17 subcategories. For each category, the dataset includes visual signals of various types of textures, and three-axis (X, Y, Z) acceleration signals generated by sliding or tapping. Based on previous work on cross-modal learning using this dataset, the visual signal samples in the present invention are mainly represented by non-flashing RGB visual signals; for tactile data, the z-axis acceleration signal shows the most obvious vibration during the movement of the collector, so the z-axis acceleration signal is selected as the tactile signal. The original high-resolution visual signal is represented as 128×128 in size, and the low-resolution visual signal is obtained by 4× downsampling, so its size is 32×32. For the tactile signal, considering the pressure effect when the collector touches and leaves, only the middle tactile signal is intercepted, and then we obtain the corresponding spectrum through STFT transformation.

现有方法一：文献“Image-to-image translation with conditionaladversarial networks”(作者P.Isola,J.-Y.Zhu,T.Zhou,and A.A.Efros)中的PIX2PIX方法是将生成对抗网络(GAN)应用于有监督的不同风格的视觉信号转换的经典方法，本发明把触觉信号的频谱图和配对的高分辨率视觉信号作为成对的训练数据，并使用触觉信号的频谱图作为条件信息去生成高分辨率视觉信号。Existing method 1: The PIX2PIX method in the document “Image-to-image translation with conditional adversarial networks” (authors P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros) is a classic method for applying generative adversarial networks (GAN) to supervised conversion of visual signals of different styles. The present invention uses the spectrogram of tactile signals and the paired high-resolution visual signals as paired training data, and uses the spectrogram of tactile signals as conditional information to generate high-resolution visual signals.

现有方法二：文献“Learning to discover cross-domain relations withgenerative adversarial networks”(作者T.Kim,M.Cha,H.Kim,J.K.Lee,and J.Kim)的Discogan利用GAN发现不同域的关系，并实现从一个域到另一个域的转换。本发明实现从触觉信号的频谱图到高分辨率视觉信号的转换。Existing method 2: Discogan in the document “Learning to discover cross-domain relations with generative adversarial networks” (authors T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim) uses GAN to discover relations between different domains and achieve conversion from one domain to another. The present invention achieves conversion from the spectrogram of tactile signals to high-resolution visual signals.

现有方法三：双线性插值是一种常用的视觉信号上采样方法，直接插值低分辨率视觉信号得到更清晰的视觉信号。它具有平滑功能并能有效克服传统邻域插值的不足。Existing method 3: Bilinear interpolation is a commonly used visual signal upsampling method, which directly interpolates low-resolution visual signals to obtain clearer visual signals. It has a smoothing function and can effectively overcome the shortcomings of traditional neighborhood interpolation.

现有方法四：文献“Photo-realistic single image super-resolution usingagenerative adversarial network”(作者C.Ledig,L.Theis and F.Huszar et al.)提出了SRGAN,是第一个将GAN引入超分重建领域的方法，并通过引入感知损失获得了更符合人类感知的视觉信号。Existing method 4: The paper "Photo-realistic single image super-resolution using agenerative adversarial network" (authors C. Ledig, L. Theis and F. Huszar et al.) proposed SRGAN, which is the first method to introduce GAN into the field of super-resolution reconstruction, and obtains visual signals that are more in line with human perception by introducing perceptual loss.

现有方法五：文献“Esrgan:Enhanced super-resolution generativeadversarial networks”(作者X.Wang,K.Yu,S.Wu et al.)在SRGAN的基础上，引入了一个密集连接的残差块来实现更深的网络训练，并通过平衡感知质量和保真度来进一步提高视觉质量。Existing method 5: The document "Esrgan: Enhanced super-resolution generative adversarial networks" (authors X. Wang, K. Yu, S. Wu et al.) introduces a densely connected residual block based on SRGAN to achieve deeper network training and further improve the visual quality by balancing perceptual quality and fidelity.

本发明：本实施例的方法。The present invention: the method of this embodiment.

本实施例中，用于评价本发明提出的超分辨率重建方案的性能指标分为三类：峰值信噪比,结构相似性,Frechet Inception距离。In this embodiment, the performance indicators used to evaluate the super-resolution reconstruction solution proposed in the present invention are divided into three categories: peak signal-to-noise ratio, structural similarity, and Frechet Inception distance.

峰值信噪比：峰值信噪比(psnr)是一种视觉信号质量评价指标，基于像素点之间的误差，计算峰值信号的能量与噪声的平均能量之比，是最普遍、使用最广泛的一种视觉信号客观评价指标。PSNR值越高表示失真越小。Peak signal-to-noise ratio: Peak signal-to-noise ratio (PSNR) is a visual signal quality evaluation indicator. It is based on the error between pixels and calculates the ratio of the peak signal energy to the average energy of the noise. It is the most common and widely used objective evaluation indicator of visual signals. The higher the PSNR value, the smaller the distortion.

结构相似性：结构相似性(ssim)分别从亮度、对比度、结构三方面度量视觉信号的相似性，ssim的取值范围为[0,1],其值越大，表示视觉信号的失真越小，结构相似性在视觉信号质量的评估上比psnr更符合人眼对感知的判断。Structural similarity: Structural similarity (SSIM) measures the similarity of visual signals from three aspects: brightness, contrast, and structure. The value range of SSIM is [0,1]. The larger the value, the smaller the distortion of the visual signal. Structural similarity is more consistent with the human eye's perception judgment than PSNR in evaluating the quality of visual signals.

Frechet Inception距离：Frechet Inception距离(FID)被用于评估由生成对抗网络生成的视觉信号和真实视觉信号的相似程度。其计算真实视觉信号和生成视觉信号在特征空间之间的距离，首先利用Inception网络来提取特征，然后使用高斯模型对特征空间进行建模，再去求解两个特征之间的距离，较低的FID意味着较高的图片质量和多样性。Frechet Inception distance: Frechet Inception distance (FID) is used to evaluate the similarity between the visual signal generated by the generative adversarial network and the real visual signal. It calculates the distance between the real visual signal and the generated visual signal in the feature space. First, the Inception network is used to extract features, and then the Gaussian model is used to model the feature space, and then the distance between the two features is solved. Lower FID means higher image quality and diversity.

表1超分辨重建方案的性能对比结果Table 1 Performance comparison results of super-resolution reconstruction schemes

从表1和图3可以看出与上述具有竞争力的方法相比，本发明发方法有着明显的优势，在五种对比方案中，基于跨模态融合的超分辨率重建方法表现出了更好的性能。该结果表明，本发明所提出的模态内和模态间的映射特征学习能够更好地挖掘模态的性质，有效的特征融合方式能够利用不同模态间的互补性，最后实现了利用重建网络提高了视觉信号的重建质量，使其与原始视觉信号更相似。It can be seen from Table 1 and Figure 3 that compared with the above competitive methods, the method of the present invention has obvious advantages. Among the five comparison schemes, the super-resolution reconstruction method based on cross-modal fusion shows better performance. The results show that the intra-modal and inter-modal mapping feature learning proposed in the present invention can better explore the properties of the modality, and the effective feature fusion method can utilize the complementarity between different modalities. Finally, the reconstruction network is used to improve the reconstruction quality of the visual signal, making it more similar to the original visual signal.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围内。The above description is only a specific implementation mode of the present invention, but the protection scope of the present invention is not limited thereto. Any changes or substitutions that can be easily thought of by any technician familiar with the technical field within the technical scope disclosed by the present invention should be covered by the protection scope of the present invention.

Claims

1. A cross-modal communication-oriented image super-resolution reconstruction method is characterized by comprising the following steps: the image super-resolution reconstruction method comprises the following steps:

step 1, coding and decoding a high-resolution visual signal by using a complete high-resolution visual signal, training a coding network of the high-resolution visual signal through the coding step, obtaining coding characteristics of the high-resolution visual signal, training a decoding network of the high-resolution visual signal through the decoding step, and providing support for a subsequent visual signal super-resolution reconstruction model;

step 2, designing a touch-assisted visual signal super-resolution reconstruction HasR model; the super-resolution reconstruction HaSR model is as follows:

after the visual signals and the tactile signals are collected from the terminal, the visual signals are down-sampled at the edge nodes to obtain low-resolution visual signals, and the low-resolution visual signals and the corresponding tactile signals are subjected to preliminary feature extraction by utilizing a pre-trained and widely-used coding network;

by utilizing the identification in the modes and the consistency between the modes, the difference between the modes is reduced through a mapping network, and the correlation between different modes is mined to make up the semantic gap between the modes, so that the mapping characteristics with semantic identification and semantic correlation are obtained based on the primary characteristics extracted by coding, and then the obtained mapping characteristics are normalized and pass through a channel model so as to be used for the next step of characteristic fusion;

according to the mapping characteristics of the low-resolution visual signals, the mapping characteristics of the tactile signals and the obtained coding characteristics of the high-resolution visual signals, strong data fitting capacity of the anti-network is generated in a combined mode, and fusion characteristics are obtained;

finally, inputting the obtained fusion characteristics into a generation network of the high-resolution visual signal to realize the reconstruction of the high-resolution visual signal;

3, training the HasR model designed in the step 2 by using a model optimization algorithm to finally obtain optimal model parameters for a subsequent testing stage;

and 4, inputting the paired low-resolution visual signals and tactile signals to be detected into an optimal HasR model, wherein the optimal HasR model is used for extracting and fusing the characteristics of the low-resolution visual signals and the corresponding tactile signals, and generating the required high-resolution visual signals by utilizing the fused characteristics.

2. The method for reconstructing the super-resolution image oriented to the cross-modal communication according to claim 1, wherein: the step 1 specifically comprises the following steps:

step 1-1: for training data sets

It contains paired tactile, low-resolution visual signals and high-resolution visual signals, N is the number of paired visual and tactile signals, d _i ＝{h _i ,l _i ,t _i H represents a high-resolution visual signal, a low-resolution visual signal and a corresponding tactile signal, respectively, and the ith high-resolution visual signal h _i Inputting a coding network of the high-resolution visual signal, extracting the coding features z of the visual signal _h ；

Step 1-2: encoding feature z of the obtained high-resolution visual signal _h Inputting the video signal into a decoder for generating a high-resolution visual signal formed by a countermeasure network, then inputting the high-resolution visual signal reconstructed by the decoder into a discriminator for the high-resolution visual signal, finally learning the coding characteristics of the high-resolution visual signal by jointly training the encoder and the decoder and optimizing by using reconstruction loss and discrimination loss, and a corresponding decoding network, wherein a defined loss function is as follows:

L _pre ＝L _rec +αL _pre-adv ,

where α is a coefficient used to adjust the ratio of different losses, the first term loss is the reconstruction loss:

where (C, H, W) is the size of the high resolution visual signal, G _h A decoding network representing a corresponding high resolution visual signal,

representing the coding characteristics of the high-resolution visual signal obtained by the corresponding coding network, | · | | calving | ₁ Represents the corresponding L ₁ Norm, the second term is the loss generated against the network, and the specific loss function is:

where E (—) represents the expected value of the distribution function, p (z) _h ) Representing the distribution of the coding features of the high-resolution image, p (h) representing the distribution of the true high-resolution image, D _h A discrimination network representing the corresponding high-resolution visual signal for performing a determination of the reconstructed high-resolution visual signal, theta _gh And theta _dh Parameters of the generator and the discriminator, respectively, representing the corresponding high-resolution visual signal, by minimizing L _pre And obtaining an optimal coding network of the high-resolution visual signal, a corresponding coding characteristic of the high-resolution visual signal, a corresponding decoding network and a corresponding discrimination network of the high-resolution visual signal.

3. The method for reconstructing the super-resolution image oriented to the cross-modal communication according to claim 1, wherein: the step 2 comprises the following steps:

step 2-1: preliminary feature extraction of low resolution visual signals and corresponding haptic signals based on a training data set

Paired low resolution visual signal l present in _i And a haptic signal h _i Completing the primary feature extraction of the low-resolution visual signal and the tactile signal by utilizing a deep neural network to obtain the coding feature f of the corresponding low-resolution visual signal _l And corresponding haptic signal coding feature f _t ；

Step 2-2: coding features f based on the obtained low resolution visual signal _l And corresponding haptic signal coding feature f _t Establishing a feature mapping network to effectively reduce the heterogeneity difference between the modes, learning the identification representation in the modes and the consistent representation between the modes from the same mode and the cross-mode, and finally obtaining mapping features which comprise mapping of the low-resolution visual signalsRadial feature z _l And a mapping characteristic z of the haptic signal _t Then inputting the obtained mapping characteristics into a channel model, and executing a corresponding characteristic fusion step after the main terminal receives the mapping characteristics;

cross-modal semantic relevance learning: selecting triple losses to carry out cross-modal semantic correlation learning, and achieving the following effect through the learning of a mapping network, namely that the distances between low-resolution visual signal features and tactile signal features from the same category are close, and the distances between low-resolution visual signal features and tactile signal features from different categories are far; the loss function is defined as follows:

wherein theta is _l And theta _t Parameters representing the corresponding mapping network of low resolution visual signals and haptic signals, respectively, p and q represent different categories, N represents the number of instances of the corresponding low resolution visual signals and haptic signals, σ represents a corresponding threshold value, and L represents ₂ ＝||·|| ₂ Represents the corresponding L ₂ Norm, the total loss of semantic relevance, is expressed as the sum of the two, i.e.:

discriminative learning within isomorphous: while guaranteeing the semantic relevance, the problem of semantic identification in the same modality is effectively solved, namely, for samples in the same modality, the samples belonging to the same category are closer, and the samples belonging to different categories are farther, and the method is completed by adding a common classifier after a network is mapped, wherein the specific loss is represented as follows:

wherein p is _i (z) probability distribution representing classifier prediction, y _i Is a true tag, θ _c Parameters representing respective common classifiers; after the above processing, the mapping characteristic z of the obtained low-resolution visual signal is obtained _l And a mapping characteristic z of the haptic signal _t Inputting a channel model after normalization;

step 2-3: after transmission through the channel, the mapping characteristic z of the corresponding noisy low resolution visual signal is obtained at the master terminal _l-n And a mapping characteristic z of the haptic signal _t-n The ability to generate a fitted data distribution of the countermeasure network is used to select it to perform the feature fusion task, where z _h Regarded as a true sample, z _l-n And z _t-n Viewed as an input to the generator, z _m Representing the obtained fusion characteristics, the loss of the defined fusion network is as follows:

L _m ＝L _m-adv (G _m ,D _m )+L ₂ (z _m ,z _h )，

wherein the first term represents the loss of the ordinary generative countermeasure network, G _m Representing a feature fusion network, D _m Representing a feature authentication network corresponding to the feature fusion network, which is specifically represented as:

where E (#) represents the expected value of the distribution function, p (z) _h ) Distribution, p (z), representing coding characteristics of high-resolution images _m ) Distribution, θ, representing fusion characteristics _gm Representing the corresponding parameter of the converged network generator, θ _dm Representing model parameters of the authentication network corresponding to the converged network generator, the second term representing L ₂ Loss, facilitating maintaining semantic consistency；

Step 2-4: after the fusion characteristics are obtained, the reconstruction of the high-resolution visual signal is realized by using the fusion characteristics, the decoding network and the corresponding identification network of the high-resolution visual signal obtained in the step 1 are used, the network structure and the parameters of the decoding network and the corresponding identification network are used as the initialization network parameters of the step, and L is lost in the step 1-2 _pre On the basis, the perception loss is added to make the generated visual signal more consistent with the perception characteristic of human, and the loss is specifically expressed as follows:

L _finet ＝L _per +βL _adv-finet +γL _rec ，

wherein the first term is the perceptual loss, which is specifically expressed as:

wherein M is _i,j Number of parameters representing corresponding characteristic map, F ^i,j Represents the output after the convolution layer of the ith layer and before the maximum pooling layer of the jth layer of the VGG-19 network, beta and gamma are hyper-parameters, and the second term is the loss of the generation network of the high-resolution visual signal, which is expressed in detail as

Wherein theta is _gh Representing the corresponding utilization fusion feature z _m Parameters of the generation network, θ, for high resolution visual signal reconstruction _dh Representing the parameters of the corresponding authentication network, and the third term is the reconstruction loss.

4. The method for reconstructing the super-resolution image oriented to the cross-modal communication according to claim 1, wherein: the step 3 specifically comprises the following steps:

step 3-1: training an encoding network and a corresponding decoding network of the high-resolution visual signal and a corresponding discrimination network of the high-resolution visual signal by using the existing high-resolution visual signal;

step 3-2: based on the step 3-1, obtaining a coding network of the high-resolution visual signal, a decoding network of the high-resolution visual signal and a corresponding identification network, training the coding and mapping network of the corresponding low-resolution visual signal, the coding and mapping network of the haptic signal and the feature fusion network by using the low-resolution visual signal and the corresponding haptic signal, and finely adjusting the generating network of the corresponding high-resolution visual signal, wherein the coding network of the low-resolution visual signal and the coding network of the haptic signal are obtained by loading the pre-trained networks, and do not participate in updating in the step.

5. The method for reconstructing the super-resolution image oriented to the cross-modal communication according to claim 4, wherein: the specific process of the step 3-1 is as follows:

step 311, initializing parameter θ _eh (0)，θ _gh (0)，θ _dh (0) The value of the corresponding parameter for iteration 0;

step 312, setting the iteration number to n ₁ Learning rate of mu ₁ ；

Step 313, optimizing network parameters by using a RMSProp algorithm:

wherein theta is _eh (n+1)，θ _dh (n+1)，θ _gh (n + 1) and θ _eh (n)，θ _dh (n)，θ _gh (n) coding network of high resolution visual signals corresponding to the (n + 1) th and nth times, respectively, high resolution visualA discrimination network of signals, parameters of a generation network of high resolution visual signals,

making partial derivatives for each loss function;

step 314, if n<n ₁ Then repeat step 313 through n ₁ After iteration, a network converging to the optimal is obtained, and the network comprises a coding network of the high-resolution visual signal, an identification network of the high-resolution visual signal and a generation network of the high-resolution visual signal.

6. The method for reconstructing the super-resolution image oriented to the cross-modal communication according to claim 4, wherein: the specific process of the step 3-2 is as follows:

step 321, initializing parameter theta _l (0)，θ _t (0)，θ _c (0) Representing initial random parameters of the corresponding network, and loading the parameters of the generation network and the discrimination network of the high resolution visual signal obtained in step 3-1 as theta _gh (0)，θ _dh (0) The initial parameters of (a);

322, starting iteration, and setting the iteration number as n ₂ The learning rate is mu ₂ ；

Step 323, optimizing parameters of a low-resolution visual signal mapping network, a tactile signal mapping network and a public classifier by adopting an Adam algorithm:

wherein theta is _c (n+1)，θ _l (n+1)，θ _t (n + 1) and θ _c (n)，θ _l (n)，θ _t (n) parameters of the n +1 th and nth corresponding common classifiers, respectively, a low resolution visual signal mapping network and a corresponding haptic signal mapping network,

making partial derivatives for each loss function;

step 324, optimizing the parameters of the feature fusion network and the corresponding identification network of the fusion features by adopting an Adam algorithm:

wherein theta is _gm (n+1)，θ _dm (n+1)，θ _gm (n) and θ _dm (n) parameters of the n +1 th and nth corresponding fused feature generation networks and fused feature authentication networks, respectively,

making partial derivatives for each loss function;

step 325, fine-tuning the generation network and the identification network of the high-resolution visual signal by adopting a RMSProp algorithm, and optimizing and updating through the following functions:

wherein theta is _gh (n+1)，θ _dh (n+1)，θ _gh (n) and θ _dh (n) parameters corresponding to the generation network and the discrimination network of the high-resolution visual signal respectively corresponding to the (n + 1) th and nth times,

making partial derivatives for each loss function;

step 326, if n<n ₂ Then go to step 323, go through n ₂ After iteration, the optimal HasR network converged to is obtained, and the optimal HasR network comprises a mapping network of low-resolution visual signals, a mapping network of tactile signals, a feature fusion network and a corresponding identification network of fusion features, a fine-tuned identification network converged to the optimal high-resolution visual signals and a high-resolution visual signal generation network.

7. The method for reconstructing the super-resolution image oriented to the cross-modal communication according to claim 1, wherein: the step 4 comprises the following steps:

step 4-1: adopting a trained HasR model;

step 4-2: and inputting a group of paired low-resolution visual signals and corresponding tactile signals into a HasR model, completing the coding, mapping and fusion of modal characteristics, and finally obtaining the corresponding high-resolution visual signals.