JP7742677B2

JP7742677B2 - A method for reconstructing fine-grained tactile signals for audiovisual aids

Info

Publication number: JP7742677B2
Application number: JP2024568338A
Authority: JP
Inventors: 亮周; ▲シン▼ 魏; ▲ジョー▼ 張; ▲イン▼▲イン▼ 石
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-11-18
Filing date: 2023-11-15
Publication date: 2025-09-22
Anticipated expiration: 2043-11-15
Also published as: CN115905838B; WO2024104376A1; JP2025515925A; CN115905838A

Description

本発明は触覚信号を生成する技術分野に関し、特に視聴覚補助用の細粒度触覚信号の再構築方法に関する。 The present invention relates to the technical field of generating haptic signals, and in particular to a method for reconstructing fine-grained haptic signals for audiovisual aids.

従来のマルチメディアアプリケーションの関連技術が成熟するにつれて、人々は視聴覚ニーズが大きく満たされるとともに、一層多くの次元、一層高い階層の知覚体験を追求し始める。そして、触覚情報が徐々に既存のオーディオ／ビデオマルチメディアサービスに統合されて、マルチモーダルサービスを形成することになって、より極めて豊富なインタラクティブ体験をもたらすことが望まれている。クロスモーダル通信技術は、クロスモーダルサービスをサポートするために提案されており、マルチモーダルストリームの品質を確保する面で一定の有効性を有するが、クロスモーダル通信を、触覚を主とするマルチモーダルサービスに適用する場合、依然としていくつかの技術的挑戦に直面している。まず、触覚ストリームが無線リンクにおける干渉及び騒音に非常に敏感であり、その結果、受信端で触覚信号が劣化ひいては失い、特に遠隔操作例えば遠隔工業制御、遠隔手術などの応用シーンにおいて、この問題は深刻で避けられない。次に、サービスプロバイダは触覚収集装置を持っていないが、ユーザーは触覚を感知する必要があり、特にオンライン没入型ショッピング、ホログラフィックミュージアムガイド、バーチャルインタラクティブムービーなどのバーチャルインタラクティブ応用シーンにおいて、触覚知覚に対するユーザーのニーズは極めて高く、従って、ビデオ及びオーディオ信号に基づいて「バーチャル」タッチ感覚又は触覚信号を生成することができるように求めている。 As traditional multimedia application technologies mature, people's audiovisual needs are largely met, and they are beginning to pursue more dimensional and higher-level perceptual experiences. Haptic information is gradually being integrated into existing audio/video multimedia services to form multimodal services, providing a richer interactive experience. Cross-modal communication technologies have been proposed to support cross-modal services and have demonstrated some effectiveness in ensuring the quality of multimodal streams. However, applying cross-modal communication to haptic-based multimodal services still faces several technical challenges. First, haptic streams are highly sensitive to interference and noise in wireless links, resulting in degradation or even loss of haptic signals at the receiving end. This problem is particularly serious and unavoidable in remote control applications, such as remote industrial control and remote surgery. Secondly, service providers do not have tactile collection devices, but users need to sense touch. Particularly in virtual interactive application scenarios such as online immersive shopping, holographic museum guides, and virtual interactive movies, users have a very high need for tactile perception, and therefore require the ability to generate "virtual" touch sensations or tactile signals based on video and audio signals.

現在、無線通信の不信頼性及び通信騒音干渉により破損したり部分的に欠損したりした触覚信号は、２つの面で自己回復し得る。第１としては、従来の信号処理技術に基づくものである。それはスパース表現を用いることで最も類似した構造を有する特定の信号を検索し、次に該特定の信号を用いて破損した信号の欠損部分を推定する。第２としては、信号自身の時空相関性をマイニング及び利用して、モーダル内の自己修復及び再構築を実現するものである。しかし、触覚信号がひどく破壊され、ひいては存在しない場合、モーダル内に基づく再構築スキームが失敗してしまう。 Currently, tactile signals that are corrupted or partially lost due to unreliable wireless communication and communication noise interference can self-recover in two ways. The first is based on traditional signal processing techniques, which use sparse representations to search for a specific signal with the most similar structure, and then use this specific signal to estimate the missing parts of the corrupted signal. The second is to mine and utilize the spatiotemporal correlations of the signal itself to achieve intra-modal self-recovery and reconstruction. However, when the tactile signal is severely corrupted or even non-existent, intra-modal reconstruction schemes fail.

近年以来、いくつかの研究は異なるモーダル間の相関性に注目しており、且つこれによりクロスモーダル再構築を実現した。Ｌｉらは文献「Ｌｅａｒｎｉｎｇｃｒｏｓｓ－ｍｏｄａｌｖｉｓｕａｌ－ｔａｃｔｉｌｅｒｅｐｒｅｓｅｎｔａｔｉｏｎｕｓｉｎｇｅｎｓｅｍｂｌｅｄｇｅｎｅｒａｔｉｖｅａｄｖｅｒｓａｒｉａｌｎｅｔｗｏｒｋｓ」において、画像特徴を利用して必要なカテゴリ情報を取得し、次に、該情報を騒音とともに敵対的生成ネットワークの入力として対応するカテゴリの触覚スペクトルマップを生成することを提案した。該方法では各モーダル間の意味的相関、カテゴリをマイニングして取得した情報が限られるため、生成される結果は多くの場合不正確である。ＫｕｎｉｙｕｋｉＴａｋａｈａｓｈｉらは文献「ＤｅｅｐＶｉｓｕｏ－ＴａｃｔｉｌｅＬｅａｒｎｉｎｇ：ＥｓｔｉｍａｔｉｏｎｏｆＴａｃｔｉｌｅＰｒｏｐｅｒｔｉｅｓｆｒｏｍＩｍａｇｅｓ」において１つのエンコーダ－デコーダネットワークを拡張し、視覚及び触覚属性をいずれも潜在空間に埋め込み、重点的に潜在変数で示される材料の触覚属性の程度に注目している。更に、ＭａｔｔｈｅｗＰｕｒｒらは文献「ＴｅａｃｈｉｎｇＣａｍｅｒａｓｔｏＦｅｅｌ：ＥｓｔｉｍａｔｉｎｇＴａｃｔｉｌｅＰｈｙｓｉｃａｌＰｒｏｐｅｒｔｉｅｓｏｆＳｕｒｆａｃｅｓＦｒｏｍＩｍａｇｅｓ」において、１つの敵対的学習及びクロスドメイン連合分類付きのクロスモーダル学習フレームワークが単一の画像から触覚の物理特性を推定することを提案した。このような方法は、モーダルの意味情報を利用したが、完全な触覚信号を生成しないため、クロスモーダルサービスにとって実際の意味がない。 In recent years, several studies have focused on the correlation between different modalities and achieved cross-modal reconstruction. In their paper "Learning cross-modal visual-tactile representation using ensembled generative adversarial networks," Li et al. proposed using image features to obtain the necessary category information, and then using this information along with noise as input to a generative adversarial network to generate a tactile spectral map of the corresponding category. However, due to the limited information obtained by mining the semantic correlation and categories between each modality, the results generated are often inaccurate. In their paper "Deep Visuo-Tactile Learning: Estimation of Tactile Properties from Images," Kuniyuki Takahashi et al. extend a single encoder-decoder network to embed both visual and tactile attributes into a latent space, focusing on the degree of tactile properties of materials represented by latent variables. Furthermore, in the paper "Teaching Cameras to Feel: Estimating Tactile Physical Properties of Surfaces from Images," Matthew Purr et al. proposed a cross-modal learning framework with adversarial learning and cross-domain associative classification to estimate tactile physical properties from a single image. While this method utilizes modal semantic information, it does not generate a complete tactile signal, making it of no practical value for cross-modal services.

上記既存のクロスモーダル生成方法には、次の欠陥もある。そのモデルの訓練はいずれも大規模な訓練データに依存してモデルの効果を確保し、また、それらはいずれも単一モーダルの情報のみを利用したが、実際に単一モーダルの優位性が十分多くの情報量をもたらすことができず、異なるモーダルが同じ意味を共同で記述する場合、不均等な量の情報を含む可能性があり、モーダル間の情報の相補及び強化が生成効果の向上に寄与する。実際の応用シーンにおいて、大規模なデータセットのアノテーションコストが膨大であり、粗粒度の大分類カテゴリをより容易に取得する場合が多く、細粒度カテゴリが明確でない。また、異なるモーダルのサンプル同士に直接的な対応関係がなく、弱監督及び弱マッチングの難題がある。 The above-mentioned existing cross-modal generation methods also suffer from the following shortcomings: The training of these models all relies on large-scale training data to ensure the effectiveness of the models, and they all only use information from a single modality. However, in reality, the advantages of a single modality cannot provide a sufficient amount of information. When different modalities jointly describe the same meaning, they may contain unequal amounts of information. Complementary and enhanced information between modalities contributes to improved generation effectiveness. In practical applications, the annotation costs for large datasets are enormous, and coarse-grained, broad classification categories are often more easily obtained, while fine-grained categories are unclear. Furthermore, there is no direct correspondence between samples from different modalities, posing challenges in weak supervision and weak matching.

本発明が解決しようとする技術的問題は、従来技術の欠点を克服して視聴覚補助用の細粒度触覚信号の再構築方法を提供することである。まず、粗粒度カテゴリの内部でクラスタ分析を行ってサンプルの細粒度分類を取得する。次に、細粒度カテゴリにおけるモーダル共通意味制約を行い、その目的はカテゴリ内の差異を最小化してカテゴリ間の差異を最大化することである。最後に、細粒度サブカテゴリにおいて触覚信号にポジティブマッチする視聴覚サンプルを検索して相関性制約を行い、触覚信号の高品質で細粒度の再構築を実現する。 The technical problem that the present invention aims to solve is to provide a method for reconstructing fine-grained haptic signals for audiovisual assistance that overcomes the shortcomings of the prior art. First, cluster analysis is performed within coarse-grained categories to obtain fine-grained classifications of samples. Next, modal co-semantic constraints are performed in the fine-grained categories, with the goal of minimizing intra-category differences and maximizing inter-category differences. Finally, correlation constraints are performed by searching for audiovisual samples that positively match the haptic signal in the fine-grained subcategories, thereby achieving high-quality, fine-grained reconstruction of the haptic signal.

上記技術的問題を解決するために、本発明は視聴覚補助用の細粒度触覚信号の再構築方法を提供し、
触覚信号を触覚オートエンコーダに入力し、クラスタリングタスクにより触覚信号を特徴抽出するステップと、
クロスモーダル転移学習方法によって触覚オートエンコーダの特徴抽出能力をオーディオ特徴抽出ネットワーク及び画像特徴抽出ネットワークに転移して最適化するステップと、
クラスタリング制約、中心制約、及びソート制約を共同で考慮して抽出された触覚、オーディオ、及び画像のモーダル特徴を制約することで、同じ意味に属する各モーダル特徴を接近させるが、同じ意味に属しない各モーダル特徴を分離させ、細粒度分類を有する触覚、オーディオ、及び画像のモーダル特徴を取得するステップと、
触覚、オーディオ、及び画像のモーダル特徴に基づいてトリプレット集合を構築し、トリプレット制約の共有意味学習を行い、マルチモーダル融合マッピング関数を最適化し、共有意味情報を含む融合特徴を取得するステップと、
触覚生成ネットワークを事前設定し、融合特徴を触覚生成ネットワークに入力して触覚信号を再構築するステップと、を含む。 In order to solve the above technical problems, the present invention provides a method for reconstructing fine-grained tactile signals for audiovisual assistance,
inputting a haptic signal into a haptic autoencoder and extracting features from the haptic signal through a clustering task;
transferring and optimizing the feature extraction ability of the haptic autoencoder to the audio feature extraction network and the image feature extraction network by a cross-modal transfer learning method;
Constraining the extracted tactile, audio, and image modal features by jointly considering clustering constraints, centrality constraints, and sorting constraints to bring modal features belonging to the same meaning closer together but separate modal features that do not belong to the same meaning, thereby obtaining tactile, audio, and image modal features with fine-grained classification;
Constructing a triplet set based on tactile, audio and image modal features, performing shared semantic learning of triplet constraints, optimizing a multimodal fusion mapping function, and obtaining fusion features containing shared semantic information;
and preconfiguring a haptic generation network and inputting the fused features into the haptic generation network to reconstruct a haptic signal.

更に、触覚生成ネットワークを事前設定し、融合特徴を触覚生成ネットワークに入力して触覚信号を再構築する前記ステップは、
触覚オートエンコーダ、オーディオ特徴抽出ネットワーク、及び画像特徴抽出ネットワークのパラメータを事前設定することと、
マルチモーダル融合マッピング関数のパラメータ及び触覚生成ネットワークのパラメータを事前設定することと、
触覚オートエンコーダ、オーディオ特徴抽出ネットワーク、画像特徴抽出ネットワーク、マルチモーダル融合マッピング関数、及び触覚生成ネットワークを訓練することと、
受信したばかりの画像信号及びオーディオ信号をそれぞれ訓練済みの画像特徴抽出ネットワーク及びオーディオ特徴ネットワークに入力し、それぞれ画像特徴及びオーディオ特徴を取得し、次に、上記特徴をマルチモーダル融合マッピング関数に入力し、融合特徴を取得し、最後に、融合特徴を訓練済みの触覚生成ネットワークに入力し、再構築された触覚信号を取得することと、を含む。 Furthermore, the step of preconfiguring a haptic generation network and inputting the fused features into the haptic generation network to reconstruct a haptic signal includes:
Presetting parameters of a haptic autoencoder, an audio feature extraction network, and an image feature extraction network;
Presetting parameters of a multimodal fusion mapping function and parameters of a haptic generation network;
training a haptic autoencoder, an audio feature extraction network, an image feature extraction network, a multimodal fusion mapping function, and a haptic generation network;
The method includes inputting the just-received image signal and audio signal into a trained image feature extraction network and an audio feature network, respectively, to obtain image features and audio features, then inputting the features into a multimodal fusion mapping function to obtain fusion features, and finally inputting the fusion features into a trained haptic generation network to obtain a reconstructed haptic signal.

更に、触覚信号を触覚オートエンコーダに入力し、クラスタリングタスクにより触覚信号を特徴抽出することは、
触覚信号を触覚オートエンコーダに入力して学習し、対応する触覚特徴を抽出し、触覚特徴に基づいて触覚信号に対してＫ－ｍｅａｎｓアルゴリズムに基づくクラスタリングを実施し、即ち、
ｈが入力された触覚信号であり、ｈ＝｛ｈ_ｉ｝_{ｉ＝１,…,Ｎ}であり、ｉが入力触覚信号のソート下付き文字を示し、Ｎが入力された触覚信号の総量であり、オートエンコーダの符号化モジュールＥ_ｈ（・）を通過した後、ｆ_ｉ ^ｈ＝Ｅ_ｈ（ｈ_ｉ;θ_ｈｅ）が触覚信号ｈ_ｉの特徴表現であり、ｆ^ｈ＝｛ｆ_ｉ ^ｈ｝_{ｉ＝１,…,Ｎ}であり、θ_ｈｅが符号化モジュールのパラメータであり、ｆ_ｉ ^ｈを復号モジュールＤ_ｈ（・）に入力し、出力触覚信号
を取得し、ここで、θ_ｈｄが復号モジュールのパラメータであり、また、特徴ｆ_ｉ ^ｈに対してＫ－ｍｅａｎｓアルゴリズムに基づくクラスタリングを実施し、対応するカテゴリタグｓ_ｉ ^ｈを出力し、上記過程におけるパラメータを共同で推計し、損失関数
（ただし、
はエンコーダの再構築誤差であり、Ｎは触覚信号の数であり、
はＫ－ｍｅａｎｓのクラスタリング誤差であり、ＭはＫ－ｍｅａｎｓアルゴリズムにより触覚データを取得するクラスタ中心ベクトル行列であり、Ｍ行列における第ｃ列のｍ_ｃはｃ番目のクラスタの質量中心を示し、θ_ｈ＝［θ_ｈｅ,θ_ｈｄ］はエンコーダモジュール及びデコーダモジュールのパラメータであり、ｓ_ｊ，ｉ ^ｈはｓ_ｉ ^ｈのｊ番目の要素であり、その中の要素のｓ_ｊ，ｉ ^ｈが１、他の要素がいずれも０であれば、ｓ_ｉ ^ｈに対応するオリジナルの触覚信号ｈ_ｉが第ｊカテゴリに属することを示し、ｌは最小二乗損失
であり、λは正則化パラメータであり、λ≧０である）を設計することと、
Ｌ_ｃｌｕ ^ｈを最小化することで、θ_ｈを推計し、ｆ_ｉ ^ｈ及びｓ_ｉ ^ｈを取得することと、を含む。 Furthermore, inputting tactile signals into a tactile autoencoder and extracting features from the tactile signals through a clustering task is
The haptic signal is input to a haptic autoencoder for learning, and the corresponding haptic features are extracted. Then, the haptic signal is clustered based on the K-means algorithm according to the haptic features, i.e.,
h is the input haptic signal, h = {h _i } _{i = 1, ..., N} , where i indicates the sort subscript of the input haptic signal, and N is the total amount of the input haptic signal. After passing through the encoding module E _h (·) of the autoencoder, f _i ^h = E _h (h _i ; θ _he ) is the feature representation of the haptic signal h _i , where f ^h = {f _i ^h } _{i = 1, ..., N} , where θ _he is the parameter of the encoding module. f _i ^h is input to the decoding module D _h (·), and the output haptic signal
where θ _hd is the parameter of the decoding module, and performs clustering based on the K-means algorithm on the features f _i ^h to output the corresponding category tags s _i ^h . The parameters in the above process are jointly estimated, and the loss function
(however,
is the reconstruction error of the encoder, N is the number of haptic signals,
is the clustering error of K-means, M is the cluster center vector matrix for obtaining tactile data using the K-means algorithm, m _c in the c-th column of the M matrix indicates the center of mass of the c-th cluster, θ _h = [θ _he , θ _hd ] is the parameter of the encoder module and the decoder module, s _j,i ^h is the j-th element of s _i ^h , and if the element s _j,i ^h is 1 and the other elements are all 0, it indicates that the original tactile signal h _i corresponding to s _i ^h belongs to the j-th category, and l is the least squares loss.
where λ is a regularization parameter, λ≧0;
Minimizing L _clu ^h to estimate θ _h and obtain f _i ^h and s _i ^h .

更に、クロスモーダル転移学習方法によって触覚オートエンコーダの特徴抽出能力をオーディオ特徴抽出ネットワーク及び画像特徴抽出ネットワークに転移することは、
特徴自己適応方法で触覚領域と視聴覚領域との間の最大平均差異準則を最小化して転移を実現し、即ち、
触覚、オーディオ、及び画像信号セットの分布がそれぞれＰ、Ｑ、及びＲであり、触覚信号とオーディオ信号との間のＭＭＤをＭＭＤ_ｋ（Ｐ,Ｑ）として示し、触覚信号と視覚信号との間のＭＭＤがＭＭＤ_ｋ（Ｐ,Ｒ）であり、再生カーネルＨｉｂｅｒｔ空間Ｈ_ｋにおいて、Ｈ_ｋが非空集合に定義された関数セットｆを含み、ＭＭＤの２乗が、
であり、
ただし、触覚、オーディオ、及び画像信号にそれぞれの特徴抽出ネットワークφを通過させ、抽出された特徴ベクトルを取得し、触覚特徴ベクトルがφ^ｈ（ｈ;θ_ｈｅ）として示され、即ちオートエンコーダの符号化モジュールの出力であり、オーディオ及び画像の特徴ベクトルがφ^ａ（ａ;θ_ａ）＝ｆ^ａ、φ^ｖ（ｖ;θ_ｖ）＝ｆ^ｖであり、３つのモーダルの特徴集合が（ｆ^ｈ,ｆ^ａ,ｆ^ｖ）として示され、θ_ａ及びθ_ｖがそれぞれオーディオ及び画像特徴抽出ネットワークのパラメータであり、θ_ｈｅが符号化モジュールのパラメータであり、任意の関数ｆ∈Ｈ_ｋ且つ任意のＸ∈Ｐであり、
であり、μ_ｋ（Ｐ）がＰのＨ_ｋにおける平均埋め込みであり、即ち分布ＰのＨ_ｋ空間における１つの要素表現であり、ｆ（Ｘ）はＸが関数ｆによりＨ_ｋ空間にマッピングすることを示し、＜・,・＞_Ｈｋが内積演算であり、同様に、
であり、μ_ｋ（Ｑ）がＱのＨ_ｋにおける平均埋め込みであり、
であり、μ_ｋ（Ｒ）がＲのＨ_ｋにおける平均埋め込みであることと、
対応する
の値をクロスモーダル転移の損失関数として計算し、具体的な公式が、
であることと、
Ｌ_ＣＴを最適化することで、触覚特徴抽出オートエンコーダモデルとオーディオ・画像特徴抽出ネットワークとの間の情報が流れるように案内し、触覚モーダルのためのオートエンコーダの特徴抽出能力をオーディオ・画像の特徴抽出ネットワークに効果的に転移し、即ちθ_ａ及びθ_ｖを推計することと、を含む。 Furthermore, transferring the feature extraction ability of a haptic autoencoder to an audio feature extraction network and an image feature extraction network through a cross-modal transfer learning method is
The feature self-adaptation method minimizes the maximum average difference criterion between the tactile and audiovisual areas to achieve the transition, i.e.,
Let the distributions of the haptic, audio, and image signal sets be P, Q, and R, respectively, denote the MMD between the haptic and audio signals as MMD _k (P,Q), the MMD between the haptic and visual signals as MMD _k (P,R), and in the reproduction kernel Hilbert space H _k , H _k contains a function set f defined on a non-empty set, and the square of the MMD is
and
where haptic, audio, and image signals are passed through respective feature extraction networks φ to obtain extracted feature vectors, the haptic feature vector is denoted as ^φh (h; _θhe ), i.e., it is the output of the encoding module of the autoencoder, the audio and image feature vectors are ^φa (a; _θa )= ^fa , ^φv (v; _θv )= ^fv , the three modal feature set is denoted as ( ^fh , ^fa , ^fv ), _θa and _θv are the parameters of the audio and image feature extraction networks respectively, _θhe is the parameter of the encoding module, and for any function f∈Hk and any _X∈P ,
where μ _k (P) is the mean embedding of P in H _k , i.e., one element representation in H _k space of the distribution P, f(X) denotes the mapping of X to H _k space by function f, <·,·> _{H k} is the dot product operation, and similarly,
and μ _k (Q) is the mean embedding of Q in H _k ,
and μ _k (R) is the mean embedding of R in H _k ;
handle
The value of is calculated as the loss function of cross-modal transfer, and the specific formula is
And,
By optimizing the _LCT , we guide the information flow between the haptic feature extraction autoencoder model and the audio-visual feature extraction network, effectively transferring the feature extraction capabilities of the autoencoder for the haptic modality to the audio-visual feature extraction network, i.e., estimating θ _a and θ _v .

更に、オーディオ特徴抽出ネットワーク及び画像特徴抽出ネットワークを更に最適化することは、
分類損失関数が、
（ただし、ｓ_ｉ ^ａ及びｓ_ｉ ^ｖはそれぞれオーディオ信号及び画像信号のカテゴリタグであり、Ｌ_ｃｌｕ ^ａｖを最小化することで、θ_ａ及びθ_ｖの最適値を更に取得し、ｐ（ｆ_ｉ ^ａ;θ_ａ）の意味はオーディオ特徴抽出ネットワーク入力がｆ_ｉ ^ａ、ネットワークパラメータがθ_ａである場合にオーディオ信号カテゴリｓ_ｉ ^ａを取得する確率であり、ｐ（ｆ_ｉ ^ｖ;θ_ｖ）の意味は画像特徴抽出ネットワーク入力がｆ_ｉ ^ｖ、ネットワークパラメータがθ_ｖである場合に画像信号カテゴリｓ_ｉ ^ｖを取得する確率であり、ｆ^ａ＝｛ｆ_ｉ ^ａ｝_{ｉ＝１,・・・,Ｎ}、ｆ^ｖ＝｛ｆ_ｉ ^ｖ｝_{ｉ＝１,・・・,Ｎ}である）であることと、Ｌ_ｃｌｕ ^ｈ、Ｌ_ＣＴ及びＬ_ｃｌｕ ^ａｖ
を組み合わせ、
総目的関数Ｌ_ｃｌｕ＝Ｌ_ｃｌｕ ^ｈ＋Ｌ_ＣＴ＋Ｌ_ｃｌｕ ^ａｖを取得することと、Ｌ_ｃｌｕを最小化することで、最適なθ_ｈ、θ_ａ、及びθ_ｖを取得することができ、パラメータを決定した後に、触覚信号、オーディオ信号、及び画像信号に対応する特徴ｆ^ｈ、ｆ^ａ、ｆ^ｖを取得することができることと、を含む。 Further optimizing the audio feature extraction network and the image feature extraction network may include:
The classification loss function is
(where s _i ^a and s _i ^v are the category tags of the audio signal and the image signal, respectively; by minimizing L _clu ^av , the optimal values of θ _a and θ _v are further obtained; p(f _i ^a ; θ _a ) means the probability of obtaining the audio signal category s _i ^a when the audio feature extraction network input is f _i ^a and the network parameter is θ _a ; p(f _i ^v ; θ _v ) means the probability of obtaining the image signal category s _i ^v when the image feature extraction network input is f _i ^v and the network parameter is θ _v ; f ^a = {f _i ^a } _{i = 1, ... , N} , f ^v = {f _i ^v } _{i = 1, ... , N} ); and L _clu ^h , L _CT and L _clu ^av
Combined,
Obtaining the overall objective function L _clu = L _clu ^h + L _CT + L _clu ^av ; and by minimizing L _clu , the optimal θ _h , θ _a , and θ _v can be obtained, and after determining the parameters, the features f ^h , ^fa , f ^v corresponding to the tactile signal, audio signal, and image signal can be obtained.

更に、クラスタリング制約、中心制約、及びソート制約を共同で考慮して抽出された触覚、オーディオ、及び画像のモーダル特徴を制約することで、同じ意味に属する各モーダル特徴を接近させるが、同じ意味に属しない各モーダル特徴を分離させ、細粒度分類を有する触覚、オーディオ、及び画像のモーダル特徴を取得することは、
同じ細粒度サブカテゴリの特徴間のコンパクト性を確保するために、３種類のモーダル信号に対して中心制約におけるクラスタリング学習を行い、より良い細粒度の分類性能を実現するために、同じサブカテゴリの特徴が共通空間において隣接すべきであり、この目的がカテゴリ内分散を最小化することであり、特徴からそのサブカテゴリ中心までの距離を最小化することでクラスタリング学習を駆動し、触覚信号のサブカテゴリ中心をクロスモーダル信号の共通サブカテゴリ中心とすることで同じカテゴリのクロスモーダル信号意味特徴間のコンパクト性を確保し、中心制約におけるクラスタリング学習の損失関数が、
として定義され、
ただし、Ｎが触覚、オーディオ、及び画像信号の数であり、ｓ_ｉ ^ｈ、ｓ_ｉ ^ａ、及びｓ_ｉ ^ｖがそれぞれ触覚信号、オーディオ信号、及び画像信号のカテゴリを示し、オーディオ信号と画像信号が触覚信号のクラスタ中心ベクトル行列Ｍを共有し、上記過程によって類似意味を持つ各モーダル特徴ｆ_ｉ ^ａ、ｆ_ｉ ^ｖ、及びｆ_ｉ ^ｈを互いに接近させることを含む。 Furthermore, by constraining the extracted tactile, audio, and image modal features by jointly considering the clustering constraint, the centrality constraint, and the sorting constraint, the modal features belonging to the same meaning are brought close to each other, but the modal features not belonging to the same meaning are separated, thereby obtaining tactile, audio, and image modal features with fine-grained classification.
To ensure compactness between features of the same fine-grained subcategory, clustering learning under center constraints is performed on three types of modal signals. To achieve better fine-grained classification performance, features of the same subcategory should be adjacent in a common space, with the goal of minimizing intra-category variance. Clustering learning is driven by minimizing the distance from a feature to its subcategory center. The subcategory center of the tactile signal is set as the common subcategory center of the cross-modal signal to ensure compactness between semantic features of the same category of cross-modal signals. The loss function of clustering learning under center constraints is:
is defined as
where N is the number of haptic, audio, and image signals, s _i ^h , s _i ^a , and s _i ^v indicate the categories of haptic signals, audio signals, and image signals, respectively, the audio signals and image signals share the cluster center vector matrix M of the haptic signals, and the above process includes bringing each modal feature f _i ^a , f _i ^v , and f _i ^h with similar meaning closer to each other.

更に、異なる細粒度サブカテゴリの特徴が一定のスパース性を備えるように確保するために、３種類のモーダル信号に対してソート制約におけるクラスタリング学習を行い、中心制約の目標はカテゴリ内分散を最小化することであるが、ソート制約の目標はカテゴリ間分散を最大化することであり、それにより異なるサブカテゴリの特徴出力が同じサブカテゴリの特徴出力よりも類似しないようにし、ソート制約は、
として定義され、
ただし、Ｃは触覚信号がＫ－ｍｅａｎｓアルゴリズムによりクラスタリングした後の総カテゴリ数であり、上記過程によって類似意味を持つ各モーダル特徴ｆ_ｉ ^ａ、ｆ_ｉ ^ｖ、及びｆ_ｉ ^ｈを更に接近させるが、異なる意味を持つ各モーダル特徴をできる限り分離させる。 Furthermore, to ensure that the features of different fine-grained subcategories have a certain sparsity, we perform clustering learning under sorting constraints on the three types of modal signals. The goal of the centrality constraint is to minimize the intra-category variance, while the goal of the sorting constraint is to maximize the inter-category variance, so that the feature outputs of different subcategories are less similar than the feature outputs of the same subcategory. The sorting constraint is:
is defined as
where C is the total number of categories after the tactile signals are clustered using the K-means algorithm, and the above process brings modal features f _i ^a , f _i ^v , and f _i ^h with similar meanings closer together, while separating modal features with different meanings as much as possible.

更に、触覚、オーディオ、及び画像のモーダル特徴に基づいてトリプレット集合を構築し、トリプレット制約の共有意味学習を行い、マルチモーダル融合マッピング関数を最適化し、共有意味情報を含む融合特徴を取得することは、
ある細粒度サブカテゴリから１つの触覚信号サンプルｈ_ｉをランダムに選択し、該サンプルをアンカーとすることと、
画像データセットからｈ_ｉと同じカテゴリに属し且つ意味特徴がｆ_ｉ ^ｈに最も近いサンプルｖ_ｉ ^＋をポジティブマッチサンプルとして選択し、ｈ_ｉと同じカテゴリに属せず且つ意味特徴がｆ_ｉ ^ｈに最も近いサンプルｖ_ｊ ^－をネガティブマッチサンプルとして選択することと、
これによりデータセット内のサンプルのためにトリプレット集合｛（ｈ_ｉ，ｖ_ｉ ^＋，ｖ_ｊ ^－）｝を構成することと、
同様に触覚信号サンプルとオーディオ信号サンプルとで構成されるトリプレット集合｛（ｈ_ｉ，ａ_ｉ ^＋，ａ_ｊ ^－）｝を取得することと、
アンカー点ｈ_ｉの意味特徴ｆ_ｉ ^ｈとオーディオ・画像モーダルにおける対応する小分類内のポジティブマッチの意味特徴
との距離を最小化し、且つｆ_ｉ ^ｈとネガティブマッチの意味特徴
と間の意味特徴を最大化し、且つ１つの最小の間隔δがあることで、取得した２つのトリプレット損失関数が、
であり、
統合パラダイムを導入し、マルチモーダル特徴を高度融合し、即ち、
ｆ^ａとｆ^ｖを融合し、過程が、
ｆ^ｍ＝Ｆ_ｍ（ｆ^ａ,ｆ^ｖ;θ_ｍ）
（ただし、ｆ^ｍは共有意味部分空間におけるマルチモーダル融合の出力即ち融合特徴であり、Ｆ_ｍ（・）はパラメータがθ_ｍのマルチモーダル融合マッピング関数であり、Ｆ_ｍ（・）はｆ^ａ及びｆ^ｖの線形重み付けを取る）であることと、
ｆ_＋ ^ａとｆ_＋ ^ｖをｆ_＋ ^ｍに融合し、ｆ_－ ^ａとｆ_－ ^ｖをｆ_－ ^ｍに融合することと、
トリプレット損失を利用して融合特徴を制約し、即ち、
であり、
共有意味学習の目的関数が３つの損失関数を組み合わせることでモデリングし、
Ｌ_ｓｙｎ＝Ｌ_ｓｙｎ ^ａ＋Ｌ_ｓｙｎ ^ｖ＋Ｌ_ｓｙｎ ^ｍとして示されることと、
Ｌ_Ｓｙｎを最小化することで、最適なθ_ｍを取得することと、を含む。 Furthermore, constructing a triplet set based on the modal features of tactile, audio, and image, conducting shared semantic learning of triplet constraints, optimizing the multimodal fusion mapping function, and obtaining fusion features containing shared semantic information is
Randomly selecting one haptic signal sample h _i from a fine-grained subcategory and setting the sample as an anchor;
Selecting a sample v _i ⁺ from the image dataset that belongs to the same category as h _i and whose semantic features are closest to f _i ^h as a positive match sample, and selecting a sample v _j ⁻ that does not belong to the same category as h _i and whose semantic features are closest to f _i ^h as a negative match sample;
thereby constructing a set of triplets {(h _i ,v _i ⁺ ,v _j ⁻ )} for the samples in the dataset;
Similarly, obtaining a set of triplets {(h _i , a _i ⁺ , a _j ⁻ )} consisting of haptic signal samples and audio signal samples;
Semantic features f _i ^h of anchor point h _i and positive match semantic features in the corresponding subclass in the audio-visual modal
and minimize the distance between f _i ^h and the semantic feature of negative match
By maximizing the semantic feature between and and having a minimum interval δ, the two triplet loss functions obtained are
and
We introduce an integration paradigm to achieve high-level fusion of multimodal features, i.e.,
By fusing f ^a and f ^v , the process becomes
f ^m =F _m (f ^a ,f ^v ;θ _m )
where f ^m is the output of multimodal fusion in the shared semantic subspace, i.e., the fusion feature, F _m (·) is the multimodal fusion mapping function with parameter θ _m , and F _m (·) takes a linear weighting of f ^a and f ^v ;
fusing f ₊ ^a and f ₊ ^v into f ₊ ^m , and f _- ^a and f _- ^v into f _- ^m ;
The triplet loss is used to constrain the fused features, i.e.,
and
The objective function of shared semantic learning is modeled by combining three loss functions.
It can be shown that L _syn =L _syn ^a +L _syn ^v +L _syn ^m ;
and minimizing L _Syn to obtain the optimal θ _m .

更に、触覚生成ネットワークを構築し、即ち、触覚生成ネットワークＧ（・）を構築し、その構造がＤ_ｈ（・）と同じであり、且つそのネットワークパラメータθ_ｈｄをＧ（・）のパラメータθ_Ｇの初期値とし、
必要な意味情報を含む融合特徴を触覚生成ネットワークＧ（・）に入力して所望の触覚信号ｈ′を取得し、且つ生成された触覚信号ｈ′をＥ_ｈ（・）により触覚特徴ｆ^ｈ′に再マッピングし、カテゴリ中心を選択して該触覚特徴ｆ^ｈ′に対して意味制約を行い、最終的な損失関数が、
として示され、
ただし、
であり、
が特徴ｆ^ｈとｆ^ｈ′との類似度を示し、
がｆ^ｈ′のクラスタリング損失であり、それらが一緒に損失関数の正則化項とされ、該損失関数を最適化することで、θ_Ｇの最適値を取得し、即ちＧ（・）を決定する。 Furthermore, a haptic generation network is constructed, that is, a haptic generation network G(·) is constructed, and its structure is the same as D _h (·), and its network parameter θ _hd is set as the initial value of the parameter θ _G of G(·);
The fused features containing the required semantic information are input into a haptic generation network G(·) to obtain the desired haptic signal h′, and the generated haptic signal h′ is remapped to a haptic feature f ^h′ by E _h (·), and a category center is selected to perform semantic constraints on the haptic feature f ^h′ , so that the final loss function is
is shown as
however,
and
indicates the similarity between features f ^h and f ^h′ ,
is the clustering loss of f ^h′ , and these are used together as the regularization terms of the loss function. By optimizing the loss function, the optimal value of θ _G is obtained, i.e., G(·) is determined.

更に、触覚オートエンコーダ、オーディオ特徴抽出ネットワーク、及び画像特徴抽出ネットワークのパラメータを事前設定し、
マルチモーダル融合マッピング関数のパラメータ及び触覚生成ネットワークのパラメータを事前設定し、
触覚オートエンコーダ、オーディオ特徴抽出ネットワーク、画像特徴抽出ネットワーク、マルチモーダル融合マッピング関数、及び触覚生成ネットワークを訓練することは、ステップ１及びステップ２を含み、
前記ステップ１において、θ_ｖ、θ_ａ、θ_ｈｅ、θ_ｈｄ、及びＭを事前設定し、且つ｛ｓ_ｉ ^ｈ｝、｛ｓ_ｉ ^ａ｝、｛ｓ_ｉ ^ｖ｝を最適化し、前記ステップ１は、
ネットワークパラメータθ_ｖ、θ_ａ、θ_ｈｅ、θ_ｈｄ、ノード系タグ｛ｓ_ｉ ^ｈ｝、｛ｓ_ｉ ^ａ｝、｛ｓ_ｉ ^ｖ｝、及びカテゴリ中心行列Ｍを初期化し、クラスタ数Ｃ、学習率μ_１、及び反復回数Ｔを設定するステップ１１と、
｛ｓ_ｉ ^ｈ｝、｛ｓ_ｉ ^ａ｝、｛ｓ_ｉ ^ｖ｝、及びＭを固定し、確率的勾配降下法に基づいてθ_ｖ、θ_ａ、θ_ｈｅ、θ_ｈｄを最適化し、即ち、
であり、
ただし、∇が各損失関数を偏微分することであるステップ１２と、
θ_ｖ、θ_ａ、θ_ｈｅ、θ_ｈｄ、及びＭを固定し、｛ｓ_ｉ ^ｈ｝、｛ｓ_ｉ ^ａ｝、｛ｓ_ｉ ^ｖ｝を最適化し、即ち、
であるステップ１３と、
θ_ｖ、θ_ａ、θ_ｈｅ、θ_ｈｄ、及び｛ｓ_ｉ ^ｈ｝、｛ｓ_ｉ ^ａ｝、｛ｓ_ｉ ^ｖ｝を固定し、Ｍを最適化し、即ち、
であるステップ１４と、
ｔ＜Ｔの場合、ステップ４１２にジャンプし、ｔ＝ｔ＋１の場合、次回の反復を継続し、そうでない場合、反復を終了するステップ１５と、
Ｔ回反復した後、最適なオーディオ特徴抽出ネットワークのパラメータθ_ａ、画像特徴抽出ネットワークのパラメータθ_ｖ、触覚オートエンコーダのパラメータθ_ｈｅ、θ_ｈｄ、ノード系タグ｛ｓ_ｉ ^ｈ｝、｛ｓ_ｉ ^ａ｝、｛ｓ_ｉ ^ｖ｝、及び触覚データのクラスタ中心ベクトル行列Ｍを取得するステップ１６と、を含み、
前記ステップ２において、確率的勾配降下法に基づいてθ_ｍ及びθ_Ｇを推計し、前記ステップ２は、
θ_ｍ、学習率μ_２、μ_３、反復回数ｎ_１を初期化するステップ２１と、
Ｌ_Ｓｙｎに基づいて確率的勾配降下法を利用してθ_ｍを推定し、即ち、
であるステップ２２と、
Ｌ_Ｇｅｎに基づいて確率的勾配降下法を利用してθ_Ｇを更新し、即ち、
であるステップ２３と、
ｎ＜ｎ_１の場合、ステップ２２にジャンプし、ｎ＝ｎ＋１の場合、次回の反復を継続し、そうでない場合、反復を終了するステップ２４と、
ｎ_１回反復した後、最適なθ_ｍ及びθ_Ｇを取得するステップ２５と、を含む。 Furthermore, we pre-set the parameters of the haptic autoencoder, audio feature extraction network, and image feature extraction network.
Presetting the parameters of the multimodal fusion mapping function and the parameters of the haptic generation network;
Training the haptic autoencoder, the audio feature extraction network, the image feature extraction network, the multimodal fusion mapping function, and the haptic generation network includes step 1 and step 2;
In the step 1, θ _v , θ _a , θ _he , θ _hd , and M are preset, and {s _i ^h }, {s _i ^a }, and {s _i ^v } are optimized.
Step 11: initializing network parameters θ _v , θ _a , θ _he , θ _hd , node system tags {s _i ^h }, { ^{s ia} _} , {s _i ^v }, and category center matrix M, and setting the number of clusters C, learning rate μ ₁ , and number of iterations T;
{s _i ^h }, {s _i ^a }, {s _i ^v }, and M are fixed, and θ _v , θ _a , θ _he , and θ _hd are optimized based on stochastic gradient descent, i.e.,
and
where step 12 is where ∇ is the partial derivative of each loss function;
Fix θ _v , θ _a , θ _he , θ _hd , and M, and optimize {s _i ^h }, {s _i ^a }, {s _i ^v }, i.e.,
Step 13, where
Fix θ _v , θ _a , θ _he , θ _hd , and {s _i ^h }, {s _i ^a }, {s _i ^v } and optimize M, i.e.,
Step 14, where
If t<T, jump to step 412; if t=t+1, continue with the next iteration; otherwise, end the iteration;
and step 16, after T iterations, obtaining optimal audio feature extraction network parameters θ _a , image feature extraction network parameters θ _v , haptic autoencoder parameters θ _he , θ _hd , node system tags {s _i ^h }, {s _{i a} ^} , {s _i ^v }, and a cluster center vector matrix M of the haptic data;
In the step 2, θ _m and θ _G are estimated based on a stochastic gradient descent method, and the step 2 includes the following steps:
Step 21 of initializing θ _m , learning rates μ ₂ , μ ₃ , and number of iterations n ₁ ;
Based on L _Syn, θ _m is estimated using stochastic gradient descent, i.e.,
Step 22, where
Based on L _Gen , θ _G is updated using stochastic gradient descent, i.e.,
Step 23, where
If n< _n1 , jump to step 22; if n=n+1, continue with the next iteration; otherwise, step 24 to end the iteration;
and step 25 of obtaining the optimal θ _m and θ _G after one _n iteration.

従来技術に比べて、本発明は以上の技術的解決手段を用いることにより、以下の技術的効果を有する。 Compared to the prior art, the present invention has the following technical advantages by using the above technical solutions:

本発明は、クロスモーダル転移の深層クラスタリングアルゴリズムによって３種類のモーダルサンプルの細粒度分類を学習した後、共有意味学習を行い、マルチモーダル特徴融合の優位性を十分に発揮し、最終的にクラスタリング制約に基づく細粒度触覚信号の生成を実現し、既存の弱監督及び弱マッチングの問題のあるデータセットを最大限に利用し、これにより高品質で細粒度の触覚信号を生成し、それによりクロスモーダルサービスの要件に一層合致する。 This invention uses a deep clustering algorithm for cross-modal transfer to learn fine-grained classification of three types of modal samples, then performs shared semantic learning, fully utilizing the advantages of multi-modal feature fusion, ultimately realizing the generation of fine-grained haptic signals based on clustering constraints, making full use of existing datasets with weak supervision and weak matching problems, thereby generating high-quality fine-grained haptic signals that better meet the requirements of cross-modal services.

本発明に係る視聴覚補助用の細粒度触覚信号の再構築方法のフローチャートである。1 is a flowchart of a method for reconstructing fine-grained haptic signals for audiovisual aids according to the present invention; 本発明に係る完全なネットワークの構造模式図である。1 is a structural schematic diagram of a complete network according to the present invention; 本発明に係るクロスモーダル転移に基づく深層クラスタリングモデルのアーキテクチャ模式図である。FIG. 1 is a schematic diagram of the architecture of a deep clustering model based on cross-modal transfer according to the present invention. 本発明及び他の比較方法の触覚信号の再構築結果を示す図である。10A and 10B show the reconstruction results of tactile signals of the present invention and other comparative methods.

本発明の目的、技術的解決手段、及び利点をより明確にするために、以下に図面及び具体的な実施例を参照しながら本発明を詳しく説明する。 To make the objectives, technical solutions, and advantages of the present invention clearer, the present invention will be described in detail below with reference to the drawings and specific embodiments.

本発明は視聴覚補助用の細粒度触覚信号の再構築方法を提供し、そのフローチャートは図１に示され、該方法は以下のステップ１～ステップ５を含む。 The present invention provides a method for reconstructing fine-grained tactile signals for audiovisual assistance, the flowchart of which is shown in Figure 1, and the method includes the following steps 1 to 5.

ステップ１では、まず、触覚信号を触覚オートエンコーダに入力し、クラスタリングによって触覚信号の特徴抽出を実現し、次に、クロスモーダル転移学習技術を利用し、触覚オートエンコーダの特徴抽出能力をそれぞれオーディオ特徴抽出ネットワーク及び画像特徴抽出ネットワークに転移して最適化し、その後、クラスタリング制約、中心制約、及びソート制約を共同で考慮して抽出された触覚、オーディオ、及び画像のモーダル特徴を更に制約することで、同じ意味に属する各モーダル特徴を接近させるが、同じ意味に属しない各モーダル特徴を分離させ、細粒度分類を有する触覚、オーディオ、及び画像のモーダル特徴を取得する。 In step 1, haptic signals are first input into a haptic autoencoder, and feature extraction of the haptic signals is achieved through clustering. Next, cross-modal transfer learning technology is used to transfer and optimize the feature extraction capabilities of the haptic autoencoder to an audio feature extraction network and an image feature extraction network, respectively. The extracted haptic, audio, and image modal features are then further constrained by jointly considering clustering constraints, centrality constraints, and sorting constraints, thereby approximating modal features that belong to the same meaning and separating modal features that do not belong to the same meaning, thereby obtaining haptic, audio, and image modal features with fine-grained classification.

（１－１）、まず、触覚、画像、及びオーディオの３種類のモーダル信号に対してクラスタリング制約の特徴学習を行い、細粒度サブカテゴリを有する区分特徴を取得する。これについて以下の３つのステップ即ちステップ（１－１－１）～ステップ（１－１－３）に分けられてもよい。 (1-1) First, clustering-constrained feature learning is performed on three types of modal signals: haptic, image, and audio, to obtain segmental features with fine-grained subcategories. This can be divided into the following three steps: Step (1-1-1) to Step (1-1-3).

（１－１－１）、第１ステップとしては、まず、触覚信号をオートエンコーダに入力して学習し、対応する触覚特徴を抽出し、触覚特徴に基づいて触覚信号に対してＫ－ｍｅａｎｓアルゴリズムに基づくクラスタリングを実施し、即ち、
具体的に、ｈ_ｉが入力された触覚信号であり、ｈ＝｛ｈ_ｉ｝_{ｉ＝１,・・・,Ｎ}であり、ｉが入力触覚信号のソート下付き文字を示し、Ｎが入力された触覚信号の総量であり、オートエンコーダの符号化モジュールＥ_ｈ（・）を通過した後、ｆ_ｉ ^ｈ＝Ｅ_ｈ（ｈ_ｉ;θ_ｈｅ）が触覚信号ｈ_ｉの特徴表現であり、ｆ^ｈ＝｛ｆ_ｉ ^ｈ｝_{ｉ＝１,・・・,Ｎ}であり、θ_ｈｅが符号化モジュールのパラメータであると仮定し、ｆ_ｉ ^ｈを復号モジュールＤ_ｈ（・）に入力し、出力触覚信号
を取得し、ここで、θ_ｈｄが復号モジュールのパラメータであり、また、特徴ｆ_ｉ ^ｈに対してＫ－ｍｅａｎｓアルゴリズムに基づくクラスタリングを実施し、対応するカテゴリタグｓ_ｉ ^ｈを出力し、上記過程におけるパラメータを共同で推計し、損失関数
（ただし、
はエンコーダの再構築誤差であり、Ｎは触覚信号の数であり、
はＫ－ｍｅａｎｓのクラスタリング誤差であり、ＭはＫ－ｍｅａｎｓアルゴリズムにより触覚データを取得するクラスタ中心ベクトル行列であり、Ｍ行列における第ｃ列のｍ_ｃはｃ番目のクラスタの質量中心を示し、θ_ｈ＝［θ_ｈｅ,θ_ｈｄ］はエンコーダモジュール及びデコーダモジュールのパラメータであり、ｓ_ｊ，ｉ ^ｈはｓ_ｉ ^ｈのｊ番目の要素であり、その中の要素のｓ_ｊ，ｉ ^ｈ値が１、他の要素がいずれも０であれば、ｓ_ｉ ^ｈに対応するオリジナルの触覚信号ｈ_ｉは第ｊカテゴリに属することを示し、ｌは最小二乗損失
であり、λ≧０は正則化パラメータである）を設計し、
Ｌ_ｃｌｕ ^ｈを最小化することで、θ_ｈを推計し、ｆ_ｉ ^ｈ及びｓ_ｉ ^ｈを取得することができる。 (1-1-1) In the first step, the tactile signal is input to an autoencoder for learning, and the corresponding tactile features are extracted. Then, the tactile signal is clustered based on the K-means algorithm according to the tactile features, i.e.,
Specifically, suppose h _i is the input haptic signal, h = {h _i } _{i = 1, ..., N,} where i indicates the sort subscript of the input haptic signal, and N is the total amount of the input haptic signal. After passing through the encoding module E _h (·) of the autoencoder, f _i ^h = E _h (h _i ; θ _he ) is the feature representation of the haptic signal h _i , where f ^h = {f _i ^h } _{i = 1, ..., N} , and θ _he is the parameter of the encoding module. Then, f _i ^h is input to the decoding module D _h (·), and the output haptic signal
where θ _hd is the parameter of the decoding module, and performs clustering based on the K-means algorithm on the features f _i ^h to output the corresponding category tags s _i ^h . The parameters in the above process are jointly estimated, and the loss function
(however,
is the reconstruction error of the encoder, N is the number of haptic signals,
is the clustering error of K-means, M is the cluster center vector matrix for obtaining tactile data using the K-means algorithm, m _c in the c-th column of the M matrix indicates the center of mass of the c-th cluster, θ _h = [θ _he , θ _hd ] is the parameter of the encoder module and the decoder module, s _j,i ^h is the j-th element of s _i ^h , and if the s _j,i ^h value of the element is 1 and the other elements are all 0, it indicates that the original tactile signal h _i corresponding to s _i ^h belongs to the j-th category, and l is the least squares loss.
where λ≧0 is a regularization parameter),
By minimizing L _clu ^h , θ _h can be estimated and f _i ^h and s _i ^h can be obtained.

（１－１－２）、第２ステップとしては、クロスモーダル転移学習技術によって、取得された触覚特徴を画像信号及びオーディオ信号の特徴抽出過程に転移する。即ち、特徴自己適応方法で触覚領域と視聴覚領域との間の最大平均差異（ＭＭＤ）準則を最小化して転移を実現する。 (1-1-2) In the second step, the acquired tactile features are transferred to the feature extraction process of the image and audio signals using cross-modal transfer learning technology. That is, the transfer is achieved by minimizing the maximum mean difference (MMD) criterion between the tactile and audiovisual domains using a feature self-adaptation method.

具体的に、触覚、オーディオ、及び画像信号セットの分布をそれぞれＰ、Ｑ、及びＲとする。触覚信号とオーディオ信号との間のＭＭＤをＭＭＤ_ｋ（Ｐ,Ｑ）として示し、触覚信号と視覚信号との間のＭＭＤがＭＭＤ_ｋ（Ｐ,Ｒ）である。再生カーネルＨｉｂｅｒｔ空間Ｈ_ｋにおいて、Ｈ_ｋが非空集合に定義された関数セットｆを含み、ＭＭＤの２乗が、
であり、
ただし、触覚、オーディオ、及び画像信号にそれぞれの特徴抽出ネットワークφを通過させ、抽出された特徴ベクトルを取得し、触覚特徴ベクトルがφ^ｈ（ｈ;θ_ｈｅ）として示されてもよく、即ち前のステップにおけるオートエンコーダの符号化モジュールの出力であり、オーディオ及び画像の特徴ベクトルがφ^ａ（ａ;θ_ａ）＝ｆ^ａ、φ^ｖ（ｖ;θ_ｖ）＝ｆ^ｖであり、３つのモーダルの特徴集合が（ｆ^ｈ,ｆ^ａ,ｆ^ｖ）として示されてもよい。θ_ａ及びθ_ｖがそれぞれオーディオ及び画像特徴抽出ネットワークのパラメータであり、θ_ｈｅが符号化モジュールのパラメータである。任意の関数ｆ∈Ｈ_ｋ且つ任意のＸ∈Ｐであり、
であり、μ_ｋ（Ｐ）がＰのＨ_ｋにおける平均埋め込みであり、即ち分布ＰのＨ_ｋ空間における１つの要素表現であり、ｆ（Ｘ）はＸが関数ｆによりＨ_ｋ空間にマッピングすることを示し、＜・,・＞_Ｈｋが内積演算であり、同様に、
であり、μ_ｋ（Ｑ）がＱのＨ_ｋにおける平均埋め込みであり、
であり、μ_ｋ（Ｒ）がＲのＨ_ｋにおける平均埋め込みである。 Specifically, let the distributions of the haptic, audio, and image signal sets be P, Q, and R, respectively. The MMD between the haptic signal and the audio signal is denoted as _MMDk (P,Q), and the MMD between the haptic signal and the visual signal is _MMDk (P,R). In the reproduction kernel Hibert space _Hk , _Hk contains a function set f defined on a non-empty set, and the square of the MMD is
and
where haptic, audio, and image signals are passed through respective feature extraction networks φ to obtain extracted feature vectors, and the haptic feature vector may be denoted as ^φh (h; _θhe ), i.e., the output of the encoding module of the autoencoder in the previous step, the audio and image feature vectors are ^φa (a; _θa )= ^fa , ^φv (v; _θv )= ^fv , and the three-modal feature set may be denoted as ( ^fh , ^fa , ^fv ). _θa and _θv are the parameters of the audio and image feature extraction networks, respectively, and _θhe is the parameter of the encoding module. For any function _f∈Hk and any X∈P,
where μ _k (P) is the mean embedding of P in H _k , i.e., one element representation in H _k space of the distribution P, f(X) denotes the mapping of X to H _k space by function f, <·,·> _{H k} is the dot product operation, and similarly,
and μ _k (Q) is the mean embedding of Q in H _k ,
and μ _k (R) is the mean embedding of R in H _k .

対応する
の値をクロスモーダル転移の損失関数として計算し、具体的な公式は、
である。 handle
The value of is calculated as the loss function of cross-modal transfer, and the specific formula is
is.

Ｌ_ＣＴを最適化することで、触覚特徴抽出オートエンコーダモデルとオーディオ・画像特徴抽出ネットワークとの間の情報が流れるように案内することができ、それにより触覚モーダルのためのオートエンコーダの特徴抽出能力をオーディオ・画像の特徴抽出ネットワークに効果的に転移し、即ちθ_ａ及びθ_ｖを推計することができる。 By optimizing _LCT , we can guide the information flow between the haptic feature extraction autoencoder model and the audio-visual feature extraction network, thereby effectively transferring the feature extraction capabilities of the autoencoder for the haptic modality to the audio-visual feature extraction network, i.e., estimating θ _a and θ _v .

（１－１－３）、第３ステップとしては、オーディオ・画像の特徴抽出ネットワークを更に最適化し、ここで、ビデオ特徴抽出ネットワークはＶＧＧネットワークの設計スタイルを選択し、即ち３×３の畳み込みフィルタ及びステップ幅が２である充填なしの２×２の最大プーリング層を有し、ネットワークは４つのブロックに分けられ、各ブロックに２つの畳み込み層及び１つのプーリング層が含まれ、連続ブロックの間に倍になるフィルタ数を有し、最後に、全ての空間位置に最大プーリングを実行することで単一の５１２次元の意味特徴ベクトルを生成する。次に、該意味特徴ベクトルを、１つの３層の完全接続ニューラルネットワーク（２５６－１２８－３２）、及びＫ個のノードとｓｏｆｔｍａｘ関数付きの１つの完全接続層に入力し、３２次元ベクトルは視覚信号の特徴ベクトルであり、更に後続のトリプレット制約に基づく共有意味学習の訓練を受けることとなり、Ｋが各粗粒度カテゴリにおける細粒度サブカテゴリの個数であり、本実験のデータセットにおいてＫが３である。オーディオ信号の関連するネットワーク構造は設定が視覚信号と同様であり、視覚信号と分類器を共有する。 (1-1-3) In the third step, the audio and image feature extraction network is further optimized. Here, the video feature extraction network uses a VGG network design style, i.e., a 3x3 convolutional filter and a 2x2 max pooling layer with a step width of 2 and no padding. The network is divided into four blocks, each containing two convolutional layers and one pooling layer, with the number of filters doubling between successive blocks. Finally, max pooling is performed at all spatial locations to generate a single 512-dimensional semantic feature vector. This semantic feature vector is then input to a three-layer fully connected neural network (256-128-32) with K nodes and one fully connected layer with a softmax function. The 32-dimensional vector is the visual signal feature vector, which is then further trained through subsequent shared semantic learning based on triplet constraints. K is the number of fine-grained subcategories in each coarse-grained category, and K is 3 in this experimental dataset. The associated network architecture for audio signals is similar in configuration to visual signals and shares classifiers with visual signals.

設計された分類損失関数は、
（ただし、ｓ_ｉ ^ａ及びｓ_ｉ ^ｖはそれぞれオーディオ信号及び画像信号のカテゴリタグであり、Ｌ_ｃｌｕ ^ａｖを最小化することで、θ_ａ及びθ_ｖの最適値を更に取得することができ、ｐ（ｆ_ｉ ^ａ;θ_ａ）の意味はオーディオ特徴抽出ネットワーク入力がｆ_ｉ ^ａ、ネットワークパラメータがθ_ａである場合にオーディオ信号カテゴリｓ_ｉ ^ａを取得する確率であり、ｐ（ｆ_ｉ ^ｖ;θ_ｖ）の意味は画像特徴抽出ネットワーク入力がｆ_ｉ ^ｖ、ネットワークパラメータがθ_ｖである場合に画像信号カテゴリｓ_ｉ ^ｖを取得する確率である）である。 The designed classification loss function is
(where s _i ^a and s _i ^v are the category tags of the audio signal and the image signal, respectively; by minimizing L _clu ^av , the optimal values of θ _a and θ _v can further be obtained; p(f _i ^a ; θ _a ) means the probability of obtaining the audio signal category s _i ^a when the audio feature extraction network input is f _i ^a and the network parameter is θ _a ; and p(f _i ^v ; θ _v ) means the probability of obtaining the image signal category s _i ^v when the image feature extraction network input is f _i ^v and the network parameter is θ _v ).

特に説明すべきなのは、該タグの取得は以下の漸進ポリシーを採用することである。まず、データに疑似タグを付け、これらのデータを利用してモデルを最適化し、最適化されたモデルによる視聴覚信号の分類能力が強化される。更に、訓練後のモデルを利用して疑似タグ操作を実行し、それにより疑似タグを更新する。このような漸進的最適化方法によって、ネットワークの細粒度分類能力を徐々に向上させる。 Notably, the tags are obtained using the following incremental policy: First, pseudo-tags are added to the data, and then these data are used to optimize the model, enhancing the model's ability to classify audiovisual signals. Then, the trained model is used to perform pseudo-tag operations, thereby updating the pseudo-tags. This incremental optimization method gradually improves the network's fine-grained classification ability.

要するに、ステップ（１－１）における総目的関数は上記３つの損失関数の組合せであり、
Ｌ_ｃｌｕ＝Ｌ_ｃｌｕ ^ｈ＋Ｌ_ＣＴ＋Ｌ_ｃｌｕ ^ａｖとして示されてもよく、
Ｌ_ｃｌｕを最小化することで、最適なθ_ｈ、θ_ａ、及びθ_ｖを取得することができ、パラメータを決定した後に、触覚信号、オーディオ信号、及び画像信号に対応する特徴ｆ^ｈ、ｆ^ａ、ｆ^ｖを取得することができ、
（１－２）、同じ細粒度サブカテゴリの特徴間のコンパクト性を確保するために、３種類のモーダル信号に対して中心制約におけるクラスタリング学習を行う。より良い細粒度の分類性能を実現するために、同じサブカテゴリの特徴が共通空間において隣接すべきであり、この目的はカテゴリ内分散を最小化することである。具体的に、特徴からそのサブカテゴリ中心までの距離を最小化することでクラスタリング学習を駆動する。触覚信号のサブカテゴリ中心をクロスモーダル信号の共通サブカテゴリ中心とすることで、同じカテゴリのクロスモーダル信号意味特徴間のコンパクト性を確保する。中心制約におけるクラスタリング学習の損失関数は、
として定義され、
ただし、Ｎが触覚、オーディオ、及び画像信号の数であり、ｓ_ｉ ^ｈ、ｓ_ｉ ^ａ、及びｓ_ｉ ^ｖがそれぞれ触覚信号、オーディオ信号、及び画像信号のカテゴリを示し、なお、オーディオデータと画像データが触覚データのクラスタ中心ベクトル行列Ｍを共有する。Ｌ_ｃｅｎを最小化することで、細粒度分類におけるカテゴリ内差異が大きい問題を効果的に解決することができる。上記過程によって、類似意味を持つ各モーダル特徴ｆ_ｉ ^ａ、ｆ_ｉ ^ｖ、及びｆ_ｉ ^ｈを互いに接近させる。 In short, the total objective function in step (1-1) is a combination of the above three loss functions,
It may be shown as L _clu =L _clu ^h +L _CT +L _clu ^av ,
By minimizing L _clu , the optimal θ _h , θ _a , and θ _v can be obtained, and after determining the parameters, the features f ^h , f ^a , and f ^v corresponding to the haptic signal, audio signal, and image signal can be obtained;
(1-2) To ensure compactness between features of the same fine-grained subcategory, clustering learning under center constraints is performed on three types of modal signals. To achieve better fine-grained classification performance, features of the same subcategory should be adjacent in a common space, with the goal of minimizing intra-category variance. Specifically, clustering learning is driven by minimizing the distance from a feature to its subcategory center. By setting the subcategory center of the tactile signal as the common subcategory center of the cross-modal signal, compactness between semantic features of cross-modal signals of the same category is ensured. The loss function for clustering learning under center constraints is:
is defined as
where N is the number of haptic, audio, and image signals, s _i ^h , s _i ^a , and s _i ^v respectively indicate the categories of haptic signals, audio signals, and image signals, and audio data and image data share the cluster center vector matrix M of haptic data. Minimizing L _cen can effectively solve the problem of large intra-category discrepancies in fine-grained classification. Through the above process, modal features f _i ^a , f _i ^v , and f _i ^h with similar meanings are made closer to each other.

（１－３）、異なる細粒度サブカテゴリの特徴が一定のスパース性を備えるように確保するために、３種類のモーダルデータに対してソート制約におけるクラスタリング学習を行う。中心制約の目標はカテゴリ内分散を最小化することであるが、ソート制約の目標はカテゴリ間分散を最大化することであり、それにより異なるサブカテゴリの特徴出力が同じサブカテゴリの特徴出力よりも類似しないようにする。ソート制約は、
として定義され、
ただし、Ｃは触覚信号がＫ－ｍｅａｎｓアルゴリズムによりクラスタリングした後の総カテゴリ数、即ちクラスタ数である。Ｌ_ｒａｎｋを最小化することで、細粒度分類におけるカテゴリ間差異が小さい問題を効果的に解決することができる。上記過程によって、類似意味を持つ各モーダル特徴ｆ_ｉ ^ａ、ｆ_ｉ ^ｖ、及びｆ_ｉ ^ｈを更に接近させるが、異なる意味を持つ各モーダル特徴をできる限り分離させる。 (1-3) To ensure that the features of different fine-grained subcategories have a certain sparsity, we perform clustering learning under sorting constraints on three types of modal data. The goal of the centrality constraint is to minimize the within-category variance, while the goal of the sorting constraint is to maximize the between-category variance, so that the feature outputs of different subcategories are less similar than the feature outputs of the same subcategory. The sorting constraint is:
is defined as
where C is the total number of categories, i.e., the number of clusters, after the tactile signals are clustered using the K-means algorithm. Minimizing L _rank can effectively solve the problem of small inter-category differences in fine-grained classification. Through the above process, modal features f _i ^a , f _i ^v , and f _i ^h with similar meanings are brought closer together, while modal features with different meanings are separated as much as possible.

ステップ２では、細粒度分類を有する触覚、オーディオ、及び画像特徴を取得した後、３種類のモーダル特徴をトリプレット集合に構築し、トリプレット制約の共有意味学習を行い、マルチモーダル融合マッピング関数を最適化し、共有意味情報を含む融合特徴を取得し、触覚信号の生成に基礎を築く。 In step 2, after obtaining tactile, audio, and image features with fine-grained classification, the three modal features are constructed into a triplet set, triplet-constrained shared semantic learning is performed, a multimodal fusion mapping function is optimized, and fusion features containing shared semantic information are obtained, laying the foundation for generating tactile signals.

ステップ２は具体的に以下のとおりである。 Step 2 specifically includes the following:

（２－１）、ある細粒度サブカテゴリから１つの触覚信号サンプルｈ_ｉをランダムに選択し、該サンプルをアンカー（Ａｎｃｈｏｒ）とし、次に、画像データセットからｈ_ｉと同じカテゴリに属し且つ意味特徴がｆ_ｉ ^ｈに最も近いサンプルｖ_ｉ ^＋をポジティブマッチサンプルとして選択し、その後、ｈ_ｉと同じカテゴリに属せず且つ意味特徴がｆ_ｉ ^ｈに最も近いサンプルｖ_ｊ ^－をネガティブマッチサンプルとして選択する。これにより、データセット内のすべてのサンプルのためにトリプレット集合｛（ｈ_ｉ，ｖ_ｉ ^＋，ｖ_ｊ ^－）｝を構成する。同様に触覚信号サンプルとオーディオ信号サンプルとで構成されるトリプレット集合｛（ｈ_ｉ，ａ_ｉ ^＋，ａ_ｊ ^－）｝を取得することができる。アンカー点ｈ_ｉの意味特徴ｆ_ｉ ^ｈとオーディオ・画像モーダルにおける対応する小分類内のポジティブマッチの意味特徴
との距離を最小化し、且つｆ_ｉ ^ｈとネガティブマッチの意味特徴
との間の意味特徴を最大化し、且つ１つの最小の間隔δがある。ここで、δが１である。これにより取得した２つのトリプレット損失関数は、
である。 (2-1), one haptic signal sample h _i is randomly selected from a certain fine-grained subcategory and this sample is designated as the anchor. Next, a sample v _i ⁺ belonging to the same category as h _i and whose semantic features are closest to f _i ^h is selected from the image dataset as a positive match sample. After that, a sample v _j ⁻ not belonging to the same category as h _i and whose semantic features are closest to f _i ^h is selected as a negative match sample. This constructs a triplet set {(h _i , v _i ⁺ , v _j ⁻ )} for all samples in the dataset. Similarly, a triplet set {(h _i , a _i ⁺ , a _j ⁻ )} consisting of haptic signal samples and audio signal samples can be obtained. The semantic features f _i _h of the anchor point h ⁱ and the positive match semantic features in the corresponding subcategory in the audio-image modal are
and minimize the distance between f _i ^h and the semantic feature of negative match
and there is one minimum interval δ, where δ is 1. The two triplet loss functions obtained are
is.

（２－２）、上記に基づいて、統合パラダイムを導入し、マルチモーダル特徴を高度融合する。具体的には、まず、視聴覚データがそれぞれオーディオ特徴抽出ネットワーク及び画像特徴抽出ネットワークを通過して特徴ｆ^ａ及びｆ^ｖを取得し、その後でｆ^ａとｆ^ｖを融合し、過程は、
ｆ^ｍ＝Ｆ_ｍ（ｆ^ａ,ｆ^ｖ;θ_ｍ）
（ただし、ｆ^ｍは共有意味部分空間におけるマルチモーダル融合の出力即ち融合特徴であり、Ｆ_ｍ（・）はパラメータがθ_ｍのマッピング関数であり、一般的には、Ｆ_ｍ（・）はｆ^ａ及びｆ^ｖの線形重み付けを取る）である。 (2-2) Based on the above, we introduce an integration paradigm to achieve advanced multimodal feature fusion. Specifically, audiovisual data first passes through an audio feature extraction network and an image feature extraction network to obtain features f ^a and f ^v , respectively, and then f ^a and f ^v are fused. The process is as follows:
f ^m =F _m (f ^a ,f ^v ;θ _m )
(where f ^m is the output of multimodal fusion in the shared semantic subspace, i.e., the fusion feature, and F _m (·) is a mapping function with parameter θ _m ; in general, F _m (·) takes a linear weighting of f ^a and f ^v ).

ｆ_＋ ^ａとｆ_＋ ^ｖをｆ_＋ ^ｍに融合し、ｆ_－ ^ａとｆ_― ^ｖをｆ_― ^ｍに融合する。同様にトリプレット損失を利用して融合特徴を制約し、即ち、
である。 We fuse f ₊ ^a and f ₊ ^v into f ₊ ^m , and f ₋ ^a and f ₋ ^v into f ₋ ^m . Similarly, we use triplet loss to constrain the fused features, i.e.,
is.

（２－３）、共有意味学習の目的関数は３つの損失関数を組み合わせることでモデリングしてもよく、
Ｌ_Ｓｙｎ＝Ｌ_ｓｙｎ ^ａ＋Ｌ_ｓｙｎ ^ｖ＋Ｌ_ｓｙｎ ^ｍとして示されてもよい。 (2-3) The objective function of shared semantic learning may be modeled by combining three loss functions:
It may be shown as L _syn =L _syn ^a +L _syn ^v +L _syn ^m .

Ｌ_Ｓｙｎを最小化することで、最適なθ_ｍを取得し、それにより次の段階の触覚信号の生成に基礎を築く。 By minimizing L _Syn , the optimal θ _m is obtained, which lays the foundation for the next stage of haptic signal generation.

ステップ３では、融合特徴を触覚生成ネットワークに入力して所望の触覚信号ｈ′を生成する。 In step 3, the fused features are input into a haptic generation network to generate the desired haptic signal h'.

ステップ３は具体的に以下のとおりである。 Step 3 specifically includes the following:

まず、触覚生成ネットワークを構築しており、触覚デコーダＤ_ｈ（・）がステップ（１）において完全なオートエンコーダの一部として訓練されるため、ここで１つの触覚生成ネットワークＧ（・）を別に構築し、その構造がＤ_ｈ（・）と同様（３２－１２８－２５６－Ｚ）であり、且つそのネットワークパラメータθ_ｈｄをＧ（・）のパラメータθ_Ｇの初期値とする。必要な意味情報を含む融合特徴を触覚生成ネットワークＧ（・）に入力して所望の触覚信号ｈ′を取得し、且つ生成された触覚信号ｈ′をＥ_ｈ（・）により３２次元の触覚特徴ｆ^ｈ′に再マッピングし、カテゴリ中心を選択して該触覚特徴ｆ^ｈ′に対して意味制約を行う。明らかに、該触覚特徴ｆ^ｈ′と対応するカテゴリのカテゴリ中心との距離をできる限り小さくするが、他のカテゴリ中心との距離をできる限り大きくする。最終的な損失関数は、
として示されてもよく、
ただし、
であり、
が特徴ｆ^ｈとｆ^ｈ′との類似度を示し、
がｆ^ｈ′のクラスタリング損失であり、それらが一緒に損失関数の正則化項とされる。該損失関数を最適化することで、θ_Ｇの最適値を取得し、即ちＧ（・）を決定することができる。 First, a haptic generation network is constructed. Since the haptic decoder D _h (·) is trained as part of the complete autoencoder in step (1), here we construct another haptic generation network G(·), whose structure is the same as D _h (·) (32-128-256-Z), and whose network parameters θ _hd are the initial values of the parameters θ _G of G(·). The fused features containing the required semantic information are input into the haptic generation network G(·) to obtain the desired haptic signal h′, and the generated haptic signal h′ is remapped to a 32-dimensional haptic feature f ^h′ by E _h (·), and a category center is selected to impose semantic constraints on the haptic feature f ^h′ . Obviously, the distance between the haptic feature f ^h′ and the category center of the corresponding category should be as small as possible, but the distance between the haptic feature f h′ and other category centers should be as large as possible. The final loss function is
may be shown as
however,
and
indicates the similarity between features f ^h and f ^h′ ,
is the clustering loss of f ^h′ , and they are jointly used as the regularization term of the loss function. By optimizing the loss function, the optimal value of θ _G can be obtained, i.e., G(·) can be determined.

ステップ４では、上記モデルの訓練を行い、即ち、モデル訓練を２つのステップに分け、第１ステップとしては、触覚オートエンコーダ、オーディオ特徴抽出ネットワーク、及び画像特徴抽出ネットワークにおけるパラメータを推定し、第２ステップとしては、マルチモーダル融合マッピング関数のパラメータ及び触覚生成ネットワークのパラメータを推定する。該ステップによって、触覚オートエンコーダ、オーディオ特徴抽出ネットワーク、画像特徴抽出ネットワーク、マルチモーダル融合マッピング関数、及び触覚生成ネットワークを訓練する。 In step 4, the model is trained. That is, model training is divided into two steps. In the first step, parameters in the haptic autoencoder, audio feature extraction network, and image feature extraction network are estimated. In the second step, parameters of the multimodal fusion mapping function and the haptic generation network are estimated. Through these steps, the haptic autoencoder, audio feature extraction network, image feature extraction network, multimodal fusion mapping function, and haptic generation network are trained.

ステップ４は具体的には、以下のとおりである。 Specifically, Step 4 is as follows:

θ_ｖ、θ_ａ、θ_ｈｅ、θ_ｈｄ、及びＭの推定を完了し、且つ｛ｓ_ｉ ^ｈ｝、｛ｓ_ｉ ^ａ｝、｛ｓ_ｉ ^ｖ｝を最適化する。 Estimation of θ _v , θ _a , θ _he , θ _hd , and M is completed, and {s _i ^h }, {s _i ^a }, {s _i ^v } are optimized.

ステップ４１１、ネットワークパラメータθ_ｖ、θ_ａ、θ_ｈｅ、θ_ｈｄ、ノード系タグ｛ｓ_ｉ ^ｈ｝、｛ｓ_ｉ ^ａ｝、｛ｓ_ｉ ^ｖ｝、及びカテゴリ中心行列Ｍを初期化し、クラスタ数Ｃを設定し、学習率μ_１＝０．０００１、反復回数Ｔ＝６００である。 In step 411, the network parameters θ _v , θ _a , θ _he , θ _hd , the node system tags {s _i ^h }, {s _i ^a }, {s _i ^v }, and the category center matrix M are initialized, the number of clusters C is set, the learning rate μ ₁ = 0.0001, and the number of iterations T = 600.

ステップ４１２、｛ｓ_ｉ ^ｈ｝、｛ｓ_ｉ ^ａ｝、｛ｓ_ｉ ^ｖ｝、及びＭを固定し、θ_ｖ、θ_ａ、θ_ｈｅ、θ_ｈｄを最適化し、即ち、
であり、
ただし、∇が各損失関数を偏微分することである。 Step 412, {s _i ^h }, {s _i ^a }, {s _i ^v }, and M are fixed, and θ _v , θ _a , θ _he , θ _hd are optimized, i.e.,
and
where ∇ is the partial derivative of each loss function.

ステップ４１３、θ_ｖ、θ_ａ、θ_ｈｅ、θ_ｈｄ、及びＭを固定し、｛ｓ_ｉ ^ｈ｝、｛ｓ_ｉ ^ａ｝、｛ｓ_ｉ ^ｖ｝を最適化し、即ち、
である。 Step 413, fix θ _v , θ _a , θ _he , θ _hd , and M, and optimize {s _i ^h }, {s _i ^a }, {s _i ^v }, i.e.,
is.

ステップ４１４、θ_ｖ、θ_ａ、θ_ｈｅ、θ_ｈｄ、及び｛ｓ_ｉ ^ｈ｝、｛ｓ_ｉ ^ａ｝、｛ｓ_ｉ ^ｖ｝を固定し、Ｍを最適化し、即ち、
である。 Step 414: Fix θ _v , θ _a , θ _he , θ _hd , and {s _i ^h }, {s _i ^a }, {s _i ^v } and optimize M, i.e.,
is.

ステップ４１５、ｔ＜Ｔの場合、ステップ４１２にジャンプし、ｔ＝ｔ＋１の場合、次回の反復を継続し、そうでない場合、反復を終了する。 In step 415, if t < T, jump to step 412; if t = t + 1, continue with the next iteration; otherwise, end the iteration.

ステップ４１６、Ｔ回反復した後、最適なオーディオ特徴抽出ネットワークのパラメータθ_ａ、画像特徴抽出ネットワークのパラメータθ_ｖ、触覚オートエンコーダのパラメータθ_ｈｅ、θ_ｈｄ、ノード系タグ｛ｓ_ｉ ^ｈ｝、｛ｓ_ｉ ^ａ｝、｛ｓ_ｉ ^ｖ｝、及び触覚データのクラスタ中心ベクトル行列Ｍを取得する。 In step 416, after T iterations, the optimal audio feature extraction network parameters θ _a , image feature extraction network parameters θ _v , haptic autoencoder parameters θ _he and θ _hd , node system tags {s _i ^h }, {s _i ^a }, {s _i ^v }, and cluster center vector matrix M of the haptic data are obtained.

特に、ｍ_ｋを更新する際に
を簡単に使用せず、ここでｃ_ｋ ^ｉは１番目のサンプルから現在のサンプルまでクラスタｋに割り当てられたインデックスセットであるが、既に出現した履歴データは全体のクラスタ構造状況を表すには不十分であり、且つｓ_ｉ ^ｈが正しくない恐れがある。このため、本アルゴリズムは、各クラスタに含まれるデータサンプルの数がほぼバランスするという前提を仮定し、これに基づいて、上記勾配更新ステップを設計してｍ_ｋを更新し、１／ｃ_ｋ ^ｉを用いて学習速度を制御し、ｃ_ｋ ^ｉはアルゴリズムによってｉ番目のサンプルを処理する前にサンプルをクラスタｋに割り当てる回数である。このようにして、Ｍの更新はＳＧＤステップとして見なされてもよい。 In particular, when updating m _k
where c _k ⁱ is the set of indexes assigned to cluster k from the first sample to the current sample, but the historical data that has already appeared is insufficient to represent the overall cluster structure, and s _i ^h may be incorrect. Therefore, this algorithm assumes that the number of data samples contained in each cluster is approximately balanced, and based on this, the gradient update step is designed to update m _k , and 1/c _k ⁱ is used to control the learning rate, where c _k ⁱ is the number of times a sample is assigned to cluster k before the i-th sample is processed by the algorithm. In this way, the update of M may be regarded as an SGD step.

（４－２）、ＳＧＤに基づいて、θ_ｍ及びθ_Ｇの推定を完了する。 (4-2) Based on SGD, the estimation of θ _m and θ _G is completed.

ステップ４２１、θ_ｍを初期化し、バッチサイズｂａｃｔｃｈ＝６４、学習率μ_２、μ_３＝０．０００１、反復回数ｎ_１＝６００である。 In step 421, θ _m is initialized, the batch size bactch=64, the learning rates μ ₂ , μ ₃ =0.0001, and the number of iterations n ₁ =600.

ステップ４２２、Ｌ_Ｓｙｎに基づいて確率的勾配降下法を利用してθ_ｍを微調整し、即ち、
である。 Step 422: Fine-tune θ _m using stochastic gradient descent based on L _Syn , i.e.,
is.

ステップ４２３、Ｌ_Ｇｅｎに基づいて確率的勾配降下法を利用してθ_Ｇを更新し、即ち、
である。 Step 423: Update θ _G using stochastic gradient descent based on L _Gen , i.e.,
is.

ステップ４２４、ｎ＜ｎ_１の場合、ステップ４２２にジャンプし、ｎ＝ｎ＋１の場合、次回の反復を継続し、そうでない場合、反復を終了する。 Step 424, if n< _n1 , jump to step 422, if n=n+1, continue next iteration, otherwise end iteration.

ステップ４２５、６００回反復した後、最適なｎ＝ｎ＋１を取得する。 Step 425: After 600 iterations, the optimal n = n + 1 is obtained.

ステップ５では、受信したばかりの画像信号及びオーディオ信号をそれぞれ訓練済みの画像特徴抽出ネットワーク及びオーディオ特徴ネットワークに入力し、それぞれ画像特徴及びオーディオ特徴を取得し、次に、上記特徴をマルチモーダル融合マッピング関数に入力し、融合特徴を取得し、最後に、融合特徴を訓練済みの触覚生成ネットワークに入力し、再構築された触覚信号を取得する。 In step 5, the just-received image signal and audio signal are input into the trained image feature extraction network and audio feature network, respectively, to obtain image features and audio features, respectively; then, the above features are input into the multimodal fusion mapping function to obtain fusion features; and finally, the fusion features are input into the trained haptic generation network to obtain a reconstructed haptic signal.

ステップ５は具体的に以下のとおりである。 Step 5 specifically includes the following:

受信したばかりの画像信号ｖ及びオーディオ信号ａをそれぞれ訓練済みの画像特徴抽出ネットワーク及びオーディオ特徴抽出ネットワークに入力し、画像特徴
を取得し、且つ訓練済みのマルチモーダル融合マッピング関数を入力し、融合特徴
を取得し、
を訓練済みのＧ（・）に入力し、最終的に再構築された触覚信号
を取得する。 The just-received image signal v and audio signal a are input to the trained image feature extraction network and audio feature extraction network, respectively, and the image features are extracted.
and input the trained multimodal fusion mapping function to obtain the fusion feature
Get
is input to the trained G(·), and the final reconstructed tactile signal is
Get.

以下の実験結果から分かるように、従来の方法に比べて、本発明はマルチモーダル意味の相補融合により触覚信号の合成を実現し、より高い生成効果を得る。 As can be seen from the experimental results below, compared to conventional methods, our invention achieves tactile signal synthesis through the complementary fusion of multimodal meanings, achieving a higher generation effect.

本実施例はＬＭＴクロスモーダルデータセットを用いて実験を行い、該データセットは文献「Ｍｕｌｔｉｍｏｄａｌｆｅａｔｕｒｅ－ｂａｓｅｄｓｕｒｆａｃｅｍａｔｅｒｉａｌｃｌａｓｓｉｆｉｃａｔｉｏｎ」において提案されており、９つの意味カテゴリのサンプル、即ちグリッド、石、金属、木材、ゴム、繊維、泡沫、箔及び紙、繊維製品及び織物を含む。本実施例は５つの大分類（各大分類に３つの小分類が含まれる）を選択して実験を行う。ＬＭＴデータセットを再構築し、まず、各材料の実例における訓練セット及びテストセットを参照し、それぞれ各実例における２０個の画像サンプル、２０個のオーディオ信号サンプル、及び２０個の触覚信号サンプルを取得する。次に、データを拡張することでニューラルネットワークを訓練し、具体的に、各画像を水平及び垂直に反転し、任意の角度でそれらを回転させ、且つ従来の方法に加えて、ランダム拡大縮小、カット、及びオフセットなどの技術を使用する。これをもって、各カテゴリのデータを１００まで拡張し、従って、合計して１５００個の画像があり、寸法が２２４^＊２２４である。データセットにおいて、８０％が訓練に用いられるものとして選択されるが、残りの２０％がテスト及び性能評価に用いられる。本実験は細粒度カテゴリが未知のものであるように初期設定される。 This example uses the LMT cross-modal dataset proposed in the paper "Multimodal feature-based surface material classification," which includes samples from nine semantic categories: grid, stone, metal, wood, rubber, fiber, foam, foil, paper, textiles, and fabrics. This example selects five major categories (each of which contains three subcategories) for the experiment. The LMT dataset is reconstructed by first referencing the training and test sets for each material example, and obtaining 20 image samples, 20 audio signal samples, and 20 tactile signal samples for each example. Next, the neural network is trained by augmenting the data. Specifically, each image is flipped horizontally and vertically, rotated at an arbitrary angle, and, in addition to conventional methods, techniques such as random scaling, cutting, and offsetting are used. With this, we expand the data of each category to 100, so there are a total of 1500 images with a size of 224 ^* 224. In the dataset, 80% is selected to be used for training, while the remaining 20% is used for testing and performance evaluation. The experiment is initially set up so that the fine-grained categories are unknown.

（１）クラスタリング結果
本発明に係るクラスタリング方法の有効性を検証するために、該クラスタリング方法を複数のベースライン方法と比較し、これらの方法は以下を含む。 (1) Clustering Results To verify the effectiveness of the clustering method of the present invention, the clustering method was compared with several baseline methods, including:

Ｋ－ｍｅａｎｓ（ＫＭ）
Ｋ－Ｍｅａｎｓアルゴリズムによってそれぞれ画像、オーディオ、及び触覚モーダルのサンプルをクラスタリングする。 K-means (KM)
The K-Means algorithm clusters the samples of image, audio, and haptic modal, respectively.

オートエンコーダ＋Ｋ－ｍｅａｎｓ（ＡＥ＋ＫＭ、ＡｕｔｏｅｎｃｏｄｅｒｆｏｌｌｏｗｅｄｂｙＫ－ｍｅａｎｓ）
これは２段階の方法である。まず、異なるモーダルの信号サンプルを再構築学習することで各モーダルの特徴表現を取得し、更にＫ－ｍｅａｎｓを用いてクラスタリングする。 Autoencoder + K-means (AE+KM, Autoencoder followed by K-means)
This is a two-step method: first, we obtain feature representations for each modality by reconstructing and learning signal samples of different modalities, and then cluster them using K-means.

トリプル深層クラスタリングモデル（３－ＤＣＮ、３－ＤｅｅｐＣｌｕｓｔｅｒｉｎｇＮｅｔｗｏｒｋ）
異なるモーダルの信号に対してそれぞれＤＣＮモデルを利用してクラスタリングする。 Triple deep clustering model (3-DCN, 3-Deep Clustering Network)
The DCN model is used to cluster signals of different modalities.

本発明は、本実施例の方法を用いる。 The present invention uses the method of this example.

結果を示す際に複数のサブカテゴリにおける平均値を選択し、主に３つの指標、即ち規格化相互情報（ＮＭＩ）、調整ランド指数（ＡＲＩ）、及びクラスタリング精度（ＡＣＣ）を採用する。実験結果は表１に示される。 When presenting the results, we select average values across multiple subcategories and mainly use three indices: normalized mutual information (NMI), adjusted Rand index (ARI), and clustering accuracy (ACC). The experimental results are shown in Table 1.

表１に、本発明、３－ＤＣＮ、ＡＥ＋ＫＭ、及びＫＭをＬＭＴデータセットに適用した結果を示す。以上から分かるように、本発明の方法はこのデータセットにおいて極めて高い競争力を示しており、結果が従来のクラスタリングアルゴリズム及び一般的な深層クラスタリングアルゴリズムよりも明らかに良い。分析したところ、他のアルゴリズムの応用シーンがいずれも単一モーダルであるため、クラスタリング結果のアンバランスをもたらしやすいためである恐れがある。理論的に、共に存在するクロスモーダルデータのあるカテゴリにおけるサンプル数が等しいべきである。それ以外に、本クラスタリング方法により学習したサブカテゴリの特徴は明らかに一層コンパクトであり、且つ異なるカテゴリ間の区別性も一層高く、これは後の触覚信号の再構築に寄与する。 Table 1 shows the results of applying the present invention, 3-DCN, AE+KM, and KM to the LMT dataset. As can be seen, the method of the present invention is highly competitive on this dataset, with results significantly better than those of traditional clustering algorithms and common deep clustering algorithms. Analysis suggests that this may be due to the fact that the application scenarios of other algorithms are all unimodal, which can easily lead to imbalances in clustering results. Theoretically, the number of samples in a given category of coexisting cross-modal data should be equal. Furthermore, the subcategory features learned by this clustering method are significantly more compact and more distinctive between different categories, which contributes to the subsequent reconstruction of tactile signals.

（２）触覚の再構築結果
細粒度カテゴリを決定した上で、提案された細粒度触覚の再構築方法を以下のいくつかの方法と比較する。 (2) Tactile Reconstruction Results After determining the fine-grained category, we compare the proposed fine-grained tactile reconstruction method with the following several methods.

既存の方法１
文献「Ｌｅａｒｎｉｎｇｃｒｏｓｓ－ｍｏｄａｌｖｉｓｕａｌ－ｔａｃｔｉｌｅｒｅｐｒｅｓｅｎｔａｔｉｏｎｕｓｉｎｇｅｎｓｅｍｂｌｅｄｇｅｎｅｒａｔｉｖｅａｄｖｅｒｓａｒｉａｌｎｅｔｗｏｒｋｓ」（作者Ｘ．Ｌｉ，Ｈ．Ｌｉｕ，Ｊ．Ｚｈｏｕ，ａｎｄＦ．Ｓｕｎ）におけるアンサンブル敵対的生成ネットワーク（Ｅ－ＧＡＮｓ、ＥｎｓｅｍｂｌｅｄＧＡＮｓ）は画像特徴を利用して必要なカテゴリ情報を取得し、次に、該カテゴリ情報を騒音とともに敵対的生成ネットワークの入力として対応するカテゴリの触覚スペクトルマップを生成し、最後に触覚信号に変換する。 Existing Method 1
The ensemble generative adversarial networks (E-GANs) in the paper "Learning cross-modal visual-tactile representation using ensemble generative adversarial networks" (authors X. Li, H. Liu, J. Zhou, and F. Sun) utilize image features to obtain the necessary category information. Then, the category information, along with noise, is input to the generative adversarial network to generate a tactile spectral map of the corresponding category, which is then converted into a tactile signal.

既存の方法２
文献「ＤｅｅｐＶｉｓｕｏ－ＴａｃｔｉｌｅＬｅａｒｎｉｎｇ：ＥｓｔｉｍａｔｉｏｎｏｆＴａｃｔｉｌｅＰｒｏｐｅｒｔｉｅｓｆｒｏｍＩｍａｇｅｓ」（作者：ＫｕｎｉｙｕｋｉＴａｋａｈａｓｈｉａｎｄＪｅｔｈｒｏＴａｎ）における深層視覚－触覚学習方法（ＤＶＴＬ、Ｄｅｅｐｖｉｓｉｏ－ｔａｃｔｉｌｅｌｅａｒｎｉｎｇ）は従来の潜在変数を有するエンコーダ－デコーダネットワークを拡張し、視覚及び触覚属性を潜在空間に埋め込む。 Existing Method 2
The deep visio-tactile learning method (DVTL, Deep Visio-Tactile Learning) in the paper "Deep Visuo-Tactile Learning: Estimation of Tactile Properties from Images" (authors: Kuniyuki Takahashi and Jethro Tan) extends the traditional encoder-decoder network with latent variables to embed visual and tactile attributes into the latent space.

既存の方法３
文献「ＴｅａｃｈｉｎｇＣａｍｅｒａｓｔｏＦｅｅｌ：ＥｓｔｉｍａｔｉｎｇＴａｃｔｉｌｅＰｈｙｓｉｃａｌＰｒｏｐｅｒｔｉｅｓｏｆＳｕｒｆａｃｅｓＦｒｏｍＩｍａｇｅｓ」（作者：ＭａｔｔｈｅｗＰｕｒｒｉａｎｄＫｒｉｓｔｉｎＤａｎａ）には結合符号化分類生成ネットワーク（ＪＥＣ－ＧＡＮ、Ｊｏｉｎｔ－ｅｎｃｏｄｉｎｇ－ｃｌａｓｓｉｆｉｃａｔｉｏｎＧＡＮ）が提案されており、該結合符号化分類生成ネットワークは異なる符号化ネットワークにより各モーダルの実例を１つの共有する内部空間に符号化し、対になった制約によって埋め込んだ視覚サンプル及び触覚サンプルを潜在空間で接近させる。最後に、視覚情報を入力として、生成ネットワークによって対応する触覚信号を再構築する。 Existing Method 3
The paper "Teaching Cameras to Feel: Estimating Tactile Physical Properties of Surfaces from Images" (authors: Matthew Purri and Kristin Dana) proposes a joint-encoding-classification-generative network (JEC-GAN), which uses different encoding networks to encode each modal instance into a shared internal space, and then couples the embedded visual and tactile samples together in the latent space using paired constraints. Finally, using visual information as input, the corresponding tactile signal is reconstructed by a generative network.

本実験は定量及び定性の２つの観点から分析する。まず、表２は二乗平均平方根誤差（ＲＭＳＥ）、構造類似度（ＳＩＭ）、及び分類精度（ＡＣＣ）の複数の観点から各方法における触覚信号の再構築性能を示す。表２は本発明の実験結果を示す。 This experiment is analyzed from two perspectives: quantitative and qualitative. First, Table 2 shows the tactile signal reconstruction performance of each method from multiple perspectives: root mean square error (RMSE), structural similarity (SIM), and classification accuracy (ACC). Table 2 shows the experimental results of the present invention.

表２及び図４から分かるように、上記の最も先進的な方法に比べて、本発明に係る方法は明らかな優位性を有する。その理由は、以下のとおりである。本発明に係る触覚信号の再構築方法は、クロスモーダルクラスタリングアルゴリズムによって細粒度サブカテゴリを明確にし、且つ中心制約及びソート制約を利用して意味特徴のコンパクト性及び区別性を効果的に向上させ、（２）、視覚・聴覚・触覚サンプルの対応関係は人間が指定した場合、主観的な意識が比較的強いため、精度が十分ではなく、それとは逆に、本発明に係る視聴覚補助用の細粒度触覚信号の再構築方法は、訓練時にモデルが触覚意味特徴に最も近い視聴覚意味特徴をその生成ネットワークの入力として自己選択する。 As can be seen from Table 2 and Figure 4, the method of the present invention has clear advantages over the most advanced methods mentioned above. The reasons are as follows: (1) The tactile signal reconstruction method of the present invention uses a cross-modal clustering algorithm to clarify fine-grained subcategories and utilizes centrality and sorting constraints to effectively improve the compactness and distinctiveness of semantic features. (2) When the correspondence between visual, auditory, and tactile samples is specified by humans, it is relatively subjective and therefore not accurate enough. In contrast, in the fine-grained tactile signal reconstruction method for audiovisual assistance of the present invention, during training, the model self-selects the audiovisual semantic features that are closest to the tactile semantic features as the input of its generative network.

他の実施例では、本発明のステップ１における触覚エンコーダはフィードフォワードニューラルネットワークを用いて、１次元畳み込み（１Ｄ－ＣＮＮ、Ｏｎｅ－ｄｉｍｅｎｓｉｏｎａｌｃｏｎｖｏｌｕｔｉｏｎａｌｎｅｕｒａｌｎｅｔｗｏｒｋｓ）で代替してもよい。 In another embodiment, the haptic encoder in step 1 of the present invention may be replaced by a one-dimensional convolutional neural network (1D-CNN) using a feedforward neural network.

以上の説明は、単に本発明の具体的な実施形態であるが、本発明の保護範囲はこれに限らず、当業者が本発明に開示される技術的範囲内で容易に想到し得る変化又は置換は、いずれも本発明の保護範囲に含まれるべきである。 The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited to this. Any modifications or substitutions that a person skilled in the art can easily make within the technical scope disclosed in the present invention should be included in the scope of protection of the present invention.

Claims

1. A method for reconstructing fine-grained haptic signals for audiovisual aids, comprising:
inputting a haptic signal into a haptic autoencoder and extracting features from the haptic signal through a clustering task;
Transferring and optimizing the feature extraction ability of the haptic autoencoder to the audio feature extraction network and the image feature extraction network by a cross-modal transfer learning method , i.e., optimizing the parameters of the audio feature extraction network and the image feature extraction network by minimizing the cross-modal transfer loss function and the classification loss function derived from the maximum mean difference (MMD) criterion between the haptic domain and the audiovisual domain;
A step of performing clustering constraint feature learning and clustering learning under center constraint and sorting constraint for three types of modal signals of tactile, audio, and image , thereby approximating tactile, audio, and image modal features that belong to the same meaning but separating modal features that do not belong to the same meaning, and obtaining tactile, audio, and image modal features with fine-grained classification;
Constructing a triplet set based on tactile, audio, and image modal features, and performing triplet-constrained shared semantic learning, i.e., optimizing parameters of a multimodal fusion mapping function by modeling and minimizing the objective function of shared semantic learning, and inputting the audio features and image features into the optimized multimodal fusion mapping function to obtain fusion features containing shared semantic information;
a step of presetting a haptic generation network and inputting the fused features into the haptic generation network to reconstruct a haptic signal.

The step of preconfiguring a haptic generative network and inputting the fused features into the haptic generative network to reconstruct a haptic signal includes:
Presetting parameters of a haptic autoencoder, an audio feature extraction network, and an image feature extraction network;
Presetting parameters of a multimodal fusion mapping function and parameters of a haptic generation network;
training a haptic autoencoder, an audio feature extraction network, an image feature extraction network, a multimodal fusion mapping function, and a haptic generation network;
2. The method for reconstructing fine-grained haptic signals for audiovisual assistance according to claim 1, comprising: inputting the just-received image signal and audio signal into a trained image feature extraction network and an audio feature network, respectively, to obtain image features and audio features, respectively; then inputting the features into a multimodal fusion mapping function to obtain fusion features; and finally inputting the fusion features into a trained haptic generation network to obtain a reconstructed haptic signal.

Inputting tactile signals into a tactile autoencoder and extracting features from the tactile signals through a clustering task is
The haptic signal is input to a haptic autoencoder for learning, and the corresponding haptic features are extracted. Then, the haptic signal is clustered based on the K-means algorithm according to the haptic features, i.e.,
h is the input haptic signal, h = {h _i } _{i = 1, ..., N} , where i indicates the sort subscript of the input haptic signal, and N is the total amount of the input haptic signal. After passing through the encoding module E _h (·) of the autoencoder, f _i ^h = E _h (h _i ; θ _he ) is the feature representation of the haptic signal h _i , where f ^h = {f _i ^h } _{i = 1, ..., N} , where θ _he is the parameter of the encoding module. f _i ^h is input to the decoding module D _h (·), and the output haptic signal
where θ _hd is the parameter of the decoding module, and performs clustering based on the K-means algorithm on the features f _i ^h to output the corresponding category tags s _i ^h . The parameters in the above process are jointly estimated, and the loss function
(however,
is the reconstruction error of the encoder, N is the number of haptic signals,
is the clustering error of K-means, M is the cluster center vector matrix for obtaining tactile data using the K-means algorithm, m _c in the c-th column of the M matrix indicates the center of mass of the c-th cluster, θ _h = [θ _he , θ _hd ] is the parameter of the encoder module and the decoder module, s _j,i ^h is the j-th element of s _i ^h , and if the s _j,i ^h value of the element is 1 and the other elements are all 0, it indicates that the original tactile signal h _i corresponding to s _i ^h belongs to the j-th category, and l is the least squares loss.
where λ is a regularization parameter, λ≧0;
3. The method for reconstructing a fine-grained haptic signal for audiovisual assistance according to claim 2, further comprising minimizing L _clu ^h to estimate θ _h and obtain f _i ^h and s _i ^h .

Transferring the feature extraction ability of a haptic autoencoder to an audio feature extraction network and an image feature extraction network using a cross-modal transfer learning method is
The feature self-adaptation method minimizes the maximum average difference criterion between the tactile and audiovisual areas to achieve the transition, i.e.,
Let the distributions of the haptic, audio, and image signal sets be P, Q, and R, respectively, denote the MMD between the haptic and audio signals as MMD _k (P,Q), the MMD between the haptic and visual signals as MMD _k (P,R), and in the reproduction kernel Hilbert space H _k , H _k contains a function set f defined on a non-empty set, and the square of the MMD is
and
where haptic, audio, and image signals are passed through respective feature extraction networks φ to obtain extracted feature vectors, the haptic feature vector is denoted as ^φh (h; _θhe ), i.e., it is the output of the encoding module of the autoencoder, the audio and image feature vectors are ^φa (a; _θa )= ^fa , ^φv (v; _θv )= ^fv , the three modal feature set is denoted as ( ^fh , ^fa , ^fv ), _θa and _θv are the parameters of the audio and image feature extraction networks respectively, _θhe is the parameter of the encoding module, and for any function f∈Hk and any _X∈P ,
where μ _k (P) is the mean embedding of P in H _k , i.e., one element representation in H _k -space of the distribution P, f(X) denotes the mapping of X into H _k- space by function f, <·,·> _{H k} is the dot product operation, and similarly,
and μ _k (Q) is the mean embedding of Q in H _k ,
and μ _k (R) is the mean embedding of R in H _k ;
handle
The value of is calculated as the loss function of cross-modal transfer, and the specific formula is
And,
Optimizing the _LCT guides information flow between the haptic feature extraction autoencoder model and the audio-visual feature extraction network, effectively transferring the feature extraction capability of the autoencoder for the haptic modality to the audio-visual feature extraction network, i.e., estimating θ _a and θ _v .

Further optimizing the audio feature extraction network and the image feature extraction network includes:
The classification loss function is
(where s _i ^a and s _i ^v are the category tags of the audio signal and the image signal, respectively; by minimizing L _clu ^av , the optimal values of θ _a and θ _v are further obtained; p(f _i ^a ; θ _a ) means the probability of obtaining the audio signal category s _i ^a when the audio feature extraction network input is f _i ^a and the network parameter is θ _a ; p(f _i ^v ; θ _v ) means the probability of obtaining the image signal category s _i ^v when the image feature extraction network input is f _i ^v and the network parameter is θ _v ; f ^a = {f _i ^a } _{i = 1, ... , N} , f ^v = {f _i ^v } _{i = 1, ... , N} );
Combining L _clu ^h , L _{CT ,} and L _clu ^av ;
Obtaining the total objective function L _clu =L _clu ^h +L _CT +L _clu ^av ;
The method for reconstructing fine-grained haptic signals for audiovisual assistance according to claim 4, further comprising: minimizing L _clu to obtain optimal θ _h , θ _{a ,} and θ _v ; and after determining the parameters, obtaining features f ^h , ^fa , and f ^v corresponding to the haptic signal, audio signal, and image signal.

By constraining the extracted tactile, audio, and image modal features by jointly considering the clustering constraint, the centrality constraint, and the sorting constraint, each modal feature belonging to the same meaning is brought close to each other, but each modal feature not belonging to the same meaning is separated, and tactile, audio, and image modal features with fine-grained classification are obtained.
To ensure compactness between features of the same fine-grained subcategory, clustering learning under center constraints is performed on three types of modal signals. To achieve better fine-grained classification performance, features of the same subcategory should be adjacent in a common space, with the goal of minimizing intra-category variance. Clustering learning is driven by minimizing the distance from a feature to its subcategory center. The subcategory center of the tactile signal is set as the common subcategory center of the cross-modal signal to ensure compactness between semantic features of the same category of cross-modal signals. The loss function of clustering learning under center constraints is:
is defined as
2. The method for reconstructing fine-grained haptic signals for audiovisual assistance according to claim 1, wherein N is the number of haptic, audio, and image signals, s _i ^h , s _i ^a , and s _i ^v indicate the categories of haptic signals, audio signals, and image signals, respectively, the audio signals and image signals share a cluster center vector matrix M of haptic signals, and the above process includes bringing each modal feature f _i ^a , f _i ^v , and f _i ^h with similar meaning closer to each other.

To ensure that the features of different fine-grained subcategories have a certain sparsity, we perform clustering learning under sorting constraints on three types of modal signals. The goal of the centrality constraint is to minimize the intra-category variance, while the goal of the sorting constraint is to maximize the inter-category variance, so that the feature outputs of different subcategories are less similar than the feature outputs of the same subcategory. The sorting constraint is:
is defined as
Here, C is the total number of categories after the haptic signal is clustered using the K-means algorithm, and the above process brings modal features f _i ^a , f _i ^v , and f _i ^h with similar meanings closer together, while separating modal features with different meanings as much as possible.

Constructing a triplet set based on tactile, audio, and image modal features, performing shared semantic learning of triplet constraints, optimizing a multimodal fusion mapping function, and obtaining fusion features containing shared semantic information is
Randomly selecting one haptic signal sample h _i from a fine-grained subcategory and setting the sample as an anchor;
Selecting a sample v _i ⁺ from the image dataset that belongs to the same category as h _i and whose semantic features are closest to f _i ^h as a positive match sample, and selecting a sample v _j ⁻ that does not belong to the same category as h _i and whose semantic features are closest to f _i ^h as a negative match sample;
thereby constructing a set of triplets {(h _i ,v _i ⁺ ,v _j ⁻ )} for the samples in the dataset;
Similarly, obtaining a set of triplets {(h _i , a _i ⁺ , a _j ⁻ )} consisting of haptic signal samples and audio signal samples;
Semantic features f _i ^h of anchor point h _i and positive match semantic features in the corresponding subclass in the audio-visual modal
and minimize the distance between f _i ^h and the semantic feature of negative match
By maximizing the semantic feature between and there is one minimum interval δ, the two triplet loss functions obtained are
and
We introduce an integration paradigm to achieve high-level fusion of multimodal features, i.e.,
By fusing f ^a and f ^v , the process becomes
f ^m =F _m (f ^a ,f ^v ;θ _m )
where f ^m is the output of multimodal fusion in the shared semantic subspace, i.e., the fusion feature, F _m (·) is the multimodal fusion mapping function with parameter θ _m , and F _m (·) takes a linear weighting of f ^a and f ^v ;
fusing f ₊ ^a and f ₊ ^v into f ₊ ^m , and f _- ^a and f _- ^v into f _- ^m ;
The triplet loss is used to constrain the fused features, i.e.,
and
The objective function of shared semantic learning is modeled by combining three loss functions.
It can be shown that L _Syn =L _syn ^a +L _syn ^v +L _syn ^m ;
and minimizing L _Syn to obtain the optimal θ _m .

Inputting the fused features into a haptic generation network to reconstruct the haptic signal is
Constructing a haptic generation network, i.e., constructing a haptic generation network G(·), whose structure is the same as D _h (·), and whose network parameters θ _hd are set as the initial values of parameters θ _G of G(·);
The fused features containing the required semantic information are input into a haptic generation network G(·) to obtain the desired haptic signal h′, and the generated haptic signal h′ is remapped to a haptic feature f ^h′ by E _h (·), and a category center is selected to perform semantic constraints on the haptic feature f ^h′ , so that the final loss function is
is shown as
however,
and
indicates the similarity between features f ^h and f ^h′ ,
is the clustering loss of f ^h' , and they are used together as regularization terms of a loss function, and by optimizing the loss function, an optimal value of θ _G is obtained, i.e., G(·) is determined.

Pre-configure the parameters of the haptic autoencoder, audio feature extraction network, and image feature extraction network.
Presetting the parameters of the multimodal fusion mapping function and the parameters of the haptic generation network;
Training the haptic autoencoder, the audio feature extraction network, the image feature extraction network, the multimodal fusion mapping function, and the haptic generation network includes step 1 and step 2;
In the step 1, θ _v , θ _a , θ _he , θ _hd , and M are preset, and {s _i ^h }, {s _i ^a }, and {s _i ^v } are optimized.
Step 11: Initialize network parameters θ _v , θ _a , θ _he , θ _hd , node system tags {s _i ^h }, { ^{s ia} _} , {s _i ^v }, and category center matrix M, and set the number of clusters C, learning rate μ _1, and number of iterations T;
{s _i ^h }, {s _i ^a }, {s _i ^v }, and M are fixed, and θ _v , θ _a , θ _he , and θ _hd are optimized based on stochastic gradient descent, i.e.,
and
where step 12 is where ∇ is the partial derivative of each loss function;
Fix θ _v , θ _a , θ _he , θ _hd , and M, and optimize {s _i ^h }, {s _i ^a }, {s _i ^v }, i.e.,
Step 13, where
Fix θ _v , θ _a , θ _he , θ _hd , and {s _i ^h }, {s _i ^a }, {s _i ^v } and optimize M, i.e.,
Step 14, where
If t<T, jump to step 412; if t=t+1, continue with the next iteration; otherwise, end the iteration;
and step 16, after T iterations, obtaining optimal audio feature extraction network parameters θ _a , image feature extraction network parameters θ _v , haptic autoencoder parameters θ _he , θ _hd , node system tags {s _i ^h }, {s _{i a} ^} , {s _i ^v }, and a cluster center vector matrix M of the haptic data;
In the step 2, θ _m and θ _G are estimated based on a stochastic gradient descent method, and the step 2 includes the following steps:
Step 21 of initializing θ _m , learning rates μ ₂ , μ ₃ , and number of iterations n ₁ ;
Based on L _Syn, θ _m is estimated using stochastic gradient descent, i.e.,
Step 22, where
Based on L _Gen , θ _G is updated using stochastic gradient descent, i.e.,
Step 23, where
If n< _n1 , jump to step 22; if n=n+1, continue with the next iteration; otherwise, step 24 to end the iteration;
and (25) obtaining optimal θ _m and θ _G after _n iterations.