CN117851750A

CN117851750A - An intelligent cleaning method for low-quality monitoring big data of intelligent manufacturing production lines

Info

Publication number: CN117851750A
Application number: CN202311575671.6A
Authority: CN
Inventors: 李响; 付春霖; 雷亚国; 李乃鹏; 杨彬; 曹军义; 武通海
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2024-04-09

Abstract

An intelligent cleaning method for low-quality monitoring big data of an intelligent manufacturing production line comprises the steps of adding Gaussian noise to original multi-source monitoring data and covering a certain proportion of subsequences in each sample; mapping sequences visible in the simulated noise data into potential representations by using a transducer network, and reconstructing the potential representations into original data; the model has the capacity of reducing complex noise in the production line monitoring data by minimizing the mean square error between the reconstruction data and the original multi-source monitoring data; finally, the original test data is subjected to noise adding and covering treatment and then is input into a big data intelligent cleaning model to test the cleaning effect of the model; the invention has stronger feature extraction and data reconstruction capability, can effectively reduce noise components in the monitoring data of the intelligent manufacturing production line, and provides data support for downstream tasks such as intelligent diagnosis of subsequent production line equipment.

Description

An intelligent cleaning method for low-quality monitoring big data of intelligent manufacturing production lines

技术领域Technical Field

本发明属于数据清洗技术领域，具体涉及一种智能制造产线低质量监测大数据智能清洗方法。The present invention belongs to the technical field of data cleaning, and in particular relates to an intelligent cleaning method for low-quality monitoring big data of an intelligent manufacturing production line.

背景技术Background technique

随着智能制造理念的兴起，企业开始有意识的收集产线智能设备加工过程中产生的数据，以实现对生产线设备的预测性维护。然而，工业智能设备加工流程复杂且生产线上通常配置多台设备协同工作，导致其数据记录文件类型多样，文件之间的关系错综复杂，使得难以直接从海量监测数据中获取所需的信息；此外，产线设备在加工过程中通常做高速运动且服役环境恶劣，随机干扰因素较多，受异物撞击等外界环境的干扰，以及内部数据采集装置性能退化、传感器失效、数据记录异常等因素的影响，其关键部件的监测数据中会夹杂噪声、异常点、缺失值等干扰数据，使得数据质量降低，影响了后续智能诊断和预测模型的性能。因此，研究智能制造产线低质量监测大数据的清洗方法尤为重要。With the rise of the concept of intelligent manufacturing, enterprises have begun to consciously collect data generated during the processing of production line intelligent equipment to achieve predictive maintenance of production line equipment. However, the processing process of industrial intelligent equipment is complex and multiple devices are usually configured on the production line to work together, resulting in a variety of data recording file types and intricate relationships between files, making it difficult to directly obtain the required information from massive monitoring data; in addition, production line equipment usually moves at high speed during processing and has a harsh service environment. There are many random interference factors, and they are affected by external factors such as foreign body impact, as well as performance degradation of internal data acquisition devices, sensor failure, and data recording anomalies. The monitoring data of its key components will be mixed with noise, abnormal points, missing values and other interference data, which reduces the data quality and affects the performance of subsequent intelligent diagnosis and prediction models. Therefore, it is particularly important to study the cleaning method of low-quality monitoring big data of intelligent manufacturing production lines.

对于低质量监测数据，传统处理方法主要包括人工手动剔除和采用信号处理技术。人工手动剔除通常由设备维护人员凭借自身专业知识和经验来识别和剔除异常数据，这种方法仅适用于数据量较少且数据质量损害程度较为严重的情形，而且对维护人员专业水平有较高的要求，在实际治理过程中会不可避免引入人为因素，影响后续分析的准确性。信号处理技术在数据清洗方法中得到广泛的应用，常用的信号处理方法包括小波变换和经验模态分解。小波变换方法的理论和实现过程相对较复杂，对专家知识需求高，此外其计算复杂度较高，清洗效果依赖于所选的小波基函数。基于经验模态分解的方法参数选取困难，不同参数选择可能导致不同的分解结果，此外分解得到的模态函数数量是不确定的，这使得难以确定清洗后的数据应该保留多少模态函数。For low-quality monitoring data, traditional processing methods mainly include manual elimination and signal processing technology. Manual elimination is usually done by equipment maintenance personnel relying on their own professional knowledge and experience to identify and eliminate abnormal data. This method is only applicable to situations where the amount of data is small and the degree of data quality damage is more serious. It also has high requirements for the professional level of maintenance personnel. In the actual governance process, human factors will inevitably be introduced, affecting the accuracy of subsequent analysis. Signal processing technology has been widely used in data cleaning methods. Commonly used signal processing methods include wavelet transform and empirical mode decomposition. The theory and implementation process of the wavelet transform method are relatively complex, and it requires high expert knowledge. In addition, its computational complexity is high, and the cleaning effect depends on the selected wavelet basis function. The method based on empirical mode decomposition is difficult to select parameters. Different parameter selections may lead to different decomposition results. In addition, the number of modal functions obtained by decomposition is uncertain, which makes it difficult to determine how many modal functions should be retained in the cleaned data.

近年来，深度神经网络在特征提取和数据重构方面表现出优越的性能，为低质量数据清洗方法提供了新的思路，与传统清洗方法相比，数据驱动的智能清洗方法(QiangZhang,Yaming Zheng,Qiangqiang Yuan,et al."Hyperspectral Image Denoising:FromModel-Driven,Data-Driven,to Model-Data-Driven",IEEE Transactions on NeuralNetworks and Learning Systems,2023.)适应强、无需手动提取特征、效率高；然而，这种方法也存在一些局限性：一方面，该方法往往仅使用单一类型的噪声，难以适用于智能制造生产线数据包含复杂噪声的情况；另一方面，该方法通常采用卷积神经网络或长短时记忆网络来进行特征提取和数据重构，然而卷积神经网络及其变体的方法难以捕获远距离特征，长短时记忆网络及其变体方法的计算是串行的，计算效率低，在处理大规模数据时表现不佳；此外长短时记忆网络还可能受到梯度消失或梯度爆炸问题的影响，导致模型训练困难。In recent years, deep neural networks have shown superior performance in feature extraction and data reconstruction, providing new ideas for low-quality data cleaning methods. Compared with traditional cleaning methods, data-driven intelligent cleaning methods (Qiang Zhang, Yaming Zheng, Qiangqiang Yuan, et al. "Hyperspectral Image Denoising: From Model-Driven, Data-Driven, to Model-Data-Driven", IEEE Transactions on Neural Networks and Learning Systems, 2023.) have strong adaptability, no need for manual feature extraction, and high efficiency; however, this method also has some limitations: on the one hand, this method often only uses a single type of noise, which is difficult to apply to situations where the data of smart manufacturing production lines contains complex noise; on the other hand, this method usually uses convolutional neural networks or long short-term memory networks for feature extraction and data reconstruction, but convolutional neural networks and their variants are difficult to capture long-distance features, and the calculations of long short-term memory networks and their variants are serial, with low computational efficiency, and perform poorly when processing large-scale data; in addition, long short-term memory networks may also be affected by gradient vanishing or gradient exploding problems, resulting in model training difficulties.

发明内容Summary of the invention

为了克服上述现有技术的缺陷，本发明的目的在于提出一种智能制造产线低质量监测大数据智能清洗方法，采用Transformer网络从复杂噪声数据中提取到具有鲁棒性的高层表征，并将其重构回原始多源监测数据，降低了数据中噪声、异常值和缺失值对后续故障诊断与寿命预测精度的影响。In order to overcome the defects of the above-mentioned prior art, the purpose of the present invention is to propose an intelligent cleaning method for low-quality monitoring big data of intelligent manufacturing production lines, which uses a Transformer network to extract robust high-level representations from complex noisy data and reconstruct them back to the original multi-source monitoring data, thereby reducing the impact of noise, outliers and missing values in the data on the subsequent fault diagnosis and life prediction accuracy.

为达到上述目的，本发明采取的技术方案为：In order to achieve the above object, the technical solution adopted by the present invention is:

一种智能制造产线低质量监测大数据智能清洗方法，通过对原始多源监测数据额外添加多种噪声并遮掩每个样本中一定比例的子序列来模拟噪声数据，采用Transformer网络将模拟噪声数据映射为潜在表征，并将潜在表征重构为原始多源监测数据，降低原始多源监测数据中的复杂噪声对后续分析的影响。An intelligent cleaning method for low-quality monitoring big data of intelligent manufacturing production lines is proposed. The method simulates noisy data by adding multiple noises to the original multi-source monitoring data and masking a certain proportion of subsequences in each sample. The simulated noise data is mapped to potential representations using a Transformer network, and the potential representations are reconstructed into the original multi-source monitoring data, thereby reducing the impact of complex noise in the original multi-source monitoring data on subsequent analysis.

一种智能制造产线低质量监测大数据智能清洗方法，包括以下步骤：An intelligent cleaning method for low-quality monitoring big data of intelligent manufacturing production lines, comprising the following steps:

步骤1：从生产线的数据记录文件中获取设备加工过程中的原始多源监测数据集其中，N表示样本个数，/>为第i个样本，每个样本的序列长度为L，通道数为C；然后将数据集X划分为训练集/>和测试集/>其中，N₁表示训练集的样本个数；Step 1: Obtain the original multi-source monitoring data set during the equipment processing from the data recording file of the production line Where N represents the number of samples, /> is the i-th sample, the sequence length of each sample is L, and the number of channels is C; then the data set X is divided into training sets/> and test set/> Among them, N ₁ represents the number of samples in the training set;

步骤2：对训练集X₁的样本x_i添加高斯噪声x_G、泊松噪声x_P和脉冲噪声x_S，得到模拟噪声数据x_n，具体计算式如下：Step 2: Add Gaussian noise _xG , Poisson noise _xP and impulse noise _xS to the sample _xi of the training set _X1 to obtain simulated noise data _xn . The specific calculation formula is as follows:

其中，P_s表示原始真空值信号的功率，P_n表示高斯噪声信号的功率，SNR表示信噪比，Var(·)表示方差，x_rand∈N(0,1)表示服从标准正态分布的随机样本序列，其序列长度与x_i相同；Where _Ps represents the power of the original vacuum value signal, _Pn represents the power of the Gaussian noise signal, SNR represents the signal-to-noise ratio, Var(·) represents the variance, and _xrand∈N (0,1) represents a random sample sequence that obeys the standard normal distribution, and its sequence length is the same as _xi ;

步骤3：将模拟噪声数据集X_n输入到一层卷积神经网络中实现序列的分割处理，把单个样本分成了多个规则、不重叠的子序列，同时将每个子序列映射为高维向量，分割映射处理后的噪声信号为其中，N_p表示分块的个数，D表示映射后向量的维度，Embed(·)表示分割映射操作；Step 3: Input the simulated noise data set _Xn into a convolutional neural network to segment the sequence, dividing a single sample into multiple regular, non-overlapping subsequences, and mapping each subsequence into a high-dimensional vector. The noise signal after segmentation and mapping is Where _Np represents the number of blocks, D represents the dimension of the mapped vector, and Embed(·) represents the segmentation mapping operation;

步骤4：将每个样本映射后的子序列进行随机打乱得到数据集Shuffle(Emed(X_n))，之后随机遮掩(删除)部分子序列，遮掩的比率为r，然后使用正余弦函数对遮掩后的数据集进行位置编码，具体计算式为：Step 4: Randomly shuffle the subsequences after each sample mapping to obtain the data set Shuffle (Emed (X _n )), then randomly mask (delete) some subsequences, the masking ratio is r, and then use the sine and cosine functions to positionally encode the masked data set. The specific calculation formula is:

其中，pos表示位置，i表示编码维度的索引，Shuffle(·)表示随机打乱操作，Mask(·)表示遮掩操作，X′_n为加噪、遮掩且位置编码后的噪声数据集；Where pos represents the position, i represents the index of the encoding dimension, Shuffle(·) represents the random shuffle operation, Mask(·) represents the masking operation, and _X′n is the noise dataset after adding noise, masking and position encoding;

步骤5：基于掩码自编码器建立大数据智能清洗模型，其编码器与解码器均采用Transformer网络；将数据集X′_n输入到模型的编码器部分进行计算得到的特征映射为其中，/>采用一层全连接网络将z_j的维度映射到解码器维度得到/>解码器的输入Z′由高层特征Fc(z_j)和可学习的掩码向量集组成，具体计算式为：Step 5: Establish a big data intelligent cleaning model based on the masked autoencoder, whose encoder and decoder both use the Transformer network; input the dataset _X′n into the encoder part of the model and calculate the feature map obtained: Among them,/> A fully connected network is used to map the dimension of z _j to the decoder dimension to obtain/> The decoder input Z′ consists of high-level features Fc(z _j ) and a learnable mask vector set Composition, the specific calculation formula is:

Z′＝Unshuffle([Fc(z_j)；Z_m])+E_pos (3)Z′＝Unshuffle([Fc(z _j )；Z _m ])+E _pos (3)

其中，Unshuffle(·)表示取消随机打乱操作，E_pos表示位置编码向量；Among them, Unshuffle(·) means canceling the random shuffle operation, and E _pos means the position encoding vector;

步骤6：将Z′输入到解码器中进行计算，得到模型的预测结果以预测结果Y₁与训练集X₁之间的均方误差来衡量模型的降噪结果，并通过最小化Y₁与X₁中样本的均方误差L_mse来提高模型的清洗能力，计算式如下：Step 6: Input Z′ into the decoder for calculation to obtain the model’s prediction result The denoising result of the model is measured by the mean square error between the predicted result Y ₁ and the training set X ₁ , and the cleaning ability of the model is improved by minimizing the mean square error L _mse of the samples in Y ₁ and X _1. The calculation formula is as follows:

步骤7：将步骤6中得到的重构误差L_mse作为训练阶段的优化目标，使用梯度下降法更新模型参数θ：Step 7: Use the reconstruction error L _mse obtained in step 6 as the optimization target of the training phase and use the gradient descent method to update the model parameters θ:

其中，α表示学习率；Among them, α represents the learning rate;

步骤8：重复步骤2-步骤7，对大数据智能清洗模型不断迭代优化，直至到达最大迭代次数；Step 8: Repeat steps 2 to 7 to continuously iterate and optimize the big data intelligent cleaning model until the maximum number of iterations is reached;

步骤9：测试模型。首先，采用模拟噪声数据测试大数据智能清洗模型。将测试集X₂经过步骤2-步骤4过程的处理得到测试样本的加噪、遮掩和位置编码结果数据集X′_n2；将数据集X′_n2输入到训练好的大数据智能清洗模型中，得到模型对加噪数据的降噪结果Y₂，计算Y₂与测试集X₂之间的信噪比，计算式为：Step 9: Test the model. First, use simulated noise data to test the big data intelligent cleaning model. The test set _X2 is processed through steps 2 to 4 to obtain the test sample noise, masking and position encoding result data set _X′n2 ; the data set _X′n2 is input into the trained big data intelligent cleaning model to obtain the model's denoising result _Y2 for the noisy data, and the signal-to-noise ratio between _Y2 and the test set _X2 is calculated as follows:

其中，SNR表示信噪比，P_s2表示测试信号的功率，P_n2表示噪声信号的功率；Wherein, SNR represents signal-to-noise ratio, P _s2 represents the power of the test signal, and P _n2 represents the power of the noise signal;

同时，采用真实数据测试大数据智能清洗模型。将测试集X₂直接输入到训练好的大数据智能清洗模型中，得到对原始多源监测数据集的降噪结果Y₂′。At the same time, real data is used to test the big data intelligent cleaning model. The test set _X2 is directly input into the trained big data intelligent cleaning model to obtain the denoising result _Y2 ′ of the original multi-source monitoring data set.

步骤5中大数据智能清洗模型分为编码器与解码器两部分，且均由Transformer网络组成；编码器将每个样本中未遮掩的子序列映射为潜在表征，其包含三个Transformer模块，每个Transformer模块的隐藏层维度为256，每个Transformer模块由层标准化、多头自注意力机制、多层感知机组成；解码器将潜在表征和可学习掩码向量重构为原始输入，共包含两个相同结构的Transformer模块，每个Transformer模块的隐藏层维度为64；此外，采用一层全连接层将编码器隐藏层维度映射为解码器器隐藏层维度。In step 5, the big data intelligent cleaning model is divided into two parts: encoder and decoder, and both are composed of Transformer networks; the encoder maps the unmasked subsequence in each sample to a potential representation, which contains three Transformer modules, and the hidden layer dimension of each Transformer module is 256. Each Transformer module consists of layer normalization, multi-head self-attention mechanism, and multi-layer perceptron; the decoder reconstructs the potential representation and the learnable mask vector into the original input, and contains two Transformer modules with the same structure, and the hidden layer dimension of each Transformer module is 64; in addition, a fully connected layer is used to map the encoder hidden layer dimension to the decoder hidden layer dimension.

和现有技术相比，本发明的有益效果如下：Compared with the prior art, the present invention has the following beneficial effects:

本发明提出了一种智能制造产线低质量监测大数据智能清洗方法，通过对原始多源监测数据额外添加多种噪声并遮掩每个样本中一定比例的子序列来模拟噪声数据，采用Transformer网络替代传统卷积神经网络或长短时记忆网络将模拟噪声数据映射为潜在表征，并将潜在表征重构为原始多源监测数据，使其可以有效降低监测数据中的复杂噪声对后续分析的影响，提高了数据的质量；此外，所构建的大数据智能清洗模型的结构是非对称的，其编码器部分仅对未遮掩的子序列进行操作，与编码器相比解码器是轻量级的，其更窄更浅，这种设计大大降低了模型的计算开销，减小了训练时间；而且，解码器仅在训练期间用于执行数据重建任务，仅将编码器微调后即可用于下游诊断或预测任务，提高了模型泛化能力。The present invention proposes an intelligent cleaning method for low-quality monitoring big data of intelligent manufacturing production lines. The method simulates noise data by adding multiple noises to the original multi-source monitoring data and masking a certain proportion of subsequences in each sample. The Transformer network is used to replace the traditional convolutional neural network or the long short-term memory network to map the simulated noise data into potential representations, and the potential representations are reconstructed into the original multi-source monitoring data, so that it can effectively reduce the impact of complex noise in the monitoring data on subsequent analysis, thereby improving the quality of the data. In addition, the structure of the constructed big data intelligent cleaning model is asymmetric, and its encoder part only operates on unmasked subsequences. Compared with the encoder, the decoder is lightweight, narrower and shallower. This design greatly reduces the computational overhead of the model and reduces the training time. Moreover, the decoder is only used to perform data reconstruction tasks during training, and the encoder can be used for downstream diagnosis or prediction tasks after only fine-tuning, thereby improving the generalization ability of the model.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的流程图。FIG. 1 is a flow chart of the present invention.

图2为本发明大数据智能清洗模型的结构示意图。FIG2 is a schematic diagram of the structure of the big data intelligent cleaning model of the present invention.

具体实施方式Detailed ways

下面结合附图和实施例对本发明做进一步详细描述。The present invention is further described in detail below with reference to the accompanying drawings and embodiments.

如图1所示，一种智能制造产线低质量监测大数据智能清洗方法，包括以下步骤：As shown in FIG1 , a method for intelligent cleaning of low-quality monitoring big data of an intelligent manufacturing production line includes the following steps:

步骤2：对训练集X₁的样本x_i添加高斯噪声x_G、泊松噪声x_P和脉冲噪声x_S，得到模拟的噪声数据x_n，具体计算式如下：Step 2: Add Gaussian noise _xG , Poisson noise _xP and impulse noise _xS to the sample _xi of the training set _X1 to obtain the simulated noise data _xn . The specific calculation formula is as follows:

其中，P_s表示原始真空值信号的功率，P_n表示高斯噪声信号的功率，SNR表示信噪比，Var(·)表示方差，x_rand∈N(0，1)表示服从标准正态分布的随机样本序列，其序列长度与x_i相同；Where _Ps represents the power of the original vacuum value signal, _Pn represents the power of the Gaussian noise signal, SNR represents the signal-to-noise ratio, Var(·) represents the variance, and _xrand∈N (0,1) represents a random sample sequence that obeys the standard normal distribution, and its sequence length is the same as _xi ;

由于生产线设备加工过程中噪声的类型、强度均未知，因此通过在原始多源监测数据上额外添加常见的高斯白噪声、泊松噪声和脉冲噪声来模拟噪声数据；Since the type and intensity of noise in the production line equipment processing are unknown, the noise data is simulated by adding common Gaussian white noise, Poisson noise and impulse noise to the original multi-source monitoring data;

通过随机遮掩一定比例的子序列，来模拟原始多源监测数据中存在缺失值的情况，由于Transformer在处理序列数据时，不像循环神经网络或卷积神经网络一样具有记忆功能，不能捕捉输入序列中的位置信息，因此额外需要引入位置编码；By randomly masking a certain proportion of subsequences, we simulate the situation where there are missing values in the original multi-source monitoring data. Since Transformer does not have a memory function like recurrent neural networks or convolutional neural networks when processing sequence data, it cannot capture the position information in the input sequence, so position encoding needs to be introduced additionally.

步骤5：基于掩码自编码器建立大数据智能清洗模型，如图2所示，大数据智能清洗模型分为编码器与解码器两部分，且均由Transformer网络组成；编码器将每个样本中未遮掩的子序列映射为潜在表征，其包含三个Transformer模块，每个Transformer模块的隐藏层维度为256，每个Transformer模块由层标准化、多头自注意力机制、多层感知机组成；解码器将潜在表征和可学习掩码向量重构为原始输入，共包含两个相同结构的Transformer模块，每个Transformer模块的隐藏层维度为64；此外，采用一层全连接层将编码器隐藏层维度映射为解码器器隐藏层维度；Step 5: Establish a big data intelligent cleaning model based on the masked autoencoder. As shown in Figure 2, the big data intelligent cleaning model is divided into two parts: encoder and decoder, and both are composed of Transformer networks; the encoder maps the unmasked subsequence in each sample to a potential representation, which contains three Transformer modules, and the hidden layer dimension of each Transformer module is 256. Each Transformer module consists of layer normalization, multi-head self-attention mechanism, and multi-layer perceptron; the decoder reconstructs the potential representation and the learnable mask vector into the original input, and contains two Transformer modules with the same structure, and the hidden layer dimension of each Transformer module is 64; in addition, a fully connected layer is used to map the encoder hidden layer dimension to the decoder hidden layer dimension;

将数据集X′_n输入到模型的编码器部分进行计算得到的特征映射为其中，/>采用一层全连接网络将z_j的维度映射到解码器维度得到/>解码器的输入Z′由高层特征Fc(z_j)和掩码向量集/>组成，具体计算式为：The feature map obtained by inputting the dataset _X′n into the encoder part of the model is Among them,/> A fully connected network is used to map the dimension of z _j to the decoder dimension to obtain/> The decoder input Z′ consists of high-level features Fc(z _j ) and a mask vector set/> Composition, the specific calculation formula is:

将大数据智能清洗模型的输出结果和原始训练数据之间的均方误差作为损失，通过最小化损失使模型能从加噪数据中重构出原始数据，即模型能具备一定的数据清洗能力；The mean square error between the output of the big data intelligent cleaning model and the original training data is taken as the loss. By minimizing the loss, the model can reconstruct the original data from the noisy data, that is, the model can have a certain data cleaning ability.

其中，α表示学习率；Among them, α represents the learning rate;

步骤8：重复步骤2-步骤7，对大数据智能清洗模型不断迭代优化，直至到达最大迭代次数E；Step 8: Repeat steps 2 to 7 to continuously iterate and optimize the big data intelligent cleaning model until the maximum number of iterations E is reached;

步骤9：测试模型，首先采用模拟噪声数据测试大数据智能清洗模型，将测试集X₂经过步骤2-步骤4过程的处理得到测试样本的加噪、遮掩和位置编码结果数据集X′_n2；将数据集X′_n2输入到训练好的大数据智能清洗模型中，得到模型对加噪数据的降噪结果Y₂，计算Y₂与测试集X₂之间的信噪比，计算式为：Step 9: Test the model. First, use simulated noise data to test the big data intelligent cleaning model. Process the test set _X2 through steps 2 to 4 to obtain the test sample noise, masking and position encoding result data set _X′n2 ; input the data set _X′n2 into the trained big data intelligent cleaning model to obtain the model's denoising result _Y2 for the noisy data, and calculate the signal-to-noise ratio between _Y2 and the test set _X2 . The calculation formula is:

其中，SNR表示信噪比，P_s2表示测试信号的功率，P_n2表示噪声信号的功率；通过将测试集数据加噪、遮掩处理后输入到大数据智能清洗模型，并计算测试集数据经模型清洗后的信噪比，可以定量衡量模型清洗效果与泛化能力；Among them, SNR represents the signal-to-noise ratio, P _s2 represents the power of the test signal, and P _n2 represents the power of the noise signal. By adding noise and masking to the test set data and inputting it into the big data intelligent cleaning model, and calculating the signal-to-noise ratio of the test set data after the model cleaning, the model cleaning effect and generalization ability can be quantitatively measured.

同时，采用真实数据测试大数据智能清洗模型，将测试集X₂直接输入到训练好的大数据智能清洗模型中，得到对原始多源监测数据集的降噪结果Y₂′；由于无法获取到不含任何噪声的理想的干净数据，因此无法直接计算信噪比来定量衡量模型对真实数据的清洗效果，但可以分别用真实数据和降噪后的数据进行故障诊断或寿命预测等下游任务，已诊断或预测效果来衡量清洗效果。At the same time, real data is used to test the big data intelligent cleaning model. The test set _X2 is directly input into the trained big data intelligent cleaning model to obtain the denoising result _Y2 ′ of the original multi-source monitoring data set. Since it is impossible to obtain ideal clean data without any noise, it is impossible to directly calculate the signal-to-noise ratio to quantitatively measure the cleaning effect of the model on the real data. However, the real data and the denoised data can be used for downstream tasks such as fault diagnosis or life prediction, and the cleaning effect can be measured by the diagnosis or prediction effect.

实施例：以某智能制造生产基地量产期间所收集的某机台加工过程中多源监测数据的清洗为例，对本发明方法的有效性进行验证。Embodiment: Taking the cleaning of multi-source monitoring data collected during mass production of a certain machine processing in a certain intelligent manufacturing production base as an example, the effectiveness of the method of the present invention is verified.

从智能制造产线数据采集系统中提取机台加工过程中的原始多源监测数据，对原始多源监测数据进行预处理，将一段连续时间内的原始多源监测数据按照9：1的比例划分为训练集和测试集，分别对两段数据进行随机采样得到2000条训练样本和200条测试样本，每个样本中包含了300个样本点；为训练集数据添加高斯白噪声并固定信噪比为0；将每条样本划分为50个子序列，每个序列中包含5个样本点，随机遮掩约25％的子序列；将预处理后的加噪数据输入到大数据智能清洗模型中进行训练，大数据智能清洗模型的编码器堆叠了三个Transformer模块，每个模块的隐藏层维度为256，解码器相对于编码器更窄更浅，堆叠了两个Transformer模块，每个模块的隐藏层维度为64；大数据智能清洗模型的训练参数如表1所示：The original multi-source monitoring data of the machine processing process is extracted from the data acquisition system of the intelligent manufacturing production line, and the original multi-source monitoring data is preprocessed. The original multi-source monitoring data in a continuous period of time is divided into a training set and a test set at a ratio of 9:1. The two segments of data are randomly sampled to obtain 2000 training samples and 200 test samples, each of which contains 300 sample points; Gaussian white noise is added to the training set data and the signal-to-noise ratio is fixed to 0; each sample is divided into 50 subsequences, each sequence contains 5 sample points, and about 25% of the subsequences are randomly masked; the preprocessed noise-added data is input into the big data intelligent cleaning model for training. The encoder of the big data intelligent cleaning model stacks three Transformer modules, and the hidden layer dimension of each module is 256. The decoder is narrower and shallower than the encoder, and stacks two Transformer modules, and the hidden layer dimension of each module is 64; the training parameters of the big data intelligent cleaning model are shown in Table 1:

表1大数据智能清洗模型训练参数Table 1. Training parameters of big data intelligent cleaning model

大数据智能清洗模型训练完成后，将测试集进行相同的数据预处理后输入到大数据智能清洗模型中测试清洗效果；为减少实验的随机性，重复实验5次，计算清洗结果的统计值；测试集中部分样本清洗后的平均信噪比分别为4.10、6.02、5.16、5.94，可以看出，大数据智能清洗模型可以有效将数据中添加的额外噪声去除并重构中缺失值，且泛化性较好。After the training of the big data intelligent cleaning model is completed, the test set is subjected to the same data preprocessing and then input into the big data intelligent cleaning model to test the cleaning effect; in order to reduce the randomness of the experiment, the experiment is repeated 5 times, and the statistical values of the cleaning results are calculated; the average signal-to-noise ratios of some samples in the test set after cleaning are 4.10, 6.02, 5.16, and 5.94, respectively. It can be seen that the big data intelligent cleaning model can effectively remove the additional noise added to the data and reconstruct the missing values, and has good generalization ability.

为更充分验证发明方法的有效性，采用本发明方法清洗后的数据进行该机台的故障诊断。所构建的故障诊断模型包括特征提取与状态识别两部分，特征提取模块为本发明方法中大数据智能清洗模型的编码器，状态识别模块包含两个全连接层，第一个全连接层激活函数选用LeakyReLU并带有概率为0.2的Dropout层，第二个全连接层输出预测的状态并采用Softmax函数将预测值进行归一化。故障诊断模型的训练参数如表2所示，诊断结果如表3所示。表3中，对比方法1的模型结构与发明方法相同，但其特征提取模块的权重为随机初始化，即未对输入数据进行清洗。对比方法2的特征提取模块由卷积神经网络构成，网络层数同样为三层。由表3可以看出，采用本发明的智能清洗方法后，智能故障诊断精度可以达到92.75％。对比方法1由于未对原始数据进行清洗操作，平均诊断精度只能达到86.09％，远低于本发明方法，并且由于噪声的干扰，使得训练不平稳，测试结果具有较大的波动性。对比方法2采用卷积神经网络进行特征提取，其挖掘特征能力低于Transformer网络，平均诊断精度只有80.72％，远低于本发明方法。In order to more fully verify the effectiveness of the inventive method, the data cleaned by the inventive method is used to perform fault diagnosis on the machine. The constructed fault diagnosis model includes two parts: feature extraction and state recognition. The feature extraction module is the encoder of the big data intelligent cleaning model in the inventive method. The state recognition module contains two fully connected layers. The activation function of the first fully connected layer uses LeakyReLU and a Dropout layer with a probability of 0.2. The second fully connected layer outputs the predicted state and uses the Softmax function to normalize the predicted value. The training parameters of the fault diagnosis model are shown in Table 2, and the diagnosis results are shown in Table 3. In Table 3, the model structure of comparison method 1 is the same as that of the inventive method, but the weights of its feature extraction module are randomly initialized, that is, the input data is not cleaned. The feature extraction module of comparison method 2 is composed of a convolutional neural network, and the number of network layers is also three. It can be seen from Table 3 that after adopting the intelligent cleaning method of the present invention, the accuracy of intelligent fault diagnosis can reach 92.75%. Comparative method 1 does not clean the original data, so the average diagnostic accuracy can only reach 86.09%, which is much lower than the method of the present invention. In addition, due to the interference of noise, the training is not stable and the test results have large fluctuations. Comparative method 2 uses convolutional neural network for feature extraction, and its feature mining ability is lower than that of Transformer network. The average diagnostic accuracy is only 80.72%, which is much lower than the method of the present invention.

表2故障诊断模型训练参数Table 2 Fault diagnosis model training parameters

表3不同方法的诊断结果Table 3 Diagnostic results of different methods

通过对比本发明方法与对比方法1、对比方法2的诊断效果，表示本发明方法可以有效的降低原始多源监测数据中的噪声，提升了数据的质量，从而提高了后续智能诊断的效果，验证了本发明方法的泛化能力；此外，本发明采用的Transformer网络具有较强的特征提取能力，其性能优于传统卷积神经网络。By comparing the diagnostic effects of the method of the present invention with those of comparison method 1 and comparison method 2, it is shown that the method of the present invention can effectively reduce the noise in the original multi-source monitoring data, improve the quality of the data, thereby improving the effect of subsequent intelligent diagnosis, and verifying the generalization ability of the method of the present invention; in addition, the Transformer network used in the present invention has a strong feature extraction capability, and its performance is better than that of the traditional convolutional neural network.

Claims

1. An intelligent cleaning method for low-quality monitoring big data of intelligent manufacturing production lines, characterized by: simulating noise data by adding multiple noises to the original multi-source monitoring data and masking a certain proportion of subsequences in each sample, using a Transformer network to map the simulated noise data into potential representations, and reconstructing the potential representations into the original multi-source monitoring data, thereby reducing the impact of complex noise in the original multi-source monitoring data on subsequent analysis.

2. A smart cleaning method for low-quality monitoring big data of smart manufacturing production lines, characterized by comprising the following steps:

Step 1: Obtain the original multi-source monitoring data set during the equipment processing from the data recording file of the production line Where N represents the number of samples, /> is the i-th sample, the sequence length of each sample is L, and the number of channels is C; then the data set X is divided into training sets/> and test set/> Among them, N ₁ represents the number of samples in the training set;

Step 2: Add Gaussian noise _xG , Poisson noise _xP and impulse noise _xS to the sample _xi of the training set _X1 to obtain simulated noise data _xn . The specific calculation formula is as follows:

Where _Ps represents the power of the original vacuum value signal, _Pn represents the power of the Gaussian noise signal, SNR represents the signal-to-noise ratio, Var(·) represents the variance, and _xrand∈N (0,1) represents a random sample sequence that obeys the standard normal distribution, and its sequence length is the same as _xi ;

Step 3: Input the simulated noise data set _Xn into a convolutional neural network to segment the sequence, dividing a single sample into multiple regular, non-overlapping subsequences, and mapping each subsequence into a high-dimensional vector. The noise signal after segmentation and mapping is Where _Np represents the number of blocks, D represents the dimension of the mapped vector, and Embed(·) represents the segmentation mapping operation;

Step 4: Randomly shuffle the subsequences after each sample mapping to obtain the data set Shuffle (Emed (X _n )), then randomly mask (delete) some subsequences, the masking ratio is r, and then use the sine and cosine functions to positionally encode the masked data set. The specific calculation formula is:

Where pos represents the position, i represents the index of the encoding dimension, Shuffle(·) represents the random shuffle operation, Mask(·) represents the masking operation, and _X′n is the noise dataset after adding noise, masking and position encoding;

Step 5: Establish a big data intelligent cleaning model based on the masked autoencoder, whose encoder and decoder both use the Transformer network; input the dataset _X′n into the encoder part of the model and calculate the feature map obtained: Among them,/> A fully connected network is used to map the dimension of z _j to the decoder dimension to obtain/> The decoder input Z′ consists of high-level features Fc(z _j ) and a learnable mask vector set Composition, the specific calculation formula is:

Z′＝Unshuffle([Fc(z _j )；Z _m ])+E _pos (3)

Among them, Unshuffle(·) means canceling the random shuffle operation, and E _pos means the position encoding vector;

Step 6: Input Z′ into the decoder for calculation to obtain the model’s prediction result The denoising result of the model is measured by the mean square error between the predicted result Y ₁ and the training set X ₁ , and the cleaning ability of the model is improved by minimizing the mean square error L _mse of the samples in Y ₁ and X _1. The calculation formula is as follows:

Step 7: Use the reconstruction error L _mse obtained in step 6 as the optimization target of the training phase and use the gradient descent method to update the model parameters θ:

Among them, α represents the learning rate;

Step 8: Repeat steps 2 to 7 to continuously iterate and optimize the big data intelligent cleaning model until the maximum number of iterations is reached;

Step 9: Test the model. First, use simulated noise data to test the big data intelligent cleaning model. Process the test set _X2 through steps 2 to 4 to obtain the test sample noise, masking and position encoding result data set _X′n2 ; input the data set _X′n2 into the trained big data intelligent cleaning model to obtain the model's denoising result _Y2 for the noisy data, and calculate the signal-to-noise ratio between _Y2 and the test set _X2 . The calculation formula is:

Wherein, SNR represents signal-to-noise ratio, P _s2 represents the power of the test signal, and P _n2 represents the power of the noise signal;

At the same time, real data is used to test the big data intelligent cleaning model. The test set X ₂ is directly input into the trained big data intelligent cleaning model to obtain the denoising result Y ₂ ′ of the original multi-source monitoring data set.

3. According to the method of claim 2, it is characterized in that: in step 5, the big data intelligent cleaning model is divided into two parts, an encoder and a decoder, and both are composed of a Transformer network; the encoder maps the unmasked subsequence in each sample to a potential representation, which includes three Transformer modules, and the hidden layer dimension of each Transformer module is 256. Each Transformer module consists of layer normalization, a multi-head self-attention mechanism, and a multi-layer perceptron; the decoder reconstructs the potential representation and the learnable mask vector into the original input, and includes two Transformer modules with the same structure, and the hidden layer dimension of each Transformer module is 64; in addition, a fully connected layer is used to map the encoder hidden layer dimension to the decoder hidden layer dimension.