CN106503731A

CN106503731A - A kind of based on conditional mutual information and the unsupervised feature selection approach of K means

Info

Publication number: CN106503731A
Application number: CN201610888945.0A
Authority: CN
Inventors: 马廷淮; 邵文晔; 曹杰; 薛羽
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2016-10-11
Filing date: 2016-10-11
Publication date: 2017-03-15

Abstract

The present invention provides an unsupervised feature selection method based on conditional mutual information and K-means. First, the data without class labels is clustered through multiple K-means algorithms with different initial conditions, and then based on each clustering , considering the modularity value of each feature and the conditional mutual information between different features, and using the correlation independence index between features to select a subset of features with high correlation and low redundancy. The final feature subset is obtained by summarizing the feature subsets obtained from different K-means clustering results. The invention can be effectively applied to unlabeled and unbalanced data sets, and the obtained feature subsets have high correlation and low redundancy.

Description

An Unsupervised Feature Selection Method Based on Conditional Mutual Information and K-means

技术领域technical field

本发明属于机器学习领域的特征选择问题，具体涉及的是一种利用条件互信息与K-means算法对无标签数据集进行无监督特征选择的方法。The invention belongs to the problem of feature selection in the field of machine learning, and specifically relates to a method for unsupervised feature selection of an unlabeled data set by using conditional mutual information and a K-means algorithm.

背景技术Background technique

在机器学习的实际应用中，特征数量往往较多，其中可能存在不相关的特征，特征之间也可能存在相互依赖。特征个数越多，分析特征、训练模型所需的时间就越长，而且容易引起“维度灾难”，使模型更为复杂，从而带来模型推广能力下降等后果。因此，进行特征选择尤为重要。In the practical application of machine learning, the number of features is often large, there may be irrelevant features, and there may be interdependence between features. The more the number of features, the longer it takes to analyze the features and train the model, and it is easy to cause the "dimension disaster", making the model more complex, which will lead to consequences such as a decline in the model's generalization ability. Therefore, feature selection is particularly important.

特征选择也称特征子集选择或属性选择，是指从全部特征中选取一个特征子集，使构造出来的模型更好。特征选择能剔除不相关或冗余的特征，从而达到减少特征个数，提高模型精确度，减少运行时间的目的。另一方面，选取出真正相关的特征简化了模型，使研究人员易于理解数据产生的过程。Feature selection, also known as feature subset selection or attribute selection, refers to selecting a feature subset from all features to make the constructed model better. Feature selection can eliminate irrelevant or redundant features, so as to reduce the number of features, improve model accuracy, and reduce running time. On the other hand, selecting the truly relevant features simplifies the model and makes it easier for researchers to understand the process of data generation.

根据搜索最优特征子集与构建学习模型的结合方式的不同，特征选择方法可以大致分为封装式特征选择(Wrapper)和过滤式特征选择(Filter)两类。封装式特征选择不断重复地运行学习算法去评估属性集的好坏，它在精度上优于过滤式特征选择，但对于其他分类器来说，它的泛化性能较差。面对高维数据集，由于封装式特征选择需要与特定的学习算法紧密结合，因此学习过程中的计算复杂度很高。过滤式特征选择不需要特定的学习算法，而是使用合适的准则来快速评价特征的好坏，因此是一种计算效率较高的方法。According to the combination of searching the optimal feature subset and building the learning model, feature selection methods can be roughly divided into two types: wrapper feature selection (Wrapper) and filter feature selection (Filter). Encapsulated feature selection repeatedly runs the learning algorithm to evaluate the quality of the attribute set. It is better than filtered feature selection in accuracy, but it has poor generalization performance for other classifiers. In the face of high-dimensional datasets, the computational complexity in the learning process is high because the encapsulated feature selection needs to be tightly coupled with specific learning algorithms. Filtering feature selection does not require a specific learning algorithm, but uses appropriate criteria to quickly evaluate the quality of features, so it is a method with high computational efficiency.

现有的大部分传统特征选择方法是以提高分类精度为优化目标，没有充分考虑数据样本的分布状况，且普遍追求大类的学习效果，容易忽略小类的学习性能。为解决数据不平衡的问题，在数据层面上，可以在训练前对训练集的正类样本进行重抽样，从而使正负类样本达到平衡，然后再进行相应的学习(Exploratory under-sampling for class-imbalance learning.Liu X Y,Wu J,Zhou Z H)，但这无法使所有数据得到利用，会使得分类精度降低。在算法层面上，依据数据类别分布不平衡性的特点对传统特征选择算法进行改进，以使算法适应类别分布不均衡的样本(不均衡问题中的特征选择新算法:IM-IG.尤鸣宇,陈燕,李国正),但这种方法局限于二类不均衡问题，对于多类不均衡问题并不适用。Most of the existing traditional feature selection methods take improving the classification accuracy as the optimization goal, without fully considering the distribution of data samples, and generally pursue the learning effect of large classes, and tend to ignore the learning performance of small classes. In order to solve the problem of data imbalance, at the data level, the positive samples of the training set can be resampled before training, so that the positive and negative samples can be balanced, and then corresponding learning (Exploratory under-sampling for class -imbalance learning.Liu X Y, Wu J, Zhou Z H), but this cannot make use of all the data, which will reduce the classification accuracy. At the algorithm level, the traditional feature selection algorithm is improved according to the characteristics of the unbalanced distribution of data categories, so that the algorithm can adapt to samples with unbalanced category distribution (new algorithm for feature selection in imbalanced problems: IM-IG. You Mingyu , Chen Yan, Li Guozheng), but this method is limited to two-type imbalance problems, and is not suitable for multi-type imbalance problems.

对于过滤式特征选择而言，目前已有许多监督式特征选择方法被提出，如应用互信息对候选特征进行评估，并选择排名最前的几个特征作为神经网络分类器的输入(Usingmutual information for selecting features in supervised neural netlearning.R.Battiti)，但这种方法忽略了特征之间的冗余，从而导致选择许多冗余的特征，并且不利于后续分类器的性能提高。而且这种方法仅适用于带有类标签信息的数据，对于无监督的特征选择并不适用。For filtering feature selection, many supervised feature selection methods have been proposed, such as applying mutual information to evaluate candidate features, and selecting the top features as the input of the neural network classifier (Using mutual information for selecting features in supervised neural netlearning.R.Battiti), but this method ignores the redundancy between features, resulting in the selection of many redundant features, which is not conducive to the performance improvement of subsequent classifiers. Moreover, this method is only suitable for data with class label information, and is not suitable for unsupervised feature selection.

在无监督特征选择领域，许多应用于文本的无监督特征选择方法被提出，但是这些方法无法直接应用于数值型数据。部分应用于数值数据的无监督特征选择方法，如面向分类特征的无监督过滤式特征选择算法，以一趟聚类算法为基础，利用各个特征在不同簇间所表现的重要性程度作为判断依据，最后根据重要性的变化规律选取特征子集(面向分类特征的无监督特征选择方法研究.王连喜,蒋盛益)，这种方法仅使用一趟聚类算法对数据进行划分，使得聚类的结果存在随机性，无法保证特征选择的准确性。In the field of unsupervised feature selection, many unsupervised feature selection methods for text have been proposed, but these methods cannot be directly applied to numerical data. Some unsupervised feature selection methods applied to numerical data, such as unsupervised filtering feature selection algorithms for classification features, are based on one-pass clustering algorithms and use the importance of each feature in different clusters as a basis for judgment , and finally select feature subsets according to the changing law of importance (research on unsupervised feature selection method for classification features. Wang Lianxi, Jiang Shengyi), this method only uses one clustering algorithm to divide the data, so that the clustering results exist Randomness cannot guarantee the accuracy of feature selection.

本发明首先通过多次初始条件不同的K-means算法对无类标签的数据进行聚类，然后在此聚类基础上，综合考虑每个特征的模块化度量值及不同特征之间的条件互信息，获得相关度高且冗余度小的特征子集，最后将不同K-means聚类结果得到的特征子集进行汇总。The present invention first clusters the data without class labels through multiple K-means algorithms with different initial conditions, and then, on the basis of this clustering, comprehensively considers the modular measurement value of each feature and the conditional interaction between different features. Information, obtain feature subsets with high correlation and low redundancy, and finally summarize the feature subsets obtained from different K-means clustering results.

发明内容Contents of the invention

目的：本发明所要解决的技术问题是无标签数据集的特征选择问题，提出一种基于条件互信息和K-means的无监督特征选择方法。通过多次初始条件不同的K-means算法对无类标签的数据进行聚类，消除单次聚类结果上进行特征选择的随机性，并减少数据不平衡对特征选择的影响。在每一次的聚类的基础上，综合考虑每个特征的模块化度量值及不同特征之间的条件互信息，利用特征之间的相关独立性指标来选择出相关度高且冗余度小的特征组合。通过将不同K-means聚类结果得到的特征子集进行汇总，获得最终的特征子集。本发明能够有效地应用于无标签和不平衡的数据集，且获得的特征子集相关度高、冗余度小。Purpose: The technical problem to be solved by this invention is the feature selection problem of unlabeled data sets, and an unsupervised feature selection method based on conditional mutual information and K-means is proposed. The data without class labels is clustered through the K-means algorithm with different initial conditions multiple times, which eliminates the randomness of feature selection on a single clustering result and reduces the impact of data imbalance on feature selection. On the basis of each clustering, comprehensively consider the modularity metric value of each feature and the conditional mutual information between different features, and use the correlation independence index between features to select the one with high correlation and low redundancy. combination of features. The final feature subset is obtained by summarizing the feature subsets obtained from different K-means clustering results. The invention can be effectively applied to unlabeled and unbalanced data sets, and the obtained feature subsets have high correlation and low redundancy.

本发明的技术方案如下：Technical scheme of the present invention is as follows:

一种基于条件互信息和K-means的无监督特征选择方法，包括以下步骤：An unsupervised feature selection method based on conditional mutual information and K-means, comprising the following steps:

步骤1)，对无标签数据集进行多次不同K值和不同聚类中心的K-means聚类，并获得每次的聚类结果；Step 1), perform multiple K-means clustering with different K values and different cluster centers on the unlabeled data set, and obtain each clustering result;

步骤2)，根据步骤1)得到的不同聚类结果，依次针对每次的聚类结果构造各个特征的特征向量图；Step 2), according to the different clustering results obtained in step 1), constructing a feature vector diagram of each feature for each clustering result in turn;

步骤3)，根据步骤2)构造出的特征向量图，计算每个特征的模块化度量值，并将模块化度量值最大的特征放入特征子集中；Step 3), according to the eigenvector diagram constructed in step 2), calculate the modularization metric value of each feature, and put the feature with the largest modularization metric value into the feature subset;

步骤4)，根据步骤3)得到的初始特征子集，计算每个剩余特征相对于特征子集里各个特征的条件互信息，从而计算出每个剩余特征相对于特征子集的相关独立性度量值；Step 4), according to the initial feature subset obtained in step 3), calculate the conditional mutual information of each remaining feature with respect to each feature in the feature subset, so as to calculate the relative independence measure of each remaining feature with respect to the feature subset value;

步骤5)，将步骤3)得到的每个剩余特征的模块化度量值与步骤4)得到的相关独立性度量值以一定权重相加，将计算结果作为每个剩余特征的得分；Step 5), adding the modularization metric value of each remaining feature obtained in step 3) and the correlation independence metric value obtained in step 4) with a certain weight, and using the calculation result as the score of each remaining feature;

步骤6)，将步骤5)得到的得分最高的特征放入特征子集中，然后迭代地进行步骤4)、步骤5)、步骤6)，直到特征子集中的特征个数达到所需要的个数；Step 6), put the feature with the highest score obtained in step 5) into the feature subset, and then iteratively perform step 4), step 5), and step 6), until the number of features in the feature subset reaches the required number ;

步骤7)，将步骤6)得到的根据不同K-means聚类结果形成的特征子集进行汇总，得到最终的特征子集。In step 7), the feature subsets formed according to different K-means clustering results obtained in step 6) are summarized to obtain the final feature subset.

进一步的，本发明的基于条件互信息和K-means的无监督特征选择方法，步骤1)对无标签数据集进行多次不同K值和不同聚类中心的K-means聚类，并获得每次的聚类结果。本发明首先使用K-means聚类算法对无标签数据集进行多次初始值不同的聚类。初始化时，人为地指定K-means聚类算法的最大聚类个数和最小聚类个数，以及聚类次数。每一次进行聚类时，K-means算法在最大聚类个数和最小聚类个数之间随机选择一个数作为簇的数目k，并在数据集中随机选择k个点作为初始质心，通过K-means聚类算法，可以依次得到每一次聚类的结果，即类标签C。Further, in the unsupervised feature selection method based on conditional mutual information and K-means of the present invention, step 1) performs multiple K-means clustering with different K values and different cluster centers on the unlabeled data set, and obtains each The second clustering result. The present invention first uses the K-means clustering algorithm to cluster the unlabeled data set for multiple times with different initial values. During initialization, the maximum number of clusters, the minimum number of clusters, and the number of clusters of the K-means clustering algorithm are artificially specified. Each time clustering is performed, the K-means algorithm randomly selects a number between the maximum number of clusters and the minimum number of clusters as the number k of clusters, and randomly selects k points in the data set as the initial centroid, passing K The -means clustering algorithm can obtain the result of each clustering in turn, that is, the class label C.

进一步的，本发明的基于条件互信息和K-means的无监督特征选择方法，步骤2)根据步骤1)得到的不同聚类结果，依次针对每次的聚类结果构造各个特征的特征向量图。对数据集中某一特征的特征向量图的构造，是在该特征下特征值和类标签已知的情况下，将每个样本作为一个点，假设某个样本所在的类包含了x个样本，则将该样本所对应的点与和它特征值最接近的x-1个样本点相连接，在同一特征下对数据集中的所有样本执行以上的操作，即可构造出该特征的特征向量图。Further, in the unsupervised feature selection method based on conditional mutual information and K-means of the present invention, step 2) according to the different clustering results obtained in step 1), construct the feature vector diagram of each feature in turn for each clustering result . The construction of the eigenvector diagram of a certain feature in the data set is to use each sample as a point when the eigenvalue and class label of the feature are known, assuming that the class of a certain sample contains x samples, Then connect the point corresponding to the sample with the x-1 sample points closest to its eigenvalue, perform the above operations on all samples in the data set under the same feature, and construct the eigenvector diagram of the feature .

进一步的，本发明的基于条件互信息和K-means的无监督特征选择方法，步骤3)根据步骤2)构造出的特征向量图，计算每个特征的模块化度量值，计算公式为：Further, in the unsupervised feature selection method based on conditional mutual information and K-means of the present invention, step 3) calculates the modularized metric value of each feature according to the feature vector graph constructed in step 2), and the calculation formula is:

公式中，i、j是步骤2)构造出的特征向量图中的两个点；A_ij是特征向量图的邻接矩阵，如果从i到j存在边，则A_ij＝1，否则为0；M是总连接数，即特征向量图中边的总数；k_i和k_j分别是节点i和j的度数；二元函数δ(C_i,C_j)表示如果节点i和j属于同一个簇，则为1，否则为0；根据每个特征的特征向量图计算出各自的模块化度量值之后，将所有的模块化度量值进行归一化，得到Q’，将Q’最大值所对应的特征放入特征子集中。In the formula, i and j are two points in the eigenvector graph constructed in step 2); A _ij is the adjacency matrix of the eigenvector graph, if there is an edge from i to j, then A _ij =1, otherwise it is 0; M is the total number of connections, that is, the total number of edges in the feature vector graph; k _i and k _j are the degrees of nodes i and j respectively; the binary function δ(C _i , C _j ) indicates that if nodes i and j belong to the same cluster , then it is 1, otherwise it is 0; after calculating the respective modularity metric values according to the eigenvector diagram of each feature, normalize all the modularity metric values to obtain Q', and correspond to the maximum value of Q' features into the feature subset.

进一步的，本发明的基于条件互信息和K-means的无监督特征选择方法，步骤4)根据步骤3)得到的初始特征子集，计算每个剩余特征相对于特征子集里各个特征的条件互信息，从而计算出每个剩余特征相对于特征子集的相关独立性度量值，计算公式为：Further, in the unsupervised feature selection method based on conditional mutual information and K-means of the present invention, step 4) calculates the condition of each remaining feature relative to each feature in the feature subset according to the initial feature subset obtained in step 3) Mutual information, so as to calculate the relative independence measure of each remaining feature with respect to the feature subset, the calculation formula is:

公式中，f_r是未被选入特征子集的剩余特征，f_j是特征子集中的特征，S是特征子集；其中RI(f_r,f_j)表示剩余特征f_r相对于特征子集中特征之一f_j的相关独立性，计算公式为：In the formula, f _r is the remaining features that have not been selected into the feature subset, f _j is the feature in the feature subset, and S is the feature subset; where RI( _fr , f _j ) means that the remaining feature f _r is relative to the feature sub-set The relative independence of f _j , one of the features in the set, is calculated as:

公式中，H(C)是目标变量C的熵，I(f_r；C|f_j)和I(f_j；C|f_i)是特征f_r与特征f_j的条件互信息，计算公式为：In the formula, H(C) is the entropy of the target variable C, I(f _r ; C|f _j ) and I(f _j ; C|f _i ) are the conditional mutual information of feature f _r and feature f _j , and the calculation formula for:

公式中，N是数据集中样本的个数，C是类的数量。计算出每个剩余特征相对于特征子集的相关独立性度量值之后，将所有的相关独立性度量值进行归一化，得到I_ri'。In the formula, N is the number of samples in the data set, and C is the number of classes. After calculating the relative independence measure of each remaining feature with respect to the feature subset, all the relative independence measures are normalized to obtain I _ri '.

进一步的，本发明的基于条件互信息和K-means的无监督特征选择方法，步骤5)将步骤3)得到的每个剩余特征的规范化模块化度量值与步骤4)得到每个剩余特征的规范化相关独立性度量值以一定权重相加，即：s＝wQ'+(1-w)I_ri',公式中的w人为指定，取值范围为[0,1]，将计算结果作为每个剩余特征的得分。Further, in the unsupervised feature selection method based on conditional mutual information and K-means of the present invention, step 5) combines the normalized modularization metric value of each remaining feature obtained in step 3) with step 4) to obtain the value of each remaining feature The normalized correlation independence measurement value is added with a certain weight, namely: s=wQ'+(1-w)I _ri ', the w in the formula is artificially specified, and the value range is [0,1], and the calculation result is used as each scores for the remaining features.

进一步的，本发明的基于条件互信息和K-means的无监督特征选择方法，步骤6)将步骤5)得到的s最大值所对应的特征放入特征子集中，然后迭代地进行步骤4)、步骤5)、步骤6)，直到特征子集中的特征个数达到所需要的个数，特征个数人为指定。Further, in the unsupervised feature selection method based on conditional mutual information and K-means of the present invention, step 6) puts the feature corresponding to the s maximum value obtained in step 5) into the feature subset, and then iteratively performs step 4) , Step 5), and Step 6), until the number of features in the feature subset reaches the required number, and the number of features is manually specified.

进一步的，本发明的基于条件互信息和K-means的无监督特征选择方法，步骤7)将步骤6)得到的根据不同K-means聚类结果形成的特征子集进行汇总，根据所需要的特征个数选出出现次数最多的几个特征，构成最终的特征子集。Further, in the unsupervised feature selection method based on conditional mutual information and K-means of the present invention, step 7) summarizes the feature subsets formed according to different K-means clustering results obtained in step 6), and according to the required The number of features selects the features with the most occurrences to form the final feature subset.

有益效果Beneficial effect

本发明针对机器学习中的无标签数据集的特征选择问题，将K-means算法与特征之间的条件互信息相结合，有助于选择出无标签数据集中最重要的特征。该方法通过多次初始条件不同的K-means算法，对无标签数据集进行聚类，可以消除单次聚类结果上进行特征选择的随机性，减少了数据不平衡对特征选择的影响，弥补了以往特征选择方法对不平衡数据集特征选择效果不理想或仅适用于有标签数据集的缺陷；同时，为了获得相关度高、冗余度小的特征子集，本方法在每一次的聚类基础上，综合考虑每个特征的模块化度量值及不同特征之间的条件互信息，利用特征之间的相关独立性指标来选择出相关度高且冗余度小的特征组合，通过将多次提取出的特征子集进行汇总，获得最终的特征子集。K-means算法与条件互信息的结合，使得该特征选择算法既能应用于平衡或非平衡的无标签数据集，又能提升特征子集的相关度，降低其冗余度，从而选择出最重要的特征集合。Aiming at the feature selection problem of the unlabeled data set in machine learning, the present invention combines the K-means algorithm with the conditional mutual information between features, which helps to select the most important features in the unlabeled data set. This method uses the K-means algorithm with different initial conditions to cluster unlabeled data sets, which can eliminate the randomness of feature selection on a single clustering result, reduce the impact of data imbalance on feature selection, and make up for The previous feature selection method is not ideal for feature selection of unbalanced datasets or is only suitable for labeled datasets; at the same time, in order to obtain feature subsets with high correlation and low redundancy, this method is On the basis of class, comprehensively consider the modular measure value of each feature and the conditional mutual information between different features, and use the correlation independence index between features to select a feature combination with high correlation and low redundancy. The feature subsets extracted multiple times are summarized to obtain the final feature subset. The combination of K-means algorithm and conditional mutual information makes the feature selection algorithm not only applicable to balanced or unbalanced unlabeled data sets, but also improves the correlation of feature subsets and reduces their redundancy, thereby selecting the most Important collection of features.

附图说明Description of drawings

图1是基于条件互信息和K-means的无监督特征选择方法的流程图。Figure 1 is a flowchart of an unsupervised feature selection method based on conditional mutual information and K-means.

图2是对数据集构造特征向量图的例子。Figure 2 is an example of constructing a feature vector map for a dataset.

具体实施方式detailed description

下面结合附图对技术方案的实施作进一步的详细描述：Below in conjunction with accompanying drawing, the implementation of technical scheme is described in further detail:

结合流程图及实施案例对本发明所述的基于条件互信息和K-means的无监督特征选择方法作进一步的详细描述。The unsupervised feature selection method based on conditional mutual information and K-means of the present invention will be further described in detail in conjunction with the flow chart and the implementation case.

本实施案例采用条件互信息和K-means算法对无标签的数据集进行特征选择。如图1所示，本方法包含如下步骤：In this implementation case, the conditional mutual information and K-means algorithm are used to select the features of the unlabeled data set. As shown in Figure 1, this method includes the following steps:

步骤10，对无标签数据集进行多次不同K值和不同聚类中心的K-means聚类，并获得每次的聚类结果；Step 10, performing multiple K-means clustering with different K values and different cluster centers on the unlabeled data set, and obtaining each clustering result;

步骤101，K-means算法的最大聚类个数MAX和最小聚类个数MIN是在输入阶段预先给定的，每次聚类前，在[MAX,MIN]范围内随机选择一个数作为簇的个数k，并在数据集中随机选择k个点作为初始质心；In step 101, the maximum number of clusters MAX and the minimum number of clusters MIN of the K-means algorithm are predetermined in the input stage, and before each clustering, a number is randomly selected in the range of [MAX, MIN] as a cluster The number k, and randomly select k points in the data set as the initial centroid;

步骤102，进行K-means聚类算法的总次数T是在输入阶段预先给定的，每执行过一次K-means算法，可以得到一组聚类结果即类标签C，重复进行K-means聚类，直到聚类次数达到预先设定的总次数，最终可以得到T组不同的聚类结果；Step 102, the total number of times T of performing the K-means clustering algorithm is predetermined in the input stage, and each time the K-means algorithm is executed, a set of clustering results, that is, the class label C, can be obtained, and the K-means clustering algorithm is repeated. Classes until the number of clustering times reaches the preset total number, and finally T groups of different clustering results can be obtained;

步骤20，根据上一步得到的聚类结果，依次针对每次的聚类结果构造各个特征的特征向量图；Step 20, according to the clustering results obtained in the previous step, sequentially constructing feature vector diagrams of each feature for each clustering result;

步骤201，对数据集中某一特征的特征向量图的构造，是在该特征下，样本的特征值和类标签已知的情况下，首先将每个样本作为一个点，如图2所示的包含两个特征的数据，右侧的每个圆点和方点都表示一个样本，点旁边的数字表示该点所对应的特征值的大小；In step 201, the construction of the eigenvector diagram of a certain feature in the data set is to first use each sample as a point when the eigenvalue and class label of the sample are known under the feature, as shown in Figure 2 Data containing two features, each dot and square point on the right represents a sample, and the number next to the point indicates the size of the feature value corresponding to the point;

步骤202，若某个样本所在的类包含的样本总数为x个，则将该样本所对应的点与和它特征值最接近的x-1个样本点相连，如图2所示，样本1所在的类为C1，C1类包含的样本总数为4个，则将样本1所对应的点与和它特征值最接近的3个样本点，即样本2、样本7、样本6相连；Step 202, if the total number of samples contained in the class of a certain sample is x, connect the point corresponding to the sample with the x-1 sample points closest to its feature value, as shown in Figure 2, sample 1 The class is C1, and the total number of samples contained in class C1 is 4, then the point corresponding to sample 1 is connected to the 3 sample points closest to its eigenvalue, namely sample 2, sample 7, and sample 6;

步骤203，对同一特征下数据集中的所有样本执行步骤202的操作，即可构造出该特征的特征向量图；Step 203, perform the operation of step 202 on all samples in the data set under the same feature, and then construct the feature vector diagram of the feature;

步骤204，对数据集中所有特征执行步骤201-203的操作，即可构造出所有特征的特征向量图，如图2所示，左侧包含2个特征的数据集，经过步骤10的一趟K-means聚类之后得到了类标签C1和C2，右侧分别是特征1和特征2所对应的特征向量图；Step 204, perform the operations of steps 201-203 on all the features in the data set, and then construct the feature vector diagram of all the features, as shown in Figure 2, the data set with 2 features on the left, after step 10, a trip K After -means clustering, the class labels C1 and C2 are obtained, and the right side is the feature vector diagram corresponding to feature 1 and feature 2;

步骤30，根据上一步构造出的特征向量图，计算每个特征的模块化度量值，并将模块化度量值最大的特征放入特征子集中；Step 30, according to the eigenvector diagram constructed in the previous step, calculate the modularity metric value of each feature, and put the feature with the largest modularity metric value into the feature subset;

步骤301，根据公式计算每个特征各自的模块化度量值；Step 301, according to the formula Calculate the respective modularity measures for each feature;

步骤302，将各个特征的模块化度量值进行归一化处理，得到Q’；Step 302, normalize the modularized metric values of each feature to obtain Q';

步骤303，将Q’最大值所对应的特征放入特征子集中，并将其从剩余特征中删除；Step 303, put the feature corresponding to the maximum value of Q' into the feature subset, and delete it from the remaining features;

步骤40，根据上一步得到的特征子集，计算每个剩余特征相对于特征子集的相关独立性度量值；Step 40, according to the feature subset obtained in the previous step, calculate the relative independence measure value of each remaining feature with respect to the feature subset;

步骤401，根据条件互信息公式计算出I(f_r；C|f_j)和I(f_j；C|f_i)的值，即剩余特征与已选特征的条件互信息；Step 401, according to the conditional mutual information formula Calculate the values of I(f _r ; C|f _j ) and I(f _j ; C|f _i ), that is, the conditional mutual information between the remaining features and the selected features;

步骤402，根据公式计算各个剩余特征相对于特征子集中某一特征的相关独立性；Step 402, according to the formula Calculate the relative independence of each remaining feature with respect to a feature in the feature subset;

步骤403，根据公式计算各个剩余相对于特征子集的相关独立性度量值；Step 403, according to the formula Compute the relative independence measure of each remainder with respect to the subset of features;

步骤404，将各个剩余特征的相关独立性度量值进行归一化处理，得到I_ri'；Step 404, normalize the correlation independence measure value of each remaining feature to obtain I _ri ';

步骤50，将根据步骤30得到的每个剩余特征的模块化度量值Q’和步骤40得到的每个特征的相关独立性度量值I_ri'以一定的权重相加，将计算结果作为每个剩余特征的得分；Step 50, add the modularity metric Q' of each remaining feature obtained in step 30 and the correlation independence metric I _ri ' of each feature obtained in step 40 with a certain weight, and use the calculation result as each Scores for the remaining features;

步骤501，模块化度量值和相关独立性度量值的权重w在输入阶段预先设定，取值范围为[0,1]，默认设置为0.3；Step 501, the weight w of the modularity measure value and the related independence measure value is preset in the input stage, the value range is [0,1], and the default setting is 0.3;

步骤502，根据公式s＝wQ'+(1-w)I_ri'，计算每个剩余特征的得分；Step 502, according to the formula s=wQ'+(1-w)I _ri ', calculate the score of each remaining feature;

步骤60，将上一步得分最高的特征放入特征子集中，并将其从剩余特征中删除，重复执行步骤40、步骤50、步骤60，直到特征子集中的特征个数达到所需要的个数，需要的特征个数a在输入阶段预先设定；Step 60, put the feature with the highest score in the previous step into the feature subset, and delete it from the remaining features, repeat step 40, step 50, and step 60 until the number of features in the feature subset reaches the required number , the required number of features a is preset in the input stage;

步骤70，将上一步得到的根据不同K-means聚类结果形成的特征子集进行汇总，根据需要的特征个数选出出现次数最多的a个特征，构成并输出最终的特征子集。Step 70: Summarize the feature subsets obtained in the previous step based on different K-means clustering results, select a feature with the most occurrences according to the required number of features, and form and output the final feature subset.

Claims

1. a kind of based on conditional mutual information and the unsupervised feature selection approach of K-means, it is characterised in that including following step Suddenly：

Step 1), to carrying out the K-means clusters of multiple different K values and different cluster centres without label data collection, and obtain every Secondary cluster result；

Step 2), according to step 1) the different cluster results that obtain, each feature is constructed for each cluster result successively Feature vector chart；

Step 3), according to step 2) feature vector chart that constructs, calculate the modularization metric of each feature, and by modularization The maximum feature of metric is put in character subset；

Step 4), according to step 3) the initial characteristicses subset that obtains, calculate each residue character relative in character subset each The conditional mutual information of feature, so that calculate correlation independence metric of each residue character relative to character subset；

Step 5), by step 3) the modularization metric of each residue character and the step 4 that obtain) the correlation independence degree that obtains Value is added with certain weight, using result of calculation as each residue character score；

Step 6), by step 5) feature of highest scoring that obtains is put in character subset, is then made iteratively step 4), step Rapid 5), step 6), the Characteristic Number in character subset reach required for number；

Step 7), by step 6) obtain according to different K-means cluster results formed character subset collected, obtain most Whole character subset.

2. the method for claim 1, it is characterised in that step 1) to without label data collection carry out multiple different K values and The K-means clusters of different cluster centres, and obtain each cluster result；During initialization, K-means clusters are artificially specified The maximum cluster number of algorithm and min cluster number, and cluster number of times；When being clustered each time, K-means algorithms exist A number is randomly choosed as the number k of cluster between maximum cluster number and min cluster number, and is selected in data set at random K point is selected as initial barycenter, by K-means clustering algorithms, the result for being clustered each time successively, i.e. class label C.

3. the method for claim 1, it is characterised in that further, step 2) according to step 1) difference that obtains gathers Class result, constructs the feature vector chart of each feature successively for each cluster result；The spy that a certain feature is concentrated to data The construction of vectogram is levied, is in the case of known to this feature lower eigenvalue and class label, using each sample as a point, vacation If the class that certain sample is located contains x sample, then by the point corresponding to the sample with and the immediate x-1 of its characteristic value individual Sample point is connected, and executes above operation to all samples that data are concentrated, you can construct this feature under same feature Feature vector chart.

4. the method for claim 1, it is characterised in that step 3) according to step 2) feature vector chart that constructs, meter The modularization metric of each feature is calculated, computing formula is：

Q = \underset{i j}{Σ} [\frac{A_{i j}}{2 M} - \frac{k_{i} * k_{j}}{(2 M) * (2 M)}] δ (C_{i}, C_{j})

In formula, i, j are steps 2) two points in the feature vector chart that constructs；A_ijIt is the adjacency matrix of feature vector chart, If there is side, A from i to j_ij=1, it is otherwise 0；M is the sum for always connecting side in number, i.e. feature vector chart；k_iAnd k_jPoint It is not the number of degrees of node i and j；Binary function δ (C_i,C_j) represent that if node i and j belong to same cluster, for 1, it is otherwise 0； After feature vector chart according to each feature calculates respective modularization metric, all of modularization metric is carried out Normalization, obtains Q ', the feature corresponding to Q ' maximums is put in character subset.

5. the method for claim 1, it is characterised in that step 4) according to step 3) the initial characteristicses subset that obtains, meter Conditional mutual information of each residue character relative to each feature in character subset is calculated, relative so as to calculate each residue character In the correlation independence metric of character subset, computing formula is：

I_{r i} (f_{r}; C | S) = \underset{f_{j} &Element; S}{Σ} R I (f_{r}, f_{j})

In formula, f_rIt is the residue character for not being selected into character subset, f_jIt is the feature in character subset, S is character subset；Its Middle RI (f_r,f_j) represent residue character f_rRelative to one of feature in character subset f_jCorrelation independence, computing formula is：

R I (f_{r}, f_{j}) = \frac{I (f_{r}; C | f_{j}) + I (f_{j}; C | f_{i})}{2 H (C)}

In formula, H (C) is the entropy of target variable C, I (f_r；C|f_j) and I (f_j；C|f_i) it is feature f_rWith feature f_jCondition mutual trust Cease, computing formula is：

I (X_{i}; Y | X_{j}) = Σ_{i = 1}^{N} Σ_{j = 1}^{N} Σ_{k = 1}^{C} p (x_{i}, x_{j}, y_{k}) \log \frac{p (x_{i}, y_{k} | x_{j})}{p (x_{i} | x_{j}) p (y_{k} | x_{j})}

In formula, N is the number of sample in data set, and C is the quantity of class.Each residue character is calculated relative to character subset Correlation independence metric after, all of correlation independence metric is normalized, I is obtained_ri'.

6. the method for claim 1, it is characterised in that step 5) by step 3) specification of each residue character that obtains Change modularization metric and step 4) the standardization correlation independence metric that obtains each residue character is added with certain weight, I.e.：S=wQ'+ (1-w) I_ri', the w people in formula is specified, and span is [0,1], and result of calculation is remaining special as each The score that levies.

7. the method for claim 1, it is characterised in that step 6) by step 5) spy corresponding to the s maximums that obtain Levy and be put in character subset, be then made iteratively step 4), step 5), step 6), the Characteristic Number in character subset Number required for reaching, Characteristic Number are artificially specified.

8. the method for claim 1, it is characterised in that step 7) by step 6) obtain poly- according to different K-means The character subset that class result is formed is collected, and selects the most several features of occurrence number according to required Characteristic Number, Constitute final character subset.