WO2024245413A1

WO2024245413A1 - Post-correction-based high-entropy knn clustering method and device, and medium

Info

Publication number: WO2024245413A1
Application number: PCT/CN2024/096756
Authority: WO
Inventors: 徐同明; 鹿海洋; 魏代森; 谭宁宁; 祝静; 林卉; 孙帅; 马娉婷; 陈杰
Original assignee: Inspur General Software Co Ltd
Current assignee: Inspur General Software Co Ltd
Priority date: 2023-06-01
Filing date: 2024-05-31
Publication date: 2024-12-05
Anticipated expiration: 2025-12-01
Also published as: CN116361671B; CN116361671A

Abstract

The present application relates to the field of electric digital data processing, and discloses a post-correction-based high-entropy KNN clustering method and device, and a medium. The method comprises: determining a set of samples needing to be clustered, and carrying out initial classification on a plurality of specified samples in the set of samples on the basis of the mode of a same similarity; selecting K prior samples closest to samples to be classified as comparison samples; on the basis of the mode of different similarities, obtaining category labels of the samples to be classified; and re-classifying the plurality of prior samples on the basis of the mode of different similarities. The accuracy of prior samples is effectively guaranteed, then on the basis of the mode of different similarities, the requirements of inter-class homogeneity and intra-class heterogeneity are effectively met, and finally, the prior samples are subjected to post-correction and re-classification, so that the high-entropy clustering process of all the samples can be achieved, and the requirements for high-entropy clustering are met.

Description

A high entropy KNN clustering method, device and medium based on post-correction

本申请要求于2023年06月01日提交中国专利局、申请号为202310636506.0、发明名称为“一种基于后校正的高熵KNN聚类方法、设备及介质”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the China Patent Office on June 1, 2023, with application number 202310636506.0 and invention name “A high entropy KNN clustering method, device and medium based on post-correction”, the entire contents of which are incorporated by reference in this application.

Technical Field

本申请涉及电数字数据处理领域，具体涉及一种基于后校正的高熵KNN聚类方法、设备及介质。The present application relates to the field of electronic digital data processing, and in particular to a post-correction based high entropy KNN clustering method, device and medium.

Background Art

K最邻近分类算法(K-NearestNeighbor，KNN)是一种监督学习算法，其能够根据K个最近的邻居的状态来决定样本的状态，常用于样本分类。通常来说，KNN算法能够呈现类间迥异、类内同质的特点，也就是能起到类间高熵、类内低熵的效果。K-NearestNeighbor (KNN) is a supervised learning algorithm that can determine the state of a sample based on the states of its K nearest neighbors and is often used for sample classification. Generally speaking, the KNN algorithm can present the characteristics of being very different between classes and homogeneous within classes, that is, it can achieve the effect of high entropy between classes and low entropy within classes.

但是，随着技术的发展，出现一些类间同质、类内迥异的应用需求，比如，在对多类型产品或者多类型的数据进行分类时，只需要保证每个类别中，各类型的产品或者数据是符合一定比例的即可。此时在分类过程中，需要保证实现类间低熵、类内高熵的效果，通过传统的KNN算法是难以实现的。However, with the development of technology, some application requirements have emerged, such as homogeneity between classes and heterogeneity within classes. For example, when classifying multiple types of products or multiple types of data, it is only necessary to ensure that each type of product or data in each category meets a certain proportion. In this case, during the classification process, it is necessary to ensure the effect of low entropy between classes and high entropy within classes, which is difficult to achieve through the traditional KNN algorithm.

发明内容Summary of the invention

为了解决上述问题，本申请提出了一种基于后校正的高熵KNN聚类方法，包括：In order to solve the above problems, this application proposes a high entropy KNN clustering method based on post-correction, including:

确定需要进行聚类的样本集合，并基于相似度相同的方式，为所述样本集合中的若干个指定样本进行初始化分类；Determine a sample set that needs to be clustered, and initialize classification for a number of designated samples in the sample set based on the same similarity;

将完成所述初始化分类的样本作为先验样本，并针对所述样本集合中，除所述先验样本以外剩余的待分类样本，选取与所述待分类样本距离最近的K个先验样本，作为对比样本；所述K为预先设置的正整数值；The sample that has completed the initial classification is used as a priori sample, and for the remaining samples to be classified in the sample set except the priori sample, the sample with the closest distance to the sample to be classified is selected. K prior samples are used as comparison samples; K is a preset positive integer value;

基于相似度相异的方式，以及所述对比样本在初始化分类中确定的类别标签，得到所述待分类样本的类别标签，直至对所有待分类样本完成分类；Based on the difference in similarity and the category labels of the comparison samples determined in the initial classification, the category labels of the samples to be classified are obtained until all the samples to be classified are classified;

基于相似度相异的方式，将若干个先验样本进行重新分类。Based on the difference in similarity, several prior samples are reclassified.

另一方面，本申请还提出了一种基于后校正的高熵KNN聚类设备，包括：On the other hand, the present application also proposes a post-correction based high entropy KNN clustering device, comprising:

至少一个处理器；以及，at least one processor; and,

与所述至少一个处理器通信连接的存储器；其中，a memory communicatively connected to the at least one processor; wherein,

所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行如：The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform, for example:

将完成所述初始化分类的样本作为先验样本，并针对所述样本集合中，除所述先验样本以外剩余的待分类样本，选取与所述待分类样本距离最近的K个先验样本，作为对比样本；所述K为预先设置的正整数值；The sample that has completed the initialization classification is used as a priori sample, and for the remaining samples to be classified in the sample set except the priori sample, K priori samples that are closest to the samples to be classified are selected as comparison samples; K is a preset positive integer value;

本申请还提出了一种非易失性计算机存储介质，存储有计算机可执行指令，所述计算机可执行指令设置为：The present application also proposes a non-volatile computer storage medium storing computer executable instructions, wherein the computer executable instructions are configured as follows:

基于相似度相异的方式，以及所述对比样本在初始化分类中确定的类别标签，得到所述待分类样本的类别标签，直至对所有待分类样本完成分类； Based on the difference in similarity and the category labels of the comparison samples determined in the initial classification, the category labels of the samples to be classified are obtained until all the samples to be classified are classified;

通过本申请提出基于后校正的高熵KNN聚类方法能够带来如下有益效果：The high entropy KNN clustering method based on post-correction proposed in this application can bring the following beneficial effects:

通过传统的相似度相同的方式得到先验样本，有效保证了先验样本的准确性，然后基于相似度相异的方式，有效实现类间同质、类内迥异的需求，最终再对先验样本进行后校正重新分类，即可实现对所有样本的高熵聚类过程，满足了对于高熵聚类的需求。The prior samples are obtained through the traditional method of the same similarity, which effectively guarantees the accuracy of the prior samples. Then, based on the method of different similarities, the requirements of homogeneity between classes and differences within classes are effectively met. Finally, the prior samples are post-corrected and reclassified to realize the high-entropy clustering process of all samples, meeting the needs for high-entropy clustering.

BRIEF DESCRIPTION OF THE DRAWINGS

此处所说明的附图用来提供对本申请的进一步理解，构成本申请的一部分，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The drawings described herein are used to provide a further understanding of the present application and constitute a part of the present application. The illustrative embodiments of the present application and their descriptions are used to explain the present application and do not constitute an improper limitation on the present application. In the drawings:

图1为本申请实施例中基于后校正的高熵KNN聚类方法的流程示意图；FIG1 is a schematic diagram of a process of a high entropy KNN clustering method based on post-correction in an embodiment of the present application;

图2为本申请实施例中初始化分类的示意图；FIG2 is a schematic diagram of initialization classification in an embodiment of the present application;

图3为本申请实施例中传统KNN聚类算法的结果示意图；FIG3 is a schematic diagram of the results of a traditional KNN clustering algorithm in an embodiment of the present application;

图4为本申请实施例中，第一种情况下对应的相似度相异的方式进行分类的示意图；FIG4 is a schematic diagram of classification in different ways corresponding to the similarities in the first case in an embodiment of the present application;

图5为本申请实施例中，第二种情况下对应的相似度相异的方式进行分类的示意图；FIG5 is a schematic diagram of classification in different ways corresponding to the similarities in the second case in an embodiment of the present application;

图6为本申请实施例中，第三种情况下对应的相似度相异的方式进行分类的示意图；FIG6 is a schematic diagram of classification in different ways corresponding to the similarities in the third case in an embodiment of the present application;

图7为本申请实施例中相似度相异的方式的分类结果示意图；FIG7 is a schematic diagram of classification results in different similarities according to an embodiment of the present application;

图8为本申请实施例中先验样本后校正的示意图；FIG8 is a schematic diagram of a priori sample post-correction in an embodiment of the present application;

图9为本申请实施例中基于后校正的高熵KNN聚类设备的示意图。FIG9 is a schematic diagram of a high entropy KNN clustering device based on post-correction in an embodiment of the present application.

DETAILED DESCRIPTION

为使本申请的目的、技术方案和优点更加清楚，下面将结合本申请具体实施例及相应的附图对本申请技术方案进行清楚、完整地描述。显然，所描述的实施例仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make the purpose, technical solution and advantages of this application clearer, the technical solution of this application will be clearly and completely described below in combination with the specific embodiments of this application and the corresponding drawings. Obviously, the described embodiments are only part of the embodiments of this application, not all of them. The embodiments in the application and all other embodiments obtained by ordinary technicians in this field without making any creative work are within the scope of protection of this application.

以下结合附图，详细说明本申请各实施例提供的技术方案。The technical solutions provided by various embodiments of the present application are described in detail below in conjunction with the accompanying drawings.

如图1所示，本申请实施例提供基于后校正的高熵KNN聚类方法，包括：As shown in FIG1 , the embodiment of the present application provides a high entropy KNN clustering method based on post-correction, including:

S101：确定需要进行聚类的样本集合，并基于相似度相同的方式，为所述样本集合中的若干个指定样本进行初始化分类。S101: Determine a sample set that needs to be clustered, and perform initial classification for a number of designated samples in the sample set based on the same similarity.

与传统的KNN聚类不同的是，在本文中的高熵KNN聚类所要实现的目的不同。在预先获取的数据集合中，选取若干个数据，该数据里可以是产品数据、图像数据、音频数据等。Different from the traditional KNN clustering, the purpose of high entropy KNN clustering in this paper is different. In the pre-acquired data set, several data are selected, which can be product data, image data, audio data, etc.

将若干个数据作为样本集合，以对样本集合进行聚类，此时，聚类的目的不再是将相同或相似类别的数据汇集在一个类簇中，而是在聚类结果的类簇中，不同类别的数据符合预设比例。比如，以产品数据为例，最终得到的每个类簇中，产品质量的比例符合预设比例，优品、良品、差品的比例符合5:3:2的比例，即可达到预先的目的。Several data are used as sample sets to cluster the sample sets. At this time, the purpose of clustering is no longer to gather data of the same or similar categories into one cluster, but to make the data of different categories meet the preset ratio in the clusters of the clustering results. For example, taking product data as an example, in each cluster finally obtained, the ratio of product quality meets the preset ratio, and the ratio of excellent products, good products, and poor products meets the ratio of 5:3:2, which can achieve the preset purpose.

在初始化分类时，在样本集合中，确定已经选取的若干个指定样本，此处的指定样本为具有可识别特点(也可以称作显著特点)的样本，比如，以产品数据为例，某些产品的质量非常优秀，或者具有非常明显的残次，则可以认为其具有可识别特点。或者，对图像数据进行识别时，图像中明显存在指定物品，或者明显不存在指定物品的，认为其具有可识别特点。通常来说选取的指定样本数量相比于样本集合为少量，When initializing the classification, several designated samples are selected from the sample set. The designated samples here are samples with identifiable characteristics (also called significant characteristics). For example, taking product data as an example, some products are of excellent quality or have very obvious defects, which can be considered to have identifiable characteristics. Or, when recognizing image data, if the designated object is obviously present in the image, or if the designated object is obviously absent, it is considered to have identifiable characteristics. Generally speaking, the number of designated samples selected is small compared to the sample set.

针对每个指定样本，选取距离该指定样本最近的K个样本，并将K个样本中，出现次数最多的类别标签，作为该指定样本的类别标签。如图2所示，其中共选取了12个指定样本，共分为两类，在图中以不同的图标进行标识。此时，这些指定样本是通过相似度相同的方式得到的，其符合传统KNN聚类过程中的聚类过程和效果。For each designated sample, select the K samples closest to the designated sample, and use the category label with the most occurrences among the K samples as the category label of the designated sample. As shown in Figure 2, a total of 12 designated samples are selected and divided into two categories, which are marked with different icons in the figure. At this time, these designated samples are obtained in the same similarity, which conforms to the clustering process and effect in the traditional KNN clustering process.

S102：将完成所述初始化分类的样本作为先验样本，并针对所述样本集合中，除所述先验样本以外剩余的待分类样本，选取与所述待分类样本距离最近的K个先验样本，作为对比样本；所述K为预先设置的正整数值。S102: taking the samples that have completed the initialization classification as prior samples, and selecting K prior samples that are closest to the samples to be classified from the remaining samples to be classified in the sample set except the prior samples as comparison samples; K is a preset positive integer value.

K值的选取不宜过大或过小，通常来说，其与样本集合的样本容量相关。此时，确定样本集合对应的样本容量，根据样本容量确定分类过程中对应的K值以及指定样本的选取数量，其中，K值与样本容量的比值范围为[0.03,0.09]之内的正整数，且K值为奇数，当样本容量为100时，K值可以是3、5、7、9。而指定样本的选取数量则需要高于K值选取范围的最大值，以便于对比样本的选取。The K value should not be too large or too small. Generally speaking, it is proportional to the sample size of the sample set. At this time, determine the sample size corresponding to the sample set, and determine the corresponding K value and the number of selected samples in the classification process according to the sample size. The ratio of the K value to the sample size is a positive integer within [0.03, 0.09], and the K value is an odd number. When the sample size is 100, the K value can be 3, 5, 7, or 9. The number of selected samples needs to be higher than the maximum value of the K value selection range to facilitate the selection of comparative samples.

S103：基于相似度相异的方式，以及所述对比样本在初始化分类中确定的类别标签，得到所述待分类样本的类别标签，直至对所有待分类样本完成分类。S103: Based on the different similarities and the category labels of the comparison samples determined in the initial classification, the category labels of the samples to be classified are obtained until all the samples to be classified are classified.

如图3所示，按照传统的KNN聚类算法，仍以相似度相同的方式进行聚类时，则最终得到的结果仍是类间迥异，类内同质的效果，此时类内仍处于低熵的状态，不符合本文中的需求。As shown in Figure 3, according to the traditional KNN clustering algorithm, when clustering is still performed in the same similarity manner, the final result is still very different between classes and homogeneous within classes. At this time, the class is still in a low entropy state, which does not meet the requirements of this article.

基于此，采用相似度相异的方式，针对每个待分类样本，确定其对应的对比样本中出现的类别标签，以及出现的各类别标签分别对应的出现次数。在所有类别标签中，选取出现次数最少的类别标签，作为待分类样本的类别标签。Based on this, the method of similarity difference is adopted to determine the category labels that appear in the corresponding comparison samples for each sample to be classified, as well as the number of occurrences of each category label. Among all the category labels, the category label with the least number of occurrences is selected as the category label of the sample to be classified.

如图4、图5以及图6所示，设K值为3，为方便描述，在此将空心方框的图标代表第一类别，将方框内包含叉的图标代表第二类别，将实心方框的图标代表未确定的类别。在图4中，样本1周围最接近的3个样本中，第一类别的数量为1，第二类别的数量为2，第一类被的数量更少，故而样本1为第一类别。在图5中，针对样本2，第一类别和第二类别的数量分别为2和1，故而样本2为第二类别。在图6中，针对样本3，第一类别和第二类别的数量分别为3和0，故而样本3的类别为第二类别，类似可以得到，样本4的类别为第一类别。As shown in Figures 4, 5 and 6, let the K value be 3. For the convenience of description, the icons in the hollow boxes represent the first category, the icons with a cross in the box represent the second category, and the icons in the solid boxes represent the undetermined category. In Figure 4, among the three samples closest to sample 1, the number of the first category is 1, and the number of the second category is 2. The number of the first category is smaller, so sample 1 is the first category. In Figure 5, for sample 2, the number of the first category and the second category are 2 and 1 respectively, so sample 2 is the second category. In Figure 6, for sample 3, the number of the first category and the second category are 3 and 0 respectively, so the category of sample 3 is the second category. Similarly, it can be obtained that the category of sample 4 is the first category.

此时，最终实现的效果可以如图7所示，达到类间同质、类内迥异的效果，此时类内处于高熵的状态，符合需求。At this point, the final effect can be shown in Figure 7, achieving the effect of homogeneity between classes and great differences within classes. At this time, the class is in a high entropy state, which meets the requirements.

S104：基于相似度相异的方式，将若干个先验样本进行重新分类。S104: Reclassify a number of prior samples based on different similarities.

上文中已将待分类样本，按照相似度相异的方式进行了聚类，然而最开始得到的先验样本仍是按照相似度相同的方式聚类的，不符合需求，此时，针对每个先验样本，在已经完成分类的待分类样本中(由于先验样本不符合需求，故而在选取对比文件时，从已经完成分类的待分类样本中选取)，选取最近的K个先验样本，作为对比样本。类似地，确定对比样本中出现的类别标签，以及出现的各类别标签分别对应的出现次数，在所有类别标签中，选取出现次数最少的类别标签，作为先验样本重新分类后得到的类别标签。In the above text, the samples to be classified have been clustered according to the method of different similarities. However, the a priori samples obtained at the beginning are still clustered according to the method of the same similarity, which does not meet the requirements. At this time, for each a priori sample, among the samples to be classified that have been classified (because the a priori samples do not meet the requirements, when selecting the comparison file, select the samples to be classified that have been classified). Take), select the nearest K prior samples as comparison samples. Similarly, determine the category labels that appear in the comparison samples, as well as the number of occurrences of each category label. Among all category labels, select the category label with the least number of occurrences as the category label obtained after reclassification of the prior sample.

如图8所示，圈出的样本为先验样本，通过将其与最接近的3个对比样本，按照相似度相异的方式进行对比后发现，其需要更改类别，则将其由第一类别改为第二类别，由此，通过后校正的方式，对先验样本进行了重新分类，完成了所有样本的相似度相异方式的聚类，最终使得样本集合所有的样本均符合需求。As shown in Figure 8, the circled samples are prior samples. By comparing them with the three closest comparison samples in terms of similarity difference, it is found that they need to be changed in category, so they are changed from the first category to the second category. Therefore, the prior samples are reclassified through post-correction, and the clustering of all samples in terms of similarity difference is completed, so that all samples in the sample set finally meet the requirements.

在一个实施例中，上文中描述过需求选取距离最近的样本作为对比样本，而在计算样本间距离时，确定样本集合中，各样本所包含的维度数量，并根据维度数量计算待分类样本与其他所有先验样本之间的距离，从而选取距离最近的K个先验样本，作为对比样本。In one embodiment, it is described above that the closest sample needs to be selected as a comparison sample. When calculating the distance between samples, the number of dimensions contained in each sample in the sample set is determined, and the distance between the sample to be classified and all other prior samples is calculated based on the number of dimensions, so as to select the K prior samples with the closest distance as comparison samples.

维度数量通常包括一维至三维数据，比如，文本数据中包含文字的一维数据，2D平面图像中包含像素在x轴和y轴二维数据，产品数据中包含外观、功能、价格的三维数据。The number of dimensions usually includes one-dimensional to three-dimensional data. For example, text data contains one-dimensional data of text, 2D plane images contain two-dimensional data of pixels on the x-axis and y-axis, and product data contains three-dimensional data of appearance, function, and price.

当样本集合中样本为一维时，通过d_i＝|x_i-x₁|得到待分类样本与先验样本之间的距离，其中，d_i为待分类样本与第i个先验样本之间的距离，x_i为第i个先验样本的坐标，x₁为待分类样本的坐标。When the samples in the sample set are one-dimensional, the distance between the sample to be classified and the prior sample is obtained by d _i =| _xi - _x1 |, where d _i is the distance between the sample to be classified and the i-th prior sample, _xi is the coordinate of the i-th prior sample, and _x1 is the coordinate of the sample to be classified.

当样本集合中样本为二维时，通过得到待分类样本与先验样本之间的距离，其中，d_i为待分类样本与第i个先验样本之间的距离，(x_i，y_i)为第i个先验样本的坐标，(x₁，y₁)为待分类样本的坐标；When the samples in the sample set are two-dimensional, The distance between the sample to be classified and the prior sample is obtained, where d _i is the distance between the sample to be classified and the i-th prior sample, (x _i , y _i ) is the coordinate of the i-th prior sample, and (x ₁ , y ₁ ) is the coordinate of the sample to be classified;

当样本集合中样本为三维，通过得到待分类样本与先验样本之间的距离，其中，d_i为待分类样本与第i个先验样本之间的距离，(x_i，y_i，z_i)为第i个先验样本的坐标，(x₁，y₁，z₁)为待分类样本的坐标。When the samples in the sample set are three-dimensional, Get the distance between the sample to be classified and the prior sample, where d _i is the distance between the sample to be classified and the i-th prior sample, (x _i , y _i , z _i ) is the coordinate of the i-th prior sample, and (x ₁ , y ₁ , z ₁ ) is the coordinate of the sample to be classified. Coordinates of the classified samples.

当然，维度数量还可以包括更多维度，此时推导类似地公式即可计算样本之间的距离。Of course, the number of dimensions can also include more dimensions, in which case a similar formula can be derived to calculate the distance between samples.

如图9所示，本申请还提出了一种基于后校正的高熵KNN聚类设备，包括：As shown in FIG9 , the present application also proposes a post-correction based high entropy KNN clustering device, comprising:

至少一个处理器；以及，at least one processor; and,

本申请中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于设备和介质实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。The various embodiments in this application are described in a progressive manner. The same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on the differences from other embodiments. In particular, for the device and medium embodiments, since they are basically similar to the method embodiments, the description The method is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.

本申请实施例提供的设备和介质与方法是一一对应的，因此，设备和介质也具有与其对应的方法类似的有益技术效果，由于上面已经对方法的有益技术效果进行了详细说明，因此，这里不再赘述设备和介质的有益技术效果。The devices and media provided in the embodiments of the present application correspond one-to-one to the methods. Therefore, the devices and media also have similar beneficial technical effects as the corresponding methods. Since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the devices and media will not be repeated here.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment in combination with software and hardware. Moreover, the present application may adopt the form of a computer program product implemented in one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) that include computer-usable program code.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to the flowchart and/or block diagram of the method, device (system) and computer program product according to the embodiment of the present application. It should be understood that each process and/or box in the flowchart and/or block diagram, and the combination of the process and/or box in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for realizing the function specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

在一个典型的配置中，计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.

内存可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。Memory may include non-permanent storage in a computer-readable medium, in the form of random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带，磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。按照本文中的界定，计算机可读介质不包括暂存电脑可读媒体(transitory media)，如调制的数据信号和载波。Computer-readable media include permanent and non-permanent, removable and non-removable media that can be stored in any computer. Any method or technology to achieve information storage. Information can be computer-readable instructions, data structures, modules of programs or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disk read-only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transitory media such as modulated data signals and carrier waves.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "include", "comprises" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, commodity or device. In the absence of more restrictions, the elements defined by the sentence "comprises a ..." do not exclude the existence of other identical elements in the process, method, commodity or device including the elements.

以上所述仅为本申请的实施例而已，并不用于限制本申请。对于本领域技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等，均应包含在本申请的权利要求范围之内。 The above is only an embodiment of the present application and is not intended to limit the present application. For those skilled in the art, the present application may have various changes and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

A high entropy KNN clustering method based on post-correction, comprising:

Determine a sample set that needs to be clustered, and initialize classification for a number of designated samples in the sample set based on the same similarity; wherein, in a pre-acquired data set, select a number of data as the sample set, and the number of data is product data; the designated samples are samples with identifiable characteristics, and the identifiable characteristics include excellent product quality and defective products;

The sample that has completed the initialization classification is used as a priori sample, and for the remaining samples to be classified in the sample set except the priori sample, K priori samples that are closest to the samples to be classified are selected as comparison samples; K is a preset positive integer value;

Based on the difference in similarity and the category labels of the comparison samples determined in the initial classification, the category labels of the samples to be classified are obtained until all the samples to be classified are classified;

Based on the difference in similarity, several prior samples are reclassified so that the proportion of product quality in each cluster finally obtained meets the preset proportion;

Initializing classification for a number of designated samples in the sample set based on the same similarity, specifically including:

Determining a number of designated samples that have been selected from the sample set;

For each specified sample, select the K samples closest to the specified sample, and use the category label with the largest number of occurrences among the K samples as the category label of the specified sample;

Based on the different similarities and the category labels of the comparison samples determined in the initial classification, the category labels of the samples to be classified are obtained, specifically including:

Determine the category labels that appear in the comparison sample, and the number of occurrences of each category label that appears;

Among all the category labels, the category label with the least number of occurrences is selected as the category label of the sample to be classified.

The method according to claim 1, wherein a plurality of The prior samples are reclassified, including:

For each priori sample, select the nearest K classified samples from the classified samples as comparison samples;

Among all the category labels, the category label with the least number of occurrences is selected as the category label obtained after the prior sample is reclassified.

The method according to claim 1, wherein after determining the sample set to be clustered, the method further comprises:

Determining a sample capacity corresponding to the sample set;

The corresponding K value and the number of selected samples in the classification process are determined according to the sample capacity, wherein the ratio range of the K value to the sample capacity is a positive integer within [0.03, 0.09], the K value is an odd number, and the number of selected samples is higher than the maximum value of the K value selection range.

The method according to claim 1, wherein selecting K prior samples closest to the sample to be classified as comparison samples specifically comprises:

Calculate the distance between the sample to be classified and all other prior samples according to the number of dimensions contained in each sample in the sample set;

Select the K prior samples closest to each other as comparison samples.

The method according to claim 4, wherein calculating the distance between the sample to be classified and all other prior samples specifically comprises:

When the samples in the sample set are one-dimensional, the distance between the sample to be classified and the priori sample is obtained by d _i =| _xi - _x1 |, wherein d _i is the distance between the sample to be classified and the i-th priori sample, _xi is the coordinate of the i-th priori sample, and _x1 is the coordinate of the sample to be classified;

When the samples in the sample set are two-dimensional, Obtaining the distance between the sample to be classified and the priori sample, wherein d _i is the distance between the sample to be classified and the i-th priori sample, (x _i , y _i ) is the coordinate of the i-th priori sample, and (x ₁ , y ₁ ) is the coordinate of the sample to be classified;

When the samples in the sample set are three-dimensional, The distance between the sample to be classified and the priori sample is obtained, wherein d _i is the distance between the sample to be classified and the i-th priori sample, (xi _, _yi , z _i ) is the coordinate of the i-th priori sample, and ( _x1 , _y1 , _z1 ) is the coordinate of the sample to be classified.

A high entropy KNN clustering device based on post-correction, comprising:

at least one processor; and,

a memory communicatively coupled to the at least one processor;

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can perform the following operations:

Determine a sample set that needs to be clustered, and initialize classification for a number of designated samples in the sample set based on the same similarity; wherein, in a pre-acquired data set, select a number of data as the sample set, and the number of data includes product data; the designated samples are samples with identifiable characteristics, and the identifiable characteristics include excellent product quality and defective products;

Based on the difference in similarity, and the category label determined by the comparison sample in the initial classification Label, and obtain the category label of the sample to be classified, specifically including:

A non-volatile computer storage medium storing computer executable instructions, wherein the computer executable instructions are configured to:

Determine the category labels that appear in the comparison sample, and the corresponding Number of occurrences;