+

WO2018187948A1 - Local repairing method for machine learning model - Google Patents

Local repairing method for machine learning model Download PDF

Info

Publication number
WO2018187948A1
WO2018187948A1 PCT/CN2017/080172 CN2017080172W WO2018187948A1 WO 2018187948 A1 WO2018187948 A1 WO 2018187948A1 CN 2017080172 W CN2017080172 W CN 2017080172W WO 2018187948 A1 WO2018187948 A1 WO 2018187948A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
machine learning
model
learning model
patch
Prior art date
Application number
PCT/CN2017/080172
Other languages
French (fr)
Chinese (zh)
Inventor
邹霞
Original Assignee
邹霞
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 邹霞 filed Critical 邹霞
Priority to PCT/CN2017/080172 priority Critical patent/WO2018187948A1/en
Publication of WO2018187948A1 publication Critical patent/WO2018187948A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N99/00Subject matter not provided for in other groups of this subclass

Definitions

  • the present invention relates to a local repair method of a machine learning model, and belongs to the field of Internet search.
  • search engines have become an important tool for people to use Internet information resources.
  • search engines such as Google, Yahoo!. Bing, and Baidu
  • the relevance of query results has attracted more and more attention.
  • the pros and cons of sorting the results of the query have also become the main indicators for evaluating the search engine.
  • the user gives the keyword as a query request
  • the search engine queries the index database according to the user query, and returns the retrieval result of the sorting and correlation analysis to the user, helping the person to reject and ignore a large amount of irrelevant information, thereby Play the role of information navigation.
  • the massive amount of information data means massive search results.
  • most users of the cable engine only browse the first few pages of the returned results, and rarely care about the lower ranked pages. Search results with strong correlation should be ranked higher, while weak correlation results should be ranked lower. Therefore, sorting the query results according to their relevance becomes one of the core problems of search engines. The relevance ranking of search results has also become an important indicator for evaluating search engine performance.
  • a multidimensional feature vector is used to represent the relevant attributes and information of each data pair (user query-query result). Extract some data pairs in the dataset and manually identify the relevance of the query results and user queries in each data pair.
  • the machine learning model is trained using the already identified data as a training data set, and the resulting machine learning model is used to predict the relevance of the unknown query and the query results.
  • the size of the feedback data set is much smaller than the original training data.
  • the learning model established by re-learning is mainly determined by the original training data set, so the space for performance improvement is very limited.
  • the purpose of the present invention is to modify the original machine learning model from a local perspective, to make up for the deficiencies of the retraining model, incremental learning, and the like, and to improve the performance of the machine learning model.
  • the present invention adopts the following technical solutions:
  • the present invention provides a partial repair method of a machine learning model, comprising the following steps:
  • Step 1 Collecting and analyzing feedback data: collecting user feedback data, and extracting data samples of prediction errors
  • Step 2 Spatial Transformation: Converting the original data space to a new data space through scale learning, in the new In the data space, the distance between the predicted error data samples is reduced as much as possible, and the distance between the predicted error data sample and the predicted correct data sample is increased as much as possible;
  • Step 3 In the new data space, learn the wrong data sample to establish a patch model, and define the application scope of the patch model;
  • Step 4 In the new data space, learn the wrong data sample to establish a patch model, and define the application scope of the patch model.
  • the user feedback data is a series of data pairs, and the result is evaluated by establishing a machine learning model to evaluate the degree of relevance.
  • the spatial distance between the predicted error data samples is reduced as much as possible, and the distance between the predicted error data sample and the predicted correct data sample is exhausted. Possible increase.
  • step 3 after mapping the predicted error data set to the new feature space, a patch model is established on the learning data sample.
  • the process of establishing the patch model in the above step 3 is a training process of the supervised machine learning model.
  • the machine learning model is used to predict the ordering of the query results.
  • the local repair method of the machine learning model does not change the original learning model, but only learns the sub-space of the local patch of the model and the patch model according to the predicted error data fed back by the user, and the original learning
  • the model and the generated patch model form a new learning model, modify the original machine learning model from a local perspective, make up for the shortcomings of retraining, incrementing, etc., and improve the performance of the machine learning model.
  • the present invention provides a method for locally repairing a machine learning model.
  • the present invention will be further described in detail in the following embodiments in order to clarify and clarify the objects, technical solutions and effects of the present invention. It should be understood that The specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
  • the machine learning model In the process of processing massive amounts of information, the machine learning model has been widely used in various problems and played a huge role with its automatic and rapid advantages.
  • Machine learning models especially supervised machine learning models, are supported by a large amount of training data to achieve higher and higher prediction accuracy.
  • the machine learning model has some drawbacks. Once the machine learning model is built, it is like a black box, only the input and output are visible. Even if you find data that predicts errors, you cannot adjust the original machine learning model. Moreover, no matter how powerful a machine learning model is, there is no guarantee that its prediction accuracy will be 100%. This requires constant adjustment of the original machine learning model based on the user's feedback data to continuously improve the prediction accuracy.
  • the machine learning model addressed by this embodiment is a collection of multiple decision trees.
  • a decision tree represents a submodel.
  • the weighted sum of the prediction results of all submodels is the final prediction result.
  • For each user query a collection of query results.
  • a feature vector is used to represent each query result.
  • the non-leaf node will calculate some attributes of the query result, and determine the path of the query result in the current decision tree according to the set threshold. When the leaf node is reached, it can be obtained.
  • the classification result of the query result The classification result is represented by a score.
  • the final result of the query result is obtained by weighted summation of the classification results of the query results on each decision tree.
  • the level of the score determines how relevant the query results are to the user's query. The higher the score, the stronger the correlation; the lower the score, the weaker correlation.
  • the feedback data information of the user is first collected, and the prediction error samples therein are extracted. Learning and training the prediction error samples, and establishing a patch model to make up for the defects of the original model. In correcting the peers of the predicted error samples, it must be ensured that there is no negative impact on predicting the correct data. Therefore, in the local repair method, not only the patch model needs to be established, but also the scope of the patch model application needs to be defined.
  • the data space of the prediction failure is spatially transformed into a new space by the method of scale learning, and in the new data space.
  • the data samples that failed the prediction are aggregated as much as possible, and away from predicting the correct data samples.
  • a patch model is created. The patch model is obtained by learning and training the data samples that failed to predict. After the patch model is built, the peers also need to define the area in which the patch model is applied.
  • the local repair method provided by this embodiment is mainly divided into the following four steps:
  • User feedback data D is composed of a series of data pairs.
  • the machine learning model can be used to evaluate the result of the query d / (represented. For any two pairs of data ( ⁇ q, d i> , r t ), ( ⁇ q, dj>, r 7 r , ⁇ rj ( ⁇ i; ) > f(dj), Bay U considers ( ⁇ ;; , r; ), ( ⁇ q , dj>, rj ) is a pair of data pairs that predict errors.
  • the spatial transformation matrix is learned.
  • the feature space of the original data needs to be transformed.
  • the purpose of the spatial transformation is to make the spatial distance between the predicted error data samples as small as possible in the new feature space, and the distance between the predicted error data sample and the predicted correct data sample is increased as much as possible. . This minimizes the impact of the patch model on predicting correct data, thereby ensuring the predictive accuracy of the new machine learning model.
  • the two query results have different correlations with the user query, and the patch model is continuously updated by analyzing the two query results.
  • the calculation formula is as shown in (1):
  • the space transformation is first required to obtain a new spatial feature vector A, ⁇ , and then the new machine learning model is used to learn A ⁇ Forecast, get the final evaluation score.
  • the query results are sorted according to the scores.
  • This embodiment uses the method of scale learning to map the original data into a new feature space.
  • the objective function only considers data samples of prediction errors in the data space. This is because the model repair algorithm is mainly used to repair the data samples of the prediction errors, but it does not need to be processed for predicting the correct data samples. Moreover, in the user feedback data set, the size of the predicted error data sample is much smaller than the predicted correct data sample size. Considering only the data samples that predict errors will greatly improve the efficiency of the algorithm.
  • the local repair method of the machine learning model does not change the original learning model, but only learns the sub-space of the local patch of the model and the patch model according to the predicted error data fed back by the user, and the original learning
  • the model and the generated patch model form a new learning model, modify the original machine learning model from a local perspective, make up for the shortcomings of retraining, incrementing, etc., and improve the performance of the machine learning model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A local repairing method for a machine learning model, comprising: collection and analysis of feedback data: collecting user feedback data and extracting incorrectly predicted data samples; spatial transformation: converting an original data space to a new data space by means of scale learning, reducing the distance between the incorrectly predicted data samples as much as possible in the new data space, and increasing the distance between the incorrectly predicted data samples and the correctly predicted data samples as much as possible; learning incorrect data samples to establish a patch model in the new data space and defining an application range of the patch model; and learning the incorrect data samples to establish a patch model in the new data space and defining an application range of the patch model. The local repairing method for the machine learning model can improve the performance of the machine learning model.

Description

说明书 发明名称:机器学习模型的局部修复方法 技术领域  Specification Name of Invention: Local Repair Method of Machine Learning Model Technical Field
[0001] 本发明涉及一种机器学习模型的局部修复方法, 属于互联网搜索领域。  [0001] The present invention relates to a local repair method of a machine learning model, and belongs to the field of Internet search.
背景技术  Background technique
[0002] 随着互联网的快速发展, 搜索引擎成为人们使用 Internet信息资源的重要工具 。 伴随 Google、 Yahoo! . Bing、 百度等搜索引擎的兴起和发展, 査询结果的相 关度越来越受到人们的关注。 査询结果排序的优劣亦成为评价搜索弓 I擎的主要 指标。  [0002] With the rapid development of the Internet, search engines have become an important tool for people to use Internet information resources. With the rise and development of search engines such as Google, Yahoo!. Bing, and Baidu, the relevance of query results has attracted more and more attention. The pros and cons of sorting the results of the query have also become the main indicators for evaluating the search engine.
[0003] 随着信息技术快速发展和广泛应用, 互联网得到了蓬勃发展, 成为全球最大的 信息资源, 在人们的生活中已经占据了重要的位置。 互联网也成为了人们进行 信息共享和交互的重要平台。 用户要在如此庞大杂乱的互联网资源中査找所需 要的信息, 就像大海捞针一样, 而搜索引擎恰好解决了这一问题。 搜索引擎是 基于互联网平台, 是提供网络信息检索服务的工具。 搜索引擎也成为是互联网 技术中最重要的应用。 用户给出关键词作为査询请求, 搜索引擎根据用户査询 在自己的索引数据库中进行査询, 并将排序和相关性分析的检索结果返回给用 户, 帮助人们拒绝和忽略大量无关信息, 从而起到信息导航的作用。 而海量的 信息数据则意味着海量的搜索结果。 在实际应用中, 大多数索引擎的用户只对 返回结果的前几页进行浏览, 很少关心排名较后的网页。 具有强相关性的搜索 结果应该排在比较靠前的位置, 而弱相关性的搜索结果则应该排在比较靠后的 位置。 因此根据其相关性对査询结果进行排序成为搜索引擎的核心问题之一。 搜索结果的相关性排序也成为评价搜索引擎性能的重要指标。  [0003] With the rapid development and wide application of information technology, the Internet has prospered and become the world's largest information resource, which has occupied an important position in people's lives. The Internet has also become an important platform for people to share and interact with information. Users need to find the information they need in such a large and messy Internet resource, just like a needle in a haystack, and the search engine just solves this problem. The search engine is based on the Internet platform and is a tool for providing network information retrieval services. Search engines have also become the most important applications in Internet technology. The user gives the keyword as a query request, the search engine queries the index database according to the user query, and returns the retrieval result of the sorting and correlation analysis to the user, helping the person to reject and ignore a large amount of irrelevant information, thereby Play the role of information navigation. And the massive amount of information data means massive search results. In practical applications, most users of the cable engine only browse the first few pages of the returned results, and rarely care about the lower ranked pages. Search results with strong correlation should be ranked higher, while weak correlation results should be ranked lower. Therefore, sorting the query results according to their relevance becomes one of the core problems of search engines. The relevance ranking of search results has also become an important indicator for evaluating search engine performance.
[0004] 在搜索引擎排序问题中, 使用一个多维的特征向量表示每个数据对 (用户査询 -査询结果) 的相关属性和信息。 抽取数据集中的部分数据对, 并人为的标识每 个数据对中査询结果和用户査询的相关性。 使用已经标识的数据作为训练数据 集来训练机器学习模型, 并使用得到的机器学习模型来预测未知査询和査询结 果的相关度。 然而无论一个机器学习模型的理论基础多么强大, 总可以在应用 过程中发现其不吋出现的错误。 很多原因可以导致机器学习模型在应用过程中 的预测错误, 比如带有噪音或是比较极端的训练数据, 比如不稳定的数据分布 以及机器学习模型本身的缺陷等等。 [0004] In the search engine ranking problem, a multidimensional feature vector is used to represent the relevant attributes and information of each data pair (user query-query result). Extract some data pairs in the dataset and manually identify the relevance of the query results and user queries in each data pair. The machine learning model is trained using the already identified data as a training data set, and the resulting machine learning model is used to predict the relevance of the unknown query and the query results. However, no matter how powerful the theoretical foundation of a machine learning model is, it can always be applied. In the process, it found that it did not appear wrong. There are many reasons why machine learning models can predict errors in the application process, such as noise or extreme training data, such as unstable data distribution and defects in the machine learning model itself.
[0005] 为了提高机器学习模型的性能, 通常的做法是不断收集错误的用户反馈数据作 为额外的训练数据来重新建立新的学习模型。 然而原始的学习模型在大部分的 测试数据集中已经达到良好的效果。 因为少量的反馈数据就需要重新建立新学 习模型。 这样会大大降低搜索的效率。 而学习模型一旦建立, 模型的修改就变 得比较困难。  [0005] In order to improve the performance of machine learning models, it is common practice to continuously collect erroneous user feedback data as additional training data to re-establish a new learning model. However, the original learning model has achieved good results in most of the test data sets. Because of the small amount of feedback data, it is necessary to re-establish a new learning model. This will greatly reduce the efficiency of the search. Once the learning model is established, the modification of the model becomes more difficult.
技术问题  technical problem
[0006] 对于机器学习模型的修复问题, 许多研究人员已经提出很多解决方法。 最直观 的方法就是当获得用户反馈数据后, 将用户反馈数据与原始的训练数据合并作 为一个新的训练数据集重新学习训练后得到一个新的机器学习模型。 然而这种 方法主要存在两个问题:  [0006] Many researchers have proposed many solutions to the problem of repairing machine learning models. The most intuitive method is to obtain a new machine learning model after merging the user feedback data with the original training data as a new training data set. However, there are two main problems with this approach:
[0007] 1.反馈数据集的规模远远小于原始的训练数据。 重新学习建立的学习模型主要 还是由原始的训练数据集决定的, 因此性能提升的空间非常有限。 [0007] 1. The size of the feedback data set is much smaller than the original training data. The learning model established by re-learning is mainly determined by the original training data set, so the space for performance improvement is very limited.
[0008] 2.每获得少量的用户反馈就需要重新学习建立新的机器学习模型, 势必会大大 降低搜索的效果。 这也是用户不希望看到的。 [0008] 2. Every time a small amount of user feedback is obtained, it is necessary to re-learn to establish a new machine learning model, which will inevitably greatly reduce the search effect. This is what users don't want to see.
问题的解决方案  Problem solution
技术解决方案  Technical solution
[0009] 鉴于上述现有技术的不足之处, 本发明的目的在于提供一种机器学习模型的局 部修复方法。  In view of the above deficiencies of the prior art, it is an object of the present invention to provide a local repair method for a machine learning model.
[0010] 本发明的目的是为了从局部的视角修改原始的机器学习模型, 弥补了重新训练 模型、 增量学习等方法的不足, 提高机器学习模型的性能。 为了达到上述目的 , 本发明采取了以下技术方案:  [0010] The purpose of the present invention is to modify the original machine learning model from a local perspective, to make up for the deficiencies of the retraining model, incremental learning, and the like, and to improve the performance of the machine learning model. In order to achieve the above object, the present invention adopts the following technical solutions:
[0011] 本发明提供了一种机器学习模型的局部修复方法, 包括以下步骤:  [0011] The present invention provides a partial repair method of a machine learning model, comprising the following steps:
[0012] 步骤一、 收集分析反馈数据: 收集用户反馈数据, 并抽取预测错误的数据样本  [0012] Step 1: Collecting and analyzing feedback data: collecting user feedback data, and extracting data samples of prediction errors
[0013] 步骤二、 空间变换: 通过尺度学习将原始数据空间转换到新的数据空间, 在新 的数据空间中, 预测错误的数据样本之间距离尽可能减小, 同吋预测错误的数 据样本与预测正确的数据样本之间的距离尽可能增大; [0013] Step 2: Spatial Transformation: Converting the original data space to a new data space through scale learning, in the new In the data space, the distance between the predicted error data samples is reduced as much as possible, and the distance between the predicted error data sample and the predicted correct data sample is increased as much as possible;
[0014] 步骤三、 在新的数据空间中, 学习错误的数据样本建立补丁模型, 并定义补丁 模型的应用范围;  [0014] Step 3: In the new data space, learn the wrong data sample to establish a patch model, and define the application scope of the patch model;
[0015] 步骤四、 在新的数据空间中, 学习错误的数据样本建立补丁模型, 并定义补丁 模型的应用范围。  [0015] Step 4: In the new data space, learn the wrong data sample to establish a patch model, and define the application scope of the patch model.
[0016] 优选的, 上述步骤一中根据用户反馈数据是系列的数据对组成, 通过建立机器 学习模型评价相关程度来评价结果。  [0016] Preferably, in the above step 1, the user feedback data is a series of data pairs, and the result is evaluated by establishing a machine learning model to evaluate the degree of relevance.
[0017] 优选的, 上述步骤二中在新的特征空间中, 预测错误的数据样本之间的空间距 离尽可能的缩小, 而预测错误的数据样本与预测正确的数据样本之间的距离则 尽可能的增大。 [0017] Preferably, in the second feature step, in the new feature space, the spatial distance between the predicted error data samples is reduced as much as possible, and the distance between the predicted error data sample and the predicted correct data sample is exhausted. Possible increase.
[0018] 优选的, 上述步骤三中在将预测错误的数据集映射到新的特征空间后, 对学习 数据样本建立补丁模型。  [0018] Preferably, in step 3 above, after mapping the predicted error data set to the new feature space, a patch model is established on the learning data sample.
[0019] 优选的, 上述步骤三中建立补丁模型的过程为有监督的机器学习模型的训练过 程。 [0019] Preferably, the process of establishing the patch model in the above step 3 is a training process of the supervised machine learning model.
[0020] 优选的, 上述步骤四中当获得 N个补丁模型, 并定义补丁模型的作用域之后, 使用机器学习模型来预测査询结果的排序。  [0020] Preferably, after obtaining the N patch models in the above step 4 and defining the scope of the patch model, the machine learning model is used to predict the ordering of the query results.
发明的有益效果  Advantageous effects of the invention
有益效果  Beneficial effect
[0021] 相比现有技术, 本发明提供的机器学习模型的局部修复方法, 不改变原始的学 习模型, 而只是根据用户反馈的预测错误数据学习模型局部补丁的子空间以及 补丁模型, 原始学习模型和生成的补丁模型组成一个新的学习模型, 从局部的 视角修改原始的机器学习模型, 弥补了重新训练、 增量等方法的不足, 提高机 器学习模型的性能。  [0021] Compared with the prior art, the local repair method of the machine learning model provided by the present invention does not change the original learning model, but only learns the sub-space of the local patch of the model and the patch model according to the predicted error data fed back by the user, and the original learning The model and the generated patch model form a new learning model, modify the original machine learning model from a local perspective, make up for the shortcomings of retraining, incrementing, etc., and improve the performance of the machine learning model.
本发明的实施方式 Embodiments of the invention
[0022] 本发明提供一种机器学习模型的局部修复方法, 为使本发明的目的、 技术方案 及效果更加清楚、 明确, 以下举实施例对本发明进一步详细说明。 应当理解, 此处所描述的具体实施例仅用以解释本发明, 并不用于限定本发明。 The present invention provides a method for locally repairing a machine learning model. The present invention will be further described in detail in the following embodiments in order to clarify and clarify the objects, technical solutions and effects of the present invention. It should be understood that The specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
[0023] 在处理海量信息的过程中, 机器学习模型以其自动, 快速的优势已经广泛应用 于各种问题中并发挥了巨大的作用。 机器学习模型, 尤其是有监督的机器学习 模型, 在大量的训练数据的支持下, 达到了越来越高的预测准确度。 然而, 机 器学习模型存在一些缺陷。 机器学习模型一旦建立, 就像一个黑盒子, 可见的 只有输入、 输出。 即使发现预测错误的数据, 仍无法对原始的机器学习模型进 行调整。 而且, 无论一个机器学习模型多么强大, 都不能保证其预测准确率达 到百分之百。 这就需要不断根据用户的反馈数据对原始的机器学习模型进行调 整, 不断提高预测准确率。  [0023] In the process of processing massive amounts of information, the machine learning model has been widely used in various problems and played a huge role with its automatic and rapid advantages. Machine learning models, especially supervised machine learning models, are supported by a large amount of training data to achieve higher and higher prediction accuracy. However, the machine learning model has some drawbacks. Once the machine learning model is built, it is like a black box, only the input and output are visible. Even if you find data that predicts errors, you cannot adjust the original machine learning model. Moreover, no matter how powerful a machine learning model is, there is no guarantee that its prediction accuracy will be 100%. This requires constant adjustment of the original machine learning model based on the user's feedback data to continuously improve the prediction accuracy.
[0024] 本实施例针对的机器学习模型是多个决策树的集合。 一个决策树表示一个子模 型。 所有子模型预测结果的加权和即为最终的预测结果。 对于每个用户査询, 对应一个査询结果的集合。 在该机器学习模型中, 使用一个特征向量来表示每 一个査询结果。 在决策树中, 非叶子节点将会对査询结果的某些属性进行计算 , 并根据设定的阈值来决定该査询结果在当前决策树上的路径, 当到达叶子节 点吋, 即可得到该査询结果的分类结果。 分类结果是由一个分数来表示。 该査 询结果的最终结果是由该査询结果在每个决策树上的分类结果进行加权求和得 到的。 分数的高低决定该査询结果与用户査询的相关程度。 分数越高, 则具有 强相关性; 分数越低, 则具有弱相关性。  [0024] The machine learning model addressed by this embodiment is a collection of multiple decision trees. A decision tree represents a submodel. The weighted sum of the prediction results of all submodels is the final prediction result. For each user query, a collection of query results. In this machine learning model, a feature vector is used to represent each query result. In the decision tree, the non-leaf node will calculate some attributes of the query result, and determine the path of the query result in the current decision tree according to the set threshold. When the leaf node is reached, it can be obtained. The classification result of the query result. The classification result is represented by a score. The final result of the query result is obtained by weighted summation of the classification results of the query results on each decision tree. The level of the score determines how relevant the query results are to the user's query. The higher the score, the stronger the correlation; the lower the score, the weaker correlation.
[0025] 如图 1所示, 本发明提供的机器学习模型的局部修复方法  [0025] As shown in FIG. 1, the local repair method of the machine learning model provided by the present invention
[0026] 本实施例提供的局部修复方法中, 首先收集用户的反馈数据信息, 并抽取其中 的预测错误样本。 对预测错误样本进行学习训练, 建立补丁模型, 以弥补原始 模型存在的缺陷。 在修正预测错误样本的同吋, 必须保证对预测正确数据不会 造成消极的影响。 因此在局部修复方法中, 不仅仅需要建立补丁模型, 同吋需 要定义补丁模型应用的范围。  In the local repair method provided by this embodiment, the feedback data information of the user is first collected, and the prediction error samples therein are extracted. Learning and training the prediction error samples, and establishing a patch model to make up for the defects of the original model. In correcting the peers of the predicted error samples, it must be ensured that there is no negative impact on predicting the correct data. Therefore, in the local repair method, not only the patch model needs to be established, but also the scope of the patch model application needs to be defined.
[0027] 在用户反馈数据中, 关注的只是预测错误的数据, 而预测错误的数据与预测正 确的数据的分布是错综复杂的, 当建立补丁模型, 纠正错误的同吋, 同吋会对 预测正确的数据带来消极的影响。 因此在本方法中, 首先通过尺度学习的方法 将预测失败的数据样本空间进行空间转换映射到新的空间, 并在新的数据空间 中, 使预测失败的数据样本尽可能的聚集, 并远离预测正确的数据样本。 增量 学习完成之后, 建立补丁模型。 补丁模型是对预测失败的数据样本进行学习训 练得到的。 补丁模型建立完成之后, 同吋还需要定义补丁模型所作用的区域。 [0027] In the user feedback data, only the data of the prediction error is concerned, and the distribution of the data of the prediction error and the prediction of the correct data is complicated. When the patch model is established, the correctness of the error is corrected, and the prediction is correct. The data has a negative impact. Therefore, in the method, the data space of the prediction failure is spatially transformed into a new space by the method of scale learning, and in the new data space. In the case, the data samples that failed the prediction are aggregated as much as possible, and away from predicting the correct data samples. After the incremental learning is completed, a patch model is created. The patch model is obtained by learning and training the data samples that failed to predict. After the patch model is built, the peers also need to define the area in which the patch model is applied.
[0028] 具体的, 本实施例提供的局部修复方法主要分为以下四个步骤:  [0028] Specifically, the local repair method provided by this embodiment is mainly divided into the following four steps:
[0029] 1.收集用户反馈数据, 并抽取预测错误的数据样本; [0029] 1. Collecting user feedback data, and extracting data samples of predicted errors;
[0030] 2.空间变换: 通过尺度学习将原始数据空间转换到新的数据空间, 在新的数据 空间中, 预测错误的数据样本之间距离尽可能减小, 同吋预测错误的数据样本 与预测正确的数据样本之间的距离尽可能增大;  [0030] 2. Spatial transformation: The original data space is transformed into a new data space by scale learning. In the new data space, the distance between the predicted error data samples is reduced as much as possible, and the data samples of the prediction errors are Predict the distance between the correct data samples as much as possible;
[0031] 3.在新的数据空间中, 学习错误的数据样本建立补丁模型, 并定义补丁模型的 应用范围;  [0031] 3. In the new data space, learn the wrong data sample to establish a patch model, and define the application scope of the patch model;
[0032] 4.在新的数据空间中, 学习错误的数据样本建立补丁模型, 并定义补丁模型的 应用范围。  [0032] 4. In the new data space, learn the wrong data samples to build a patch model, and define the application scope of the patch model.
[0033] 首先, 分析反馈数据。 用户反馈数据 D是由一系列的数据对组成, 数学定义 可表示为 K D^ ,^ , r ;) =l,2,...}, g表示用户査询, 表示一个用户査询 结果, 0≤r ,.≤5表示査询结果 和用户査询 之间的相关程度。 r ,.=5表示相关程 度最强, r ;=0表示相关程度最弱。 [0033] First, the feedback data is analyzed. User feedback data D is composed of a series of data pairs. The mathematical definition can be expressed as KD^ , ^ , r ; ) =l,2,...}, g represents the user query, indicating a user query result, 0 ≤ r , . ≤ 5 indicates the degree of correlation between the query result and the user query. r , .=5 indicates that the correlation is the strongest, r ; =0 indicates that the correlation is the weakest.
[数]  [number]
■ *¾ * : a. iii ■— ■ *3⁄4 * : a. iii ■—
则表示 与 之间的相关度小于 dj与 g之间的相关度, 则  Then the correlation between the representations is less than the correlation between dj and g, then
[数] 》 : f , ¾: ; ¾  [Number] 》 : f , 3⁄4: ; 3⁄4
。 对于任意数据对< ^, ^>, r ,.), 单独评价其预测错误或预测正确是没有意 义的。 假设机器学习模型为  . For any data pair < ^, ^>, r , .), it is meaningless to evaluate its prediction error or prediction correctly. Assume that the machine learning model is
[数]
Figure imgf000006_0001
[number]
Figure imgf000006_0001
, 机器学习模型对査询结果 d评价结果可用 /( 表示。 对于任意两个数据对(< q, d i> , r t), ( < q, dj>, r7 r ,< r j ( <i; ) >f(dj), 贝 U认为( < <i ;> , r; ), ( < q, dj>, rj )是一对预测错误的数据对。 , the machine learning model can be used to evaluate the result of the query d / (represented. For any two pairs of data (< q, d i> , r t ), ( < q, dj>, r 7 r , < rj ( <i; ) > f(dj), Bay U considers ( <<;; , r; ), ( < q , dj>, rj ) is a pair of data pairs that predict errors.
[0034] 其次, 学习空间变换矩阵。 为了避免建立的补丁模型对预测正确的数据样本集 合产生影响吋, 需要对原始数据的特征空间进行变换。 空间变换的目的是使在 新的特征空间中, 预测错误的数据样本之间的空间距离尽可能的缩小, 而预测 错误的数据样本与预测正确的数据样本之间的距离则尽可能的增大。 这样就可 以尽可能的降低补丁模型对预测正确数据产生的影响, 从而保证新的机器学习 模型的预测准确率。 [0034] Second, the spatial transformation matrix is learned. In order to avoid the impact of the established patch model on predicting the correct set of data samples, the feature space of the original data needs to be transformed. The purpose of the spatial transformation is to make the spatial distance between the predicted error data samples as small as possible in the new feature space, and the distance between the predicted error data sample and the predicted correct data sample is increased as much as possible. . This minimizes the impact of the patch model on predicting correct data, thereby ensuring the predictive accuracy of the new machine learning model.
[0035] 再次, 学习补丁模型。 在将预测错误的数据集映射到新的特征空间后, 接下来 需要学习数据样本建立补丁模型。 建立补丁模型的过程其实就是一个有监督的 机器学习模型的训练过程。  [0035] Again, learn the patch model. After mapping the predicted error data set to the new feature space, you need to learn the data samples to build the patch model. The process of building a patch model is actually a training process for a supervised machine learning model.
[0036] 考虑同一个用户査询下的两个不同査询结果, 这两个査询结果与用户査询具有 不同的相关度, 并通过分析这两个査询结果来不断更新补丁模型。 假设, 对于 用户査询^ 抽取任意两个具有不同相关度的査询结果, dh dj, 与用户査询 的相关度分别为, ' rr 定义 =§(< ^>), =§(< ^>)为机器学习模 型的评价分数, 设 的评定分数 g;大于 的评定分数 的概率为 定义 %. 的计算公式如 (1) 所示: [0036] Considering two different query results under the same user query, the two query results have different correlations with the user query, and the patch model is continuously updated by analyzing the two query results. Suppose, for the user query ^ to extract any two query results with different relevance, d h dj, the correlation with the user query is, 'r r definition = § (<^>), = § (<^>) is the evaluation score of the machine learning model, the evaluation score g is set ; the probability that the greater than the evaluation score is the definition %. The calculation formula is as shown in (1):
[0037] [数] [0037] [Number]
:¾:二 , ■n■ '、 .、■' :3⁄4: two, ■n■ ', ., ■'
[0038] [数] 、 表示 sigm W函数。 定义概率 %·的理想分布如 (2) 所示: [0038], represents the sigm W function. The ideal distribution for defining the probability %· is shown in (2):
[0039]
Figure imgf000008_0001
[0039]
Figure imgf000008_0001
(2)  (2)
[0040] 在上述的概率函数的基础上, 构造交叉熵损失函数如 (3) 所示:  [0040] Based on the above probability function, construct a cross entropy loss function as shown in (3):
[0041] [数] c¾* 》 1零 ί哪 [0041] [number] c ¾ * "1 which zero ί
[0042] 通过上式可得到当样本数据与补丁模型中心点的距离很远吋, 这样该样本对补 丁模型的参数不会产生影响。 这就保证补丁模型主要作用于预测错误的样本数 据, 而不会对预测正确的数据样本产生影响。 [0042] By the above formula, it can be obtained that the sample data is far away from the center point of the patch model, so that the sample has no influence on the parameters of the patch model. This ensures that the patch model is primarily used to predict incorrect sample data without affecting the correct prediction of the data sample.
[0043] 最后, 应用补丁模型。 当获得 Ν个补丁模型, 并定义补丁模型的作用域之后, 使用新的机器学习模型  [0043] Finally, the patch model is applied. After obtaining a patch model and defining the scope of the patch model, use the new machine learning model
[数]
Figure imgf000008_0002
[number]
Figure imgf000008_0002
来预测査询结果的排序。 对于用户査询, 假设其中一个用户査询结果为 , 对 应的空间特征向量为^, 则需要首先进行空间转换得到新的空间特征向量 A ,·, 然后使用新的机器学习模型对 A ^进行学习预测, 得到最后的评价分数。 当获得 所有査询结果的评价分数后, 根据分数, 对査询结果进行排序。 加入补丁模型 , 可以有效的修复在用户反馈数据中发现的错误样本, 并进行局部修复, 而且 在修复过程中, 不会对正确样本产生影响, 从而提高机器学习模型的性能和预 测准确率。  To predict the ordering of query results. For the user query, assuming that one of the user query results is, the corresponding spatial feature vector is ^, then the space transformation is first required to obtain a new spatial feature vector A, ·, and then the new machine learning model is used to learn A ^ Forecast, get the final evaluation score. After obtaining the evaluation scores of all the query results, the query results are sorted according to the scores. By adding a patch model, it is possible to effectively repair the error samples found in the user feedback data and perform local repair, and during the repair process, it does not affect the correct samples, thereby improving the performance and prediction accuracy of the machine learning model.
本实施例使用尺度学习的方法将原始数据映射到一个新的特征空间中。 目标函 数只考虑数据空间中的预测错误的数据样本。 这是因为模型修复算法主要用于 修复预测错误的数据样本, 而对于预测正确的数据样本, 则不需要进行处理。 而且在用户反馈数据集中, 预测错误的数据样本规模远远小于预测正确的数据 样本规模。 只考虑预测错误的数据样本将会大大提高算法的效率。 This embodiment uses the method of scale learning to map the original data into a new feature space. The objective function only considers data samples of prediction errors in the data space. This is because the model repair algorithm is mainly used to repair the data samples of the prediction errors, but it does not need to be processed for predicting the correct data samples. Moreover, in the user feedback data set, the size of the predicted error data sample is much smaller than the predicted correct data sample size. Considering only the data samples that predict errors will greatly improve the efficiency of the algorithm.
[0045] 相比现有技术, 本发明提供的机器学习模型的局部修复方法, 不改变原始的学 习模型, 而只是根据用户反馈的预测错误数据学习模型局部补丁的子空间以及 补丁模型, 原始学习模型和生成的补丁模型组成一个新的学习模型, 从局部的 视角修改原始的机器学习模型, 弥补了重新训练、 增量等方法的不足, 提高机 器学习模型的性能。  [0045] Compared with the prior art, the local repair method of the machine learning model provided by the present invention does not change the original learning model, but only learns the sub-space of the local patch of the model and the patch model according to the predicted error data fed back by the user, and the original learning The model and the generated patch model form a new learning model, modify the original machine learning model from a local perspective, make up for the shortcomings of retraining, incrementing, etc., and improve the performance of the machine learning model.
[0046]  [0046]
[0047] 可以理解的是, 对本领域普通技术人员来说, 可以根据本发明的技术方案及其 发明构思加以等同替换或改变, 而所有这些改变或替换都应属于本发明所附的 权利要求的保护范围。  [0047] It is to be understood that those skilled in the art can make equivalent substitutions or changes in accordance with the technical solutions of the present invention and the inventive concepts thereof, and all such changes or substitutions should belong to the appended claims. protected range.

Claims

权利要求书 Claim
[权利要求 1] 一种机器学习模型的局部修复方法, 其特征在于: 所述局部修复方法 包括以下步骤:  [Claim 1] A method for locally repairing a machine learning model, characterized in that: the partial repair method comprises the following steps:
步骤一、 收集分析反馈数据: 收集用户反馈数据, 并抽取预测错误的 数据样本;  Step 1. Collect and analyze the feedback data: collect user feedback data, and extract data samples for prediction errors;
步骤二、 空间变换: 通过尺度学习将原始数据空间转换到新的数据空 间, 在新的数据空间中, 预测错误的数据样本之间距离尽可能减小, 同吋预测错误的数据样本与预测正确的数据样本之间的距离尽可能增 大;  Step 2: Spatial transformation: Convert the original data space to a new data space through scale learning. In the new data space, the distance between the predicted error data samples is reduced as much as possible, and the data samples and predictions that predict errors are correct. The distance between the data samples is as large as possible;
步骤三、 在新的数据空间中, 学习错误的数据样本建立补丁模型, 并 定义补丁模型的应用范围;  Step 3: In the new data space, learn the wrong data sample to build a patch model, and define the application scope of the patch model;
步骤四、 在新的数据空间中, 学习错误的数据样本建立补丁模型, 并 定义补丁模型的应用范围。  Step 4: In the new data space, learn the wrong data samples to build a patch model, and define the application scope of the patch model.
[权利要求 2] 如权利要求 1所述的机器学习模型的局部修复方法, 其特征在于: 所 述步骤一中根据用户反馈数据是系列的数据对组成, 通过建立机器学 习模型评价相关程度来评价结果。  [Claim 2] The method for repairing a machine learning model according to claim 1, wherein: in step 1, the user feedback data is a series of data pairs, and the machine learning model is evaluated to determine the degree of relevance. result.
[权利要求 3] 如权利要求 1所述的机器学习模型的局部修复方法, 其特征在于: 所 述步骤二中在新的特征空间中, 预测错误的数据样本之间的空间距离 尽可能的缩小, 而预测错误的数据样本与预测正确的数据样本之间的 距离则尽可能的增大。  [Claim 3] The method for locally repairing a machine learning model according to claim 1, wherein: in the second feature space, the spatial distance between the predicted error data samples is reduced as much as possible in the new feature space. , and the distance between the predicted error data sample and the predicted correct data sample is increased as much as possible.
[权利要求 4] 如权利要求 1所述的机器学习模型的局部修复方法, 其特征在于: 所 述步骤三中在将预测错误的数据集映射到新的特征空间后, 对学习数 据样本建立补丁模型。  [Claim 4] The method for locally repairing a machine learning model according to claim 1, wherein: in step 3, after mapping the data set of the prediction error to the new feature space, the patch of the learning data sample is created. model.
[权利要求 5] 如权利要求 4所述的机器学习模型的局部修复方法, 其特征在于: 所 述步骤三中建立补丁模型的过程为有监督的机器学习模型的训练过程  [Claim 5] The method for locally repairing a machine learning model according to claim 4, wherein: the process of establishing a patch model in the third step is a training process of a supervised machine learning model.
[权利要求 6] 如权利要求 1所述的机器学习模型的局部修复方法, 其特征在于: 所 述步骤四中当获得 N个补丁模型, 并定义补丁模型的作用域之后, 使 用机器学习模型来预测査询结果的排序。 [Claim 6] The method for locally repairing a machine learning model according to claim 1, wherein: in step 4, after obtaining N patch models and defining a scope of the patch model, Use machine learning models to predict the ordering of query results.
PCT/CN2017/080172 2017-04-12 2017-04-12 Local repairing method for machine learning model WO2018187948A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/080172 WO2018187948A1 (en) 2017-04-12 2017-04-12 Local repairing method for machine learning model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/080172 WO2018187948A1 (en) 2017-04-12 2017-04-12 Local repairing method for machine learning model

Publications (1)

Publication Number Publication Date
WO2018187948A1 true WO2018187948A1 (en) 2018-10-18

Family

ID=63793095

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/080172 WO2018187948A1 (en) 2017-04-12 2017-04-12 Local repairing method for machine learning model

Country Status (1)

Country Link
WO (1) WO2018187948A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120263376A1 (en) * 2011-04-12 2012-10-18 Sharp Laboratories Of America, Inc. Supervised and semi-supervised online boosting algorithm in machine learning framework
CN103150578A (en) * 2013-04-09 2013-06-12 山东师范大学 Training method of SVM (Support Vector Machine) classifier based on semi-supervised learning
CN104572998A (en) * 2015-01-07 2015-04-29 北京云知声信息技术有限公司 Updating method and device of question answer sequencing model for automatic question answer system
US20160162779A1 (en) * 2014-12-05 2016-06-09 RealMatch, Inc. Device, system and method for generating a predictive model by machine learning
CN106548210A (en) * 2016-10-31 2017-03-29 腾讯科技(深圳)有限公司 Machine learning model training method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120263376A1 (en) * 2011-04-12 2012-10-18 Sharp Laboratories Of America, Inc. Supervised and semi-supervised online boosting algorithm in machine learning framework
CN103150578A (en) * 2013-04-09 2013-06-12 山东师范大学 Training method of SVM (Support Vector Machine) classifier based on semi-supervised learning
US20160162779A1 (en) * 2014-12-05 2016-06-09 RealMatch, Inc. Device, system and method for generating a predictive model by machine learning
CN104572998A (en) * 2015-01-07 2015-04-29 北京云知声信息技术有限公司 Updating method and device of question answer sequencing model for automatic question answer system
CN106548210A (en) * 2016-10-31 2017-03-29 腾讯科技(深圳)有限公司 Machine learning model training method and device

Similar Documents

Publication Publication Date Title
Liu et al. ProtDec-LTR3. 0: protein remote homology detection by incorporating profile-based features into learning to rank
Wu et al. AutoCTS+: Joint neural architecture and hyperparameter search for correlated time series forecasting
CN111275172A (en) A Feedforward Neural Network Structure Search Method Based on Search Space Optimization
CN106528874A (en) Spark memory computing big data platform-based CLR multi-label data classification method
CN111339258B (en) University computer basic exercise recommendation method based on knowledge graph
CN109857457B (en) A Method for Learning Function Hierarchical Embedding Representations in Source Codes in Hyperbolic Spaces
CN116306923A (en) Evaluation weight calculation method based on knowledge graph
CN118656698A (en) A classification method and device for multiple historical and cultural resources
CN112965968A (en) Attention mechanism-based heterogeneous data pattern matching method
CN111898579A (en) An unbiased semi-supervised classification model for high-resolution remote sensing images based on extreme gradient boosting
CN110442736B (en) A Semantic Enhanced Subspace Cross-Media Retrieval Method Based on Quadratic Discriminant Analysis
CN118861996B (en) Multi-source heterogeneous data fusion method, device and storage medium
CN106816871B (en) State similarity analysis method for power system
CN118861992A (en) A multimodal data processing method, device, equipment and medium for intelligent manufacturing
CN118690260A (en) A big data information processing method, system, medium and server based on deep learning
CN113269310A (en) Graph neural network interpretable method based on counterfactual
CN109543712B (en) Method for identifying entities on temporal data set
CN116821407A (en) Cross-modal remote sensing image retrieval method and device based on HGR maximum correlation
WO2018187948A1 (en) Local repairing method for machine learning model
CN117056226A (en) Cross-project software defect number prediction method based on transfer learning
CN117689865A (en) Target detection method and system based on feature and fusion mode search
CN114238439A (en) Task-driven relational data view recommendation method based on joint embedding
CN119248932B (en) A network public opinion reversal prediction system using AI technology
CN111755074A (en) A method for predicting the origin of DNA replication in Saccharomyces cerevisiae
Zhang et al. Towards fine-scale population stratification modeling based on kernel principal component analysis and random forest

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17905755

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 21.02.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 17905755

Country of ref document: EP

Kind code of ref document: A1

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载