+

WO2010060746A2 - Procédé et dispositif d'analyse automatique de modèles - Google Patents

Procédé et dispositif d'analyse automatique de modèles Download PDF

Info

Publication number
WO2010060746A2
WO2010060746A2 PCT/EP2009/064476 EP2009064476W WO2010060746A2 WO 2010060746 A2 WO2010060746 A2 WO 2010060746A2 EP 2009064476 W EP2009064476 W EP 2009064476W WO 2010060746 A2 WO2010060746 A2 WO 2010060746A2
Authority
WO
WIPO (PCT)
Prior art keywords
model
training
linear model
automatically
kernel
Prior art date
Application number
PCT/EP2009/064476
Other languages
German (de)
English (en)
Other versions
WO2010060746A3 (fr
Inventor
Klaus-Robert MÜLLER
Timon Schroeter
Katja Hansen
Original Assignee
Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.
Technische Universität Berlin
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V., Technische Universität Berlin filed Critical Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.
Priority to DE112009002693T priority Critical patent/DE112009002693A5/de
Publication of WO2010060746A2 publication Critical patent/WO2010060746A2/fr
Publication of WO2010060746A3 publication Critical patent/WO2010060746A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/80Data visualisation

Definitions

  • nonlinear models of real data are used to make predictions. These models are so complex that they can hardly be investigated by analytical methods. This means that a user of a non-linear model is confronted with a kind of black-box, where he does not know the factors that are essential for a prediction in a particular case and that have found their way into the model.
  • Fig. 2 is a schematic representation of training and model generation and analysis
  • Fig. 3 is an illustration of the effect of reducing the core width on the ability to generalize
  • Fig. 4 is an illustration of an example of toxic and non-toxic compounds
  • Fig. 6 shows an example of the use of the method
  • Fig. 7 shows an example of a use of the method in connection with two test molecules
  • Tab. 1 shows an example calculation of the percentage that each training molecule contributes to the prediction for test molecule 1;
  • Tab. 2 shows an example calculation for the percentage that each training molecule contributes to the prediction for test molecule 2.
  • machine learning is understood to mean the basically known supervised machine learning.
  • an automated machine learning process recognizes laws and correlations in a training set that allow statements about properties of a new object.
  • One possible example, discussed below, is the automatic generation and analysis of a non-linear model for predicting the toxicity of chemical molecules.
  • a prediction is meant an estimate indicating for each chemical molecule its toxicity in the form of a number. If this number exceeds a threshold determined in the training process, the prediction can be interpreted as “toxic”, otherwise the prediction is "non-toxic”. Gauss's processes can be trained to interpret the output directly as a probability, e.g. "72 of a hundred molecules with these characteristics are toxic”.
  • a descriptive quantity e.g. calculated from the structural formula of a molecule, e.g. its size, weight, surface, the number of certain functional groups in the molecule, etc. understood in chemical computer science, these features are synonymously referred to as descriptors.
  • nonlinear model is not only created and used for prediction, but it is also automatically analyzed to determine at least one major influential factor (eg, the main reason why the model predicts toxicity) and further use of that data enable.
  • the methods and apparatus are also suitable for other data, models, and articles, such as the model-based control of a chemical plant. Also examples are given.
  • the resulting models have the special feature of providing additional information that facilitates the user's understanding at an interface, so that they can subsequently be visualized, for example, in the form of a ranking list or in the form of a graphic.
  • model When reference is made to a model below, this may be e.g. a computer program, i. a formalized sequence of logical commands.
  • a model can also be represented by mathematical relationships and / or tables of values.
  • Fig. 1 shows a basic form of a method by which a non-linear model (e.g., a computer program, a look-up table, a mathematical model, etc.) can be created and automatically analyzed.
  • the model may e.g. as a non-linear model or as part of a machine learning procedure. For the example according to FIG. 1, the latter case is assumed.
  • the aim is to obtain a model with which properties (for example toxicity, control behavior) of an a priori uncharacterized object (eg molecule, chemical plant) can be predicted and at the same time the influencing variables of the model are determined.
  • Properties will hereinafter be understood as meaning in each case a measured or measurable property of a chemical compound, e.g. their water solubility or toxicity, understood.
  • the non-linear model is automatically formed by a machine learning method from a multiplicity of known training objects in such a way that at least one article has at least one statement for at least one article a property allowed.
  • the automatically determined non-linear model allows a binary statement, toxic - non-toxic.
  • this statement can be made e.g. adhering to or disobeying a determine metric or quality characteristic.
  • An analysis means is used in the second method step 200 to automatically determine at least one measure that indicates which exercise item or items of training that have become part of the non-linear model have the greatest impact on the non-linear model.
  • Influence is understood here as the size of the normalized coefficient of a training object in the linear combination with which the prediction can be calculated.
  • a quantitative example is given in connection with FIG. 7 and Tables 1 and 2.
  • the analysis means uses a special property of the automatically determined non-linear model to measure this measure.
  • This measure can be determined, for example, with the aid of the representer theorem (or a mathematically equivalent formulation of the prediction function) whose core message is that the prediction of core-based models can be formulated mathematically as a linear combination, which will be explained in more detail below.
  • the validity of the representative theorem is a well-known property of a kernel-based non-linear model, which is determined by a machine learning method.
  • a ranking data record is then automatically created, in which the measures are arranged according to a predetermined condition.
  • the following describes a concrete embodiment, namely the automatic analysis of a model for the prediction of the toxicity of chemical molecules.
  • Those skilled in the art will recognize that other properties of chemical molecules, such as water solubility, metabolic stability, binding to certain receptors, etc., can be similarly analyzed.
  • a special feature of the resulting models is that the predictions are comprehensible in terms of content due to the automatic quantitative evaluation of influencing variables.
  • additional information is provided at an interface, so that these can be visualized in the following or used in the context of a model reduction. Optimization aids can be used for model reduction, which will be explained in more detail below.
  • the relationship between the structure of the molecules and their toxicity is determined from a training set of data.
  • a training set for a supervised method of machine learning the features (here the number of different functional groups or other molecular constituents) and the measured toxicity for a set of chemical molecules function.
  • Toxicity is the property of a substance to harm an organism. The damage can affect the whole organism or one of its substructures such as organs, cells or DNA.
  • genotoxicity measured as Arnes mutagenicity.
  • the method is by no means limited to this measure. Alternatively, measurements such as the micronucleus test or chromosome abberation can also be used.
  • the training is performed on a program called "ToxTrain" in the described embodiment This program involves an implementation of a supervised machine learning method, namely a Gaussian process known per se.
  • the result of the training is a program "ToxExplain", which contains the learned context as a model and thus can generate predictions about its toxicity for new molecules.
  • the embodiment also determines so-called explanatory components and optimization aids, which can be used over an interface eg
  • the model can also be a stand-alone program, which can also be output as a module or plug-in for existing software in a company, or implemented in hardware form, ie on a chip.
  • the procedure which will be described in detail below, automatically provides explanatory components that characterize the non-linear model, which is also generated automatically.
  • the presentation remains clear and it is understandable how the prediction of the model comes to pass.
  • the automatic Determining the explanatory components ie the parts of the model that have a great influence on the predictions, it is possible to easily automatically obtain a reduction of the complexity of the non-linear model.
  • explanatory A model that can automatically identify such explanatory components is called explanatory.
  • an ordered list (i.e., the list elements are each given a measure of the feature) is understood to mean those features of a molecule on which the prediction of the model most depends.
  • an ordered list i.e., the list elements are each given a measure of the feature
  • variations of the molecule e.g. less toxic than the parent molecule.
  • FIG. 2 describes the training in which a training amount is examined with the aid of the program ToxTrain. This automatically generates the program ToxExplain, which not only makes predictions possible, but also has explanatory components. This is shown in the lower part of Fig. 2.
  • the models of type ToxExplain explain their predictions and provide optimization aids and can then be further processed, e.g. for model reduction or visualization of the ranking of influencing factors.
  • kernel function in kernel-based learning has the task of implicitly transforming the features of two objects (e.g., chemical molecules) into a very high dimensional feature space and calculating the scalar product there. Since the core function can perform non-linear transformation, by using a suitable kernel function, any linear learning methods in which the features of objects (here: molecules) appear exclusively in the form of scalar products, can be generalized such that they are used for learning non-linear relationships can be used.
  • core functions are the RBF core (synonyms: Radial Basisfunction Kernel, Gauss Kernel, graphkern, treekern, Squared exponential Kernel) and the polynomial kernel.
  • An example of a machine learning method is a Gaussian process that can be used to generate models that, in addition to predictions, also output the variance of the respective prediction.
  • Gauss's processes were originally developed for regression of data, but can also be used for classifications.
  • the support vector machine is a machine learning method that was originally developed for the classification of data, but can also be used to regression data.
  • Classification is the construction of a model for properties that can be expressed by categories or membership of groups.
  • Molecules are eg "mutagenic” or “not mutagenic”. This is contrasted with the regression, in which a model is constructed whose properties can be expressed by real numbers, eg the strength of the binding of a molecule to a receptor protein. Also, the toxicity can be given in the form of real numbers.
  • the models resulting from the kernel-based learning process can, based on various features of new molecules, generate predictions for previously unobserved or measured properties of these new molecules, i. e.g. predict their toxicity.
  • the more observed / measured data from the past is available as a training set, the better a given relationship can be modeled and the more accurate the predictions for previously unseen molecules become.
  • Due to their high performance, statistical learning methods of this kind are already used in many fields. However, they have a decisive disadvantage: For the user of such a model, it is generally not comprehensible how the prediction comes about in a specific individual case.
  • series is meant a group of chemical compounds having the same basic body but differing in which functional group is present at a particular position, how long a particular side chain is, etc.
  • the method and the device allow to identify a few molecules relevant to the respective prediction from the training amount. These are referred to below as explanatory components.
  • explanatory components In the past, attempts have been made by various research groups to estimate the reliability of predictions taking into account the amount of training. However, the previous strategies are independent of the learning algorithm and therefore not adapted to its specifics. Only the close coupling or integration of the determination of the explanatory components to the learning algorithm makes it possible to identify the molecules on which the prediction really depends.
  • the most important characteristics for the respective prediction are automatically identified with the help of the method and serve as optimization aids. This automatically determines the characteristics of each molecule whose toxicity depends the most. The most important features are determined locally, which will be explained in more detail later. Thus, the local gradients are determined, which are then used as optimization aids.
  • a gradient is a differential operator that can be applied to a scalar field.
  • the term is used synonymously for the vector whose elements are the partial derivatives of a function after all of their Variables are.
  • the gradient is understood to mean a vector with the partial derivatives of the prediction of a model for a specific molecule according to its characteristics.
  • the gradients can be calculated directly analytically.
  • the gradients are calculated using a differentiable density estimator (eg, Parzen Window) that is closely matched to the prediction function so that the gradients of the density estimator can be considered a predicted prediction function ,
  • Generalization ability is understood here as the ability of a model to produce accurate predictions for molecules that are not included in its training set.
  • the number of explanatory components is plotted as a function of the core width.
  • Many core functions, including the RBF core have a hyper-parameter called core-width.
  • the kernel width controls whether predictions of a model depend in each case only on the properties of molecules which are closely adjacent to the new molecule in the feature space or whether more distant molecules are also taken into account.
  • the core width becomes smaller from left to right, and the number of explanatory components decreases. That is the Prediction for a new molecule relies on fewer and fewer molecules from the training set. If a prediction relies on very few (eg five) molecules, they can be visualized in a clear way. Visualization enables human experts to understand predictions and assess their reliability. A model that can provide the necessary information via an interface is called explanatory. The quantitative treatment of the explanatory ability will be described in connection with FIG.
  • a novelty is that the kernel width learned from the Gaussian process (left vertical line) is subsequently reduced (see FIG. 5).
  • a slightly increased mean error i.e., degradation of the model
  • the different crosses in the curves represent different measures for the number of explanatory components or the error (median, mean etc.) and show uniform trends.
  • Fig. 3 shows the relationship between the generalizability of the model and its core width.
  • the core width is plotted with the same scale as in the upper part, ie it decreases from left to right.
  • Generalisability is measured by the mean error the model makes in predicting new molecules. This was determined for various core widths with a test set of molecules that were not considered in the training of the model. In the left half (relatively large core width), the mean error for new molecules is small. If you reduce the core width, the mean error increases (right half).
  • test set for a supervised machine learning process
  • the optimization of the core width is part of the normal training process for Gauss' processes. However, this optimization is basically carried out with the aim of achieving the lowest possible mean errors for new molecules. This optimum is symbolized in Fig. 3 by the left vertical line.
  • the kernel width is automatically reduced (see FIG. 5) to obtain a clear number of explanatory components.
  • the ability to generalize deteriorates measurably, but not severely. This means that a compromise between the ability to generalize and the ability to explain can be achieved, ie the user can use ToxTrain to generate models from his own datasets which can be explained and nevertheless generalize relatively well.
  • the present embodiment allows the influence of certain features to be determined locally.
  • Fig. 4 illustrates this relationship.
  • steroids circles
  • non-steroids squares
  • the steroids are located in the lower left corner of the quadrant, the non-steroids in the upper right corner.
  • Epoxy-group non-steroids are usually toxic (hatched circles and squares), while steroid-containing steroids may be both toxic and non-toxic (non-hatched circles and squares).
  • the epoxy group is an important feature in terms of the toxicity of the particular compound. In the local neighborhood of steroids, however, this globally obtained information is misleading. This example shows that considering the local environment can be essential for generating optimization tools.
  • optimization aids would not include the epoxy group as a criterion for toxic steroids but would instead name relevant characteristics for the particular molecule. However, for toxic non-steroids, the optimizers would in any case include the epoxy group as a toxicity-relevant feature.
  • the problem can be solved better with a program ToxTrain. All available data can be used as training amount.
  • the resulting model ToxExplain always generates its optimization aids for each new molecule from the local gradient of the prediction according to the characteristics of the molecule. In this way, the user receives a targeted optimization help, which can be extracted from the entire available data.
  • the prediction f new for a new molecule is calculated as follows:
  • K-i, j k (x, Xj) for the complete kernel matrix of the training set
  • k ( x new> x ,) denotes the core function between the features Xn eur of the new molecule and the characteristics of the respective molecule i from the training set.
  • kernel-based methods differ in how elements CC 1 of the weight vector are determined.
  • the above expression for the weight vector relates to a Gaussian process, and in principle other core-based methods are also possible for implementing the method.
  • X 1 and Xn eu are vectors, whereby a partial derivative to the j-th component is formed.
  • the partial derivatives together then form the local gradient of f neu according to the characteristics of the new molecule and form the basis for the calculation of the optimization aids by the program ToxExplain. It will be understood by those skilled in the art that the same approach can be used for other features to determine other properties than toxicity.
  • the determination of the partial derivatives also allows the automatic determination of optimization aids for other features and thus a possibility for better model reduction.
  • In order to carry out the model reduction one calculates for all molecules in the respective training amount their optimization aids, ie one receives for each molecule the sensitivity after each characteristic (measured in per cent). Then, for each feature, one calculates the average amount that the sensitivity for this feature reaches on average across all molecules. After this average amount, the list of characteristics can now be sorted and thus converted into a ranking list.
  • the features can now (exclusively) be used at the head of the feature list generated in this way.
  • FIG. 1 A flowchart for a run of the program ToxTrain is shown in FIG.
  • a data set specified by the user is first loaded.
  • a Gaussian process is then trained in method step 2, ie, using a per se known machine learning algorithm, the relationship between the molecular structure of the chemical compounds contained in the data record and their toxicity is learned.
  • Part of this training process is the internal optimization of the evidence. This is a mathematical function that is used as a criterion in various methods of machine learning to optimize parameters. In Gauss' see processes In this way, the fit of predictions and predicted variances is considered equally.
  • the goodness of the Fit describes how closely a model is adapted to its training amount.
  • Unterfitten means not adapting a model closely enough, e.g. too complex to make, such as when trying to represent a nonlinear relationship through a straight line.
  • Overfit means too tight an adaptation of the function to the training set, so that exact predictions are obtained for all molecules from the training set, but for new molecules only very inaccurate predictions are achieved (poor generalization ability).
  • GP gen a value for the core width is automatically determined, which is optimal with regard to the expected generalization capability. This model is referred to below as GP gen .
  • a model is subsequently trained on the entire input data record (method step 10).
  • the second model is called GPfj. t denotes.
  • Both models, GP gen , and GP flt are stored (step 11). Together they form an explanatory overall model of the type ToxExplain.
  • the function is generally less smooth than that of the GP gen model, and the local gradients are less useful in terms of use as optimizers. So both models are saved in order to be able to determine both good predictions and helpful optimization aids with the program ToxExplain.
  • Process step 2 A data record is loaded. This contains the following information for a number of chemical compounds:
  • Process Step 3 Using the entire data set from process step 2, a Gaussian process model is trained. In the process, all model parameters are automatically adjusted using the gradients of the evidence function so that the evidence is maximized.
  • This parameter estimation or model selection strategy is state of the art in machine learning and generally leads to models that generalize well.
  • Process Step 4 The molecules from the data set obtained in method step 2 are randomly separated t-times independently into non-overlapping sub-data sets, which are referred to later as a training or test set. It makes sense to use at this point the known per se cross-validation strategy with at least 10 repetitions.
  • a 5-fold repeated 3-fold cross validation means that the molecules in the dataset are randomly distributed in three equal parts (folds). Subsequently, two of these folds are used as a training set, ie on their basis a model is trained. This model is used to generate predictions for the third FoId. In the same way FoId 1 + 3, as well as 2 + 3 combined are used as training sets and the resulting models are used to generate 2 or 1 predictions for each remaining folds. Now predictions have been made for the entire dataset, using a model for each prediction that did not have the molecule in its training set.
  • Process Step 5 With each of the training sets generated in method step 4, a Gaussian process is trained. However, unlike usual, no internal optimization of all parameters is made, but the core width is excluded from this optimization. Instead, the determined in step 3 (or from the 2nd run of the loop determined in step 10 reduced) core width is adopted and not further adapted.
  • Method Step 6 The models trained in method step 5 are now used to generate predictions for the test sets belonging to the respective training set from method step 4. That For all molecules in the respective test set, the toxicity is predicted and the mean error of these predictions is determined.
  • Method Step 7 For all predictions made in method step 6, the explanatory components are determined. That is, those molecules i from the respective training set are determined, which together contribute more than 80% of the respective prediction f (x new ).
  • the prediction is formed by means of the representer theorem, which applies to all kernel-based methods:
  • the running index i runs over all molecules in the training amount.
  • the quantities a, b ⁇ , and C 1 are (depending on Learning algorithm) various local and global parameters.
  • k (Xj_, x n ) is the core function between the features of one molecule each of the training set X 1 and the features of the new molecule x n .
  • the contributions to the sum in equation (4) are sorted by size, and additional contributions are added by the head of the list until the subtotal of the contributions reaches 80%.
  • a measure is calculated, on the basis of which a ranking data record is automatically created.
  • the measures of the influencing factors can then be arranged according to a predetermined condition.
  • Step 8 If more than 50% of the predictions from step 6 can be explained by five or less explanatory components (i.e., five or fewer tracer objects together provide 80% of the contribution to the prediction in step 7), then step 10 is followed. Otherwise, step 9, i. a return to step 5.
  • Method step 9 Since too many explanatory components were required, the core width is now reduced and proceeding with method step 5 and the new reduced core width.
  • Step 10 Since more than 50% of the chemical compounds require only less than five explanatory components, the current core width is retained and with exactly this core width (without further internal optimization) a Gaussian process model on the entire data set from method step 2 trained.
  • Process Step 11 The models from process step 10 (GP flt ) and process step 3 (GP gen ) are both, together with the data set from process step 3 in a file stored and form the record-specific part of the program ToxExplain.
  • program code which is not record specific
  • the average errors and explanatory capabilities determined in method steps 5 to 9 are stored in a log file.
  • Process step 12 proper end.
  • the ToxTrain program ( Figure 5) generates a program called ToxExplain from a record. This allows the automatically determined models to be used for new data.
  • Explanatory models of the type ToxExplain can be generated in the manner described in connection with FIG. 5 for any data records. These are each a program that (unlike previously available models) not only produces a prediction of the toxicity of chemical compounds, but also provides the following information at an interface:
  • step 6 A list of exactly the connections in the training set on which the above prediction mainly relies (explanatory components). As described above, in step 6, the contributions to the sum are sorted by size according to equation (4), and then the connections are selected at the head of the list. The influence of these training compounds is quantified in percent.
  • the information generated is useful when presented to him without context switching within his normal working environment.
  • existing programs for editing and visualizing molecules in the respective company are connected via interfaces and, if necessary, extended by plug-ins.
  • At least the explanatory components (and their measured toxicity), the optimization aids, their respective proportions in percent, the structural formula of the new molecule and the prediction for the new molecule should be displayed simultaneously.
  • An example of such a graph is found in FIG. 6.
  • a prediction is shown, wherein the prediction for the new molecule is based primarily on two compounds from the training set. If the observing chemist considers these explanatory components to be plausible, the prediction makes his decision easier.
  • the molecule A from the training set has an influence of 51% on the prediction, the molecule B has an influence of only 43%. This automatically confused finding is also automatically applied to the new molecule C, which is considered non-toxic.
  • Fig. 6 In the lower part of Fig. 6, another example is shown.
  • the molecule D should be changed so that it is no longer toxic.
  • To the right of it are shown the structural features E, F of the molecule, which in the concrete case lead to the prediction being "toxic". These optimization aids make it easier for the chemist to specifically vary the new molecule so that it is no longer toxic i st.
  • a training dataset with active and inactive molecules as well as two test molecules is used.
  • the main aim is to show, by means of an example, how the explanatory components for kernel-based models are calculated and used.
  • Fig. 7 shows a training data set with active and inactive molecules and two test molecules with unknown activity.
  • the activity may be any property, such as binding to a receptor.
  • molecules Above a certain bond strength, molecules are called active, below they are called inactive.
  • the molecules are represented by two numerical descriptors each, which are plotted on the X and Y axes. It should be noted that in real applications usually several hundred descriptors are used. The approach described here works the same way and is demonstrated here for the sake of clarity with two descriptors.
  • the coordinates and sequence numbers of the training molecules can be found in Tables 1 and 2 in columns A, B and C.
  • Table 1 in columns D and E contains the coordinates of test molecule 1 (identical for simplicity in all rows).
  • Table 2 contains the analog information, but for test molecule 2.
  • Equation (1) leads to the following expression for the prediction f (x new ):
  • the columns F contain the value of the selected RBF core function between the training molecule listed in the respective row and test molecule 1 (Table 1) and test molecule 2 (Table 2).
  • the value 2 is assumed in this example calculation.
  • the columns G of Tables 1 and 2 contain, for each training molecule, the label-corrected entry of this weighting vector OC. It should be noted that there are two different definitions of OC, one of which
  • the correct definition can be ensured by a simple amount formation.
  • the weight vector for kernel-based models has to be defined so that it has only positive entries, and the continuous labels are saved separately.
  • n stands for all training items that have become part of the model.
  • this can be the entire training set (eg in the Gaussian process) or a subset of the training set. For support vector machines, this subset is called the support vectors.
  • test molecule 1 is about 86% based on the number 5 training molecule, while all other training molecules contribute little to the prediction.
  • test molecule 2 is approximately 87% based on the training molecule number 5, while all other training molecules (again) contribute little to the prediction.
  • Sorting for ⁇ gives us a list of the training molecules in order of relevance for the prediction of the respective test molecule.
  • represents a measure with which a ranking data record is automatically created, in which the measures of the influencing factors are arranged according to a predetermined condition.
  • the method is thus to be understood as a kind of automatic measuring method for influencing factors, whereby this determination of the influencing factors enables further applications.
  • a model of a building could be used to determine the parameters that require particularly energy-efficient air conditioning. From a model of a production plant could e.g. automatically determine the parts of a production chain that represent a particular bottleneck or on which the quality of certain product elements is particularly sensitive.
  • models of technical systems e.g. electronic circuits or machines, automatically determining factors which may take the form of a reduced model e.g. are to be used for regulatory purposes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Biomedical Technology (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

L'invention concerne un procédé et un dispositif d'analyse automatique d'un modèle non linéaire afin de prédire des propriétés d'un objet a priori non caractéristique. Selon l'invention a) le modèle non linéaire est formé sur la base d'un processus d'apprentissage par machine, notamment d'un processus d'apprentissage à base nucléaire à partir d'objets d'apprentissage, de façon à donner un renseignement sur au moins une propriété pour au moins un objet, b) au moins un chiffre d'indice est déterminé automatiquement par un moyen d'analyse utilisant le théorème de représentation, ce chiffre d'indice indiquant le ou les objets d'apprentissage qui sont devenus parties intégrantes du modèle non linéaire ont la plus grande influence sur les prédictions du modèle non linéaire, c) un enregistrement d'ordre est établi automatiquement, dans lequel les chiffres d'indices des facteurs d'influence sont ordonnés selon une condition prédéterminée.
PCT/EP2009/064476 2008-11-26 2009-11-02 Procédé et dispositif d'analyse automatique de modèles WO2010060746A2 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
DE112009002693T DE112009002693A5 (de) 2008-11-26 2009-11-02 Verfahren und Vorrichtung zur automatischen Analyse von Modellen

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE102008059045.2 2008-11-26
DE102008059045 2008-11-26

Publications (2)

Publication Number Publication Date
WO2010060746A2 true WO2010060746A2 (fr) 2010-06-03
WO2010060746A3 WO2010060746A3 (fr) 2010-11-18

Family

ID=42133384

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2009/064476 WO2010060746A2 (fr) 2008-11-26 2009-11-02 Procédé et dispositif d'analyse automatique de modèles

Country Status (2)

Country Link
DE (1) DE112009002693A5 (fr)
WO (1) WO2010060746A2 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10036219B1 (en) 2017-02-01 2018-07-31 Chevron U.S.A. Inc. Systems and methods for well control using pressure prediction

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001031580A2 (fr) * 1999-10-27 2001-05-03 Biowulf Technologies, Llc Procedes et dispositifs pouvant identifier des modeles dans des systemes biologiques, et procedes d'utilisation
GB0518665D0 (en) * 2005-09-13 2005-10-19 Imp College Innovations Ltd Support vector inductive logic programming

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10036219B1 (en) 2017-02-01 2018-07-31 Chevron U.S.A. Inc. Systems and methods for well control using pressure prediction

Also Published As

Publication number Publication date
WO2010060746A3 (fr) 2010-11-18
DE112009002693A5 (de) 2013-01-10

Similar Documents

Publication Publication Date Title
Lee et al. The use of GEE for analyzing longitudinal binomial data: a primer using data from a tobacco intervention
DE112005002331B4 (de) Verfahren, System und Vorrichtung zur Zusammenstellung und Nutzung von biologischem Wissen
DE112016001796T5 (de) Feinkörnige bildklassifizierung durch erforschen von etiketten von einem bipartiten graphen
DE112014003591T5 (de) Detektionseinheit, Detektionsverfahren und Programm
EP2854045B1 (fr) Procédé et système d'évaluation de valeurs de mesures collectées d'un système
DE10239292A1 (de) Konflikterfassung und -lösung in Zusammenhang mit einer Datenzuweisung
WO2020187788A1 (fr) Procédé de génération d'une composition destinée à des couleurs, laques, encres d'impression, résines de broyage, concentrés de pigment ou autres substances de revêtement
DE102019114378A1 (de) Verfahren und system zur vorhersage des systemstatus
EP3959660A1 (fr) Apprentissage de modules aptes à l'apprentissage avec des données d'apprentissage dont les étiquettes sont bruitées
WO2009053137A2 (fr) Procédé de détermination assistée par ordinateur d'au moins une propriété d'une coloration capillaire
EP1055180B1 (fr) Procede et dispositif de conception d'un systeme technique
DE102004025876B4 (de) Vorrichtung und Verfahren zur Stapeleigenschaftsschätzung
EP2433185B1 (fr) Dispositif et procédé pour le traitement d'une base de données de simulation d'un processus
WO2010060746A2 (fr) Procédé et dispositif d'analyse automatique de modèles
EP1264253B1 (fr) Procede et dispositif pour la modelisation d'un systeme
DE69432316T2 (de) Automatische erbgut bestimmung
EP3716058A1 (fr) Procédé de commande d'un appareil au moyen d'un nouveau code de programme
EP1280090A1 (fr) Méthode pour la configuration des analyses de séquences d'acides nucléiques parallèles
DE102022201853A1 (de) Erkennung kritischer Verkehrssituationen mit Petri-Netzen
DE19549300C1 (de) Verfahren zur rechnergestützten Ermittlung einer Bewertungsvariablen eines Bayesianischen Netzwerkgraphen
DE4331018A1 (de) Verfahren zur automatischen Bewertung von Fraktionsmustertypen, die Krankheiten betreffen
EP1071999B1 (fr) Procede et systeme pour la determination assistee par ordinateur d'une specification de representation
DE102007044380A1 (de) Verfahren zum rechnergestützten Lernen eines probabilistischen Netzes
DE60027911T2 (de) Verfahren und vorrichtung zur netzwerk folgerung
EP3651121A1 (fr) Système d'aide à l'analyse d'un réseau causal entraîné

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09796643

Country of ref document: EP

Kind code of ref document: A2

WWE Wipo information: entry into national phase

Ref document number: 1120090026931

Country of ref document: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09796643

Country of ref document: EP

Kind code of ref document: A2

REG Reference to national code

Ref country code: DE

Ref legal event code: R225

Ref document number: 112009002693

Country of ref document: DE

Effective date: 20130110

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载