+

WO2003063030A1 - Systeme et procede de regroupement de donnees - Google Patents

Systeme et procede de regroupement de donnees Download PDF

Info

Publication number
WO2003063030A1
WO2003063030A1 PCT/US2003/001806 US0301806W WO03063030A1 WO 2003063030 A1 WO2003063030 A1 WO 2003063030A1 US 0301806 W US0301806 W US 0301806W WO 03063030 A1 WO03063030 A1 WO 03063030A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
state
state value
values
updating
Prior art date
Application number
PCT/US2003/001806
Other languages
English (en)
Inventor
Guangzhou Zou
Xun Wang
Zhen Su
Original Assignee
Syngenta Participations Ag
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Syngenta Participations Ag filed Critical Syngenta Participations Ag
Publication of WO2003063030A1 publication Critical patent/WO2003063030A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Definitions

  • the present invention relates generally to analysis of data, and more particularly, to a method and apparatus for data clustering.
  • Clustering is a type of pattern recognition. Clustering is a process of organizing data into clusters by revealing naturally occurring patterns or structures. The resulting clusters allow users to discover similarities and differences among patterns and to derive useful conclusions about them. Some general applications of clustering are data reduction, hypothesis generation, hypothesis testing, and prediction based on clusters.
  • Clustering is useful in many fields such as life sciences, medical sciences, social sciences, earth sciences, and engineering. Clustering is often found under different names in different contexts such as data mining in software engineering, machine learning in pattern recognition, numerical taxonomy in biology and ecology, typology in social sciences, and partition in graph theory.
  • One example of clustering is grouping genes according to similarities in their expression patterns. Other examples are helping marketers discover and characterize customers, identifying areas of similar land use in an earth observation database, identifying groups of automobile insurance policy holders with a high average claim cost, and identifying groups of houses in an area according to house types, value, and geographic location.
  • Some conventional clustering methods are not efficient for large data sets and suffer from long computation times.
  • the agglomerative type of hierarchical clustering algorithms have a computational complexity of 0(n 3 ).
  • Other methods have inflexible clustering criteria which result in clusters that are either too coarse or too fine so that the natural patterns in the data are missed.
  • some methods, such as partitioning methods try to fit data to predefined or arbitrary patterns and, thus, they too are unable to reveal the natural patterns in the data, h addition, few methods are scalable for massively parallel computation. Therefore, a need exists for a scalable method that lends itself to parallel computation and that employs flexible clustering criteria.
  • the present invention provides systems and methods that cluster data in a data space.
  • the present invention is scalable and thus allows for parallel computation. Additionally, the clustering criteria of the present invention imposes minimal a priori restriction on the data and thus allows for a more natural clustering that does not obscure the natural patterns in the data.
  • values of cluster parameters are selected and a data set is created from the data points in the data space that satisfy the selected cluster parameter values.
  • Each data point in the data set is assigned a different initial state value.
  • the state value for each data point in the data set is updated according to one or more rales that are related to the cluster parameters.
  • the update process is capable of being performed in a substantially parallel manner. After the update process has been applied to the entire data set, the update process is repeated for the entire data set until the state values of respective data points in the data set have stabilized. For example, stabilization of state values are indicated by all of the state values of the respective data points in the data set remaining unchanged after a completed update process of the entire data set.
  • the data points in the data set can be grouped into data clusters as a function of the state values. For example, data points with the same stabilized state value belong to the same data cluster.
  • the process can be repeated for different cluster parameter values.
  • data clusters can be discovered for different sets of cluster parameter values to reveal the natural patterns of the data.
  • FIG. 1 is a representation illustrating an exemplary embodiment of a system that clusters data according to the present invention.
  • FIG. 2 is a flowchart illustrating an exemplary embodiment of a method for clustering data according to the present invention.
  • FIGS. 3, 4, and 5 are flowcharts of alternative embodiments of methods for clustering data according to the present invention.
  • FIG. 6 is a representation illustrating an exemplary embodiment of a hierarchy or an aspect of a hierarchy according to the present invention.
  • the present invention generally relates to systems and methods that cluster data.
  • the present invention provides a system and a method for discovering data clusters in a data space.
  • An exemplary embodiment of a system 10 according to the present invention is illustrated in FIG. 1.
  • the system 10 includes a computing unit 90 such as at least one data processor, at least one computer (e.g., a Beowulf cluster, a supercomputer, a server, a mainframe, a desktop, a laptop, a notebook, a portable or a handheld computer) and/or any equivalents thereof.
  • the computing unit 90 includes a controller 20, a memory 30, a bus 40, an input device 50 and/or an output device 60.
  • the system 10 includes a data device 70 (e.g., an external data storage device or a remote data storage device) and/or a link 80 (e.g., a cable link or a wireless link).
  • a data device 70 e.g., an external data storage device or a remote data storage device
  • the controller 20 includes a computing device or a plurality of computing devices (e.g., processors, microprocessors and/or state machines).
  • the memory 30 includes volatile memory components and/or non- volatile memory components.
  • the controller 20 and the memory 30 are in two-way communications with the bus 40.
  • the input device 50 e.g., receiver, microphone, mouse, keypad, keyboard, sensor and/or touch-sensitive display
  • the output device 60 e.g., display, speaker and/or transmitter
  • the data device 70 e.g., a conventional memory and/or conventional data storage device with or without computing power
  • the link 80 also provides a conventional input/output interface with the bus 40.
  • the components of the system 10 are connected directly to each other in addition to or instead of being connected via the bus 40.
  • Other conventional methods of communicating between components e.g., conventional wireless communications means
  • various levels of integration between components is contemplated by the present invention. For example, any component may be integrated in part or in whole with any other component or components.
  • the controller 20 controls data flow and/or access on the bus 40. Programs and/or data are accessed by the controller 20 from, for example, the memory 30, the input device 50 and/or the data device 70.
  • the input device 50 provides a user interface for entering data and/or commands (e.g., via a keyboard, programming the system 10 or entering values for parameters during the execution of a program).
  • the output device 60 provides, for example, a display and/or an interface that transmits information to a user or to another device, under control of the controller 20.
  • a method 100 (shown in Figure 2) that clusters data according to an exemplary embodiment of the present invention is stored, for example, in the memory 30, the controller 20 or some combination thereof. Furthermore, the method 100 is encompassed in software, hardware (e.g., an application specific integrated circuit (ASIC)) or some combination thereof.
  • ASIC application specific integrated circuit
  • a data space represents the set of all data points from which data sets are generated and/or from which clusters are formed. Each data point includes a plurality of dimensions and/or degrees of freedom.
  • the data space is stored, at least partly, in the memory 30, the controller 20, the data device 70 or some combination thereof. Furthermore, information including data points is transmitted and stored between, for example, the memory 30, the controller 20 and the data device 70. For example, if the data space is not efficiently stored within the memory 30 and/or the controller 20, then the data device 70 via the link 80 at least partly provides storage for the data space. Furthermore, the data device 70 provides additional computing power that can process at least portions of the data space stored in the data device 70 in some embodiments.
  • the controller 20 controls the other components of the system 10. For example, the controller 20 accesses the method 100 stored in the memory 30. The controller 20 then executes the method 100 and processes the data points received via the bus 40, the memory 30 and/or the data device 70. In one example, the memory 30, the data device 70, the input device 50 and/or the controller 20 serve as a source of data points, hi another example, the memory 30 provides a local cache corresponding to the memory of the data device 70.
  • At least a portion of the data can be processed substantially in parallel by the controller 20, in one embodiment.
  • parallel processing is achieved via one or more processors and/or state machines.
  • the controller 20 can process data stored in the memory 30, the data device 70, the controller 20 or some combination thereof, h another example, the processing power is distributed between the controller 20 and the data device 70.
  • the controller 20 in executing the method 100 according to an exemplary embodiment of the present invention creates at least one data set from the data space, hi one example, the data set is further processed via a substantially parallel process and possibly iterative process resulting in the clustering of data.
  • cluster conditions e.g., cluster parameter values
  • possibly different data sets are generated from the data space resulting in one or more possibly different data clusters.
  • the clusters of data are organized, for example, according to cluster conditions to illustrate one or more hierarchical levels in one or more hierarchies.
  • a flowchart shows an exemplary embodiment of the method 100 that clusters data from a data space according to the present invention.
  • the method 100 begins in step 110 and proceeds with the selection of values for cluster parameters in step 120.
  • the values for the cluster parameters are automatically selected.
  • the method 100 automatically selects initial values for the cluster parameters and automatically changes the values for the cluster parameters.
  • one or more of the cluster parameter values are increased or decreased by the integer multiples of a particular resolution value up to or down to a particular threshold value.
  • the values for the cluster parameters are selected or updated manually by an operator.
  • the number and/or type of cluster parameters are preset or chosen as a function of, for example, the application and/or the data point type.
  • the method 100 employs two cluster parameters, namely a number of neighbors n and a radius r.
  • a data set is generated by selecting data points of the data space that satisfy the particular values of the cluster parameters.
  • the data set includes data points that have at least n neighbors within a radius r.
  • the data set includes data points that satisfy at least one of the cluster parameter values.
  • Other methods such as conventional methods for applying the cluster parameter values are used to create the data set in some embodiments.
  • each of the data points in the data set is assigned a different initial state value. Integers and other types of numbers or representations are used for initial state values.
  • step 150 the state value of each data point in the data set is updated according to particular rules.
  • the rules are preset by the user, programmed by the user and/or selected automatically.
  • the rales are selected automatically as a function of the type of data space being processed or the type of application.
  • the rule is related to the cluster parameters.
  • An example rule is that a particular data point should be given the lowest state value in a corresponding neighborhood as defined by the cluster parameters.
  • the step 150 is carried out as a parallel process with all the data points simultaneously undergoing the updating process in step 150. Parallel processing usually reduces computing time, especially when the data sets are very large.
  • step 160 the method 100 determines whether or not any state values were changed in the previous step (i.e., step 150). If a state value of any of the data points was changed, then the method 100 jumps back to step 150. If no state values were changed, then the state values have stabilized and the method 100 proceeds with step 170.
  • Step 170 determines the clusters according to state values.
  • the data points are grouped into clusters according to state value.
  • a first cluster is formed from the data points with a first state value.
  • a second cluster is formed from the data points with a second state value.
  • the number of clusters is determined by the number of different state values.
  • step 180 the method 100 determines whether or not the value of any cluster parameters are to be changed. If not, then the method 100 terminates in step 190. Otherwise, if at least one of the cluster parameter values is to be changed, then the method 190 updates the particular cluster parameter value (step 190) and the process proceeds again to step 130 in which possibly different data sets are generated according to the one or more updated cluster parameter values. Any of the parameter values are manually and/or automatically changeable by the method 100. Furthermore, the change in at least one cluster parameter value is by a constant or variable incremental amount. In one embodiment, they are selected according to the noise level of the data. The changes are by addition, subtraction, multiplication, division or any other conventional methods for changing the value of a particular parameter.
  • n For a cluster parameter with a cluster parameters n and r as described above, for a particular value of n, the values of r are changed by adding a constant value to r until r reaches a particular threshold value. In another example, for a particular value of r, the values of n are changed by subtracting a constant value from n until n reaches a particular threshold value.
  • Each set of cluster parameter values generates a possibly different data set and possibly clusters of different composition, size and number.
  • each set of cluster parameter values form, for example, at least portions of hierarchical levels in one or more hierarchies.
  • FIG. 3 is a flowchart of an alternative embodiment of a method 300 for clustering data according to the present invention.
  • Cluster criteria are selected 302 and a different state is assigned each data point in the data 304.
  • the state of each data point is updated according to at least one rule that is a function of the cluster criteria 306 and this is repeated until all of the states remain unchanged 308.
  • at least one of the cluster criteria are changed and the method 300 is repeated for the new criteria.
  • data points grouped by states are displayed as a result of method 300.
  • the data points are grouped by states in a hierarchy according to cluster criteria.
  • FIG. 4 is a flowchart of another alternative embodiment of a method 400 for clustering data according to the present invention.
  • Data points are selected from a k- dimensional space that have at least n neighboring data points with a similarity measure less than or equal to r 402.
  • Each selected data point is labeled with a unique initial state 404.
  • the state of each labeled data point is updated to the lowest state in its neighborhood, if the state differs from the lowest state in its neighborhood 406.
  • the updating is repeated until there is no state change in the ⁇ -dimensional space 408.
  • each data point represents a gene and its characteristics.
  • k, n, and r are predetermined values.
  • the states are updated simultaneously.
  • genes grouped by state are displayed.
  • r is increased by a resolution ⁇ r and the method is repeated for the new r.
  • the resolution ⁇ r is selected according to noise level.
  • the resulting clusters are provided before selecting the resolution ⁇ r so that the resolution ⁇ r is selected according to resulting clusters.
  • r is varied by ⁇ r over a range of values to produce a hierarchy of clusters. In another embodiment, the hierarchy of clusters is displayed.
  • FIG. 5 is a flowchart of another alternative embodiment of a method 500 for clustering data according to the present invention.
  • a system for clustering data comprises one or more memory units that store at least a portion of a data space and a controller coupled to the one or more memory units.
  • the data space contains a plurality of data points.
  • the controller includes a plurality of computing devices that operate to perform a method.
  • a state value is updated for each data point according to at least one rale that is a function of cluster criteria 502.
  • the cluster criteria comprise a minimum number of neighbors (it) and a similarity value (r).
  • the similarity value (r) is increased by a pre-determined increment and, then, the updating is repeated 504.
  • the method is performed by the computing devices in parallel, another embodiment, the controller and the plurality of computing devices are part of a multicomputer architecture capable of parallel processing. In another embodiment, the controller and the plurality of computing devices are part of a supercomputer. In another embodiment, the method further comprises displaying a hierarchy of data points grouped by state values over a range of similarity values (r).
  • n represents the minimum number of neighbors required and r represents a value of a similarity measure.
  • the parameters n and r allow for control of data density.
  • a similarity measure is chosen depending on the application.
  • Additional examples of similarity measures include the family of Minkowski metrics, of which the Euclidean distance is a member.
  • An example of a data point is a gene and its characteristics, such as an expression pattern.
  • a group is a collection of similar data points.
  • Neighbors are data points having a defined similarity measure with respect to a particular data point. States are associated with data points and identity groups or clusters.
  • a ⁇ -dimensional space is a data set of size k, such as k number of experiments. The example is to be constraed merely as an illustration and is not to be constraed as a limitation in any manner.
  • Step T Select data points that have at least n neighboring data points within a given radius r.
  • the initial value of the minimum neighbors requirement n and the neighborhood size r are pre-determined.
  • the distance between any two data points can be calculated in a fc-dimensional space.
  • Step 2 Label each selected data point with a unique integer i, which becomes the initial state of the data point.
  • Step 3 Simultaneously update the state of all labeled data points according to the following rules: o Change the state of the data point under consideration to the lowest state that occurred in its neighborhood. o Keep the state of the data point unchanged if its state is the lowest one in its neighborhood.
  • Step 4 Repeat step 3 until there is no state change in the entire data space.
  • Step 5 Output the groups of the data points that have the same state as the clusters formed at the parameter point (r, n).
  • Step 6 Increase the value of parameter r (or ⁇ ) by a user specified resolution ⁇ r and repeat steps 1-5 to create the set of the lower level clusters with a smaller neighborhood (or number of minimum neighbors requirement).
  • Step 7 Repeat step 6 to cover a meaningful range in the parameter space. This will produce a hierarchy of clusters with respect to the selected resolution . ⁇ r.
  • Steps 3, 6, and 7 can all be carried out in a massively parallel fashion.
  • a data space includes ten data points labeled A to J.
  • Each data point represents, for example, an m-dimensional space where m is an integer.
  • point A includes m parameters that is represented, for example, as the m coordinates of point A, e.g., (A 0 , Al,..., A m-1 ). These parameters represent information such as measurement(s), characteristic(s) or representative value(s) of the same type(s) or different type(s).
  • each data point represents a particular person or group of persons with a particular financial history.
  • Information stored in the coordinates of each data point includes, for example, income data, asset data, debt data, liability data, overhead data, statistical scores relating to personal financial history and other data that is relevant as to whether or not a bank should approve a loan or extend a pre-approved credit card offer to a particular person or group of persons.
  • each data point represents a test subject (e.g., one or more organisms, cells, organic material, DNA, etc.) that is the focus of scientific research.
  • Information stored in the coordinates of each data point includes, for example, test conditions, subject characteristics, statistical data relating to the test conditions and/or subject. It will be appreciated that these are merely illustrations and not intended to limit the present invention in any way.
  • the systems and methods for clustering data according to the present invention find application in a wide variety of applications in which information is processed and/or analyzed.
  • a distance is determined between each point and every other point of the data set.
  • the term “distance” includes, for example, conventional spatial distances or its equivalent between two points (e.g., A and B) as represented, for example, by the conventional relation of the square root of the sum of the square of the differences of corresponding coordinates between two points.
  • the term “distance” also includes, for example, a conventional correlation value or its equivalent.
  • the distance is a normalized correlation between two points (e.g., A and B) subtracted from an offset such as, for example, one. Accordingly, the distance between, for example, point A and itself would be zero.
  • the term “distance” need not be limited to any one of the above-identified embodiments, but includes any conventional mathematical (e.g., statistical) parameters as known to one of ordinary skill in the art.
  • the distance information is storable in a matrix.
  • the example matrix below is for storing distances between the points of a data set and contains ten points A-J.
  • the information need not be stored in a matrix format and there can be more or less than ten data points.
  • other methods for organizing and/or keeping track of data e.g., coordinate systems, pointers, etc. are also employed.
  • the distance between a particular point and itself is zero.
  • the distance between point A and point B is, for example, shown to have a value of one. This value is reflected in row one, column two, and row two, column one.
  • Other distance values between points are also stored in the matrix.
  • the distance values are shown in the matrix to be integers, other types of numbers (e.g., non-integers) are also stored. Integers were employed, in this example, to simplify the discussion.
  • point C has at least two neighbors (i.e., points F and G), each within a distance of 1.
  • points B, C, D, E, G, H and I each have at least two neighbors, each within a distance of 1.
  • Each point is assigned an initial state value, h this example, the selected points are each assigned a different integer value. However, non-integer values may be used and may be arbitrarily assigned. The following indicates the initial state values for each of the points:
  • point B has a state value of 0.
  • point B has two neighbors in the clustering process: point D with a state value of 2 and point H with a state value of 5. Accordingly, since point B has the lowest state value (i.e., 0) when compared with its neighbors' state values (i.e., 2 and 5), the state value of point B remains unchanged at 0.
  • point C has a state value of i and has one neighbor in the clustering process: point G with a state value of 4. Accordingly, since point C has the lowest state value (i.e., 1) when compared with its neighbor's state value (i.e., 4), the state value of point C remains unchanged at 1.
  • Point D has a state value of 2 and has two neighbors in the clustering process: point B with a state value of 0 (which is not the updated value) and point H with a state value of 5.
  • a second update stage using the rales of the first update stage is performed.
  • the difference in the second update stage is that the updated state values from the first update stage, instead of the initial state values, are employed.
  • point E has a state value of 3 after the first update stage.
  • Point E has two neighbors: point G with a state value of I after the first update stage and point I with a state value of 3 after the first update stage. Accordingly, point E takes on the lowest state value of I from neighbor G.
  • point I has a state value of 3 after the first update stage.
  • Point I has two neighbors: point E with a state value of 3 after the first update stage and point G with a state value of 1 after the first update stage. Accordingly, point I takes on the lowest state value of 1 from neighbor G.
  • a third update stage using the rules of the second update stage is performed.
  • the difference in the third update stage is that the updated state values from the second update stage are employed.
  • the result of the above-disclosed clustering process is a single cluster containing all of the points A to I.
  • FIG. 6 illustrates the clustering information shown on different hierarchical levels of a hierarchy.
  • a first cluster is formed by the set including the points B, D and H.
  • a second cluster is formed by the set including the points C, E, G and I.
  • FIG. 6 illustrates a hierarchy generated by keeping a first cluster parameter n constant and by changing a second cluster parameter r.
  • Another hierarchy or another aspect of the same hierarchy is illustrated by keeping the second cluster parameter r constant and changing the first cluster parameter n.
  • cluster parameters n and r are both changed.
  • the hierarchy includes more or less than two hierarchical levels in other embodiments.
  • an exemplary embodiment provides that at least some steps of the process (e.g., the update stage) are performed in parallel and/or simultaneously, wherein the terms "in parallel” and “simultaneously” have overlapping meanings.
  • the updating of all the state values of respective data points in the data set can be performed in parallel since the updating process uses the state values from the previously completed update stage.
  • the state values can be updated separately from each other.
  • the data points are split among processes.
  • selected steps are performed in parallel, while other steps are performed sequentially.
  • Parallel processing often reduces processing time and embodiments are scalable for massively parallel computations, in particular, when the number of data points becomes very large. Such parallel processing is achieved, for example, by one or more processors and/or state machines.
  • FIG. 2 illustrates a particular order of steps
  • the present invention also contemplates other orderings and groupings.
  • the present invention includes fewer or more steps than illustrated in FIG. 2.
  • the present invention contemplates that a process is formed from a subset of the steps illustrated in FIG. 2 such as, for example, a process including steps 140 to 160.
  • additional steps not illustrated in FIG. 2 are included such as, for example, forming and/or displaying hierarchical levels and/or at least some aspects of one or more hierarchies.
  • Data source for the above described application can be obtained from microarrays and DNA chips, which give expression levels for hundreds to thousands of genes.
  • the methods and systems of the present invention can be used process the data from microarrays and DNA chips, obtained from either a single experiment or multiple experiments, to group genes together whose expression profiles are similar to each other.
  • the data source can also be nucleotide sequences.
  • the methods and systems of the present invention can also be used to align nucleotide sequences in order to produce a global alignment of the sequences collected from an organism or across organisms.
  • Other examples of applications of the systems and methods of the present invention include determining the socio-economic demographic of the world population or of each country or city in the world or hemisphere.
  • the source data could be the World Bank statistics of countries from a selected period of time.
  • the data could include various quality of life factors such as state of health, nutrition, educational service, etc.
  • countries that have similar values will be grouped together with each group assigned its own unique color.
  • the socio- economic demographic of each country of the world can then be visualized in a straightforward manner, wherein each country on the geographic map is colored according to its socio-economic type.
  • Embodiments of the present invention have many advantages over existing technology.
  • the dependence of the clustering results on the selection of parameters has been minimized so that the natural or true structure of the data can be revealed.
  • Using simple unified state transition rules allows computations to be carried out much faster or in a parallel fashion, especially for problems involving a large data set.
  • Various embodiments have computational complexities of O(n), after the distance matrix computation. Searching clusters by simultaneous state transition operations and constructing a cluster hierarchy by continuous parameter changes provide these and other advantages.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un système et un procédé de regroupement de données par la recherche de grappes par le biais d'opération de transition d'états et de construction simultanée d'une hiérarchie de grappes par des changements continus de paramètres. Des systèmes et des procédés parallèles peuvent être utilisés. On crée un jeu de données (130) à partir du point de données en appliquant des valeurs choisies de paramètres de grappes. Après attribution d'une valeur d'état initiale différente à chaque point de donnée dans le jeu de données (140), la valeur d'état pour chaque point de donnée dans le jeu en question est mise à jour en fonction d'une ou de plusieurs règles se rapportant aux paramètres de grappes. Une fois le processus de mise à jour appliqué à l'ensemble du jeu de données (150), on répète ce processus jusqu'à ce que la valeur d'état des points de données correspondants dans le jeu de données soit stabilisée. Les grappes de données peuvent être formées comme fonction des valeurs d'état stabilisées. Le processus peut être répété pour différentes valeurs de paramètres de grappes (190).
PCT/US2003/001806 2002-01-22 2003-01-22 Systeme et procede de regroupement de donnees WO2003063030A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US35057002P 2002-01-22 2002-01-22
US60/350,570 2002-01-22

Publications (1)

Publication Number Publication Date
WO2003063030A1 true WO2003063030A1 (fr) 2003-07-31

Family

ID=27613403

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2003/001806 WO2003063030A1 (fr) 2002-01-22 2003-01-22 Systeme et procede de regroupement de donnees

Country Status (1)

Country Link
WO (1) WO2003063030A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340104A (zh) * 2020-02-24 2020-06-26 中移(杭州)信息技术有限公司 智能设备的控制规则的生成方法和装置、电子设备及可读存储介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6263334B1 (en) * 1998-11-11 2001-07-17 Microsoft Corporation Density-based indexing method for efficient execution of high dimensional nearest-neighbor queries on large databases
US6529891B1 (en) * 1997-12-04 2003-03-04 Microsoft Corporation Automatic determination of the number of clusters by mixtures of bayesian networks

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6529891B1 (en) * 1997-12-04 2003-03-04 Microsoft Corporation Automatic determination of the number of clusters by mixtures of bayesian networks
US6263334B1 (en) * 1998-11-11 2001-07-17 Microsoft Corporation Density-based indexing method for efficient execution of high dimensional nearest-neighbor queries on large databases

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340104A (zh) * 2020-02-24 2020-06-26 中移(杭州)信息技术有限公司 智能设备的控制规则的生成方法和装置、电子设备及可读存储介质
CN111340104B (zh) * 2020-02-24 2023-10-31 中移(杭州)信息技术有限公司 智能设备的控制规则的生成方法和装置、电子设备及可读存储介质

Similar Documents

Publication Publication Date Title
Kumar et al. An efficient k-means clustering filtering algorithm using density based initial cluster centers
Kim et al. Fuzzy clustering of categorical data using fuzzy centroids
Asur et al. An ensemble framework for clustering protein–protein interaction networks
Madhulatha An overview on clustering methods
Celebi et al. A comparative study of efficient initialization methods for the k-means clustering algorithm
Do et al. Clustering approaches to identifying gene expression patterns from DNA microarray data
US8332346B1 (en) Fuzzy-learning-based extraction of time-series behavior
Pontes et al. Configurable pattern-based evolutionary biclustering of gene expression data
Acharya et al. Bi-clustering of microarray data using a symmetry-based multi-objective optimization framework
Tian et al. Stratified feature sampling for semi-supervised ensemble clustering
JP2022185927A (ja) 評価装置、評価方法、およびプログラム
Ma et al. Roulette wheel-based level learning evolutionary algorithm for feature selection of high-dimensional data
Kumar et al. Gene expression data clustering using variance-based harmony search algorithm
Arzamasov et al. Reds: Rule extraction for discovering scenarios
Balamurugan et al. A modified harmony search method for biclustering microarray gene expression data
WO2003063030A1 (fr) Systeme et procede de regroupement de donnees
Wang et al. Chaotic harmony search based multi-objective feature selection for classification of gene expression profiles
Sheng et al. A niching genetic k-means algorithm and its applications to gene expression data
Tchagang et al. Biclustering of DNA microarray data: theory, evaluation, and applications
Torrente Band-based similarity indices for gene expression classification and clustering
US9183503B2 (en) Sparse higher-order Markov random field
CN117634618B (zh) 一种迭代更新的生物学高维数据集的知识推理方法及系统
Azzini et al. A practically efficient fixed-pivot selection algorithm and its extensible MATLAB suite
Morán-Fernández et al. Low-precision feature selection on microarray data: an information theoretic approach
Lin et al. Spatial Adaptive Selection using Binary Conditional Autoregressive Model with Application to Brain-Computer Interface

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SC SD SE SG SK SL TJ TM TR TT TZ UA UG US UZ VC VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载