WO2003100660A1 - Procede permettant de determiner s'il existe une relation entre des objets de mesure dans un ensemble de donnees de mesure - Google Patents
Procede permettant de determiner s'il existe une relation entre des objets de mesure dans un ensemble de donnees de mesure Download PDFInfo
- Publication number
- WO2003100660A1 WO2003100660A1 PCT/SE2003/000815 SE0300815W WO03100660A1 WO 2003100660 A1 WO2003100660 A1 WO 2003100660A1 SE 0300815 W SE0300815 W SE 0300815W WO 03100660 A1 WO03100660 A1 WO 03100660A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- measurement
- subset
- cluster
- objects
- property index
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Definitions
- the present invention generally refers to analysis of measurement data, and more specifically to a method of determining if there is a relationship between measurement objects in a measurement data set, wherein said measurement data set consists of a set of measurement objects each having a set of numeric measurement values .
- the cluster concept occurs in various areas, such as in biotechnology, geographic information systems, etc., where measurement results from experiments, tests, etc., may be described by a measurement data set comprising objects each having a set of numeric measurement values, and where this data set is to be analysed with respect to the similarity of selected numeric measurement values of the objects.
- the identification of clusters i.e. the identification of objects with similar numeric measurement values, is very useful when an analyst sets out to find similarities between measurement results in order to identify relations between measurement objects.
- Such relations may e.g. be used to identify a common cause to similar results and/or to predict the results of future experiments, tests, etc.
- the identification of potential clusters may be rather straightforward in some cases just by plotting the measurement objects in a two- dimensional diagram.
- the analyst may choose between a number of different prior art clustering methods and dissimilarity measures.
- Most prior art clustering methods are either partitional, in which case they return a user- specified number of subsets that partition the set of objects, or hierarchical, in which case they return a hierarchy of subsets.
- cluster validation methods are applied to the output from a clustering method with the purpose of helping the analyst to distinguish between real and superficial clusters.
- a first type of cluster validation methods which has been proposed in prior art, estimates the number of clusters present in a data set given the results from a clustering method for a partition of the data set in 1, 2, ..., k subsets. Examples of such cluster validation methods are the Calinski-Harabasz method (R.B. Calinski and J. Harabasz. A dendrite method for cluster analysis. Communications in Statistics, 3, p.1-27, 1974), and the Gap statistic (R.Tibshirani, et al . Estimating the number of clusters in a dataset via the Gap statistic. Tech. Report, March 29, 2000) .
- a general problem for this first type of cluster validation methods is that they are based on the assumption that a data set can be partitioned into a set of crisp cluster. However, if the data set, in addition to a number of significant clusters, also comprises objects which do not belong to any cluster, this type of method will not produce any useful result.
- a second type of cluster validation methods which has been proposed in prior art assesses individual potential clusters derived from a clustering method.
- cluster validation methods are the Zhang-Zhao method (K. Zhang and H. Zhao. Assessing reliability of gene clusters from gene expression data. Functional Integrative Genomics, 1:156-173, 2000), and the Silhouette method (L. Kaufman and P. Rousseeuw. Finding groups in data: an introduction to cluster analysis. New York, Wiley, 1990) .
- a standard deviation for the numeric measurement values of an object must be estimated.
- the numeric measurement values of objects of a given set of potential clusters identified by a clustering method is then perturbed using a normal distribution with said estimated standard deviation.
- the validation is then based on whether or not a potential cluster from the original result from the given clustering method is still identified for repetitions of the given clustering method for the objects with perturbed numeric measurement values.
- a problem with this method is that a standard deviation (e.g. corresponding to measurement errors) of numeric measurement values must be estimated by the analyst and the fact that the result of the method is strongly affected by this estimated standard deviation.
- this standard deviation could be so small that the perturbation will not effect the cluster properties of a potential cluster.
- the risk is high that a superficial cluster is identified as a real cluster.
- this standard deviation could be so large that the perturbation will effect the cluster properties of a potential cluster too much. In this case, the risk is high that a real cluster is identified as a superficial cluster.
- a measure is calculated for each object in an identified potential cluster describing how similar the object is to other objects in the identified potential cluster compared to the similarity of the object to objects in other identified potential clusters.
- a problem of this method is that it will give a cluster property measure which is difficult to interpret.
- Another problem is that the cluster property measure will depend on the partition into potential clusters of objects which are not members of the identified potential cluster which is to be validated.
- the known cluster validation methods will not give reliable results which can be used to identify similarities between measurement results from experiments, tests etc., in order to identify relationships between measurement objects.
- the invention overcomes or alleviates the problems of the prior art by providing a method determining if there is a relationship between measurement objects in a measurement data set, wherein said measurement data set consists of a set of measurement objects each having a set of numeric measurement values .
- a method of determining if there is a relationship between measurement objects in a measurement data set wherein the measurement data set consists of a set of measurement objects each having a given number of numeric measurement values, and wherein a measurement subset consisting of a subset of the measurement objects has been identified in the measurement data set by means of a given clustering method.
- a measurement object in the method is typically associated to a physical entity in the technical area in which the method is applied, such as a mRNA sample in an biotechnology application.
- the numeric measurement values are typically values derived from measurements for different attributes of the physical entity, such as expression levels of different nucleotide sequences of the mRNA sample in the biotechnology application.
- the method is not limited to these examples but may be used in any technical area where such entities with a set of attributes for which numeric measurement values have been derived are found. Such entities, attributes, numeric measurement values, and the application of the method to these will be apparent to the skilled person within the technical area of interest.
- the method is not limited to any particular clustering method.
- any suitable known or future clustering method may be used.
- the same clustering method should be used throughout the method.
- the application of a selected clustering method for a measurement data set as described above will be apparent to the skilled person within the area of cluster analysis .
- a value of a cluster property index is calculated for a measurement subset of the identified measurements subsets.
- the method is not limited to any particular cluster property index.
- any suitable known or future cluster property index may be used.
- the same cluster property index should be used throughout the method. The selection of a suitable cluster property index in dependence of the clusters to be found, will be apparent to the skilled person within the area of cluster analysis.
- Each reference data set consists of a set of reference objects having randomly chosen numeric reference values. The same number of numeric reference values are given to each reference object as the number of numeric measurement values of a measurement object.
- reference subsets are identified in each of the multiple reference data sets.
- the same clustering method is used for identifying reference subsets as was used for identifying the measurement subset.
- values of the cluster property index for at least a subset of the reference subsets are calculated.
- the value of the cluster property index for the measurement subset is then compared to the distribution of the values of the cluster property index for the subset of the reference subsets.
- the cluster properties of the measurement subset are assessed based on the comparison.
- a single measurement subset in a measurement data set may be assessed in the method according to the invention.
- the method according to the invention will produce a useful result also for a measurement subset of a measurement data set, which measurement data set, in addition to a number of significant clusters, also comprises objects which do not belong to any cluster. Furthermore, in the method there is no need for a user to provide any information as an input to the method other than the measurement data set. Thus, the method is simple to use and does not rely on any estimations of measurement errors etc .
- an assessment is facilitated of a subset of measurement objects identified in the measurement data set by means of the clustering method in terms of whether the similarities between the numeric measurement values of the measurement objects is actually due to a relationship between the measurement objects and not the result of random similarities. Furthermore, even though it is relative in its nature, the assessment will produce an absolute assessment of the cluster properties of an subset of measurement objects since it is performed with randomly generated data as a reference .
- the method according to the invention will increase the possibilities for an analyst to distinguish between real and superficial clusters in a measurement data set. Consequently, the method according to the invention will increase the possibilities for an analyst to identify a relationship between measurement objects in a measurement data set which is the result of for example an experiment or a test. Such a relationship may be, but is not limited to, the fact that there is a common cause to the measurement results derived from an experiment or test for the measurement objects in the measurement subset.
- the method is advantageously used when an analyst sets out to identify relationships between measurement objects in a measurement data set.
- the cluster property index is a function of the relation between the dissimilarity, or preferably the minimum dissimilarity, between objects within a given subset of objects and objects outside the subset of objects, and the dissimilarity, or preferably the maximum dissimilarity, between objects within the subset of objects .
- This type of index is advantageous in the case of "ball like” clusters. Furthermore, this type of index does not depend on the partition into measurement subsets of objects which are not elements in the measurement subset being assessed.
- the cluster property index in the first embodiment is preferably a function of the sum of quotients derived by dividing, for each object within a given subset of objects, the minimum distance from that object and an object outside the given subset of objects with the maximum distance from that object to an object within the given subset of objects.
- the distance measure may for example be, but is not limited to, the Euclidian distance or one minus the Pearson correlation, etc.
- the choice of a distance measure will be apparent to the skilled person within the area of clustering methods.
- the same distance measure should be used throughout the method for the clustering method.
- the same distance measure should be used for the cluster property index.
- the assessment of the cluster properties of the measurement subset comprises the decision that said measurement subset is a cluster if the value of the cluster property index for the measurement subset is higher than the largest of the values of the cluster property index for the subset of the reference subsets.
- the assessment of the cluster properties of the measurement subset comprises the decision that the measurement subset is not a cluster if the value of the cluster property index for the measurement subset is lower than the lower limit of the upper quartile of the values of the cluster property index for the subset of the reference subsets.
- the assessment of the cluster properties of the measurement subset comprises the decision that the question of whether the measurement subset is a cluster or not is indeterminate if the value of the cluster property index for the measurement subset lies between the lower limit of the upper quartile and the largest value of the values of said cluster property index for the subset of the reference subsets.
- the measurement subset is decided to be a cluster.
- the assessment of the cluster properties of the measurement subset is preferably done by determining the unusualness of the value of the cluster property index for said measurement subset in respect to the distribution of the values of the cluster property index for said subset of said reference subsets.
- a high unusualness indicates a high probability that the measurement subset is a cluster. From the results of the comparison it is possible to determine an unusualness since the reference subsets have been identified in the reference data set which does not comprise any clusters .
- One way of presenting the unusualness of the measurement subset is to allot it one of an arbitrary number of grades. The selection of the number of grades will depend on the resolution desired by a user.
- the cluster properties of the measurement subset are preferably assessed by deciding that the measurement subset is a cluster if the value of the cluster property index for the measurement subset lies within a first interval.
- the first interval is derived from the distribution of the values of the cluster property index for the subset of the measurement subsets.
- the second interval is derived from the distribution of the values of the cluster property index for the subset of the measurement subsets.
- the cluster properties of the measurement subset are preferably assessed by further deciding that the question of whether or not the measurement subset is a cluster is indeterminate if the value of the cluster property index for the measurement subset lies within a third interval.
- the third interval is derived from the distribution of the values of the cluster property index for the subset of the measurement subsets .
- the setting of the first, second and third intervals will depend on the cluster property index selected. For example, if a cluster property index is selected for which real clusters will have a high value and superficial clusters will have a low value, the first interval will be from a first value and up, the second interval will be from a second value and down, and the third interval will be between the second and the first values . It should be noted, that it is not necessary to use all of the three intervals. For example, only the first and the second interval may be used. In this case a measurement subset is either decided to be a cluster or not to be a cluster. Furthermore, it should also be noted that the number of intervals may also be more than three.
- the intervals could be used in order to decide a grading of the measurement in terms of the probability that the measurement subset is a cluster. Furthermore, in the method according to the invention it is preferably determined that there is a relationship between the measurement objects in the measurement data set if the measurement subset is decided to be a cluster.
- the decision that the measurement subset is a cluster in the method indicates that the numeric measurement values of the measurement objects of the measurement subset have so much similarities compared to measurement objects which are not in the measurement subset that it is not likely that these similarities are the result of a random process. Thus, a relationship between the measurement objects of the measurement subset is identified.
- the same cluster property index as in the first embodiment is preferably used.
- the comparison of the value of the cluster property index for the measurement subset and the distribution of the values of the cluster property index for the subset of the reference subsets includes the calculation of a value of a cluster assessment index for the measurement subset .
- This cluster assessment index is a function of the quotient of the difference between the value of the cluster property index for the measurement subset and the median of the values of the cluster property index for the subset of the reference subsets, and the average deviation of the values of the cluster property index for the subset of the reference subsets. Furthermore, the cluster properties of the measurement subset are assessed based on the value of the cluster assessment index.
- the multiple reference data sets generated in the method according to the invention each preferably consists of essentially the same number of reference objects as the number of measurement objects of the measurement data set. Hence, the generated reference data sets are more similar to the measurement data set in which the measurement subset has been identified.
- Each numeric reference value is also preferably chosen at random in essentially the same region of attribute space as occupied by the numeric measurement values of the measurement objects of said measurement data set . More specifically, each reference value is preferably chosen at random using a uniform distribution in the same region of attribute space as occupied by the numeric measurement values of the measurement objects of the measurement data set.
- the numeric reference values of the reference objects of the generated reference data sets will constitute a better simulation of randomly generated numeric values.
- there will be a better basis for the comparison of the measurement subset to the identified reference subsets which in turn is the basis for assessment of the cluster properties of the measurement subset in terms of the probability that it is actually a real cluster and not the result of random similarities between the measurement objects of the measurement subset.
- the performance of the method will be enhanced.
- the subset of the reference subsets preferably consists of reference subsets having essentially the same number of reference objects as the number of measurement objects of said measurement subset.
- figure 1 shows a flow chart of a first embodiment of the method according to the invention
- figure 2 shows a flow chart of a second embodiment of the method according to the invention
- figure 3 shows a plot of the measurement objects of a measurement data set in a first example
- figure 4 shows the value of a cluster property measure called the NEN/FIN ratio for identified measurement subsets and reference subsets of different sizes in the first example
- figure 5 shows a blow up of a part of figure 4
- - figure 6 shows a dendrogram showing the hierarchical relationship between the identified measurement subsets in the first example
- figure 7 shows a plot of the measurement objects of the measurement data set in the first example, where measurement objects belonging to different measurement subsets A-E are illustrated by different symbols
- figure 8 shows a plot of the measurement objects of the measurement data set in the first example, where measurement objects belonging to different measurement subset are illustrated by different symbols, and where a grade of each measurement subset is given
- A ⁇ i, x 2 , ...,x n ⁇ be a set of n objects to be analysed in terms of clusters.
- the distance or dissimilarity between two objects xi and j is defined by a distance measure, d(x ⁇ , Xj) , such as Euclidean distance or one minus the Pearson correlation.
- D be a proper subset of A containing at least two objects.
- This cluster property index will be called the average nearest-exterior-neighbour versus furthest- interior-neighbour (NEN/FIN) ratio in the following.
- NEN/FIN ratio is defined as the arithmetic mean of the log-transformed nearest- exterior-neighbour versus furthest-interior-neighbour ratios .
- a flow chart of a first embodiment of the method according to the invention is shown.
- the measurement objects are typically associated with a physical entity, and the numeric measurement values of a measurement object are typically the results from measurements of attributes associated to the measurement object.
- a measurement subset C which consists of a subset of the measurement objects of the measurement data set S is also input to the first embodiment.
- the measurement subset has been identified by means of a given clustering method.
- the clustering method used may be any one of a number of clustering methods proposed in prior art (see e.g. M.S. Aldenderfer and R.K. Blashfield, " Cluster Analysis” , Sage Publications, 1984; B.S. Everitt. " Cluster Analysis " E. Arnold, 1993; A.D. Gordon, “ Classification” , 2 nd edition, Chapman & Hall, 1999; A.K. Jain and R.C.
- step 105 the NEN/FIN ratio r(C) as defined above is calculated for the identified measurement subset C.
- Each of the reference data sets contains n reference objects, i.e. the same number of reference objects as the number of measurement objects in the measurement data set S.
- Each reference object has numeric reference values selected independently at random based on a uniform distribution within the same region of attribute space as occupied by the numeric measurement values of the measurement objects of the measurement data set S . Obviously, the more reference data sets are generated, the longer the execution time.
- step 115 the same clustering method as used for S is applied to each Ri. For each subset identified by the clustering method, its NEN/FIN ratio is calculated and its size is recorded in step 120.
- r(C) is compared to the NEN/FIN ratio of reference subsets of the same size as C, i.e. reference subsets that consist of the same number of reference objects as the number of measurement objects of the measurement subset .
- r(C) is not only compared to the NEN/FIN ratio of reference subsets of the same size as C, but also to the NEN/FIN ratio of reference subsets which have a similar size as C.
- steps 130-155 it is decided whether or not the measurement subset C of S is a cluster based on the comparison in step 125.
- step 130 it is determined if r(C) is larger than the maximum NEN/FIN ratio of the NEN/FIN ratios of the reference subsets of the same size as C. If this is the case, it is decided in step 135 that the measurement subset C is a cluster and the method is continued in step 140 where it is determined that there is a relationship between the measurement objects in the measurement subset since the measurement subset is decided to be a cluster. After step 140 the method is ended.
- step 145 If r(C) is not larger than the maximum NEN/FIN ratio of the NEN/FIN ratios of the reference subsets of the same size as C, it is determined in step 145 if r(C) lies in the upper quartile of the NEN/FIN rations of the reference subsets of the same size as C. If this is the case, it is decided in step 150 that the question of whether measurement subset C is a cluster or not is indeterminate and the method is ended. If not, it is decided in step 155 that the measurement subset C is not a cluster since r(C) lies below the lower limit of the upper quartile of the NEN/FIN ratios of the reference subsets of the same size as C. After step 155 the method is ended.
- a relative grading of measurement subsets is used where the decision above that the measurement subset C is a cluster, that the question of whether measurement subset C is a cluster or not is indeterminate, and that the measurement subset C is not a cluster, are replaced by a grading of the measurement subset C with 2, 1 and 0 respectively.
- the grading 2 corresponds to a high probability that the measurement subset is a cluster
- the grading 1 corresponds to a medium probability that the measurement subset is a cluster
- the grading 0 corresponds to a low probability that the measurement subset is a cluster.
- the number of grades are three . Any number of grades can be used. The selection of the number of grades will depend on a desired resolution in the result presented in the method.
- the embodiment preferably also comprises a lower limit on the number of identified reference clusters of the same size as C, on which a decision of whether or not the measurement subset is a cluster may be based. For example, if there is just a few identified reference clusters of the same size as C, the decision that the measurement cluster is a cluster should not be considered to be valid.
- a flow chart of a second embodiment of the method according to the invention is shown in figure 2.
- the steps 205, 210, 215 and 220 illustrated in the flow chart of figure 2 are identical to the steps 105, 110, 115, and 120 of the first embodiment described with reference to the flow chart of figure 1. Thus, these steps are not described further herein.
- step 225 a value of a cluster assessment index, describing the unusualness of a subset, is is calculated for the measurement subset .
- Such a value of the cluster assessment index for the measurement subset is obtained by computing:
- ) denote the median and the average deviation, respectively, of the NEN/FIN ratios for the reference subsets of size
- ) may be approximations of the median and the average deviation, respectively, of the NEN/FIN ratios for the reference subsets of size
- such approximations may take into account also reference subsets which have a size which is not equal but similar to the measurement subset C. This is particularly useful in cases where the number of reference subsets identified of the same size as the measurement subset is limited or zero. For such cases there is also the alternative of determine that there is not enough statistical foundation to receive a useful result of q(C) .
- step 230 the cluster properties of the measurement subset C are assessed based on the value of the cluster assessment index. More particularly, the decision of whether the measurement subset C is a cluster or not is preferably done in a similar way as in the first embodiment.
- step 235 it is determined whether there is a relationship between the measurement objects in the measurement subset based on the assessment of the cluster properties of the measurement subset C. More particularly, it is determined that there is a relationship between the measurement objects of the measurement subset C if the measurement subset is decided to be a cluster. After step 235 the method is ended.
- two measurement data sets are described consisting of objects with two numeric measurement values, i.e. two-dimensional clustering.
- the selection of two numeric measurement values is done solely for ease of illustration.
- the embodiments are not limited to two-dimensional clusters, but are advantageously used also for clusters consisting of measurement objects which each has a large number of numeric measurement values .
- a measurement data set consisting of two hundred points is to be analysed.
- the measurement data set was created by first selecting five well-separated positions from within [- 5,5] x [- 5,5] . These positions were then used as centres in "Gaussian" clusters containing respectively 10, 20, 50, 50, and 70 points.
- the average-link hierarchical clustering method was applied using the Euclidean distance measure, and the size and average NEN/FIN ratio of each measurement subset found by the clustering method were recorded.
- the same clustering method and distance measure was applied to two hundred reference data sets each consisting of two hundred points drawn at random from the smallest rectangle that contains the measurement data set in figure 3, and the sides of which are parallel with the principal components of this measurement data set.
- its size and NEN/FIN ratio were recorded. It is to be noted, that the region from which the points drawn at random need not be the smallest rectangle that contains the data set, but may be any other suitable region, such as the convex hull.
- each star represents a measurement subset found in the measurement data set in figure 3.
- the position of a star is defined by the NEN/FIN ratio and the size, i.e. the number of measurement objects, of the corresponding measurement subset.
- box plots That is, if four or more reference subsets have the same size, they are represented with a box that spans the two middle quartiles of their NEN/FIN ratios. Their median ratio is shown by a horizontal line segment within the box. The minimum and maximum ratios are indicated with dots, which are connected to the box with vertical line segments. If one through three reference subsets have the same size, they are represented with their minimum and maximum ratios connected by a vertical line.
- Figure 5 shows a smaller part of figure 4.
- figures 4 and 5 give a very clear visual indication as to which measurement subsets (if any) are unusually compact and isolated, they do not give a numerical index for how unusual a measurement subset is.
- the value of the cluster assessment index according to the second embodiment described with reference to figure 2 may be calculated for the identified measurement subsets.
- a high positive value for the cluster assessment index is an indication that a measurement subset is a good cluster.
- the assessment of the cluster properties of the identified measurement subsets is done according to the first embodiment described above with reference to figure 1. Thus, if the NEN/FIN ratio for a measurement subset is larger than the maximum
- NEN/FIN ratio of the reference subsets of the same size as the measurement subset it is awarded unusualness grade two, which should be interpreted as an indication that the probability is high that the measurement subset is a cluster. If the NEN/FIN ration for the measurement subset lies in the upper quartile (that is, above the box but below the maximum value) , its grade is set to one, which should be interpreted as an indication that the probability is medium that the measurement subset is a cluster. If the NEN/FIN ration for the measurement subset lies within or below the box, it gets grade zero, which should be interpreted that the probability that the measurement subset is a cluster is low.
- the measurement subset gets grade one if it is higher than the median of the NEN/FIN ratio for the reference subsets of the same size as the measurement subset, and grade zero if it is lower.
- a disadvantage with the kind of diagram shown in figure 6 is that it does not show hierarchical relationships. To show both the hierarchical relationship between measurement subsets and their unusualness, we may code the nodes of the corresponding dendrogram.
- Figure 6 shows the coded dendrogram for the measurement data set in figure 3: a dotted bar corresponds to grade two, a dashed bar to grade one, and a blue bar to grade zero.
- This kind of dendrogram gives the analyst a better view of the internal structure of higher-grade measurement subsets. For example, if both sub-dendrograms of a dotted node contains dotted or dashed nodes of significant size, this is an indication that the dotted node represents a super-cluster (i.e. a cluster of real clusters, i.e. measurement subsets with grade 1 or 2) . On the other hand, if only one of the sub-dendrograms contains dotted or dashed nodes, and the other sub- dendrogram is relatively small, then the objects represented by the latter may be regarded as insignificant "satellites" to the cluster represented by the first sub-dendrogram.
- An analyst applying these ideas to the dendrogram in figure 6 may decide that the nodes marked A through E are the most interesting nodes.
- the clusters i.e. measurement subsets with grade 1 or 2 , corresponding to these nodes are shown in figure 7. As can be seen, they are essentially identical to the five clusters present in the measurement data set. That the A cluster has a lower grade than the others seems fair: with respect to its size, it is clearly less compact and isolated than the others are .
- Figure 8 shows various measurement subsets with their grades. Observe that low-grade clusters compared to higher-grade clusters either have large diameters (relative to their size) or are less isolated.
- the first example is rather simple: not only is it two-dimensional, it also contains very "crisp” clusters.
- real data sets often have various kinds of "noise", that is, data that carries no relevant information.
- the same data set as used in the first example has been masked by the addition of two hundred points chosen at random from a rectangular region.
- the measurement data set is shown in figure 9. After repeating the same procedure as in the first example, figures 10 and 11 are obtained. Although the NEN/FIN ratios of the measurement subsets found in this measurement data set are substantially smaller than in the first example, there are still quite a few higher- grade measurement subsets.
- Figure 12 shows the corresponding dendrogram.
- Figure 14 shows a selection of measurement subsets that were found in the data set in figure 9.
- measurement subsets corresponding to the clusters in figure 3 all have at least grade one, while measurement subsets containing the random points have grade zero (except in one case) .
- the application of the embodiments of the invention in different technical areas for measurement data sets consisting of measurement objects each having a number of numeric measurement values derived in an experiment or test within that technical area will be apparent to the skilled person. For example, in an application of the embodiments within the biotechnology area, consider the scenario where a micro array test has been performed for set of mRNA samples, each associated to a separate one of a set of individuals.
- each micro array test is a set of expression levels for a selected number of nucleotide sequences comprised on the micro array, wherein the nucleotide sequences are found in each of the mRNA samples.
- each mRNA sample will be a measurement object, and the expression levels of the number of nucleotide sequences associated to the mRNA sample will be the numeric measurement values of the measurement object.
- an identified cluster will indicate that the mRNA samples of the cluster will have similar expression levels for the nucleotide sequences.
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2003234380A AU2003234380A1 (en) | 2002-05-24 | 2003-05-21 | A method for determining if there is a relationship between measurement objects in a measurement data set |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
SE0201576A SE0201576D0 (sv) | 2002-05-24 | 2002-05-24 | New method |
SE0201576-6 | 2002-05-24 | ||
SE0202359-6 | 2002-08-02 | ||
SE0202359A SE0202359D0 (sv) | 2002-05-24 | 2002-08-02 | New method |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2003100660A1 true WO2003100660A1 (fr) | 2003-12-04 |
Family
ID=26655706
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/SE2003/000815 WO2003100660A1 (fr) | 2002-05-24 | 2003-05-21 | Procede permettant de determiner s'il existe une relation entre des objets de mesure dans un ensemble de donnees de mesure |
Country Status (3)
Country | Link |
---|---|
AU (1) | AU2003234380A1 (fr) |
SE (1) | SE0202359D0 (fr) |
WO (1) | WO2003100660A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115034690A (zh) * | 2022-08-10 | 2022-09-09 | 中国航天科工集团八五一一研究所 | 一种基于改进模糊c-均值聚类的战场态势分析方法 |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2002003256A1 (fr) * | 2000-07-05 | 2002-01-10 | Camo, Inc. | Procede et systeme d'analyse dynamique de donnees |
-
2002
- 2002-08-02 SE SE0202359A patent/SE0202359D0/xx unknown
-
2003
- 2003-05-21 WO PCT/SE2003/000815 patent/WO2003100660A1/fr not_active Application Discontinuation
- 2003-05-21 AU AU2003234380A patent/AU2003234380A1/en not_active Abandoned
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2002003256A1 (fr) * | 2000-07-05 | 2002-01-10 | Camo, Inc. | Procede et systeme d'analyse dynamique de donnees |
Non-Patent Citations (2)
Title |
---|
CALINSKI T. ET AL.: "A dendrite method for cluster analysis", COMMUNICATIONS IN STATISTICS, vol. 3, no. 1, 1974, pages 1 - 27, XP002962668 * |
KUI ZHANG ET AL.: "Assessing reliability of gene cluster from gene expression data", FUNCT. INEGR. GENOMICS, vol. 1, 2000, pages 156 - 173, XP002963984 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115034690A (zh) * | 2022-08-10 | 2022-09-09 | 中国航天科工集团八五一一研究所 | 一种基于改进模糊c-均值聚类的战场态势分析方法 |
Also Published As
Publication number | Publication date |
---|---|
AU2003234380A1 (en) | 2003-12-12 |
SE0202359D0 (sv) | 2002-08-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ghorbani et al. | Neuron shapley: Discovering the responsible neurons | |
Zhao | Cluster validity in clustering methods | |
Madhulatha | Comparison between k-means and k-medoids clustering algorithms | |
Peterson et al. | Merging K‐means with hierarchical clustering for identifying general‐shaped groups | |
Hou et al. | Enhancing density peak clustering via density normalization | |
Albatineh et al. | MCS: A method for finding the number of clusters | |
Chebbout et al. | Comparative study of clustering based colour image segmentation techniques | |
Wang et al. | Automatic cluster number selection by finding density peaks | |
Yao et al. | An improved clustering algorithm and its application in wechat sports users analysis | |
Bataineh et al. | Fully Automated Density-Based Clustering Method. | |
CN108510010A (zh) | 一种基于预筛选的密度峰值聚类方法及系统 | |
Wu et al. | Self-organizing-map based clustering using a local clustering validity index | |
WO2003100660A1 (fr) | Procede permettant de determiner s'il existe une relation entre des objets de mesure dans un ensemble de donnees de mesure | |
Klawonn et al. | Visual inspection of fuzzy clustering results | |
Al Shaqsi et al. | Estimating the predominant number of clusters in a dataset | |
Deshpande et al. | Time and memory scalable algorithms for clustering tendency assessment of big data | |
Schmidt et al. | Using spectral clustering of hashtag adoptions to find interest-based communities | |
Pereda et al. | Machine learning analysis of complex networks in Hyperspherical space | |
CN114186110A (zh) | 一种数据聚类方法、装置、设备及可读存储介质 | |
Zagoruiko et al. | A quantitative measure of compactness and similarity in a competitive space | |
Lee et al. | A prediction for the cluster centers in unlabeled data | |
Halder et al. | Histogram based Evolutionary Dynamic Image Segmentation | |
CN115311477B (zh) | 一种基于超分辨率重建的模仿式商标精准检测方法及系统 | |
Du et al. | Combining statistical information and distance computation for K-Means initialization | |
Obata et al. | Automatic Time-Series Clustering via Network Inference. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: JP |
|
WWW | Wipo information: withdrawn in national office |
Country of ref document: JP |