WO2018228667A1

WO2018228667A1 - Automatic feature selection in machine learning

Info

Publication number: WO2018228667A1
Application number: PCT/EP2017/064317
Authority: WO
Inventors: Janakiraman THIYAGARAJAH; Peter Valeryevich Bazanov; Peng Lv; Luca De Matteis
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2017-06-12
Filing date: 2017-06-12
Publication date: 2018-12-20
Also published as: EP3612980A1

Abstract

Provided is a learning module to extract rules from training data and a feature selection module to determine features of the training data to be used for extracting the rules. The feature selection module is to receive context data of the rules to be extracted and domain information on the training data, the context data to specify an area of analysis in which the extracted rules are to be used, and the domain information indicating one or more technical environments to which the training data pertains.

Description

AUTOMATIC FEATURE SELECTION IN MACHINE LEARNING

FIELD

The present disclosure relates to machine learning. In particular, the present disclosure relates to automatic feature selection in machine learning. BACKGROUND

In machine learning, real-world problems often involve data with a large number of features. However, not all features may be essential as features may be redundant or even irrelevant. Taking into account redundant or irrelevant feature may reduce the performance of an algorithm. Feature selection aims to solve this problem by selecting only a subset of relevant features from a large set of available features. By removing redundant or irrelevant features, feature selection may help reducing the dimensionality of the data, speed up the learning process, simplify the learnt model, and/or increase the performance.

SUMMARY

According to a first aspect of the present invention, there is provided a system comprising a learning module to extract rules from training data, and a feature selection module to determine features of the training data to be used for extracting the rules, wherein the feature selection module is to receive context data of the rules to be extracted and domain information on the training data, the context data to specify an area of analysis in which the extracted rules are to be used, and the domain information indicating one or more technical environments to which the training data pertains.

In this regard, it is noted that the term "module" as used throughout the description and claims in particular refers to software, hardware, or a combination of software and hardware.

Hence, the feature selection module may automatically select the features based on the context data specifying the area of analysis in which the extracted rules are to be used, and the domain information indicating the one or more technical environments to which the training data pertains. Accordingly, the feature selection module may be enabled to map the area of analysis to the features which are relevant, while taking into account the technical environment in which the features were produced. In a first possible implementation form of the first aspect, the system comprises an analytics module, the analytics module to provide a plurality of services, wherein the services are directed at different areas of analysis, wherein the context data is to specify one area of analysis of the different areas of analysis at which the services are directed. In this regard, it is noted that the term "service" as used throughout the description and claims in particular refers to the provision of data in response to a request.

For example, the analytics module may be directed at data mining and provide data in response to a request to identify a pattern in live data.

In a second possible implementation form of the first aspect, the context data is to further specify a technique to be applied by the learning module.

Hence, the selection of relevant features may be particularly focused on features which lend themselves to application of a particular machine learning algorithm.

In a third possible implementation form of the first aspect, the technique comprises one or more of classification, regression, clustering, prediction, and anomaly detection. In a fourth possible implementation form of the first aspect, the different areas of analysis comprise one or more of a root cause analysis, a service impact analysis, a fault prediction analysis, a traffic prediction analysis, a security/threat analysis, a service/resource optimization analysis, and a service/application performance analysis.

In a fifth possible implementation form of the first aspect, the one or more technical environments include one or more of application management, server management, telecommunications networks, wide area networks, data center network operations, cloud operations, and security operations.

In a sixth possible implementation form of the first aspect, the feature selection module is to assign different factor vectors to different areas of analysis, wherein a factor vector comprises a plurality of entries, a value of an entry being a metric for the congruence between a factor and an area of analysis.

Accordingly, a factor vector may indicate a relevance of factors to an area of analysis. In a seventh possible implementation form of the first aspect, the feature selection module is to determine a relationship between features and factors, wherein a congruence between a feature and a factor is to be determined based on fuzzification.

For instance, a feature may be processed based on a fuzzy logic normalization and a factor analysis may be performed to determine factors which identify the context.

In an eighth possible implementation form of the first aspect, the feature selection module is to assign different attribute vectors to the different areas of analysis, wherein an attribute vector comprises a subset of the factors, wherein the subset is selected based on a relevance score of the factors in view of the area of analysis. Hence, an attribute vector may indicate which factors are particularly relevant to an area of analysis.

In a ninth possible implementation form of the first aspect, the feature selection module is to assign scores to the features of the training data based on the attribute vector corresponding to the area of analysis. Thus, the features having scores above a threshold may be selected and used for training of the learning module.

According to a second aspect of the present invention, there is provided a method of training data feature selection for extracting rules from the training data, comprising receiving context data of the rules to be extracted and domain information on the training data, the context data to specify an area of analysis in which the extracted rules are to be used, and the domain information indicating one or more technical environments to which the training data pertains, selecting features of the training data based on the context data and the domain information, and feeding a machine learning module with the training data and information on the selected features.

Hence, the method may automatically select the features based on the context data specifying the area of analysis in which the extracted rules are to be used, and the domain information indicating the one or more technical environments to which the training data pertains. Accordingly, the method may map the area of analysis to the features which are relevant, while taking into account the technical environment in which the features were produced.

In a first possible implementation form of the second aspect, the different areas of analysis comprise one or more of a root cause analysis, a service impact analysis, a fault prediction analysis, a traffic prediction analysis, a security/threat analysis, a service/resource optimization analysis, and a service/application performance analysis.

In a second possible implementation form of the second aspect, the one or more technical environments include one or more of application management, server management, telecommunications networks, wide area networks, data center network operations, cloud operations, and security operations.

In a third possible implementation form of the second aspect, the method comprises assigning different factor vectors to different areas of analysis, wherein a factor vector comprises a plurality of entries, a value of an entry being a metric for the congruence between a factor and an area of analysis.

Accordingly, as indicated above, a factor vector may indicate a relevance of factors to an area of analysis.

In a fourth possible implementation form of the second aspect, the method comprises determining a relationship between features and factors, wherein determining a congruence between a feature and a factor is based on fuzzification.

Hence, as indicated above, a feature may be processed based on a fuzzy logic normalization and a factor analysis may be performed to determine factors which identify the context.

BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 shows a block diagram of an exemplary system;

Fig. 2 shows a flow-chart of a machine learning process;

Fig. 3 shows another flow-chart of the machine learning process of Fig. 2;

Fig. 4 shows examples of domain information;

Fig. 5 shows examples of context data;

Fig. 6 shows an exemplary process of assigning factors to a context;

Fig. 7 shows an exemplary process of assigning an attribute vector to an area of analysis;

Fig. 8 shows a process for fusing context factors and attribute factors; Fig. 9 shows a process of selecting features; and

Fig. 10 shows an overview of the steps of the feature selection process. DETAILED DESCRIPTION

The following exemplary system and method relate to unsupervised machine learning for addressing challenges faced in the operation complex systems such as of cloud computing systems involving a plurality of interoperating computing devices, although the system and method are not limited to cloud computing systems. In particular, the exemplary system and method are directed at optimizing the feature selection prior to machine learning, e.g., in the area of operational analytics, and may improve the usability and accuracy as compared to feature selection by experts .

Fig. 1 shows a block diagram of an exemplary system 10. The system 10 which may be a computing system comprising one or more interoperating computing devices may comprise a feature selection module 12 and a machine learning module 14. The feature selection module 12 may be provided with training data 16a, 16b. The training data 16a, 16b may be collected from a single source or multiple sources and may comprise a plurality of data fields 18. Each data field 18 may include one or more features 20.

For instance, a feature 20 may refer to an alarm serial number, an alarm type, a (first) occurrence time, a clearance time, a location, etc. For example, the training data 16a, 16b may be obtained from multiple sources wherein some training data 16a is sparse and some training data 16b is dense. Hence, in a pre-processing step multiple indicators may be extracted from the data distribution and a normalization and feature scaling may be performed, e.g., using softmax, sigmoid functions, etc.

The feature selection module 12 may carry-out a factor analysis to extract initial group components and then improve the grouping according to relevance based on context semantic reward functions and rules. The factor analysis may produce a decorrelation of features and an independent group extraction. A relevance mechanism may evaluate the factor groups and associate the factor groups with domain knowledge. The features 20 remaining after feature selection within the filtered training data 22 may then be used to train the machine learning module 14 for purposes such as pattern classification, regression, clustering, prediction, anomality detection, etc. This may reduce the computational cost and infrastructure to select the features while allowing to consider the whole set of features relevant to the context including the scope of a learnt model/rules and provides a semantic approach to complex problems, which may be fused with the factor analysis. As shown in Fig. 2, the trained machine learning module 14 may be validated using test data. Once validated, rules may be extracted from the machine learning module 14 and used to analyze a system.

As indicated in Fig. 3, selecting the features 20 may be based on domain information (such as the domain information shown in Fig. 4) indicating the source of the training data 16a, 16b. Furthermore, as also indicated in Fig. 3, selecting the features 20 may also be based on context data indicating the scope of application of the learnt model/rules as well as a type of machine learning algorithm/strategy employed by the machine learning module 14, as exemplarily illustrated in Fig. 5, where the broken line indicates an example of a chosen combination of a scope of application and a machine learning algorithm/strategy employed by the machine learning module 14. A context may be defined as a function of input factors that are assumed to impact the system under consideration. For example, a context may be expressed by C = fl(si ), which may represent adaptive context modes of an environment factor that influence the feature selection such as:

Cloud Operations Multiple context indicators may be extracted. Raw context may be transformed to "feature space".

Cloud operation fuzzy functions may encode factors of reliability, availability of services and data, shared resources data, number of active client, security, complexity, energy consumption and costs, regulations and legal issues, performance, migration, reversion, the lack of standards, limited customization, issues of privacy, etc.:

1. Control Reliability Factor si - stddev ()

S2 - mean ()

S3 - snapshot entropy rate = entropy (current state snapshot)/max entropy (all states) entropy lto2 sentropy ⁼

Pi log Pi , where Pi = ~ - the count statistic (prior probability) that i-th cell-id occurs in serving cell-id#l or next neighbor cell-id#2 occurred within timeslot (2 minutes/1 minutes/30 sec). In this encoding, there is also N- cardinality, power of alphabet that represent the maximum entropy log(N). N is the total number of unique cells (alphabet like 334-23799 Ά", 334-1 1277 'Β')

fentropy

Srate entropy ^~ _maxentropy

54 - presence of 'error type -1 ' 2. Availability of services and data

55 - presence of service Application Management

1. Number of clients and server

2. Number of VMType of interaction async/sync Server Management

1. Resource infrastructure factor

2. Runtime operation factor

3. Memory fragmentation factor

4. Memory free factor

5. Resource synchronization factor

6. Number of processes and their complexity

Network Operations

1. Migration stability factor

2. Network infrastructure stability factor Security Operations

1. Access rights snapshots

2. Process and resource dependencies snapshot A factor analysis may be applied to construct factor groups in unsupervised mode. The initial groups may be decomposed to 4 categories of the cloud state:

Common factors components detected

Unexplained factors Groups may be compared to previous state

New common factors established

Tracked factors and factor age

Fig. 6 shows an exemplary process of assigning factors to a context. A context may be represented by a function of defined input factors that impact the system under consideration. Hence, contexts may be represented by numerical vectors. This may involve fuzzification of the input and basic features to numerical values and initial groups that represent stronger factors. As indicated above, the fuzzy functions could be sigmoid functions, softmax transform, tanh, logsig, etc. For better accuracy, the input may be normalized, de-noised and follow the normal distribution. In case of partial sparse data, features may be considered in aggregation and additional fuzzification may be employed using expert rules.

Hence, a context may be given by C = fi(si ), where Si £ Rm and Si represents the factors considered for identifying a context and may be represented by a context factor vector comprising common factors, new common factors, and unique factors.

Factor analysis may be regarded as a statistical method used to describe variability among observed variables in terms of fewer unobserved variables called factors. The observed variables may be modeled as linear combinations of the factors plus an error value: χ = · Ρ + μ + ζ where x is the vector of observed variables, μ is the constant vector of means, A is the matrix of Ν^χΜ factor loadings, F is the matrix of common factors and z is the vector of independently distributed error, which leads to: xi = ^λα · ^Fi + - + ^λΐΜ · ^FM + i + Zi

In factor analysis, two main types of rotation may be used: orthogonal when the new axes are also orthogonal to each other and oblique when the new axes are not required to be orthogonal to each other. Because the rotations are always performed in a subspace (the so-called factor space), the new axes will always exhibit less variance than the original factors.

As output, a common group factor snapshot picture may, for example, be:

• Cloud operation

· Application management

• Server management

• Security operations

Fig. 7 shows an exemplary process of assigning an attribute vector to an area of analysis. In particular, the attribute vector may be created from domain information. I.e., for each domain, an attribute factor (AF) may be generated as a function of the context and attributes of the domain context driven factor/weights for the attributes/properties of the relevant Managed ObjectX Types based on the relevance to the context.

In this regard, each object and shared resources in the system/cloud may provide an additional specification of the factors: AttrF actor AFi = f₂(ci , ¾ ), where Ci £ C, ¾ E A wherein A represents the attributes/properties of the relevant Managed Object Types of the domain model.

A typical implementation of attribute vector generation may use a factor analysis that allows to select independent feature components, and coefficients in a new reduced feature space that creates an attribute vector by unsupervised learning. Using factor analysis, common factor groups and very unique features (or sparse) features without common factor may be defined. Factor analysis may operate with variation and covariance matrix and hence be sensitive to fuzzification and normalization. Weighting of group factors may additionally be used to increase the confidence and robustness. An analysis like a RCA (root-cause analysis) may operate with standard common factors that are defined from the domain context. The unsupervised factor analysis groups may be associated with standard common factors. The factor analysis groups may be checked and improved using expert rules from domain context. Also, there may be a tradeoff between unsupervised factor analysis groups and context domain driven factor groups. As output, there may be a group factors snapshot picture in dynamic for each major object resource in a cloud:

• Factors selected from principal Objects Cloud operation indicators according the timeframe dynamic features and background context.

• Factors Selected from principal Objects in Application Management & Context.

• Factors from principal Objects in Server Management & Context.

• Factors from represent principal Objects Security Operations.

Factors can usually be the common group of features and type of external or internal force. Some factors may be basic and evaluated as simple unique features.

Fig. 8 shows a process for fusing context factors and attribute factors. Feature factors may be generated for the set of all features given as input, as a function of the attribute factor and the set of all features given as input.

FeatureFactor FFi = f₃(¾, AF), where xi £ X, and X represents the set of all features given as input for learning with AF as the attribute factor which may be obtained from domain and context.

Hence, fusion concatenation of features of context factor generation and features of attribute feature generation may be used and factor analysis may be applied in order to find common group correlation between some of the features. These features may be fused from the 'static context' indicator snapshot, 'dynamic' attributes of each entity, object, and shared resources in cloud. The features from selected groups may be controlled and evaluated.

Fig. 9 shows a process of selecting features. Each feature 20 of the training data 16a, 16b data may be assessed using the feature factor and select the features as a function of the input features and the feature factor, which is generated using the properties of MO (App/Service/Resource/...) and the context.

The selected feature set may thus be represented as X"= f₄(X, FF), where X represents the set of all features 20 given as input and FF represents the feature factor obtained from the domain and context. The features 20 may be selected under a certain limitation regarding a confidence threshold. Fig. 10 shows an overview of the steps of the feature selection process. In particular, si could be basic fuzzified features (based on using softmax, hyperbolic tang, sigmoid function, etc.) and fi- could be the mapping of the normalized features to factor components:

Claims

1. A system comprising: a learning module configured to extract rules from training data; and a feature selection module configured to determine features of the training data to be used for extracting the rules; wherein the feature selection module is configured to receive context data of the rules to be extracted and domain information on the training data, wherein the context data specify an area of analysis in which the extracted rules are to be used, and the domain information indicating one or more technical environments to which the training data pertains.

2. The system of claim 1, comprising: an analytics module configured to provide a plurality of services, wherein the services are directed at different areas of analysis, wherein the context data specify one area of analysis of the different areas of analysis at which the services are directed.

3. The system of claim 1 or 2, wherein the context data further specify a technique to be applied by the learning module.

4. The system of claim 3, wherein the technique comprises one or more of: classification; and clustering.

5. The system of any one of claims 1 to 4, wherein the different areas of analysis comprise one or more of: a root cause analysis; a service impact analysis; a fault prediction analysis; a traffic prediction analysis; a security/threat analysis; a service/resource optimization analysis; and a service/application performance analysis.

6. The system of any one of claims 1 to 5, wherein the one or more technical environments include one or more of: application management; server management; telecommunications networks; wide area networks; data center network operations; cloud operations; and security operations.

7. The system of any one of claims 1 to 6, wherein the feature selection module is configured to assign different factor vectors to different areas of analysis, wherein a factor vector comprises a plurality of entries, a value of an entry being a metric for the congruence between a factor and an area of analysis.

8. The system of claim 7, wherein the feature selection module is configured to determine a relationship between features and factors, wherein a congruence between a feature and a factor is to be determined based on fuzzification.

9. The system of any one of claims 7 or 8, wherein the feature selection module is configured to assign different attribute vectors to the different areas of analysis, wherein an attribute vector comprises a subset of the factors, wherein the subset is selected based on a relevance score of the factors in view of the area of analysis.

10. The system of claim 9, wherein the feature selection module is configured to assign scores to the features of the training data based on the attribute vector corresponding to the area of analysis.

11. A method of training data feature selection for extracting rules from the training data, comprising: receiving context data of the rules to be extracted and domain information on the training data, the context data to specify an area of analysis in which the extracted rules are to be used, and the domain information indicating one or more technical environments to which the training data pertains; selecting features of the training data based on the context data and the domain information; and feeding a machine learning module with the training data and information on the selected features.

12. The method of claim 11 , wherein the different areas of analysis comprise one or more of: a root cause analysis; a service impact analysis; a fault prediction analysis; a traffic prediction analysis; a security/threat analysis; a service/resource optimization analysis; and a service/application performance analysis.

13. The method of claim 11 or 12, wherein the one or more technical environments include one or more of: application management; server management; telecommunications networks; wide area networks; data center network operations; cloud operations; and security operations.

14. The method of any one of claims 11 to 13, comprising: assigning different factor vectors to different areas of analysis, wherein a factor vector comprises a plurality of entries, a value of an entry being a metric for the congruence between a factor and an area of analysis.

15. The method of claim 14, comprising: determining a relationship between features and factors, wherein determining a congruence between a feature and a factor is based on fuzzification.