CN119292810A

CN119292810A - Fault alarm self-healing system and method

Info

Publication number: CN119292810A
Application number: CN202411337066.XA
Authority: CN
Inventors: 赵浩霖
Original assignee: Hangzhou Ezviz Network Co Ltd
Current assignee: Hangzhou Ezviz Network Co Ltd
Priority date: 2024-09-24
Filing date: 2024-09-24
Publication date: 2025-01-10

Abstract

An embodiment of the present application provides a fault alarm self-healing system and method, which includes a microservice target unit, a management unit, a monitoring collection unit, a data calculation unit, and a plan handling unit. The management unit is used to determine the monitoring collection tasks to be added and the monitoring collection tasks to be deleted according to the monitoring information carried by each metadata; the monitoring collection unit is used to collect monitoring indicator data according to the monitoring collection tasks to be added, and send the monitoring indicator data to the data calculation unit; the data calculation unit is used to perform real-time detection of the monitoring indicator data according to a preset indicator detection rule, and generate time series indicator data according to a first detection result; and perform real-time detection of the monitoring indicator data and the time series indicator data according to a preset anomaly detection rule, and generate an abnormal event according to a second detection result; the plan handling unit is used to retrieve a fault self-healing plan that matches each abnormal event, and perform fault self-healing processing on the abnormal event according to the fault self-healing plan.

Description

Fault alarm self-healing system and method

Technical Field

The application relates to the technical field of big data calculation, in particular to a fault alarm self-healing system and a fault alarm self-healing method.

Background

Micro-services are applications that consist of many smaller, loosely coupled services, as opposed to the overall approach of large, tightly coupled applications. For each micro service, if the monitoring coverage of the health status of each service or instance and the fault elimination after the fault are all handled manually, the cost of monitoring and fault handling for the micro service is very high, so the fault self-healing of the micro service is an urgent problem to be solved.

Disclosure of Invention

The embodiment of the application aims to provide a fault alarm self-healing system and a fault alarm self-healing method for realizing fault self-healing of micro-service. The specific technical scheme is as follows:

In a first aspect, an embodiment of the present application provides a fault alarm self-healing system, the system including a micro-service target unit, a management unit, a monitoring acquisition unit, a data calculation unit, a plan handling unit,

The management unit is used for acquiring metadata corresponding to each micro service instance from the micro service target unit, determining a monitoring acquisition task to be added and a monitoring acquisition task to be deleted according to monitoring information carried by each metadata, and sending the monitoring acquisition task to be added and the monitoring acquisition task to be deleted to the monitoring acquisition unit;

The monitoring acquisition unit is used for receiving the monitoring acquisition task to be added and the monitoring acquisition task to be deleted, which are sent by the management unit, acquiring monitoring index data according to the monitoring acquisition task to be added, and sending the monitoring index data to the data calculation unit;

The data calculation unit is used for receiving the monitoring index data sent by the monitoring acquisition unit, carrying out real-time monitoring on the monitoring index data according to a preset index detection rule to obtain a first detection result, generating time sequence index data according to the first detection result, carrying out real-time detection on the monitoring index data and the time sequence index data according to a preset abnormality detection rule to obtain a second detection result, generating an abnormal event according to the second detection result, and sending the abnormal event to the plan disposal unit;

The plan processing unit is used for receiving the abnormal events sent by the data computing unit, searching a fault self-healing scheme matched with each abnormal event, and performing fault self-healing processing on the abnormal events according to the fault self-healing scheme.

In some embodiments, the system further comprises a data storage unit, a protocol learning unit, a protocol recommendation and management unit, the fault self-healing scheme being obtained by:

The plan learning unit is used for performing machine learning processing according to the monitoring index data and the time sequence index data to obtain a treatment action related to the abnormal event;

The plan recommending and managing unit is used for receiving the treatment action sent by the plan learning unit, optimizing the treatment action to obtain an optimized treatment action, and sending the optimized treatment action to the data storage unit;

the data storage unit is used for receiving the optimized treatment action and correlating the optimized treatment action with the abnormal event to obtain abnormal event configuration data;

The plan processing unit is used for acquiring the abnormal event configuration data from the data storage unit, carrying out structuring processing on the abnormal event configuration data to obtain a structural body, and determining a fault self-healing scheme matched with each abnormal event according to the structural body.

In some embodiments, the system further comprises an event alert unit, the sending the abnormal event to the protocol handling unit, comprising:

The data calculation unit sends the abnormal event to the event alarm unit;

The event alarm unit receives the abnormal event sent by the event data calculation unit, classifies the abnormal event to obtain a target abnormal event, and sends the target abnormal event to the plan treatment unit;

Wherein the classifying the abnormal event to obtain a target abnormal event comprises,

For the abnormal event, if a plurality of identical abnormal events are received in a first preset time, one abnormal event in the identical abnormal events is taken as a target abnormal event, and/or,

Aiming at the abnormal event, if a plurality of abnormal events with the same attribute or rule exist in a second preset time, carrying out merging processing according to a preset merging rule to obtain a target abnormal event, and/or,

And aiming at the abnormal event, if a plurality of abnormal events with similar characteristics or service attributes exist in a third preset time, merging and converging the abnormal events with similar characteristics or service attributes to obtain a target abnormal event.

In some embodiments, the event alert unit is further configured to,

And aiming at the abnormal event, if the recovery event of the abnormal event is not received in the fourth preset time, taking the abnormal event as an upgrading abnormal event.

In some embodiments, the system further comprises an alert sending unit and an alert receiving body,

The alarm sending unit is used for acquiring the target abnormal event from the event alarm unit and sending the target abnormal event to the alarm receiving main body according to an alarm rule or a subscription rule;

The alarm receiving body is used for receiving the target abnormal event sent by the alarm sending unit.

In some embodiments, the data calculation unit is further configured to detect the monitoring acquisition index data and the time sequence index data in real time according to the preset anomaly detection rule to obtain a third detection result, and generate a recovery event according to the third detection result;

the event alarm unit is further configured to receive the recovery event sent by the data calculation unit.

In some embodiments, the system further includes a configuration center unit, and the obtaining metadata corresponding to each micro service instance from the micro service target unit includes:

The micro service target unit registers service information of each micro service instance to the configuration center unit, wherein the service information comprises a service name, a monitoring acquisition address, a monitoring acquisition port and metadata of the instance;

the configuration center unit generates a service instance list according to the service information;

The management unit acquires the service instance list from the configuration center unit, and reads metadata corresponding to the instances of each micro service from the service instance list.

In some embodiments, a communication mechanism exists between the instances of the micro services and the configuration central unit,

The configuration center unit is further configured to delete, if there is a target instance that is not in communication with the configuration center unit within a fifth preset time, the target instance from the service instance list.

In some embodiments, the system further comprises a data storage unit, a data presentation unit, and a data transmission unit,

The system comprises a data display unit, a data storage unit, a fault self-healing scheme, a data storage unit and a data display unit, wherein the data storage unit is used for acquiring and storing system data and sending the system data to the data display unit, and the system data comprises metadata, monitoring index data, time sequence index data, abnormal events, recovery events, preset index detection rules, preset abnormal detection rules, subscription rules and the fault self-healing scheme;

The data display unit is used for receiving the system data sent by the data storage unit and displaying the system data;

the sending the monitor index data to the data calculation unit includes:

The monitoring acquisition unit sends the monitoring index data to the data transmission unit;

the data transmission unit transmits the monitoring index data to the data calculation unit.

In a second aspect, an embodiment of the present application further provides a self-healing method for failure alarm, where the method is applied to a self-healing system for failure alarm, and the self-healing system for failure alarm includes a micro-service target unit, a management unit, a monitoring acquisition unit, a data calculation unit, and a plan handling unit, and the method includes:

The management unit acquires metadata corresponding to each micro service instance from the micro service target unit, determines a monitoring acquisition task to be added and a monitoring acquisition task to be deleted according to monitoring information carried by each metadata, and sends the monitoring acquisition task to be added and the monitoring acquisition task to be deleted to the monitoring acquisition unit;

The monitoring acquisition unit receives the monitoring acquisition task to be added and the monitoring acquisition task to be deleted, which are sent by the management unit, acquires monitoring index data according to the monitoring acquisition task to be added, and sends the monitoring index data to the data calculation unit;

the data calculation unit receives the monitoring index data sent by the monitoring acquisition unit, detects the monitoring index data in real time according to a preset index detection rule to obtain a first detection result, generates time sequence index data according to the first detection result, detects the monitoring index data and the time sequence index data in real time according to a preset alarm detection rule to obtain a second detection result, generates an abnormal event according to the second detection result, and sends the abnormal event to the plan disposal unit;

The plan processing unit receives the abnormal events sent by the data calculation unit, retrieves the fault self-healing schemes matched with the abnormal events, and carries out fault self-healing processing on the abnormal events according to the fault self-healing schemes.

The plan learning unit performs machine learning processing according to the monitoring index data and the time sequence index data to obtain a treatment action about the abnormal event;

The plan recommending and managing unit receives the treatment action sent by the plan learning unit, optimizes the treatment action to obtain an optimized treatment action, and sends the optimized treatment action to the data storage unit;

the data storage unit receives the optimized treatment action and associates the optimized treatment action with the abnormal event to obtain abnormal event configuration data;

The plan processing unit acquires the abnormal event configuration data from the data storage unit, carries out structuring processing on the abnormal event configuration data to obtain a structural body, and determines a fault self-healing scheme matched with each abnormal event according to the structural body.

The data calculation unit sends the abnormal event to the event alarm unit;

In some embodiments, for the abnormal event, if a recovery event of the abnormal event is not received within a fourth preset time, the event alarm unit takes the abnormal event as an upgrade abnormal event.

The alarm sending unit obtains the target abnormal event from the event alarm unit and sends the target abnormal event to the alarm receiving main body according to an alarm rule or a subscription rule;

the alarm receiving main body receives the target abnormal event sent by the alarm sending unit.

the event alarm unit receives the recovery event sent by the data calculation unit.

If the target instance which is not communicated with the configuration center unit in the fifth preset time exists, the configuration center unit deletes the target instance from the service instance list.

The data storage unit acquires and stores system data and sends the system data to the data display unit, wherein the system data comprises metadata, monitoring index data, time sequence index data, abnormal events, recovery events, preset index detection rules, preset abnormal detection rules, subscription rules and a fault self-healing scheme;

The data display unit receives the system data sent by the data storage unit and displays the system data;

the sending the monitor index data to the data calculation unit includes:

In a third aspect, an embodiment of the present application further provides an electronic device, including:

a memory for storing a computer program;

and the processor is used for realizing any one of the fault alarm systems when executing the program stored in the memory.

The embodiment of the application has the beneficial effects that:

In the technical scheme provided by the embodiment of the application, the management unit acquires metadata corresponding to each micro-service instance from the micro-service target unit, determines the monitoring acquisition task to be added and the monitoring acquisition task to be deleted according to the monitoring information carried by each metadata, then the management unit can send the generated monitoring acquisition task to be added and the generated monitoring acquisition task to be deleted to the monitoring acquisition unit, and the monitoring acquisition unit can increase the corresponding acquisition task based on the monitoring acquisition task to be added and stop the corresponding acquisition task according to the monitoring acquisition task to be deleted, so that the deployment and adjustment of the monitoring tasks are realized, and further the monitoring index data acquired by the monitoring acquisition task is acquired.

The monitoring acquisition unit can send monitoring index data to the data calculation unit, the data calculation unit monitors the monitoring index data in real time according to a preset index detection rule to obtain a first detection result, generates time sequence index data according to the first detection result, then detects the monitoring index data and the time sequence index data in real time according to a preset abnormality detection rule to obtain a second detection result, generates an abnormal event according to the second detection result, and therefore determines the abnormal event existing in the operation process of each micro service instance.

For each abnormal event, the data calculation unit can send each abnormal event to the plan treatment unit, and after each abnormal event is received by the plan treatment unit, the fault self-healing scheme matched with each abnormal event can be searched, and the corresponding abnormal event is processed according to the fault self-healing scheme, so that the abnormal event existing in the running process of each micro service instance can be automatically solved by the system, and the fault self-healing of the micro service is realized.

Of course, it is not necessary for any one product or method of practicing the application to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the application, and other embodiments may be obtained according to these drawings to those skilled in the art.

FIG. 1 is a flow interactive diagram of a self-healing system with fault alarm provided by an embodiment of the application;

FIG. 2 is a schematic structural diagram of a self-healing system with fault alarm according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of service registration and service discovery according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of issuing a monitoring task and collecting monitoring data according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of an abnormality detection and alarm event transmission processing procedure according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of an alarm event triggered protocol handling according to an embodiment of the present application;

FIG. 7 is a schematic flow chart of a process for learning and recommending configuration of a plan according to an embodiment of the present application;

FIG. 8 is a schematic flow chart of a self-healing method for fault alarm according to an embodiment of the present application;

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the embodiments of the present application, all other embodiments obtained by the person skilled in the art based on the present application are included in the scope of protection of the present application.

The terms appearing in the present application will be explained first.

The internet is also called an international network, which refers to a huge network formed by connecting networks in series, and the networks are connected by a set of general protocols to form a logically single huge international network.

The Internet originated from the apanet in the united states in 1969, and generally the Internet generally refers to the Internet, and the Internet refers to the Internet in particular. This method of interconnecting computer networks together may be referred to as "internetworking," and the Internet refers to a worldwide internetwork of networks that has been developed on the basis of this description, i.e., a network structure that is interconnected together. The internet is not the same as the world wide web, which is simply a global system based on hypertext links, and is understood to be a service that the internet can provide.

The internet of things (Internet of Things, ioT) refers to the real-time collection of any object or process needing to be monitored, connected and interacted, such as collection of various needed information of sound, light, heat, electricity, mechanics, chemistry, biology, position and the like through various information sensors, radio frequency identification technology, global positioning system, infrared sensors, laser scanners and other technologies and devices, and the ubiquitous connection of objects and people is realized through various possible network access, so that the intelligent sensing, identification and management of objects and processes are realized. The internet of things is an information carrier based on the internet, a traditional telecommunication network and the like, and enables all common physical objects which can be independently addressed to realize interconnection and intercommunication.

Microservices, also known as microservice architecture, refers to a cloud native architecture approach that includes numerous loosely-coupled and individually-deployable small components or services in a single application. These small components or services typically have their own technology stack, including databases and data management models, communicate with each other through a representational layer transformation application programming interface (REST API), event streams, and message broker combinations, and are organized by business capabilities with service separation lines, commonly referred to as bounded contexts.

Micro services are characterized by easier updating of code, direct addition of new features or functions without having to update the entire application, teams can use different technical stacks and different programming languages for different components. The components can be expanded independently of each other, thereby reducing waste and costs associated with having to expand the entire application (as a single function may face excessive loads).

Big Data (Big Data), which is also called huge amount of Data, refers to information which is huge in size and cannot be retrieved, managed, processed and arranged through a mainstream software tool in a reasonable time, and becomes a more positive purpose for helping business operation decision.

Cloud Computing (Cloud Computing), which is one type of distributed Computing, refers to the process of decomposing a huge data Computing process program into numerous small programs through a network "Cloud", and then processing and analyzing the small programs through a system consisting of multiple servers to obtain results and returning the results to users. That is, cloud computing refers to a very powerful system formed over a computer network (multi-fingered internet) that can store, aggregate, and configure related resources as needed to provide personalized services to users.

Early cloud computing, simply referred to as simple distributed computing, solves task distribution, and performs merging of computing results. Thus, cloud computing is also known as grid computing. By this technique, processing of tens of thousands of data can be completed in a short time (several seconds), thereby achieving a powerful network service. However, the cloud service is not just a distributed computing, but is a result of hybrid evolution and jump of computer technologies such as distributed computing, utility computing, load balancing, parallel computing, network storage, hot backup redundancy, virtualization and the like.

In the traditional data processing flow, data is always collected first and then put into a database, and when the user needs, the data is queried through the database, so as to obtain an answer or perform related processing. This seems to be very reasonable but the results are very compact, especially in some real-time search application environments, and offline processing like the map-reduce approach does not solve the problem well. This leads to a new data computation structure, namely a stream computation mode. It can analyze large-scale flow data in real time during the continuous changing movement process, capture the information which can be useful, and send the result to the next computing node.

Machine learning refers to a multi-field interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is aimed at researching how computer can simulate or implement learning behavior of human body so as to obtain new knowledge or skill, and reorganize the existent knowledge structure so as to continuously improve its self-performance. It is an artificial intelligence core, which is the fundamental way to make computers intelligent.

The self-healing of the fault means that the alarm is found in real time, the pre-diagnosis and analysis are carried out, the fault is automatically recovered, and the peripheral system is opened to realize the closed loop of the whole flow. The self-healing of the fault is divided into a negative self-healing and an active self-healing, wherein the negative self-healing can treat the symptoms without treating the root cause, such as automatic capacity expansion after disk alarming, and the active healing needs the cooperation of root cause analysis and directly processes the reason for generating the alarming. Root cause analysis (Root Cause Analysis, RCA) is a systematic problem solving method, and aims to find the root cause of the problem, so as to eliminate the source of the problem.

Service registration-refers to the registration of an instance of a micro service with its own service information (i.e., service instance information) to a registry (which may be understood as a function of the configuration center unit described below). This portion of the service information may include the host address (Internet Protocol Address, IP) where the service resides and the port (port) where the service is provided, as well as information exposing the status of the service itself and access protocols.

Service discovery refers to a service instance requesting a registry to obtain dependent service information. Specifically, the service instance obtains registered service instance information through the registry, and requests a service to be provided through the information. Service registration and service discovery are required to pay attention to the problems of monitoring the running state of a service instance and the like.

Micro-service monitoring refers to real-time, comprehensive performance, state, security and other detection and management of each instance in the micro-service architecture. By collecting, analyzing and displaying various data in the instance running process, the micro-service monitoring can help operation and maintenance personnel and developers to find potential problems in time, locate fault reasons and optimize system performance, so that the stability and usability of the whole micro-service architecture are ensured.

In the micro-service architecture, the dependency relationship among services is complicated, and any service fault can cause a chain reaction, so that the whole system is crashed, and therefore, through micro-service monitoring, operation and maintenance personnel and developers can grasp the running state and performance indexes of each instance in the system in real time, discover and process abnormal conditions in time, and avoid fault expansion and upgrading. Meanwhile, the micro-service monitoring can also provide historical data and trend analysis, help an operation and maintenance team predict system capacity and performance bottlenecks, and provide powerful support for service decision.

APACHE FLINK stream processing stateful SQL engine (APACHE FLINK Stateful Stream Processing SQL ENGINE, flink SQL) APACHE FLINK is a framework and distributed processing engine for performing stateful computations on borderless and bounded data streams, and structured query language (Structured Query Language, SQL) is a database query and programming language for accessing data and querying, updating and managing relational database systems. The Flink SQL is built on APACHE FLINK, and the powerful processing power of the Flink is utilized, so that users can use SQL sentences for stream data and batch data processing. The flank SQL supports both real-time stream data processing and bounded batch data processing.

With the rapid development of computer technology, various business data of various industries are exponentially and rapidly increased in aspects of society, the data storage and calculation amount is very large, the mass data stored and calculated are actually tested, the data type and structure tend to be complicated, and the structure and the type are various, including common structured data, semi-structured data, unstructured data such as audio, video and the like, streaming media data, vector or graph data and the like. The method and the system provide challenges for the resource storage and calculation of the Internet, and users currently participating in Internet services provide higher requirements for the services, such as recommending goods or resources which are more in line with the user's will for the users, and the real-time requirements of the user experience are higher, and the sensitivity of the users to information and data is also higher.

Meanwhile, with the rapid development of the new internet of things, including intelligent equipment internet of things, internet of vehicles and the like, when facing to user demands, enterprises communicate and interact with users through webpages or handheld terminal equipment in the Internet age, and in the new internet of things age, equipment time sequence data is generated by interaction between the users and the equipment, and the data is an explosive type increase compared with the Internet age.

The time sequence data of various devices in the Internet of things age are synchronized with a local area network or a cloud (or an edge) in real time (or near real time), are calculated together with big data in real time, a service result is output in real time through calculation, and different service demands are responded to a user. Thus, in this scenario, new demands and challenges are also presented for big data computation. New needs and challenges are created and the demands on big data practitioners are increasing. This increases the technical requirements of the professional access to the big data industry and increases the admission threshold. In the long term, the development of one industry requires more talents to participate and needs to be practiced in more business scenarios to verify its technology and tools, so the improvement of the admission threshold of the big data industry is detrimental to the development of the big data industry.

In the field of big data calculation, with the continuous development of business, the acquisition, calculation and analysis of business data become the need. Previously, in the field of big data, the common technical schemes all adopt a batch processing mode, and a big data technical engineer writes corresponding batch processing tasks to complete the collection, calculation and analysis of service data, but for the technical schemes, each time the mining requirement changes, the batch processing tasks have to be modified and redeployed, that is, the change of the service requirement is dealt with, corresponding adjustment and deployment are always required by related technicians, and the input cost is relatively high. Currently, with the rise of real-time streaming technology, the real-time streaming technology has a step-forward crossing in terms of big data mining analysis, is essentially an iterative update of the technology of the prior batch processing technology, is better in terms of data fidelity and coping with service demand changes than the prior batch processing, and can be smaller and smaller, and the real-time performance of data calculation results is higher.

In the digital era, with the rapid growth of service demands and the continuous evolution of technical architecture, a micro-service architecture gradually becomes the first choice for enterprises to construct complex application systems due to the characteristics of flexibility, scalability, high cohesion, low coupling and the like. However, the fragmentation, dynamics and complexity of the micro-service architecture also present unprecedented challenges to the stability, security and performance of the system.

In the related art, the following two schemes are generally used to solve the service fault self-healing problem, and two ways in the related art are respectively described below:

the method comprises the steps of 1, receiving real-time alarm information sent by a real-time monitoring system, obtaining an alarm type of the real-time alarm information based on the real-time alarm information, obtaining a corresponding fault self-healing processing scheme based on the alarm type, and carrying out self-healing processing on faults corresponding to the real-time alarm information based on the fault self-healing processing scheme.

It will be appreciated that although the solution 1 performs the fault self-healing process, from the perspective of the self-healing closed loop, the solution 1 may only reach one closed loop of the fault handling, but it is not mentioned how the related fault handling solution is closed loop, and in the environment of mass micro-services and massive big data, the closed loop of the handling solution is very important in fact, if the configuration and perfection are performed only by manpower, which obviously cannot meet the increasing service and fault scenario.

And 2, configuring a log word frequency analysis strategy according to application service log information of the acquired application log file, performing application log word frequency analysis according to the log word frequency analysis strategy to obtain word frequency monitoring indexes, training according to the word frequency monitoring indexes to obtain a word frequency detection model, performing anomaly detection through the word frequency detection model, and performing fault self-healing according to detection results.

It can be understood that, although the scheme 2 also performs fault self-healing, and has a better fault self-healing effect in the scenario of log application service log, the application log alone may have short boards in some scenarios, such as service dependency, in the scenario of multiple micro-services in series, or in the complex scenario of middleware and database, the monitoring indexes and log data in other dimensions may be needed to enrich the abnormal scenario, and in the process of processing the faults of the complex scenario of multiple services and multiple systems and multiple dependencies according to the abnormal scenario, the monitoring indexes, service logs, and data such as multilink data and service dependency graph of each service need to be synthesized and analyzed, and the complex scenarios are automatically recommended and associated with corresponding fault treatment plans.

In order to solve the above problems, the embodiment of the application provides a fault alarm self-healing system, which can be applied to electronic equipment, and the electronic equipment can be a server or a terminal device. In practical applications, the terminal device may be a smart phone, a tablet computer, a desktop computer, or the like.

Aiming at the problem that the explosive growth of the data causes big data calculation to have challenges, the fault alarm self-healing system provided by the embodiment of the application can reduce the cost and the cost of big data, real-time streaming computing technology and tools used in various industries. The monitoring index data and the time sequence index data are processed, analyzed and calculated in real time, the monitoring index data and the time sequence index data are detected in real time according to a preset abnormality detection rule, abnormal events are generated according to detection results, learning and training are carried out on source data and historical fault data through a machine learning algorithm, and different fault self-healing schemes are recommended, so that the abnormal events caused by micro-service faults can be related to different fault self-healing schemes, and micro-service fault self-healing is achieved.

Compared with the scheme 1, the fault self-healing system provided by the embodiment of the application has the advantages that the fault self-healing system can support the configuration of a fault treatment plan (namely, a fault self-healing scheme) which is statically configured through the experience of an operation and maintenance expert, the configured plan is tested and exercised, and the expert experience can also be added into machine learning to be used as a link of supervision learning in the machine learning to optimize and supplement the effect of the machine learning. In addition, the fault self-healing system provided by the embodiment of the application can also learn and predict the existing data, the historical fault data and some conventional fault self-healing scheme data through a machine learning algorithm, recommend the prediction result, then screen and filter the recommended result by an operator, configure and activate the recommended result to obtain a new fault self-healing scheme, and further test and exercise the new fault self-healing scheme to achieve a forward learning scheme, so that the fault self-healing scheme can be self-closed in richness.

The application relates to the fields of Internet big data calculation, internet or Internet of things real-time stream calculation, internet or Internet of things batch processing calculation, internet of things equipment access time sequence data real-time processing and aggregation calculation, fault self-healing, artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) and the like. Moreover, the application is based on a large framework of real-time stream computation and on the fly SQL technology, a set of requirements and structures for the change of business data structures are designed through dynamic configuration adaptation, a set of systems and services are adopted to rapidly adapt and realize data analysis and mining requirements, a large data development engineer and a programmer are not required to repeatedly develop programs continuously, and the requirements of the business in the aspect are also not required to be deployed continuously, so that a user can finish related operations and uses only by knowing SQL. Namely, the application can rapidly support the changing business demands and can also reduce the investment of technical cost.

The technical scheme provided by the application aims to rapidly adapt to business requirements and solve the problem of flow batch processing of big data real-time data with low cost. Reducing the threshold of practitioner's using specialized techniques such as big data computation and processing to help more operators without specialized knowledge in the big data computation field or with basic knowledge in the less big data computation field, the application can also complete the requirements of business analysis and calculation through tools and systems in the big data field, and the technical scheme provided by the application can promote and apply the big data real-time stream calculation in more fields. Thereby helping business analysis master's homegate to focus more on business analysis and processing. Furthermore, the technical scheme provided by the application aims to help personnel engaged in development and operation maintenance (development operations, devOps) to quickly find, monitor and handle common problems of micro services, quickly find alarms of faults, diagnose and analyze, automatically recover faults and open a peripheral system to realize the closed loop of the whole flow. Therefore, the availability of micro services is improved, and the continuous and stable operation of the service is ensured.

Specifically, the fault alarm self-healing system provided by the application provides a plurality of aspects of fault treatment and healing closed loops based on micro-service monitoring. The method comprises the steps of firstly, carrying out fault self-healing scheme to enable a corresponding treatment link to be matched with a fault self-healing scheme after the fault of the micro-service so as to enable the micro-service to be treated after the fault of the micro-service, secondly, carrying out fault self-healing scheme treatment to enable the fault self-healing scheme to be matched with the fault self-healing scheme, carrying out fault self-healing scheme treatment to enable the fault self-healing scheme to be carried out, carrying out fault self-healing scheme treatment to be carried out, and carrying out manual intervention treatment to carry out disc recovery analysis to carry out the fault pre-treatment.

The application provides a method for solving the challenges brought to the stability, safety and performance of a system by the fragmentation, the dynamics and the complexity of a micro-service architecture, which comprises the steps of completing the real-time dynamic issuing of a monitoring acquisition task to be added and a monitoring acquisition task to be deleted according to metadata corresponding to each registered micro-service instance through service registration and service discovery, completing the real-time acquisition of monitoring index data of each instance by a monitoring acquisition unit (namely a monitoring acquisition Agent (Agent) or a collector (exporter) component) and reporting the monitoring index data to a data transmission unit, acquiring the monitoring index data from the data transmission unit by a data calculation unit, generating time sequence index data according to the monitoring index data and a preset index detection rule, detecting the monitoring index data and the time sequence index data in real time according to the preset abnormality detection rule, further generating an abnormal event and a recovery event, packaging the abnormal event and the recovery event, sending the abnormal event and the recovery event to an event alarm unit, calling an alarm channel service by the event alarm unit to complete the notification sending of the abnormal event and the recovery event, notifying the relevant alarm subscribers, and completing the abnormal general service health state monitoring and alarm event notification processing. Wherein the event alert unit supports event escalation processing, event subscription processing, event ready and disposition processing, event filtering, convergence and aggregation processing. Meanwhile, in the embodiment of the application, the abnormal event can be divided into a common event, an alarm event, a fault event and the like, and further, the abnormal event can be further divided into a base layer event, a middleware and database event, a service base event, a service class object (SERVICE LEVEL object) availability event and the like according to different service levels.

Meanwhile, the application also provides mapping processing of the abnormal event and the corresponding fault self-healing scheme, wherein the fault self-healing scheme comprises a recommended fault self-healing scheme after machine learning, a static fault self-healing scheme configured by an operation and maintenance expert and the like, when the abnormal event is triggered, a plan processing unit searches an associated treatment plan tree corresponding to each event in the system, and according to the association relation, weight information and the like, invokes a corresponding plan (namely the fault self-healing scheme) to carry out treatment, and informs about the result state of the plan after treatment, wherein both the success state and the failure state of the plan need to be notified to related plan treatment relation persons, including standby (SRE OnCall) on duty personnel and service research personnel related to service.

In addition, the embodiment of the application also provides abnormal event training of index real-time data through machine learning, learning of a fault self-healing scheme and the like, the abnormal event is learned and trained in real time by a machine learning service instance, the fault self-healing scheme related to the abnormal event is recommended according to a common machine learning algorithm, an operation and maintenance expert and related operators add the recommended fault self-healing scheme into a data storage unit, authority, execution condition relation and the like can be added to a plurality of fault self-healing schemes, and the fault self-healing scheme of the whole fault self-healing system can be positively supplemented and optimized according to the algorithm and the recommendation of the machine learning, so that the fault self-healing scheme is continuously subjected to closed-loop learning and optimization.

In summary, the embodiment of the application collects each item of monitoring index data of each instance through the monitoring collection unit, calculates the real-time big data, aggregates the time sequence index data of each instance in real time, monitors the monitoring index data and the time sequence index data in real time according to the preset abnormality detection rule to obtain an abnormal event, and then jointly triggers the plan treatment by the abnormal event according to the fault self-healing scheme configured by the association of the abnormal event, thereby realizing the automatic treatment when the micro service has the abnormal event. The common fault self-healing scheme may be that an automatic cleaning event of a disk deficiency is handled, a Domain name system (Domain NAME SYSTEM, dns) Domain name resolution is automatically switched when a portal is failed or a portal detection fails, a service isok (a health detection mechanism of an interface or a service is used for detecting whether the interface or the service is normally operated) automatically restarts the service after the interface exceeds a threshold value, and the like.

The fault alarm self-healing system provided by the embodiment of the application is described in detail below through a specific embodiment.

Fig. 1 is a flow interaction diagram of a fault alarm self-healing system provided by the embodiment of the application, which comprises a micro-service target unit, a management unit, a monitoring acquisition unit, a data calculation unit and a plan disposal unit, wherein the specific interaction steps are as follows:

Step S11, the management unit acquires metadata corresponding to each micro service instance from the micro service target unit;

Step S12, the management unit determines a monitoring acquisition task to be added and a monitoring acquisition task to be deleted according to the monitoring information carried by each metadata;

step S13, the management unit sends the monitoring acquisition task to be added and the monitoring acquisition task to be deleted to the monitoring acquisition unit;

step S14, the monitoring acquisition unit acquires monitoring index data according to the monitoring acquisition task to be added, and deletes the corresponding monitoring acquisition task according to the monitoring acquisition task to be deleted;

Step S15, the monitoring acquisition unit sends monitoring index data to the data calculation unit;

Step S16, the data calculation unit detects the monitoring index data in real time according to a preset index detection rule to obtain a first detection result, and generates time sequence index data according to the first detection result;

in step S17, the data calculation unit sends the generated abnormal event to the plan handling unit.

And S18, searching a fault self-healing scheme matched with each abnormal event by the pre-scheme treatment unit, and carrying out fault self-healing treatment on the abnormal events according to the fault self-healing scheme.

In step S11, the micro service target unit refers broadly to a service object based on a legacy mode or a cloud mode, such as an object instance object of one or more running nodes of a micro service implemented using Java (an object oriented programming language), c++ (a system level programming language), go (an open source programming language), or other languages. The micro-service target unit is a target main body in the application, the health condition of the monitoring index data of each dimension can reflect the health condition of the target main body in real time, the influence path is from a single node to a service cluster, then from the service cluster to the service unit, from the service unit to the whole service line, and the like, and the radius of the influence range is from small to large.

The management unit is an important console management system in the application, and most of management configuration (such as adding, modifying, deleting and other functional operation actions) actions related to the functions of the application depend on the functional modules of the management unit. The management unit may obtain metadata corresponding to each instance of the micro service from the micro service target unit, where the structure of the metadata is as shown in table 1 below:

TABLE 1 metadata architecture

The second row of data in table 1 has the meaning of "String" as the field type of the field named "Id", which indicates (i.e., specifically meaning) that "micro service index Id is generally generated using UUID", and in some embodiments, the micro service index Id may also be generated using a database self-increment field (auto_increment), and the third row of data in table 1 has the meaning of "String" as the field type of the field named "name", which indicates "micro service name". The meaning of the other rows in table 1 is referred to in the foregoing description of the second row and the third row, and will not be repeated here.

The structure of the load carried in the metadata shown in table 1 is shown in table 2 below.

TABLE 2 metadata payload structure

The fourth line of data in table 2 has a meaning that a field type of a field named "name" is "String", which indicates (i.e., specifically means) that "when a process for merging a plurality of tasks is required, this name can be processed in a mapping relationship with a name running in a link cluster, and a prefix indicates that it can be distinguished from a single task", and the fifth line of data in table 2 has a meaning that a field type of a field named "metadata" is "JSON Object", which indicates that "when a task is running in a single mode, this name is directly used to correspond to a task name in a link". The meaning of the other rows in table 2 is referred to in the foregoing description of the fourth row and the fifth row, and will not be repeated here.

It may be understood that the examples in the embodiments of the present application refer to IP nodes in a conventional deployment mode, pod (the smallest deployment unit in K8 s) in a Kubernetes (K8 s) deployment mode, or the like, and there may be multiple examples of one micro service.

In some embodiments, the fault alarm self-healing system provided by the embodiment of the application further comprises a configuration center unit, wherein the micro service target unit can register service information of each micro service instance to the configuration center unit, then the configuration center unit generates a service instance list according to the service information, and the management unit can acquire the service instance list from the configuration center unit and acquire metadata corresponding to each micro service instance from the service instance list. The service information comprises service names, monitoring acquisition addresses, monitoring acquisition ports and metadata of examples, and a list formed by each service and corresponding deployed example nodes is called as an example list of the service, namely a service example list.

Specifically, the micro service target unit may register its own service information (including service name, IP, port, and Metadata of Metadata instance) into the configuration center unit when each instance is started, and the configuration center unit stores the service information in the form of a service instance list. The management unit may actively pull the instance and service instance list from the configuration center unit periodically, or the management unit may subscribe to a push event of the configuration center, and the configuration center actively pushes the instance and service instance list to the management unit in real time. Here, the main functions of the management center to obtain the service instance list are to monitor the instance state of each instance, monitor the state of each monitoring acquisition task, configure the index calculation policy and rule (the preset index detection rule described below), configure the anomaly detection policy and rule (the preset anomaly detection rule described below), meet the requirements of monitoring and event data presentation, and the like.

In some embodiments, a communication mechanism exists between each micro service instance and the configuration center unit, and if there is a target instance that does not perform heartbeat detection with the configuration center within a fifth preset time, the configuration center unit may delete the target instance from the service instance list. The fifth preset time may be 1 minute (min), 2min, 3min, etc. preset by the user according to the actual requirement, the target instance is an instance that is not communicated with the configuration center in the fifth preset time, and the communication mode in the embodiment of the present application may be heartbeat detection, for example, the micro service target unit periodically initiates a heartbeat request to the configuration center unit (or the configuration center unit periodically issues a heartbeat detection), and if an instance that is not replied exists in the fifth preset time, the instance may be regarded as a target instance, and the configuration center unit deletes the instance from the service instance list.

It can be understood that the service instance list is located in the configuration central unit, and after the state of the instance is changed, the service instance list can be automatically reported to the configuration central unit, or the configuration central unit can detect and find the service instance list through a communication mechanism. The service instance list in the configuration center unit is updated in real time, and the management unit only synchronizes data (such as an example and the service instance list) from the configuration center unit for other units to query or display, which can be understood as a backup data of the instance list in the configuration center unit;

As is clear from the description in the above embodiments of the present application, the service instance list in the configuration center unit may be added, updated or deleted according to the change of the state of each instance, but the deletion process may not occur in the management unit, and only the service instance list data may be added or updated, while for the instance deleted by the configuration center unit, the state thereof may be modified in the management unit, and only the state of this type of instance may be updated to be deleted, and the data thereof may not be deleted.

In some embodiments, the fault alarm self-healing system further includes a data storage unit, and after each instance and each metadata synchronized by the management unit from the configuration center unit periodically, each instance and each metadata can be sent to the data storage unit through an interface, and the data storage unit performs unified persistent storage on each instance and each metadata for query use by other units (such as a data calculation unit, a data display unit and the like mentioned below).

The configuration center unit has the basic functions of micro service registration and service discovery, and also has the function of centralized and unified management of micro service configuration, and in the process of deployment, starting and maintenance of numerous micro service instances, if the configuration is managed manually, the cost is high, so that the management of all micro services and configurations by adopting the configuration center unit is very important, and the labor cost can be greatly reduced. The configuration central unit is more in implementation, for example, the configuration central unit can be implemented by using various technical schemes such as Eureka (a service discovery framework), consul (a tool for implementing service discovery and configuration sharing), zookeeper (a distributed coordination service), and the like.

The management unit can support the generation and the issuing of monitoring acquisition tasks (such as a monitoring acquisition task to be added and a monitoring acquisition task to be deleted), synchronous storage processing of service information of each instance, and configuration modification verification of preset index detection rules, preset abnormality detection rules, rules such as event filtering upgrading (see the description of an event alarm unit below), other common rules and the like. Meanwhile, the management unit also provides partial data operation statistics functions, data operation tools, function audit tools and other abnormal function treatment tools for each functional module.

In the step S12, after the management unit obtains each metadata, the monitoring collection task to be added and the monitoring collection task to be deleted may be determined according to the monitoring information carried in the metadata. The monitoring information can be understood as a monitoring type supported by each metadata obtained by analyzing metadata of each instance by the management unit, the monitoring acquisition task to be added can be understood as a monitoring acquisition task corresponding to the monitoring acquisition type supported by each metadata, the monitoring acquisition task to be deleted can be understood as a monitoring acquisition task corresponding to the monitoring acquisition type not supported by each metadata, and whether each metadata supports a certain monitoring acquisition type can be understood as whether each metadata reports the monitoring acquisition type.

For example, when each instance is started, the management unit analyzes metadata from metadata of each instance to obtain a supporting monitoring type reported by the metadata as a central processing unit (Central Processing Unit, CPU) class index, and a Memory (Memory) class index, which indicates that a monitoring acquisition task to be added is a monitoring acquisition task corresponding to a CUP class index and a monitoring acquisition task corresponding to a Memory class index, and can be regarded as a monitoring acquisition task to be deleted except the monitoring acquisition task corresponding to the CUP class index and the monitoring acquisition task corresponding to the Memory class index.

In the subsequent process, after the management unit updates the metadata of each instance in an active acquisition or passive pushing manner, the current metadata is found to support the CPU class indexes and the indexes of the disk usage, and then the monitoring acquisition task to be added is the acquisition task corresponding to the CPU class indexes and the acquisition task corresponding to the indexes of the disk usage, and the monitoring acquisition task to be deleted is an acquisition task except the acquisition task corresponding to the CPU class indexes and the acquisition task corresponding to the indexes of the disk usage. It should be noted that, when the instance is started, the acquisition task corresponding to the CPU class index is already deployed, so, although the acquisition task corresponding to the CPU class index is also the acquisition task to be added here, in the actual operation process, the acquisition task corresponding to the CPU class index may not be deployed in the monitoring acquisition unit any more, and only the acquisition task corresponding to the index of the disk usage amount is deployed. Meanwhile, since the monitoring acquisition task corresponding to the Memory index is deployed when each instance is started, the monitoring acquisition unit should stop the monitoring acquisition task corresponding to the Memory index based on the monitoring acquisition task to be deleted determined at this time.

The monitoring types that can be supported at present can be divided into five layers according to a base layer, a middleware layer, an application layer, a service layer, a user layer and the like, and each layer has various types of indexes, for example, the base layer may include a system CPU usage amount, a system Memory total amount, a usage amount and a total amount of each main mounted disk path, and the like, while the middleware layer may also have different indexes for different types of middleware, for example, kafka (an open source stream processing platform) type middleware, and the supported indexes may include Kafka Memory usage amount, kafka Memory application total amount, and the like.

In the step S13, the management unit may send the determined monitoring acquisition task to be added and the determined monitoring acquisition task to be deleted to the monitoring acquisition unit, and support synchronization and inquiry of the states of the monitoring acquisition tasks in the monitoring acquisition unit, and govern the abnormal states of the monitoring acquisition tasks, for example, the task start, the task stop, the task interrupt and other services provided by the management unit are provided to the operator, so that govern the abnormal states through the operation of the operator.

For the above-mentioned management unit to support synchronization and inquiry of the states of each monitoring acquisition task in the monitoring acquisition unit, in some embodiments, the monitoring acquisition unit may synchronize the monitoring acquisition task and the states to the management unit at regular time, so as to update the survival activity of the monitoring acquisition unit and the state data of the monitoring acquisition task to the management center in real time, which is convenient for operation and management. In addition, each acquisition unit, the acquisition tasks distributed on the acquisition unit and the acquired data result states are synchronously stored in the data storage unit by the management unit at fixed time.

In addition, the monitoring acquisition task to be added, which is sent to the monitoring acquisition unit by the management unit, carries information such as an identification, a name, a task type, task operation configuration content and the like of the monitoring acquisition task. The task type is used for indicating whether the monitoring acquisition task is a disposable task or a task executed at regular time, the configuration content of the acquisition task can be a script or a Shell command (a series of instructions input through a command line interface or a command interpreter), the content of the monitoring acquisition task can be divided into acquisition telnet (remote terminal protocol) tasks, HTTP (hypertext transfer protocol) type tasks and the like, the configuration of the monitoring acquisition task can be different according to different adopted technical schemes, such as filebeat (a lightweight log collection processing tool), ELK (a solution for log collection, processing and displaying), logkit (a log collection and processing tool) secondary development and the like.

In the step S14, after the management unit sends the to-be-added monitoring acquisition task and the to-be-deleted monitoring acquisition task to the monitoring acquisition unit, the monitoring acquisition unit may deploy a new monitoring acquisition task according to the to-be-added monitoring acquisition task, and stop the corresponding monitoring acquisition task according to the to-be-deleted monitoring acquisition task, so as to realize the acquisition of the monitoring data.

Continuing with the embodiment shown in step S12, when each instance is started, the to-be-added monitoring acquisition task sent to the monitoring acquisition unit by the management unit is a monitoring acquisition task corresponding to the CUP class index and a monitoring acquisition task corresponding to the Memory class index, and the to-be-deleted monitoring acquisition task is a monitoring acquisition task other than the monitoring acquisition task corresponding to the CUP class index and the monitoring acquisition task corresponding to the Memory class index. At this time, if the monitoring acquisition task corresponding to the CUP index and the monitoring acquisition task corresponding to the Memory index do not exist in the monitoring acquisition unit, the monitoring acquisition task corresponding to the CUP index and the monitoring acquisition task corresponding to the Memory index are required to be deployed, and if the monitoring acquisition task other than the monitoring acquisition task corresponding to the CUP index and the monitoring acquisition task corresponding to the Memory index also exist in the monitoring acquisition unit, the monitoring acquisition tasks are required to be stopped.

In the subsequent process, the management unit updates metadata of each instance and then issues a monitoring acquisition task to be added and a monitoring acquisition task to be deleted again, wherein the monitoring acquisition task to be added is an acquisition task corresponding to a CPU (Central processing Unit) type index and an acquisition task corresponding to an index of a disk usage amount, and the monitoring acquisition task to be deleted is an acquisition task except for the acquisition task corresponding to the CPU type index and the acquisition task corresponding to the index of the disk usage amount. At this time, since the acquisition tasks corresponding to the CPU class indexes are already added to the monitoring acquisition unit when each instance is started, the monitoring acquisition unit only needs to increase the acquisition tasks corresponding to the indexes of the disk usage amount, and meanwhile, since the monitoring acquisition unit also increases the monitoring acquisition tasks corresponding to the Memory class indexes when each instance is started, the monitoring acquisition unit needs to stop the monitoring acquisition tasks corresponding to the Memory class indexes.

It can be understood that the unified management of the monitoring and collecting task by the management unit is only for convenient operation, but the actual monitoring and collecting task is distributed to the examples on each distributed Node, for example, in the application, there is mainly an Agent process, which is deployed on a physical machine Node (Node), and the Node collects the service deployed on the Node, the service in the container on the Node, or the Pod of K8s deployed on the Node, etc.

In the embodiment of the application, the monitoring acquisition unit mainly comprises a plurality of technologies and a plurality of components, and is mainly used for acquiring multi-level and multi-dimensional monitoring index data from hardware to software, from a network to an application and the like. The collection capability of the monitoring index data comprises a hardware level such as host temperature, network card information, network bandwidth and the like, an operating system level such as a central processing unit (Central Processing Unit, CPU), a Memory (Memory), a disk and the like, various service index data of a database and middleware, service custom embedded point index data and the like.

After receiving the monitoring acquisition task issued by the management unit, the monitoring acquisition unit can acquire monitoring index data of a specified dimension according to the task one time or multiple times, wherein the dimension can be understood as an index type (the dimension in the embodiment of the application mainly refers to the dimension for expressing time sequence index data after index modeling). And the monitoring acquisition unit and the management unit have real-time communication capability, the management unit can issue acquisition tasks to each monitoring acquisition unit component at random, and the monitoring acquisition unit components report the monitoring index data to the data transmission unit in real time after completing the acquisition of the monitoring index data.

The monitoring component (i.e. the monitoring acquisition unit component) is required to cover calling relations and dependency relations among the examples besides completing monitoring coverage on the examples, so that after one or more examples fail, the dependent service of the influence can be rapidly analyzed, the influence range of the influence can be evaluated, and calling relations, fault influence relations and the like can be intuitively displayed in a fault tree.

The monitoring acquisition unit can be realized in various ways, and the monitoring acquisition unit is commonly used for acquiring based on Agent technology and reporting the acquired monitoring index data to the data transmission unit for transmission in real time. The other scheme is based on Exporter assembly (an assembly for collecting monitoring data and providing data to the outside through Prometheus monitoring standard, wherein Prometheus is an open-source service monitoring system and a time sequence database), the monitoring index data is exposed in an Agent by starting an HTTP interface-based scheme, the monitoring index data is uniformly pulled by a collection center, the use scene of the scheme is less, and the main problem is that the pressure of the collection center is aggravated by uniform pulling, so that the technical scheme is generally not adopted, and the data is collected by the Agent uniformly and then uploaded or reported to a data transmission channel to finish the collection and reporting process of the monitoring index data.

The monitoring acquisition unit also has different special situations, such as acquisition based on network hardware, and the acquisition processing of the monitoring index data is required to be completed by calling an interface provided by a hardware service provider. There are various schemes to implement based on the operating system level, and based on middleware and database, the acquisition schemes are different for different middleware or databases. The scheme based on cloud protogenesis in the new technical environment is obviously different. Therefore, the acquisition technical schemes are inconsistent for various scenes and environments, and various schemes are required to be adopted for compatibility and adaptation processing.

In the step S15, the monitoring and collecting unit may send the collected monitoring index data to the data calculating unit in addition to sending the collected monitoring index data to the management unit.

In some embodiments, the fault alarm self-healing system provided by the embodiment of the application further includes a data transmission unit, the monitoring acquisition unit may send the monitoring index data to the data transmission unit, and then the data transmission unit sends the monitoring index data to the data calculation unit, or the data calculation unit may consume the monitoring index data in real time from the data transmission unit.

The data transmission unit mainly plays a role of a data service bus in the application, and achieves the functions of reporting and transmitting monitoring index data, namely, the monitoring acquisition unit can report the acquired multi-dimensional multi-level monitoring index data to the data transmission unit, and the data calculation unit and the plan learning unit achieve real-time consumption of the data.

The data transmission unit also supports a function of transmitting the consumption of the abnormal event among the plurality of services. If the event alarm unit sends an abnormal event to the data transmission unit, the abnormal event is consumed by the plan handling unit and the plan learning unit, so that plan handling and plan learning are completed. For this part, reference is made to the following description.

Currently, the main technical solution for implementing the data transmission unit is a distributed data stream processing platform (APACHE KAFKA) technical stack, other technical stacks such as APACHE ACTIVEMQ (most popular open source message bus manufactured by Apache), rabhitmq (open source message proxy and queue server developed by Erlang language for implementing communication between systems in a lightweight manner), zeroMQ (high performance asynchronous message library), rocketMQ (distributed message middleware) and other technical solutions can be used besides the APACHE KAFKA technical stack.

In the step S16, after the data calculation unit obtains the monitoring index data, the monitoring index data may be detected in real time according to a preset index detection rule, so as to obtain a first detection result, generate time sequence index data based on the first detection result, and store the generated time sequence index data in the data storage unit. The dimensions of the generated time sequence index data comprise service names (serviceName), index types (Metric) which can be iaa, app respectively represent index types of a base layer and an application layer, the monitoring types (MonitorType) such as db are database indexes, kafka is a kafka middleware index and the like, index names (METRICNAME) such as CpuUseRate represent Cpu utilization rate, examples (Endpoint) such as 192.168.1.100, names of Pod and the like.

In the embodiment of the present application, the preset index detection rules include a static index detection rule and a dynamic index detection rule, and a specific example is used to describe the principle of generating time sequence index data according to the preset index detection rules and monitoring index data in the embodiment of the present application.

For the static index detection rule, for example, the monitoring index data is the memory usage and the total memory of the host node, the data calculation unit can perform secondary calculation on the obtained memory usage and total memory of the host node to obtain the memory usage of the host node per minute, and for example, the monitoring index data is the disk usage and the total disk capacity of a certain path, the data calculation unit can perform secondary calculation on the obtained disk usage and the total disk capacity of a certain path, and the disk usage of the host node per minute falls down under a certain path.

For the dynamic index detection rule, for example, error request quantity index data of the interface error code is calculated by statistics of each service dimension and each interface dimension in each minute, and for example, error quantity index data obtained by dividing the interface error quantity and the request total quantity is calculated by statistics of each service dimension and each interface dimension in each minute. It should be understood that the static index detection rule and the dynamic index detection rule are merely examples herein, and do not represent all of the static index detection rules and the dynamic index detection rules in the present application.

In addition, after generating the time sequence index data according to the first detection result, the data calculation unit may detect the index data (i.e. the monitoring index data and the time sequence index data) in real time according to a preset abnormality detection rule, obtain a second detection result, and generate an abnormal event according to the second detection result.

Specifically, the data calculation unit detects the index data in real time by adopting a preset abnormality detection rule, and when the index data accords with the preset abnormality detection rule, a corresponding abnormality event can be generated, that is, the abnormality indicated by the abnormality event can be understood as the content of the preset abnormality detection rule which the abnormality event accords with, or the abnormality event can be used for indicating the reason which accords with the preset abnormality detection rule. The preset abnormality detection rules include a static abnormality detection rule and a dynamic abnormality detection rule, and a specific example is described below on the principle of generating an abnormal event according to the preset abnormality detection rules and index data in the embodiment of the present application.

For the static anomaly detection rule, for example, the static anomaly detection rule includes that the interface request error amount configured in 3 minutes is continuously greater than 100, an anomaly event is generated when the index data conforms to the interface request error amount configured in 3 minutes is continuously greater than 100, and for another example, the static anomaly detection rule includes that the continuous error amount is greater than 50 in 5 minutes, an anomaly event is generated when the index data conforms to the continuous error amount is greater than 50 in 5 minutes.

For the dynamic anomaly detection rule, for example, the dynamic anomaly detection rule includes that the up-and-down fluctuation of the throughput of a certain interface request is greater than 10%, an anomaly event is generated when the index data indicates that the up-and-down fluctuation of the throughput of the certain interface request is greater than 10%, the anomaly event can be used for indicating that the interface request is abnormal or that the interface is attacked, service is unavailable, and the like, and for example, the dynamic anomaly detection rule includes that the up-and-down fluctuation of the throughput of the certain interface request is less than 10%, and an anomaly event is generated when the index data indicates that the up-and-down fluctuation of the throughput of the certain interface request is less than 10%, the anomaly event can be used for indicating that the usability problem of the interface service occurs.

Currently, the most main technical solutions of the data computing unit in the big data real-time computing field include MapReduce (distributed computing model) and APACHE SPARK (memory-based distributed computing framework) based on a batch processing mode, storm and APACHE FLINK based on a real-time stream processing mode, and the like, and in the embodiment of the present application, the data computing unit is realized by APACHE FLINK.

It will be appreciated that in contrast to Spark and store, flank solves the problems of Spark and store in terms of stream processing, state management and fault tolerance, time and watermarking mechanisms. Specifically, first, the Flink provides a true piece-wise stream processing model, enabling lower latency, and is capable of providing more nearly real-time data processing capabilities than SPARK STREAMING (converting streams into small batch processing). This processing model of the flank enables it to provide greater timeliness in processing streaming data, meeting the real-time requirements of a large number of services worldwide. Second, the design of the flank makes it more efficient in handling stateful operations in terms of state management and fault tolerance. The state management of the Flink is designed to be pluggable and provides built-in keying and operator states. Furthermore, regarding time and watermark processing mechanisms, the flank provides a more powerful and flexible time and watermark processing mechanism that can better handle event times and processing events, which is particularly important when processing late or out-of-order data caused by factors such as network delays. By introducing the concepts of watermark (mainly for handling out-of-order events) and lateness (mainly for handling latency problems of data), flank can handle these problems more effectively, which Spark and Storm were relatively deficient in earlier versions. Thus, the flexible solves the deficiencies of Spark and Storm in terms of stream processing timeliness, state management, fault tolerance mechanisms and time processing by providing a true piece-by-piece stream processing model, an efficient state management and fault tolerance mechanism, and powerful time and watermark processing capabilities, and therefore, the adoption of APACHE FLINK to implement a data computing unit is more advantageous than Spark and Storm.

In the step S17, the data calculating unit may send the generated abnormal event to the plan handling unit, or the plan handling unit may consume the abnormal event generated by the data calculating unit in real time.

In the step S18, after the protocol processing unit obtains the abnormal events, the fault self-healing scheme (also called as a protocol) matched with each abnormal event can be retrieved, and the abnormal events can be subjected to fault self-healing processing according to the fault self-healing scheme.

In some embodiments, the fault alarm self-healing system further comprises an event alarm unit, wherein the step S17 of sending the abnormal event to the plan handling unit comprises a data calculation unit sending the abnormal event to the event alarm unit, the event alarm unit obtaining a target abnormal event after classifying the abnormal event, and then the event alarm unit can send the target event to the plan handling unit. The classification processing mode of the abnormal event comprises any one or more of the following four classification modes:

In the first classification mode, if a plurality of identical abnormal events are received in a first preset time, one abnormal event in the plurality of identical abnormal events is taken as a target abnormal event. That is, the event alert processing unit filters the abnormal events by rule, and if a plurality of repeated events are transmitted within a prescribed time period for the same abnormal event, the event alert unit needs to perform filtering processing on such repeated abnormal events to achieve a weight-removing function.

The second classification mode is to combine the target abnormal events according to the preset combining rule if a plurality of abnormal events, other events and fault events with the same attribute or rule exist in the second preset time, namely, the event alarm unit can combine the received abnormal events according to the preset combining rule, and the abnormal events needing to be combined mainly refer to abnormal events with similar characteristics or service attributes. After such abnormal events are aggregated according to preset merging rules, a target abnormal event and notification describing the property can be generated.

And in a third classification mode, aiming at the abnormal event, if a plurality of abnormal events with similar characteristics or service attributes exist in a third preset time, merging and converging the abnormal events with similar characteristics or service attributes to obtain a target abnormal event. That is, the event alarm unit may perform merging and converging processing on the abnormal events by using the same rule, where the abnormal events to be processed by merging and converging are a plurality of abnormal events with similar characteristics or service attributes existing in a third preset time, and after merging and converging the abnormal events, the target abnormal event may be obtained.

After the target abnormal event is obtained by the plan handling unit, a fault self-healing scheme matched with the target abnormal event can be retrieved, specifically, the plan handling unit can extract relevant features according to the content of the target abnormal event, and then match the relevant features to the corresponding fault self-healing scheme according to the features, wherein the extracted relevant features can be id of the target abnormal event, content keywords of the target abnormal event and the like.

In addition, the protocol handling unit may obtain the fault self-healing scheme by:

The plan learning unit may consume the index data in the data transmission unit and the target abnormal event in the event alert unit, may obtain a disposition action with respect to the target abnormal event based on the machine learning algorithm, and may then send the disposition action with respect to the target abnormal event to the plan recommendation and management unit; the method comprises the steps of receiving a treatment action sent by a plan learning unit, optimizing the treatment action by a plan management unit to obtain an optimized treatment action, sending the optimized treatment action to a data storage unit, associating the optimized treatment action with an abnormal event to obtain abnormal event configuration data after receiving the optimized treatment action by the data storage unit, obtaining the abnormal event configuration data from the data storage unit by the plan configuration unit, and carrying out structural processing on the abnormal event configuration data to obtain a structural body, and determining a fault self-healing scheme of each abnormal event according to the structural body.

In the embodiment of the application, the role of the plan recommending and managing unit in the application is mainly to filter and configure the recommended plans after machine learning on the associated abnormal events by the related operation staff, and the next time the related abnormal events are encountered after the plans are activated, the associated plan processing is triggered. When selecting and filtering the recommended plan through machine learning, an operator can conduct fault prediction through the recommended plan and a test event, and detect the effectiveness of the plan through the fault prediction, and can also enable the whole plan treatment to run in real environment one or more times in advance so as to test whether the execution result of the plan really meets the expectations.

In some embodiments, the abnormal event configuration data in the data storage unit may further include an abnormal event configuration data obtained by associating a treatment mode of some possible abnormal events configured in advance by an expert based on experience with the corresponding abnormal events based on the treatment modes. Or the history index data and the history abnormal event may be machine-learned in advance by using the plan learning unit to obtain a treatment action corresponding to the history abnormal event, the treatment action is optimized by the plan recommending and managing unit to obtain an optimized treatment action, and then the history abnormal event may be understood as an abnormal event which may occur, and the abnormal event configuration data stored in the data storage unit is obtained based on the optimized treatment action and the history abnormal event.

It can be understood that, when the event alarm unit does not process, each abnormal event is directly sent to the pre-plan handling unit, the principle of obtaining the fault self-healing scheme is the same as that of obtaining the fault self-healing scheme when the target abnormal event is sent to the pre-plan handling unit, and the description is omitted here.

In some embodiments, the protocol processing unit may synchronize the abnormal event configuration data (including the abnormal event configuration data obtained by expert configuration, the abnormal event configuration data obtained by processing the index data and the abnormal event in real time by the protocol learning unit and the protocol recommending and managing unit, and the abnormal event configuration data obtained by processing the historical index data and the historical abnormal event by the protocol learning unit and the protocol recommending and managing unit) from the data storage unit in full at the system initialization, and then further synchronize the incremental abnormal event configuration data (i.e., the index data and the abnormal event are processed in real time by the protocol learning unit and the protocol recommending and managing unit, and send the obtained optimized post-processing action to the data storage unit to form the abnormal event configuration data) and perform the structural processing on the configuration data. System initialization generally refers to that a service loads some configuration data from a database during a startup period, and at the same time, some systems may be running at certain intervals (e.g., once every hour), and then load data from the database to initialize values of some system variables.

In the embodiment of the present application, the structuring process means that the configuration of the plan may be of various types, such as calling interfaces, executing scripts, etc., and each plan must be associated with an abnormal event, where the association relationship will strongly associate the id of the abnormal event to the id of the configuration of the plan task when the system initializes or preheats the data, and the content of the configuration of the plan is regarded as the value of the structured data.

Describing the process of acquiring the fault self-healing scheme of the abnormal event with one example, for example, if the protocol processing unit receives the abnormal event with the content of which the CPU utilization rate exceeds 80%, the protocol processing unit may query whether the corresponding abnormal event has corresponding abnormal event configuration data (or a structure body), if the query hits, may call and execute a handling action in the abnormal event configuration data (or the structure body), if it hits, execute a restart type protocol, and if the query hits, the protocol handling unit may call the corresponding protocol to finish restarting the problematic node.

It will be appreciated that an exception event allows for multiple schedules to be associated, for example, when a node's CPU usage exceeds 80%, its associated schedule may include both deleting log files for a specified directory, restarting services, invoking interfaces to complete vendor switches, etc.

The plan learning unit comprises an online part and an offline part, classifies faults through original logs, index data and historical fault related data by adopting machine learning, performs model selection, positive and negative sample proportion adjustment, algorithm parameter adjustment and the like on the faults calculated by classification, evaluates the prediction effect, then recommends operation and maintenance operators from high to low to perform judgment and filtering selection according to the evaluation result, and synchronizes the evaluation result to the plan recommending and managing unit.

In some embodiments, when the fault self-healing scheme processes the abnormal event, the result of the operation of the treatment action of the scheme is recorded. Treatment events are also saved for each process of the protocol treatment process, and the protocol treatment process is saved, so that each step of the protocol treatment has a history record which can be used for inquiring the multiple disc.

The management of the plans by the plan recommending and managing unit also comprises a manual plan adding, modifying and deleting function, wherein the function is mainly provided for expert-oriented operation and maintenance engineers, and configuration verification and activation use of historical experience plans are mainly completed by the expert-oriented operation and maintenance engineers. These protocols do not require recommendation by machine learning, but rather relatively fixed and reusable empirical precipitation. Such as a disk usage rate exceeding 80% requires deletion of files in a specified directory.

In addition, the scheme recommending and managing unit has an important function of manually scoring and labeling the recommended scheme after machine learning, so that the scored scheme can be returned to the machine learning module, and the machine learning effect is further optimized. The result recommended by the plan is more accurate, meanwhile, after the plan is triggered manually, the result executed by the plan can be returned to the machine learning, if the result is successful, the result is added as a positive sample, and if the result is failed to be executed, the result is added as a negative sample.

In some embodiments, the event alarm unit in the embodiments of the present application may be further configured to silence certain abnormal events according to service rule requirements. Specifically, for some abnormal events, no alarm notification is required to be sent within a regular time range according to the service rule, but the alarm event needs to be recorded, and silence processing can be adopted for the abnormal events.

In addition, the event alert unit may perform classification and upgrade processing of the abnormal event according to a specified rule, where the upgrade abnormal event is generated and notified according to a predetermined rule, and in some embodiments, if a recovery event of the abnormal event is not received within a fourth predetermined time, for example, after the abnormal event is alerted, if the recovery event of the abnormal event is not received for more than 10 minutes, the abnormal event may be upgraded to obtain an upgrade abnormal event, and notify the value class and the product line service responsible person of the upgrade abnormal event, for example, the event alert unit sends the upgrade abnormal time to the alert sending unit, and the alert sending unit notifies the value class and the product line service responsible person of the upgrade abnormal event.

It can be understood that after the data calculation unit detects the index data in real time according to the preset anomaly detection rule, a third detection result can be obtained, a recovery event can be generated based on the third detection result, and the data calculation unit can also send the recovery event to the event alarm unit. Specifically, it may be understood that the recovery event may be generated when the index data does not conform to a preset abnormality detection rule. For example, the preset abnormality detection rule includes that the error request amount of the interface continues to exceed 500 in 5 minutes, the data calculation unit determines that the current index data indicates that there is a case where the error request amount of the interface continues to exceed 500 in 5 minutes, and thus an abnormality event can be generated, and after a period of time, the data calculation unit obtains new index data and determines that the new index data indicates that there is no case where the error request amount of the interface continues to exceed 500 in 5 minutes, and a recovery event for the abnormality event can be generated.

Furthermore, the event alert unit may send alert event notifications of different levels to specified receiving objects according to specified rules or subscription rules. And the event alarm unit can uniformly send the received abnormal events to a transmission channel of the data transmission unit, wherein the abnormal events, other events, fault events and the like are used as standby channels, and are also used as bypass channels of other business logic.

In some embodiments, the fault alarm self-healing system provided by the embodiment of the present application further includes a data storage unit, where the data storage unit is all systems and components that perform persistent storage on data (including metadata related to service registration and discovery, monitoring index data, time sequence index data, abnormal events, recovery events, monitoring acquisition tasks, preset abnormal detection rules, alarm policies and rules, abnormal event configuration data, etc.). The data of the part is realized by adopting a relational database scheme, such as monitoring acquisition tasks, preset abnormality detection rules, alarm strategies and rules, abnormal event configuration data and the like, and the data can be directly stored in a relational database system, such as a MySQL database cluster. The index data, abnormal events, etc. may be stored in a time series database or a non-relational database, for example, a distributed search and analysis engine (elastomer), a time series database (InfluxDB), etc. may be used.

In some embodiments, the fault alarm self-healing system provided by the embodiment of the present application further includes a data display unit, where the data display unit is mainly used for filtering and checking index data (mainly refer to time sequence index data herein), filtering and checking abnormal events, filtering and checking a fault self-healing scheme, and the like, where the basis of the filtering and checking is determined based on a user, for example, index data based on a multi-dimensional fixed model is stored in a data storage unit, and the user may need to check index data of various dimensions based on actual needs, and at this time, the data display unit is required to provide functions to perform filtering and checking according to user instructions, and display the obtained result. Other tasks such as monitoring acquisition task viewing, alarm strategy and preset anomaly detection rule viewing are also within the scope. The application can use Grafana of open source as a data display component to display, inquire and filter index data, abnormal events and the like.

In some embodiments, the fault alarm self-healing system provided by the embodiment of the application further includes an alarm sending unit, and the alarm sending unit can notify corresponding receiving objects according to different channels and different levels of abnormal events generated by the event alarm unit, and complete batch pushing, single pushing, retransmission processing and the like of the abnormal events through different technical schemes. And performs some targeted compensation processing according to the results returned by calling the application programming interface (Application Programming Interface, API).

If the notification of the WeChat abnormal event is performed, the alarm sending unit can call the WeChat API interface to complete the pushing processing of the WeChat abnormal event notification, and the corresponding WeChat public number subscription object can receive the abnormal event notification through the pushing of the WeChat public number. The micro-message abnormal event pushing function needs to pay attention to and subscribe to public numbers in advance by a user receiving short message pushing, and can receive alarm event pushing processing from the micro-message public numbers.

The notification of the short message abnormal event refers to that the notification of the short message abnormal event needs to complete the notification of the short message abnormal event sent by the appointed mobile phone number through a plurality of short message sending channels, wherein the plurality of short message sending channels can be understood that each user accesses a plurality of short message suppliers, and thus, each supplier provides one channel, namely, the multi-channel is realized. Based on the above, if one or more channels are unavailable, the availability of the short message alarm channel can be ensured by timely switching or closing the channels, namely the success rate of short message notification is ensured. In other words, when the channel is detected to be unavailable during the pushing of the short message, the filtering process is required to be performed on the unavailable channel in time, specifically, the filtering process can be realized by presetting a certain detection rule, for example, if three users continuously send the channel, all the users fail to send the channel, the channel is required to be closed, and for the abnormal event of the failed sending, other short message channels are required to be switched to complete the sending of the abnormal event, namely, the channel is switched.

Meanwhile, the short message alarm channel can be added with a sensitive word filtering function, and the alarm channel service also needs to subscribe and process the status codes returned by each operator or gateway service interface, including retransmitting certain states which can be retransmitted, and the like.

The notification of the telephone abnormal event refers to that the telephone (the abnormal event notification with very high real-time requirement) abnormal event notification needs to complete the notification of the telephone voice abnormal event to the user with the appointed mobile phone number through a plurality of voice sending gateway channels. If one or more channels are not available, the channels need to be switched or closed in time to ensure the availability of the voice type abnormal event notification channel. The voice alarm event notification channel needs to subscribe and process the sending state returned by the API provided by each voice operator, and process the sending state, if the sending state is the sending state which needs to be retransmitted, the channel needs to be switched in time or the retransmission process needs to be delayed, and for the retransmission, the upper limit of the retransmission times needs to be set, and the retransmission process is not performed until the upper limit is reached, and the notification of the sending failure is marked.

In some embodiments, the fault alarm self-healing system provided by the embodiment of the present application further includes an alarm receiving body, where the alarm receiving body is an object that receives an alarm event, and in the micro-service availability guarantee system, the object that needs to receive notification of the abnormal event generally includes an SRE OnCall attendant, an SRE OnCall class, a micro-service related developer, and so on. In some embodiments, the receiving body may further include a receiving manner that may be used, for example, by way of a text message or a nail, a text message, a telephone, etc.

In the following, by describing the information interaction between the units in the self-healing system with fault alarm shown in fig. 2, it can be understood that fig. 2 is only an example, where the included units are not all units in the self-healing system with fault alarm shown in the present application, and the data transmission between the units also does not include the data transmission between all units in the self-healing system with fault alarm shown in the present application.

As can be seen from fig. 2, the configuration center unit 202 is an important component of service registration and service discovery, and when the micro service deployment is started, the micro service target unit 201 needs to register each instance of the micro service and service information of each instance to the configuration center unit 202, where the service information includes information of the service itself, other information that the service needs to be exposed, such as a monitoring acquisition address and a port. The configuration center unit 202 has service registration and service discovery functions. Reference is made specifically to the description in step S11 above.

After registration is completed, the configuration center unit 202 may store the registered service information, and the configuration center unit 202 may detect the health status of each registered instance, and if the detection fails for a preset number of times (i.e., the configured upper limit number of times of health detection, such as N times), the instance is considered to be offline, and may be removed from the configuration center unit 202 and notified to the service discovery client. Reference is made specifically to the description in step S11 above.

The management unit 204 can synchronously pull (or passively subscribe from the configuration center unit 202) service registration and cancellation events from the configuration center unit 202 at regular time and read service information, and meanwhile, generates a monitoring acquisition task to be added and a monitoring acquisition task to be deleted according to a preset index detection rule and a preset abnormality detection rule and issues the monitoring acquisition task to the designated monitoring acquisition unit 203 to realize task issuing, and meanwhile, the management unit 204 can synchronously store the service information synchronized from the configuration center unit 202 into the data storage unit 207 to realize storage and query of the service information. Reference is specifically made to the descriptions in step S12 and step S13 described above.

After the monitoring collection unit 203 passively receives the monitoring collection tasks (including the monitoring collection task to be added and the monitoring collection task to be deleted) issued from the management unit 204, the collection of the monitoring index data is completed according to the issued monitoring collection tasks, and the collected monitoring index data is reported to the data transmission unit 205 in real time (i.e. monitoring data reporting is realized). The monitoring acquisition unit 203 may implement interface detection and monitoring data acquisition based on the micro-service target unit 201. In addition, the monitoring and collecting unit 203 can periodically and synchronously collect tasks and states to the management unit 204 (i.e. report states), so as to update the survival activities of the monitoring and collecting unit 203 and the states and data of the collecting tasks to the management unit 204 in real time, thereby facilitating operation and processing. Each acquisition unit and the acquisition task and the acquired data result state distributed on the acquisition unit are stored in the data storage unit 207 by the management unit 204 in a timing synchronization manner, that is, the storage unit can provide a storage and query function, for example, the management unit 204 can realize service information storage or query through the data storage unit 207. It can be understood that the micro-service example node can output monitoring indexes reflecting the health state of the micro-service through standard behaviors from a hardware network layer, an infrastructure layer, an operating system layer, an application layer and the like in the operation process, also comprises service monitoring index output and the like realized based on various technologies such as agents and the like, can directly or indirectly reflect multi-dimensional monitoring index data of the micro-service health state of the node, can acquire the index data in a timing way through a monitoring acquisition unit, can acquire and upload the index data to a cloud acquisition cluster based on big data in real time, and can transmit the monitoring index data in real time through a data transmission channel of the big data. Reference is specifically made to the descriptions in step S14 and step S15 described above.

The data storage unit 207 is mainly used for realizing functions of storing and inquiring all data, backing up and recovering data, and the like. The data transmission unit 205 is mainly used for realizing functions of data reporting transmission, consumption pulling and the like in the application, and is a data bus of the whole device. See in particular the description of the data storage unit above.

The data calculating unit 206 is configured to consume data from the data transmitting unit 205 in real time (i.e. perform data consumption processing), and implement data synchronization and rule issuing with the management unit 204. After filtering and judging calculation according to the preset abnormality detection rule stored in the management unit 204, the data transmission calculation unit 206 generates an abnormal event if the preset abnormality detection rule is satisfied, and generates a recovery event if the preset abnormality detection rule is not satisfied, and the data calculation unit 206 sends both the abnormal event and the recovery event to the event alarm unit 209 for processing (i.e., implementing event transmission). In addition, the data calculation unit 206 may generate new index time sequence data after the secondary calculation according to a preset index detection rule, or such index data may be directly reported to the data transmission unit 205, and then perform an anomaly detection and judgment, and generate an anomaly event or a recovery event. See for details the description in step S16 above. The data calculation unit 206 may also realize data storage through the data storage unit 207.

The data display unit 208 is mainly used for intuitively displaying time sequence index data, abnormal events, fault self-healing schemes and other data, such as displaying the data in text or graphics, so that a user can browse and inquire conveniently, and the data display unit 208 can pull the data based on the data calculation unit 206 and the data storage unit 207, and can alarm and inquire based on the event alarm unit 209.

The plan learning unit 213 is mainly used for completing machine learning processing on data such as abnormal events, source index time sequence data, historical fault data and the like according to a machine learning algorithm, recommending the most similar plan according to the algorithm, and then sending the most similar plan to the plan recommending and managing unit 214 (i.e. implementing plan recommending), and completing sorting, association and activation preservation processing by related personnel. So that recommended plans may be added to the plan handling unit 211 by machine learning (i.e. implementing synchronized plan rules). See for details the description in step S18 above.

It is understood that the plan handling unit 211 is only a user of the plan, and executes the main body of one or more plans only after an abnormal event occurs, and the data storage unit 207 stores data only and has no management function. The plan recommendation and management unit 214 is a management subject of a plan, and the plan is derived from experience configuration of an operation and maintenance expert, or a plan recommended by machine learning, which is theoretically complement and perfect to the experience of the operation and maintenance expert.

The event alarm unit 209 is mainly configured to receive an abnormal event, perform filtering, convergence, aggregation, and other processes on the abnormal event, and then invoke a corresponding channel to complete a push notification process of the alarm event according to the level, the channel type, and the target receiver of the abnormal event. For example, in fig. 2, the event alert unit 209 sends an aggregate event to the schedule learning unit 213, where the aggregate event refers to that the event alert unit 209 aggregates the similar abnormal events, for example, there may be an abnormal event with a CPU usage rate of 192.168.1.100 nodes of over 80% per minute for 10 consecutive minutes, in which case the event alert unit 209 does not send the event to the schedule learning unit 213 every minute, and in a possible embodiment, the event alert unit 209 may aggregate the 10 abnormal events into an abnormal event to be sent to the schedule learning unit 213 for processing, because the schedule learning unit 213 does not need an event detail, and is also required to perform a dimension reduction on the event.

In addition, the event alarm unit 209 may also support upgrade processing of alarm events or fault events according to rules, so that an attendant can process readiness, handling, and forwarding of abnormal events. See in particular the description of the event alert unit above.

The alarm sending unit 210 is mainly a channel for pushing abnormal event notifications, and can obtain alarm notifications from the event alarm unit 209 and the plan handling unit 211, and can be divided into channels of different types such as WeChat, SMS, phone call and self-grinding IM channels according to the target receiving type. Each channel subscribes to the push state according to the push channel API and processes the resending notification according to the return state. See in particular the description of the alarm sending unit above.

The alarm receiving body 212 mainly refers to devices, such as mobile phones, computers, etc., that need to receive notification of abnormal events. The abnormal events are pushed to the mobile phone through various channels, so that corresponding personnel are notified, and the pushed abnormal events are handled. As shown in fig. 2, the alarm transmitting unit 210 may transmit an alarm to the alarm receiving main body 212.

The plan handling unit 211 is mainly used for indexing the rules of the existing plans according to the alarm event, if the rules are hit, the corresponding plans need to be triggered for processing, relevant personnel are notified according to the handling result, and meanwhile, handling steps and results need to be saved, so that the follow-up repeated use and analysis are facilitated. See for details the description in step S18 above. As shown in fig. 2, the event alarm unit 209 performs a event call by the event handling unit 211, and it can be actually understood that the event alarm unit 209 completes triggering of an abnormal event, and when the abnormal event is sent to the event handling unit 211, the event handling unit 211 hits the event configuration according to the mapping relationship (such as the above-mentioned abnormal event configuration data or structure body), and further triggers the execution action of the event task.

Referring to fig. 3, fig. 3 is a schematic flow chart of service registration and service discovery according to an embodiment of the present application, which may include the following steps:

Step S31, when the service is started, the micro service target unit registers service information with the configuration center unit, and the specific reference is made to the description in the step S11.

Step S32, when the instance changes, the service information corresponding to the instance is updated to the configuration center unit, and the description in the step S11 can be seen.

Step S33, each instance and the configuration center unit communicate (such as heartbeat) by using a certain mechanism, if the configuration center and a certain instance cannot communicate for a long time, the instance is logged off;

The configuration center unit provides functions of service registration, service discovery, service health check, service information storage and the like, and eliminates the instance node if the instance health check fails, and the description in the step S11 can be seen specifically.

Step S34, the service consumer updates the service information of each instance by actively pulling (or subscribing to the configuration center push notification), where the service consumer may understand the management unit, and see the description in step S11.

Step S35, the service consumer uses the corresponding load balancing rule to call the service provider according to the load balancing rule registered by the service;

In the embodiment of the present application, a service consumer refers to a party consuming data provided by a service provider, for example, a service a calls a service B, in a micro service, typically through a service registry, a service B registers with the service registry, then a service a pulls registration information of the service B from the service registry, and finally the service a can complete the function of calling the service B, where the service B is the service provider, and the service a is the service consumer. The specific service consumer and service provider will depend on the actual scenario.

And S36, the management unit subscribes service information from the service configuration center unit, and analyzes information such as monitoring acquisition addresses and ports, monitoring acquisition rules and the like from the service information. See for details the description in step S11 and step S12 above.

In the technical scheme provided by the embodiment of the application, the function of automatically issuing monitoring collection is achieved by registering and collecting relevant configuration information in the service registration through micro service registration and service discovery.

Specifically, when the micro service is registered in the configuration center unit, the monitoring index acquisition address and the port, the monitoring index type and the like which can be exposed by the service instance node are registered, after the registration is completed, the configuration center unit synchronizes the service information of each instance to the management unit, after the management unit acquires the information of the monitoring acquisition address, the port, the monitoring index dimension type and the like corresponding to the service instance, the corresponding monitoring acquisition task is generated according to the given monitoring acquisition rule, the monitoring acquisition task can be a single acquisition task or a task with fixed acquisition frequency, after the monitoring acquisition task is generated, the monitoring acquisition task can be initiated by the monitoring acquisition unit through transmitting an acquisition task channel to the monitoring acquisition unit.

Before the monitoring acquisition unit initiates the acquisition task, the configuration of the acquisition task can be completed in the monitoring acquisition unit, and the configuration information can be information such as an object address and a port to be acquired, an acquisition mode, acquisition frequency, acquisition times, acquisition index data types, an acquisition protocol mode (such as acquisition according to an HTTP JSON or Telnet JSON mode), an acquisition data reporting channel mode (adopting Kafka or HTTP API JSON), a reporting channel address and the like.

After the monitoring acquisition unit initiates an acquisition task, the corresponding monitoring index data can be reported to the data transmission channel, the data transmission channel transmits the acquired and reported monitoring index data to the back-end real-time computing unit in real time, and the computing unit completes corresponding computing processing.

Referring to fig. 4, fig. 4 is a schematic flow chart of issuing a monitoring acquisition task and acquiring monitoring index data according to an embodiment of the present application, which may include the following steps,

Step S41, the management unit synchronizes the service information from the configuration center unit, and generates a monitoring acquisition task according to the monitoring task generation rule configured by the management unit after acquiring the monitoring address of the service registration, wherein the description in the step S11 and the description in the step S12 can be seen.

And step S42, the management unit issues the generated monitoring acquisition task to the monitoring acquisition unit, and the monitoring acquisition unit initiates acquisition of the monitoring index data according to the task rule, and the specific reference can be seen from the description in the step S14.

Step S43, the monitoring acquisition unit reports the acquired monitoring index data to the data transmission unit, and the specific reference can be seen from the description in the step S15.

Step S44, the monitoring acquisition unit reports the task state, the acquisition index number, the state data of the monitoring acquisition unit and the like to the management unit at regular time so as to manage the task, and the specific reference can be seen from the description in the step S13.

And S45, the management center performs node switching processing on tasks distributed on the monitoring acquisition unit with abnormal acquisition states, ensures reliable operation of the acquisition tasks, and performs task migration on the acquisition nodes with abnormal acquisition states in real time. See for details the description in step S13 above.

It is understood that machine learning plays an important role in anomaly detection, and with the development of industry and the internet, large-scale data integration is the basis for intelligent system operation. The data often contains various abnormal and fault related signals and data, such as server abnormality, communication interruption of network equipment, database access failure, disk writing failure, disk usage 100%, CPU usage 100% and other abnormal information. The abnormal information can be detected and predicted through a machine learning technology, so that abnormal events or faults possibly occurring in the system can be discovered in advance.

An anomaly may be defined as an irregular footprint or area in the data. Specifically, the method can be divided into three types, namely an outlier, wherein the outlier refers to a value in which a single value in data is obviously beyond a conventional range. If the population count of one city significantly exceeds the population count of other cities, as in one demographic data, it can be considered as an outlier, and interval anomalies, which are values in the data that are a continuous segment of value significantly outside of the conventional range. If the temperature in a period of time obviously exceeds the historical highest temperature, the temperature data can be regarded as abnormal data of a section, and the mixing abnormality refers to that the data has abnormal points and section abnormality. If, as in a production data, there is an abnormality in both individual products and overall production over a period of time, this can be considered as a mix abnormality data.

The abnormality detection method may be classified into a statistical method of detecting an abnormality based on statistical characteristics of data, such as detecting an abnormality using statistical characteristics of mean, median, variance, standard deviation, etc., a machine learning method of detecting an abnormality based on a machine learning algorithm, such as detecting an abnormality using a machine learning algorithm of decision tree, random forest, support vector machine, etc., and a deep learning method of detecting an abnormality based on a deep learning algorithm, such as detecting an abnormality using a deep learning algorithm of a self-encoder, a cyclic neural network, etc.

Abnormality detection is accomplished by machine learning, which refers to finding out data points from a large data set that are not consistent with normal patterns, which may represent abnormal behavior in the system, where abnormality detection may be accomplished by supervised and unsupervised learning, where supervised learning methods require a large amount of training data to include labels for anomalies and normal, and then training a classifier with these labels, using the classifier to identify abnormal samples. The unsupervised learning does not need label data, and it finds out abnormal samples by performing dimension reduction processing on the data and clustering or password estimation on the data.

Referring to fig. 5, fig. 5 is a schematic flow chart of an anomaly detection and anomaly event sending process according to an embodiment of the present application, which may include the following steps:

Step S51, the data calculation unit synchronizes service information and monitoring alarm detection rules from the configuration center and the storage service at regular time;

In the embodiment of the application, the storage service can be understood as a data storage unit, and the alarm detection rule can be understood as a preset abnormality detection rule.

Step S52, the data calculation unit consumes the monitoring index data from the data transmission unit in real time;

In step S53, the data calculation unit detects and filters the index data consumed from the data transmission channel according to the alarm rule, and the index data meeting the condition generates an abnormal event, which can be described in the above step S16.

Step S54, the data calculation unit filters index data according to the alarm detection rule, and if the opposite condition of the rule is met, a recovery event corresponding to the abnormal event is generated;

And step S55, the generated abnormal event and recovery event are sent to the event alarm unit and the alarm sending unit through interfaces, and alarm sending is completed.

Referring to fig. 6, fig. 6 is a schematic flow chart of an alarm event triggering scheme according to an embodiment of the present application, which may include the following steps:

step S61, the plan disposing unit synchronizes the plan configuration from the management unit at regular time;

step S62, the plan processing unit consumes the abnormal event transmitted in the data transmission unit in real time;

Step S63, triggering the execution of the plan when the plan associated with the abnormal event hits;

step S64, after the plan is executed successfully, recording the result of each step executed by the plan, and notifying the relevant contact person of the execution result;

step S65, after the execution failure of the plan, recording the result of each step of the execution of the plan, and notifying the relevant contact person of the execution result;

Step S66, after the execution of the plan fails, a person needs to be notified to take over, the plan is processed into a closed loop through manual processing, and a work order compound disc is generated;

step S67, after the execution of the plan fails, the manual intervention treatment is performed, and the reasons of the failure and the explanation of the treatment steps are filled in;

And step S68, finishing the multi-disc, and improving the success rate of the plan by obtaining relevant details of the treatment of the multi-disc optimized plan.

Referring to fig. 7, fig. 7 is a schematic flow chart of a process for learning and recommending configuration of a plan according to an embodiment of the present application, which may include the following steps:

Step S71, a plan learning unit acquires index data, all-link data, service logs, service dependency graphs and other related data, historical fault data and other data from a data transmission channel;

Step S72, analyzing and classifying the data by a plan learning unit, and extracting relevant characteristics according to a machine learning algorithm;

Step 73, training and testing according to the relevant characteristics and the configuration model;

step S74, performing online configuration on the model with the standard test, and performing fault prediction detection on real-time data according to online real-time learning;

Step S75, pushing the recommended predicted faults and the corresponding treatment plans to operation and maintenance plan operators, and configuring the recommended predicted faults and the corresponding treatment plans into a plan treatment service and activating the operation and maintenance plan operators after screening and filtering by the operators;

Step 76, triggering the treatment plan through the abnormal alarm event, and grading or labeling the plan by related personnel or on duty personnel after the corresponding fault is successfully treated;

step 77, the plan learning unit performs optimized learning on the labeled plan, if the labeled plan is successful, the plan is correspondingly classified, and if the labeled plan is failed to be executed, the plan is classified;

and step S78, carrying out adjustment and optimization processing on the plan by operating and maintaining the OnCall on-duty week meeting and rewinding.

In the embodiment of the application, the fault prediction refers to predicting possible faults in the future through historical data, the fault prediction can help enterprises to take corresponding measures before the faults occur, and repair or replace equipment in advance, so that the downtime is reduced, the cost is lowered, and the machine learning can use effective supervised learning and unsupervised learning in a fault prediction business scene.

In the technical scheme provided by the embodiment of the application, after the related plans are recommended through machine learning, and then the plans are configured and activated by manual operation, when an abnormal event occurs, the correlation between the event and the plans is detected and judged by the plan correlation and treatment system, and if the rules and conditions are met, the corresponding emergency plans are triggered to carry out treatment, so that the treatment time before the occurrence of the faults is shortened, the fault treatment is intervened in advance, and the aim of self-healing of the faults is achieved.

In addition, the embodiment of the application can also optimize the treatment process of the plan by manually labeling and judging the result of the treatment of the plan and then returning to manually optimize and process unsuitable or improper processes and steps in the treatment process so as to achieve the further refined optimization treatment of the plan. To make up for aspects of machine learning that may not be more humane. Therefore, the device actually achieves the modes of machine learning and manual operation, thereby achieving the associated scheme of monitoring abnormal judgment, scheme recommendation and configuration and finer fault treatment, and accurate scheme treatment. The multiple steps of complex disc learning optimization and the like reach the self-closing loop and operation from monitoring to abnormal detection, fault linkage planning to planning learning and the like.

The embodiment of the application also provides a fault alarm self-healing method which is applied to any fault alarm self-healing system, wherein the fault alarm self-healing system comprises a micro-service target unit, a management unit, a monitoring acquisition unit, a data calculation unit and a plan treatment unit, and referring to fig. 8, the method comprises the following steps:

Step S81, a management unit acquires metadata corresponding to each micro service instance from a micro service target unit, determines a monitoring acquisition task to be added and a monitoring acquisition task to be deleted according to monitoring information carried by each metadata, and sends the monitoring acquisition task to be added and the monitoring acquisition task to be deleted to a monitoring acquisition unit;

Step S82, a monitoring acquisition unit receives a monitoring acquisition task to be added and a monitoring acquisition task to be deleted, which are sent by a management unit, acquires monitoring index data according to the monitoring acquisition task to be added, and sends the monitoring index data to a data calculation unit, and deletes the corresponding monitoring acquisition task according to the monitoring acquisition task to be deleted, wherein the monitoring acquisition task to be added carries identification, name, task type and task operation configuration content;

Step S83, a data calculation unit receives the monitoring index data sent by the monitoring acquisition unit, and carries out real-time detection on the monitoring index data according to a preset index detection rule to obtain a first detection result, generates time sequence index data according to the first detection result, and carries out real-time detection on the monitoring index data and the time sequence index data according to a preset alarm detection rule to obtain a second detection result;

And S84, the plan processing unit receives the abnormal events sent by the data calculation unit, retrieves the fault self-healing schemes matched with the abnormal events, and carries out fault self-healing processing on the abnormal events according to the fault self-healing schemes.

In some embodiments, the fault-warning self-healing system further comprises a data storage unit, a plan learning unit, a plan recommending and managing unit, and the fault self-healing scheme is obtained by the following steps:

The plan learning unit performs machine learning processing according to the monitored index data and the time sequence index data to obtain a treatment action about the abnormal event;

The data storage unit receives the optimized treatment action, and associates the optimized treatment action with the abnormal event to obtain abnormal event configuration data;

The method comprises the steps of obtaining abnormal event configuration data from a data storage unit by a plan processing unit, carrying out structuring processing on the abnormal event configuration data to obtain a structural body, and determining a fault self-healing plan matched with each abnormal event according to the structural body.

In some embodiments, the fault alert self-healing system further includes an event alert unit that sends an abnormal event to the protocol handling unit, comprising:

the data calculation unit sends the abnormal event to the event alarm unit;

The event alarm unit receives the abnormal event sent by the event data calculation unit, classifies the abnormal event to obtain a target abnormal event;

wherein, the abnormal event is classified to obtain the target abnormal event, which comprises,

For the abnormal event, if a plurality of identical abnormal events are received in the first preset time, one abnormal event in the identical abnormal events is taken as a target abnormal event, and/or,

Aiming at the abnormal event, if a plurality of abnormal events with the same attribute or rule exist in the second preset time, carrying out merging processing according to a preset merging rule to obtain a target abnormal event, and/or,

Aiming at the abnormal event, if a plurality of abnormal events with similar characteristics or service attributes exist in a third preset time, merging and converging the abnormal events with similar characteristics or service attributes to obtain a target abnormal event.

In some embodiments, for an abnormal event, if a recovery event of the abnormal event is not received within a fourth preset time, the event alarm unit takes the abnormal event as an upgrade abnormal event.

In some embodiments, the self-healing system for fault alarm further comprises an alarm transmitting unit and an alarm receiving body,

The alarm receiving body receives the target abnormal event sent by the alarm sending unit.

In some embodiments, the data calculation unit detects the monitoring acquisition index data and the time sequence index data in real time according to a preset abnormal detection rule to obtain a third detection result, and generates a recovery event according to the third detection result;

In some embodiments, the fault alarm self-healing system further includes a configuration center unit, and the obtaining metadata corresponding to each micro service instance from the micro service target unit includes:

the management unit acquires a service instance list from the configuration center unit, and reads metadata corresponding to each micro service instance from the service instance list.

In some embodiments, a communication mechanism exists between each instance of the micro-service and the configuration central unit,

In some embodiments, the self-healing system further comprises a data storage unit, a data display unit, and a data transmission unit,

the sending the monitoring index data to the data calculating unit comprises the following steps:

The monitoring acquisition unit sends monitoring index data to the data transmission unit;

The embodiment of the application also provides an electronic device, as shown in fig. 9, including:

A memory 91 for storing a computer program;

The processor 92 is configured to implement any of the above-described fault alarm self-healing systems when executing the program stored in the memory 91.

And the electronic device may further comprise a communication bus and/or a communication interface, through which the processor 92, the communication interface, and the memory 91 communicate with each other.

The communication bus mentioned above for the electronic device may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The Processor may be a general-purpose Processor including a central processing unit (Central Processing Unit, CPU), a network Processor (Network Processor, NP), etc., or may be a digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.

In yet another embodiment of the present application, a computer readable storage medium is provided, in which a computer program is stored, the computer program implementing any one of the above-mentioned fault alert self-healing systems when executed by a processor.

In yet another embodiment of the present application, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform the self-healing system of fault alarms as described in any of the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a Solid state disk (Solid STATE DISK, SSD), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the method embodiments, since they are substantially similar to the system embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. A fault alarm self-healing system, characterized in that the system includes a microservice target unit, a management unit, a monitoring and acquisition unit, a data calculation unit, and a plan disposal unit.

The management unit is used to obtain metadata corresponding to the instances of each microservice from the microservice target unit; determine the monitoring collection tasks to be added and the monitoring collection tasks to be deleted according to the monitoring information carried by each metadata; and send the monitoring collection tasks to be added and the monitoring collection tasks to be deleted to the monitoring collection unit;

The monitoring collection unit is used to receive the monitoring collection tasks to be added and the monitoring collection tasks to be deleted sent by the management unit, collect monitoring indicator data according to the monitoring collection tasks to be added, and send the monitoring indicator data to the data calculation unit; and delete the corresponding monitoring collection tasks according to the monitoring collection tasks to be deleted; wherein the monitoring collection tasks to be added carry an identifier, a name, a task type, and a task running configuration content;

The data calculation unit is used to receive the monitoring indicator data sent by the monitoring collection unit, monitor the monitoring indicator data in real time according to the preset indicator detection rule to obtain a first detection result; generate time series indicator data according to the first detection result; and perform real-time detection on the monitoring indicator data and the time series indicator data according to the preset abnormality detection rule to obtain a second detection result; generate an abnormal event according to the second detection result; and send the abnormal event to the plan handling unit;

The plan handling unit is used to receive the abnormal events sent by the data computing unit, retrieve the fault self-healing plan matching each abnormal event, and perform fault self-healing processing on the abnormal events according to the fault self-healing plan.

2. The system according to claim 1, characterized in that the system further comprises a data storage unit, a plan learning unit, a plan recommendation and management unit, and the fault self-healing solution is obtained by:

The plan learning unit is used to perform machine learning processing based on the monitoring indicator data and the time series indicator data to obtain a handling action for the abnormal event; and send the handling action to the plan recommendation and management unit;

The plan recommendation and management unit is used to receive the treatment action sent by the plan learning unit, optimize the treatment action to obtain an optimized treatment action; and send the optimized treatment action to the data storage unit;

The data storage unit is used to receive the optimized handling action, and associate the optimized handling action with the abnormal event to obtain abnormal event configuration data;

The plan handling unit is used to obtain the abnormal event configuration data from the data storage unit, and perform structured processing on the abnormal event configuration data to obtain a structure; according to the structure, determine the fault self-healing plan matching each abnormal event.

3. The system according to claim 1, characterized in that the system further comprises an event alarm unit, wherein the sending of the abnormal event to the plan handling unit comprises:

The data calculation unit sends the abnormal event to the event alarm unit;

The event alarm unit receives the abnormal event sent by the event data calculation unit, classifies the abnormal event to obtain a target abnormal event; and sends the target abnormal event to the emergency plan handling unit;

The abnormal events are classified and processed to obtain target abnormal events, including:

For the abnormal event, if multiple identical abnormal events are received within a first preset time, one of the multiple identical abnormal events is used as a target abnormal event; and/or,

For the abnormal event, if there are multiple abnormal events with the same attributes or rules within the second preset time, merge them according to the preset merging rules to obtain the target abnormal event; and/or,

For the abnormal event, if there are multiple abnormal events with similar characteristics or business attributes within the third preset time, the abnormal events with similar characteristics or business attributes are merged and converged to obtain a target abnormal event.

4. The system according to claim 3, characterized in that the event alarm unit is further used to:

For the abnormal event, if no recovery event of the abnormal event is received within a fourth preset time, the abnormal event is regarded as an upgraded abnormal event.

5. The system according to claim 3, characterized in that the system further comprises an alarm sending unit and an alarm receiving entity,

The alarm sending unit is used to obtain the target abnormal event from the event alarm unit; and send the target abnormal event to the alarm receiving subject according to the alarm rule or subscription rule;

The alarm receiving entity is used to receive the target abnormal event sent by the alarm sending unit.

6. The system according to claim 4, characterized in that

The data calculation unit is further used to perform real-time detection on the monitoring collection index data and the time series index data according to the preset abnormality detection rule to obtain a third detection result, generate a recovery event according to the third detection result; and send the recovery event to the event alarm unit;

The event alarm unit is further used to receive the recovery event sent by the data calculation unit.

7. The system according to claim 1, characterized in that the system further comprises a configuration center unit, wherein the step of obtaining metadata corresponding to the instance of each microservice from the microservice target unit comprises:

The microservice target unit registers service information of each microservice instance to the configuration center unit, wherein the service information includes a service name, a monitoring collection address, a monitoring collection port, and metadata of the instance;

The configuration center unit generates a service instance list according to each of the service information;

The management unit obtains the service instance list from the configuration center unit, and reads metadata corresponding to the instances of each microservice from the service instance list.

8. The system according to claim 7, characterized in that there is a communication mechanism between the instances of each microservice and the configuration center unit,

The configuration center unit is further configured to delete the target instance from the service instance list if there is a target instance that has not communicated with the configuration center unit within a fifth preset time.

9. The system according to claim 1, characterized in that the system further comprises a data storage unit, a data display unit, and a data transmission unit.

The data storage unit is used to acquire and store system data, and send the system data to the data display unit; the system data includes the metadata, monitoring indicator data, time series indicator data, abnormal events, recovery events, preset indicator detection rules, preset abnormal detection rules, subscription rules, and fault self-healing solutions;

The data display unit is used to receive the system data sent by the data storage unit and display the system data;

The sending the monitoring indicator data to the data calculation unit includes:

The monitoring collection unit sends the monitoring indicator data to the data transmission unit;

The data transmission unit sends the monitoring indicator data to the data calculation unit.

10. A fault alarm self-healing method, characterized in that the method is applied to a fault alarm self-healing system, the fault alarm self-healing system includes a microservice target unit, a management unit, a monitoring and acquisition unit, a data calculation unit, and a plan disposal unit, and the method includes:

The management unit obtains metadata corresponding to the instances of each microservice from the microservice target unit; determines the monitoring collection tasks to be added and the monitoring collection tasks to be deleted according to the monitoring information carried by each metadata; and sends the monitoring collection tasks to be added and the monitoring collection tasks to be deleted to the monitoring collection unit;

The monitoring collection unit receives the monitoring collection tasks to be added and the monitoring collection tasks to be deleted sent by the management unit, collects monitoring indicator data according to the monitoring collection tasks to be added, and sends the monitoring indicator data to the data calculation unit; and deletes the corresponding monitoring collection tasks according to the monitoring collection tasks to be deleted; wherein the monitoring collection tasks to be added carry an identifier, a name, a task type, and a task running configuration content;

The data calculation unit receives the monitoring index data sent by the monitoring collection unit, performs real-time detection on the monitoring index data according to a preset index detection rule to obtain a first detection result; generates time series index data according to the first detection result; and performs real-time detection on the monitoring index data and the time series index data according to a preset alarm detection rule to obtain a second detection result; generates an abnormal event according to the second detection result; and sends the abnormal event to the plan handling unit;

The plan handling unit receives the abnormal events sent by the data computing unit, retrieves fault self-healing plans matching each of the abnormal events, and performs fault self-healing processing on the abnormal events according to the fault self-healing plans.