CN120223501B

CN120223501B - Cloud monitoring service operation and maintenance dynamic optimization system and method based on AI intelligent agent

Info

Publication number: CN120223501B
Application number: CN202510698823.4A
Authority: CN
Inventors: 孟伟
Original assignee: Sichuan Zhixing Zhicheng Technology Co ltd
Current assignee: Sichuan Zhixing Zhicheng Technology Co ltd
Priority date: 2025-05-28
Filing date: 2025-05-28
Publication date: 2025-08-08
Anticipated expiration: 2045-05-28
Also published as: CN120223501A

Abstract

The present invention discloses a cloud monitoring service operation and maintenance dynamic optimization system and method based on AI intelligent body, which relates to the field of cloud computing intelligent operation and maintenance technology. It is used to solve the problems of cross-layer data fragmentation, hidden fault correlation failure and operation and maintenance action conflict in hybrid clouds. Heterogeneous data of the physical layer, virtual layer and application layer are collected through containerized probes, and standardized cross-layer indicators are constructed through layered labeling and dynamic time series alignment. Fault propagation maps are constructed through incremental correlation analysis and dynamic time windows to identify spatiotemporal coupling nodes. Causal relationships are verified based on directional disturbances, and coupling indexes are fitted in combination with resource scheduling and fault chain correlation matrices to generate root cause location instructions. Action priorities are optimized based on propagation cost gradients and asymmetric game strategies, and models and rules are updated through feedback closed loops. The present invention realizes cross-layer data fusion and precise fault location, improving cloud service stability and resource utilization.

Description

Cloud monitoring service operation and maintenance dynamic optimization system and method based on AI intelligent agent

Technical Field

The invention relates to the technical field of cloud computing intelligent operation and maintenance, in particular to a cloud monitoring service operation and maintenance dynamic optimization system and method based on an AI intelligent agent.

Background

Under the current rapid development of information technology, cloud computing has become one of core infrastructures for enterprise digital transformation, especially under the trend of increasingly popular mixed cloud and multi-cloud environments, service types carried by cloud platforms are increasingly complex, service dependence chains are obviously prolonged, and system operation situations show the characteristics of high dynamic, high concurrency and multi-level coupling. In order to ensure the continuity and stability of the key business system, the operation and maintenance management of the cloud platform gradually evolves from static monitoring to intelligent, automatic and dynamic optimization, and the active identification and response scheduling of potential fault risks in a complex system are needed to be realized through means such as an AI intelligent agent, so that the overall service quality guarantee level is improved.

However, the existing cloud monitoring scheme is driven by a static rule base or a single data source, and has the problems of cross-layer index fracturing, insufficient hidden fault association recognition capability and the like. For example, causal links where physical layer resource fragmentation is offset from application layer service performance are difficult to verify efficiently, resulting in high false positive rates and root positioning delays. The resource scheduling strategy and the fault repairing action lack of coordination, and cascade fault risks are easily aggravated due to blind capacity expansion. In addition, the contradiction between the short-term emergency response and the long-term optimization target lacks a dynamic balance mechanism, so that the adaptability of an operation and maintenance strategy and an actual scene is poor, and the elasticity requirement in a mixed cloud environment is difficult to meet.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a cloud monitoring service operation and maintenance dynamic optimization system and method based on an AI intelligent agent, which solve the problems of the background art.

The cloud monitoring service operation and maintenance dynamic optimization system based on the AI intelligent agent comprises a cross-layer index sensing module, a space-time correlation analysis module, a causal verification and coupling degree analysis module, a dynamic priority decision module and a strategy execution and feedback module; the system comprises a cross-layer index sensing module, a causal verification and coupling degree analysis module, a dynamic priority decision module, a dynamic priority algorithm module and a real-time protocol, wherein the cross-layer index sensing module is used for deploying a containerized probe cluster, acquiring physical layer hardware resource fragmentation index, virtual layer container life cycle event and application layer micro-service call chain performance offset data in a mixed cloud environment in real time, outputting a standardized cross-layer index set through hierarchical tagging preprocessing and outlier filtering, the space-time correlation analysis module is used for connecting the cross-layer index sensing module, eliminating time sequence deviation of physical layer low frequency data and application layer high frequency data through an incremental time sequence alignment algorithm, constructing a fault propagation map based on a dynamic time window, identifying hidden associated nodes with space-time coupling characteristics among the cross-layer indexes, the causal verification and coupling degree analysis module is used for receiving a fault propagation map output by the space-time correlation analysis module, verifying authenticity of the cross-layer causal relationship through directional disturbance injection, fitting the coupling degree index according to an associated matrix of a resource scheduling operation and a fault propagation chain, and constructing a coupling degree analysis rule to quantify influence weight of the operation on fault propagation, generating a root positioning instruction of a negative coupling scene, the dynamic priority decision module is used for analyzing a fault repair time and a gradient, a gradient of a coupling level protocol, a gradient, a real-time cost and a real-time protocol and a real-time cost function, and the strategy execution and feedback module is used for calling the cloud platform interface to perform the decision action in an atomization way, collecting the index change data after the execution and dynamically updating the fault propagation cost model and the coupling degree analysis rule.

The cross-layer index perception module specifically comprises a containerized probe cluster deployed at a hybrid cloud node in a micro-service architecture and comprises physical layer probe acquisition hardware resource fragmentation indexes, virtual layer probe monitoring of container life cycle events and resource conflict behaviors, application layer probe tracking of micro-service call chain topology and performance offset, acquisition frequency adjustment according to index dynamic change rate, low-frequency triggering sampling, event-driven high-frequency tracking by an application layer, transient noise suppression through sliding window statistics, a layered labeling preprocessing unit for attaching cloud platform type and service dependency level environment context labels to original data, filtering abnormal values based on an isolated forest algorithm, and outputting a standardized cross-layer index set.

The method comprises the steps of dynamically interpolating physical layer low frequency data and application layer high frequency data according to data confidence coefficient weighting to generate a continuous time sequence, detecting potential phase differences among cross-layer indexes through sliding correlation analysis, dynamically adjusting interpolation anchor points to eliminate time sequence offset, calculating conditional transition probability among the cross-layer indexes according to aligned time sequence data, dynamically expanding time window range, shrinking window when index mutation is detected, and generating a fault propagation spectrum with weight edges.

The recognition logic of the hidden association nodes with space-time coupling characteristics among cross-layer indexes is characterized by extracting propagation paths of a cross physical layer, a virtual layer and an application layer in a fault propagation map, screening candidate nodes with transition probability exceeding a dynamic threshold, carrying out mutual information entropy analysis on the candidate nodes to quantify the dependence strength of the candidate nodes on upstream and downstream indexes, eliminating weak association interference items, verifying the space-time causality directionality of the candidate paths through a Granges causality test, and injecting directional disturbance into a high-probability path to observe the response amplitude of the downstream indexes, so as to confirm the effectiveness of space-time coupling.

Further, verifying the authenticity of the cross-layer causal relationship through directional disturbance injection, and fitting the coupling index according to the association matrix of the resource scheduling operation and the fault propagation chain by selecting a high-probability association path in the fault propagation map, and injecting controllable disturbance simulating physical layer storage delay sudden increase or limiting the network bandwidth of a virtual layer container to a path source node; the method comprises the steps of monitoring the response of a downstream index, recording the response and the amplitude of a disturbance propagation path, comparing the consistency of the predicted path of an original map, extracting the time sequence relation between historical resource scheduling operation and a fault propagation chain, constructing an incidence matrix of the resource scheduling operation and the fault propagation chain, fitting the coupling index by a gradient descent method based on the influence weight of the resource scheduling operation in the matrix on the length and the repair time of the fault chain.

Further, a coupling degree analysis rule is constructed to quantify the influence weight of the operation on fault propagation, the specific process of generating root cause positioning instructions of a negative coupling scene is as follows, the initial coupling degree analysis rule is generated by calculating the contribution weight of the operation on the length and repair time of a fault chain based on the incidence matrix of resource scheduling operation and the fault propagation chain, a dynamic judgment threshold is set, when the contribution weight of the operation on the fault propagation chain exceeds the threshold, the node associated with the negative coupling operation is marked as the negative coupling operation, the node associated with the negative coupling operation is traced back in a fault propagation map, the high causal strength node which is not covered by historical scheduling operation is screened as a candidate root cause, the candidate root cause is subjected to reverse blocking test, the interrupt effect on a downstream fault chain is verified by limiting the resource access or flow distribution, and the positioning instructions comprising the root cause node identification, the influence path and repair suggestion are generated and pushed to an operation and maintenance terminal.

The method comprises the steps of analyzing historical fault repair time and resource wave rate fitting propagation cost gradient according to a fault propagation cost model, extracting historical fault repair time data and resource wave rate caused by resource scheduling operation to construct an initial propagation cost function, iteratively optimizing cost function parameters through a gradient descent method, dynamically adjusting repair time weight and resource waste penalty factors, calculating current propagation cost gradient according to real-time fault propagation path length and resource utilization rate change, continuously optimizing gradient parameters through strategy execution feedback data, and adapting to dynamic change of a hybrid cloud environment.

Further, the specific processes of inputting the coupling degree index, the real-time service grade protocol violation rate and the fault propagation cost gradient into a weight function, calculating the cooperative priority of short-term inhibition action and long-term eradication action through a dynamic weight function, and generating the execution sequences of the two types of actions through an asymmetric game strategy are as follows, namely, introducing a coupling degree penalty factor into the dynamic weight function, and inhibiting the dispatching operation priority possibly aggravating fault propagation under a high coupling scene; the method comprises the steps of calculating the urgency weight of short-term inhibition actions based on real-time service level agreement default rate, calculating the benefit weight of long-term eradication actions in combination with propagation cost gradient, defining the short-term inhibition actions and the long-term eradication actions as asymmetric game participants, constructing a benefit function for quantifying the contribution of the short-term inhibition actions and the long-term eradication actions to service availability promotion and fault propagation inhibition, solving an optimal cooperative strategy through dynamic Nash equilibrium, preferentially executing the short-term inhibition actions to rapidly stop losses, and asynchronously triggering the long-term eradication actions.

The policy execution and feedback module specifically comprises an atomization execution engine, a cross-platform index change monitoring and capturing the inhibiting effect and the resource utilization influence of the action on a fault propagation chain, adjusting weight parameters in a fault propagation cost model according to feedback data, optimizing a coupling degree analysis rule and enhancing the early recognition capability on a negative coupling scene, wherein the atomization execution engine disassembles the decision action to call a cloud platform API to trigger a current limiting policy or initiate an independent execution atomic operation of a storage volume migration task, and ensures the atomicity and consistency of the cross-platform operation through a transaction lock mechanism.

A cloud monitoring service operation and maintenance dynamic optimization method based on an AI intelligent agent comprises the following steps of S1, deploying a containerized probe cluster, collecting physical layer hardware resource fragmentation index, virtual layer container life cycle event and application layer micro-service call chain performance offset data in a hybrid cloud environment in real time, outputting a standardized cross-layer index set through hierarchical labeling preprocessing and outlier filtering, S2, connecting a cross-layer index perception module, eliminating time sequence deviation of physical layer low frequency data and application layer high frequency data through an incremental time sequence alignment algorithm, constructing a fault propagation map based on a dynamic time window, identifying hidden association nodes with space-time coupling characteristics among cross-layer indexes, S3, receiving the fault propagation map output by a space-time association analysis module, verifying authenticity of the cross-layer relationship through directional disturbance injection, constructing a coupling degree analysis rule to influence weight of the fault propagation by quantized operation, generating root cause positioning instructions of a coupling scene, S4, analyzing historical fault repair time and resource rate gradient according to a fault propagation cost model, executing a causal function and performing a function-based on the time sequence of the fault propagation map, and the probability of the fault propagation is conducted by the aid of the time sequence, and the fault propagation chain is conducted by the aid of the correlation matrix fit degree index, and the fault propagation index is conducted by the method, and the method is conducted by the method is better, and the fault propagation index is better than a fault propagation model is calculated and is better, and a fault transmission action rule is calculated, and has a good result.

The invention has the following beneficial effects:

(1) A cloud monitoring service operation and maintenance dynamic optimization system based on an AI intelligent agent realizes unified acquisition and standardized processing of heterogeneous data of a physical layer, a virtual layer and an application layer through a containerized probe cluster, breaks through single-layer data limitation of a traditional monitoring tool, and improves cross-platform resource coordination capability. Based on an incremental time sequence alignment algorithm and a dynamic time window, a fault propagation map is constructed, the space-time coupling characteristic among cross-layer indexes is accurately identified, and the problem of false association caused by static rules is solved. And the causality is verified through the directional disturbance injection and coupling degree analysis rules, so that the manual investigation cost is reduced, and the positioning accuracy under the complex fault scene is improved. And the priorities of short-term inhibition action and long-term eradication action are dynamically balanced by combining an asymmetric game strategy, so that conflict between resource scheduling and fault repair is avoided, and the overall toughness of the system is improved. And the continuous iteration of the operation and maintenance strategy is realized by executing the feedback data to dynamically update the model and the rule, so that the method is suitable for the dynamic change of the hybrid cloud environment.

(2) A cloud monitoring service operation and maintenance dynamic optimization method based on an AI intelligent agent comprises the steps of data acquisition, time sequence alignment and map construction, so that full life cycle management of cross-layer indexes is realized, and interference of a data island on operation and maintenance decision is eliminated. And through dynamic time window and space-time coupling characteristic analysis, potential cascading failure paths are identified in advance, and failure early warning capability is improved. And the influence weight of resource scheduling on fault propagation is quantized by combining the directional disturbance and the incidence matrix, so that the reliability and the interpretability of root cause positioning are enhanced. And generating a collaborative execution strategy adapting to the complex scene based on the dynamic weight calculation of the fault propagation cost gradient and the real-time service level protocol violation rate, and reducing the service interruption risk. By executing the self-adaptive updating of the feedback closed-loop driving model and the rules, the operation and maintenance strategy is ensured to always meet the actual environment requirements, and the long-term operation and maintenance efficiency is improved.

Of course, it is not necessary for any one product to practice the invention to achieve all of the advantages set forth above at the same time.

Drawings

Fig. 1 is a flowchart of a cloud monitoring service operation and maintenance dynamic optimization system based on an AI intelligent agent.

Fig. 2 is a flowchart of a cloud monitoring service operation and maintenance dynamic optimization method based on an AI intelligent agent.

Detailed Description

According to the cloud monitoring service operation and maintenance dynamic optimization system and method based on the AI intelligent agent, aiming at complex coupling relation between cross-layer resources and services in a hybrid cloud environment, by fusing heterogeneous data of a physical layer, a virtual layer and an application layer and combining a dynamic causal verification and game decision mechanism, the problems of data splitting, hidden fault association failure, resource scheduling and fault repairing action conflict and the like in the traditional operation and maintenance scheme are solved, fault early warning, root cause positioning, closed-loop management and control of policy optimization and self-healing execution are achieved, and stability and resource utilization efficiency of a cloud service system are improved.

The scheme in the embodiment of the application has the following overall thought:

The cross-layer data integration and implicit association mining comprises the steps of collecting heterogeneous data of physical layer hardware resources, virtual layer container clusters and application layer micro-service links in real time through container probe clusters, eliminating data semantics and collection frequency differences through layering tagging and dynamic time sequence alignment technology, constructing a cross-layer unified standardized index set, further generating a fault propagation map based on dynamic time window and incremental association analysis, and identifying an implicit fault path with space-time coupling characteristics (such as cascading effect of micro-service call delay caused by storage performance degradation), so that the limitation of data splitting in traditional single-layer monitoring is broken through.

The method comprises the steps of dynamic causal verification and coupling degree quantification, namely injecting directional disturbance (such as simulating physical layer I/O delay surge) into a fault propagation map, verifying the authenticity of causal relation among cross-layer indexes, fitting coupling degree indexes and constructing analysis rules by combining a correlation matrix of historical resource scheduling operation and a fault propagation chain, quantifying influence weight (such as probability of cascade fault recurrence caused by capacity expansion operation) of the operation on fault propagation, generating root cause positioning instructions of a negative coupling scene, and solving the problems of low manual investigation efficiency and high false alarm rate.

The multi-objective collaborative decision and closed-loop optimization comprises the steps of calculating collaborative priorities of short-term inhibition actions (current limiting and degradation) and long-term eradication actions (storage migration and service reconfiguration) through dynamic weight functions based on a fault propagation cost model and real-time service level protocol violation rate, and introducing an asymmetric game strategy to balance resource conflict of the two actions so as to generate an optimal execution sequence. After the strategy is executed, the weight and the coupling degree analysis rule of the propagation cost model are dynamically updated through feedback data, a self-adaptive closed loop of perception-decision-execution-optimization is formed, and the self-healing efficiency and the resource utilization rate of faults in the mixed cloud environment are continuously improved.

Referring to FIG. 1, the embodiment of the invention provides a technical scheme that a cloud monitoring service operation and maintenance dynamic optimization system based on an AI intelligent agent comprises a cross-layer index sensing module, a space-time correlation analysis module, a causal verification and coupling degree analysis module, a dynamic priority decision module and a strategy execution and feedback module; the cross-layer index sensing module is used for deploying a containerized probe cluster, collecting physical layer hardware resource fragmentation index, virtual layer container life cycle event and application layer micro-service call chain performance offset data in a hybrid cloud environment in real time, outputting a standardized cross-layer index set through hierarchical label preprocessing and outlier filtering, the space-time correlation analysis module is used for connecting the cross-layer index sensing module, eliminating time sequence deviation of physical layer low frequency data and application layer high frequency data through an incremental time sequence alignment algorithm, constructing a fault propagation map based on a dynamic time window, identifying hidden association nodes with space-time coupling characteristics among the cross-layer indexes, the causality verification and coupling degree analysis module is used for receiving the fault propagation map output by the space-time correlation analysis module, verifying authenticity of the cross-layer causality through directional disturbance injection, fitting the coupling degree index according to the association matrix of a resource scheduling operation and a fault propagation chain, constructing a coupling degree analysis rule to quantify influence weight of the operation on fault propagation, generating root positioning instructions of a negative coupling scene, the dynamic priority decision module is used for analyzing fault repair time and resource gradient and real-time gradient of a real-time service grade gradient and the like, and inputting a real-time grade service grade gradient and a fault propagation gradient, and the strategy execution and feedback module is used for calling the cloud platform interface to perform the decision action in an atomization way, collecting the index change data after the execution and dynamically updating the fault propagation cost model and the coupling degree analysis rule.

In the embodiment, the cross-layer index sensing module is used for deploying a containerized probe cluster, continuously sensing key running state indexes of different layers (physical layer, virtual layer and application layer) in the mixed cloud environment, preprocessing and normalizing the key running state indexes, and outputting a cross-layer comparable index set. The containerized probe cluster refers to a group of lightweight monitoring programs deployed based on container technology (such as Docker), and can flexibly adapt to a multi-cloud environment to realize high expandability and low invasive data acquisition. Physical layer hardware resource fragmentation indexes such as CPU idle fragments, discontinuous occupation of memory and the like are used for reflecting scattered degree of resource use and can influence performance scheduling efficiency. Container life cycle events including container start, pause, migration, destruction, etc. state change events. The micro-service call chain performance offset data refers to dynamic changes of indexes such as response delay, error rate and the like on a call path between services, and is used for analyzing the performance bottleneck of an application layer. And carrying out hierarchical labeling pretreatment, namely classifying and labeling the acquired data according to source levels, service categories and the like so as to improve the context matching capability of subsequent analysis. Filtering abnormal values, namely eliminating invalid data caused by instantaneous fluctuation, acquisition errors or network delay, and ensuring the quality of input indexes. And the space-time association analysis module is responsible for eliminating time sequence errors among different levels of data, mining hidden association relations among indexes under a dynamic time window and constructing a fault propagation path. The incremental time sequence alignment algorithm is a time sequence data alignment method based on a sliding window, and can dynamically adjust the alignment mode of data points in the continuous updating process of data so as to realize synchronous matching between high-frequency and low-frequency indexes. The fault propagation map is a graph structure based on a probability model, wherein nodes represent index items, edges represent potential fault transmission paths, and weights represent propagation probabilities. Implicit association node refers to an index or event node whose surface appears to be not directly associated, but exhibits highly relevant behavior for some condition or period of time. And the causal verification and coupling degree analysis module is used for verifying whether real causal relation exists among cross-layer indexes and quantifying the influence degree of resource operation behaviors on fault propagation so as to position the root cause of the fault. And (3) directional disturbance injection, namely injecting small changes (such as resource lifting and matching) into specific components or parameters on the premise of not influencing the stability of the system, observing the response changes of the specific components or parameters to the whole index chain, and identifying causal relationship. Constructing an influence matrix between resource scheduling actions (such as container migration) and index variation for fitting a causal path. The coupling index is a quantitative index or a numerical index of the interaction intensity between operations, and the higher the value is, the stronger the coupling relation is. Negative coupling scenarios refer to certain operations or resource scheduling that rather exacerbate system failures or performance degradation. And (3) positioning the instruction according to the root cause, namely generating the instruction to accurately position a core node or operation causing negative propagation in the system based on the analysis. And the dynamic priority decision module is used for synthesizing a plurality of factors to carry out dynamic priority sequencing on the candidate actions, weighing short-term loss stopping and long-term treatment, and formulating an optimal operation and maintenance strategy sequence. And analyzing the time-consuming and resource-wasting conditions of repairing in the past fault event, and quantifying the evolution cost of the fault. Service Level Agreement (SLA) breach rate, which refers to the proportion of service quality (e.g., delay, availability) not provided as per contract or platform requirements. Propagation cost gradient, which refers to the marginal contribution of different nodes or paths to the overall system cost increase. And the dynamic weight function is used for dynamically adjusting the influence weight of each input factor according to the actual situation (such as service level change). And (3) an asymmetric game strategy, namely that the benefits are asymmetric among different actions, and solving the optimal response strategy of the two parties through a game model to realize the globally optimal cooperative action. And the strategy execution and feedback module is used for calling a cloud platform bottom layer interface to automatically execute strategy actions and collecting feedback data for model iteration to realize closed-loop optimization. And performing atomization, namely disassembling the strategy into a minimum operation unit (such as a restarting service, a migration container and the like) which can be independently performed so as to ensure the controllability and the safety of the system. And the index change data after execution comprises index difference values, change rates and the like before and after execution of the action and is used for evaluating the action effect. And dynamically updating the cost model and rules, namely feeding back the execution effect to adjust the weights of the decision model and the causal rules in real time, so as to realize the self-adaptive evolution of the model.

The cross-layer index perception module specifically comprises a containerized probe cluster deployed at a hybrid cloud node in a micro-service architecture and comprises physical layer probe acquisition hardware resource fragmentation indexes, virtual layer probe monitoring of container life cycle events and resource conflict behaviors, application layer probe tracking of micro-service call chain topology and performance offset, acquisition frequency adjustment according to index dynamic change rate, physical layer low-frequency triggering sampling, application layer event-driven high-frequency tracking, transient noise suppression through sliding window statistics, a layered labeling preprocessing unit for attaching cloud platform type and service dependency level environment context labels to original data, filtering abnormal values based on an isolated forest algorithm, and outputting a standardized cross-layer index set.

In this embodiment, the cross-layer index aware module encapsulates the probe in a containerized manner and deploys the probe in a plurality of nodes in a hybrid cloud environment according to a micro-service architecture. The deployment mode enables the probe to be flexibly expanded and contracted according to actual monitoring requirements, adapts to different cloud platforms (such as private cloud, public cloud or edge computing nodes), and can be independently updated and dynamically expand the probe function. The physical layer probe is mainly deployed in a physical server or a bare metal environment and is used for collecting fine-grained use states of bottom hardware resources, including but not limited to distributed idle core information of a CPU, the number of areas which are not continuously allocated in a memory, instantaneous fluctuation of disk access and the like. the information is helpful for identifying the fragmentation degree of the resources and provides a basis for efficient scheduling. The virtual layer probes are deployed in a virtualization platform or container orchestration system (e.g., kubernetes) for monitoring the lifecycle events (e.g., creation, start, termination) of containers and their resource competing behavior with other containers or processes. By recording the conditions of container scheduling failure, resource preemption and the like, the potential system bottleneck or the problem of unreasonable resource allocation is assisted to be identified. The application layer probe is integrated in a service gateway or middleware of each micro service and is responsible for capturing key performance indexes such as call chain paths among services, response time of each call, delay change and the like. The mechanism can restore the service dependency relationship map, monitor the performance drift among services and help to quickly locate the fault root cause or the abnormal performance node. The cross-layer index sensing module adjusts the sampling frequency in a self-adaptive manner according to the change trend of various indexes. For physical layer indexes with small fluctuation or slow variation trend, a low-frequency sampling strategy is adopted to save calculation resources, and for application layer indexes with severe variation such as response time and the like, a high-frequency sampling or event driving mode is adopted to ensure timeliness and precision of data. The physical layer typically samples periodically at regular intervals, such as once per minute for resource utilization, etc., while the application layer uses an event-triggered mode, such as a micro-service request response timeout, error rate surge, etc., to trigger sampling immediately and encrypt storage. the hierarchical sampling strategy gives consideration to system resource consumption and monitoring precision. In order to reduce misjudgment caused by transient data abnormality, a sliding window mechanism is introduced into a module for smoothing. The mechanism aggregates (e.g. averages or medians) the data in a period of continuous time, and effectively suppresses noise interference caused by short-time disturbance such as network jitter and load peak, so as to improve the stability of anomaly identification. In order to realize cross-platform and cross-level fusion processing of data, a module embeds rich environment context labels in original monitoring data, wherein the rich environment context labels comprise meta information such as cloud platform types, service layering to which nodes belong, application names, geographic deployment positions and the like. The label system is beneficial to subsequent grouping analysis, positioning analysis and intra-system association analysis. In the data cleaning stage, the module adopts an isolated forest algorithm to unsupervised reject the abnormal samples. The method can identify extreme or malformed data based on the isolated difficulty of the sample under the condition of no priori knowledge, further improve the overall quality and robustness of the cross-layer index, and avoid system judgment due to single-point abnormal interference. Finally, the module outputs the processed cross-layer index data in a unified format to form a standardized index set which can be called by a downstream system. Standardized content includes index name unification, data unit conversion, timestamp format alignment, label structure unification, and the like, ensuring that multi-source heterogeneous data can be resolved and utilized consistently.

The method comprises the specific processes of dynamically interpolating physical layer low frequency data and application layer high frequency data according to data confidence weighting to generate a continuous time sequence, detecting potential phase differences among cross-layer indexes through sliding correlation analysis, dynamically adjusting interpolation anchor points to eliminate time sequence offset, calculating conditional transition probability among the cross-layer indexes according to aligned time sequence data, dynamically expanding time window range, shrinking window when index mutation is detected, and generating a fault propagation spectrum with weight edges.

In this embodiment, there is a timing deviation between the physical layer low frequency data and the application layer high frequency data due to the sampling frequency difference. To eliminate this deviation, dynamic interpolation of low frequency data and high frequency data is required to generate a continuous time series. Data confidence weighting the confidence of each data point is determined based on the reliability of the data source. For example, data from different probes may differ in accuracy, with higher confidence data being given more weight. These weights are used to adjust the influence of each data point during interpolation. Interpolation methods typically employ linear interpolation or more complex interpolation methods (e.g., spline interpolation) that generate continuous time series data based on weights and confidence levels of the data. The goal of this step is to fill in the gaps between the data, ensuring a smooth transition of the data sequence. Sliding correlation analysis detection of potential phase differences once continuous timing data is obtained by interpolation, the next step is to conduct a sliding correlation analysis to detect potential timing deviations (i.e., phase differences) between cross-layer metrics. The core idea of this step is to find the time difference between the physical layer and the application layer data and dynamically adjust the data alignment. Sliding correlation analysis this analysis method detects potential phase differences by calculating correlations between cross-layer index data. Specifically, the sliding window performs correlation calculation at different time points, and the alignment point (i.e., the interpolated anchor point) of the data is gradually adjusted. If an index value offset within the time window is found to be large, this deviation is reduced by adjusting the interpolation anchor. The formula: And wherein: And Index values of the physical layer and the application layer, respectively.Is a time delay, indicating a possible phase difference.AndIs thatAndIs a mean value of (c).The size of the time window determines the extent of the sliding window. By the correlation calculated by the sliding window, the phase difference between the data can be recognized and corrected. After the time sequence data alignment is completed, the next step is to calculate the conditional transition probability between the cross-layer indexes, i.e. predict the occurrence probability of the next state according to the state of the current index. Conditional transition probability-the probability of a conditional transition between cross-layer metrics describes the probability of a metric transitioning from one state to another given the current state. During calculation, correlation of different time points is considered, and a transition probability model is fitted through historical data. The formula: And wherein: Is the state of the physical layer and, Is the next state of the application layer.Is a physical layerAnd application layer stateNumber of simultaneous occurrences. The denominator is all possibleThe total number of values is used to normalize the probability. Expansion and contraction of dynamic time window, namely, according to the change of data, the range of the time window can be dynamically adjusted. If the cross-layer index is suddenly changed (such as a fault or abnormal performance), the recent data is intensively analyzed by reducing the time window, so that the transient change of the fault can be accurately captured. Conversely, when the system is stable, the time window will be extended to cover the longer period of data, capturing the overall trend of the system. The formula: And wherein: Is the size of the time window. AndThe minimum and maximum sizes of the time window, respectively. Generating a fault propagation map with a weight side, and finally generating the fault propagation map with the weight side through calculation of conditional transition probability and adjustment of a dynamic time window. In this graph, nodes represent indices of different levels, and edges represent correlations and probability of fault propagation between them. The weights then represent the strength of fault propagation between the different indicators. The magnitude of the weights reflects the likelihood of a fault propagating from one level to another. The greater the weight, the stronger the fault propagation between the two indices, and the higher the probability of a fault occurring. The formula: And wherein: Is a slave node To the nodeIs a weight of an edge of (c).Is a nodeTo the nodeConditional transition probabilities of (2).Is a reliability factor representing the nodeAndThe quality of the connection between may be assessed based on factors such as system stability, device health, etc. Through the steps, an accurate fault propagation map is constructed, and more accurate fault early warning and repairing strategies can be realized in cloud monitoring service.

The identification logic for identifying hidden association nodes with space-time coupling characteristics among cross-layer indexes comprises the steps of extracting propagation paths of a cross physical layer, a virtual layer and an application layer in a fault propagation map, screening candidate nodes with transition probability exceeding a dynamic threshold, carrying out mutual information entropy analysis on the candidate nodes to quantify the dependence strength of the candidate nodes on upstream and downstream indexes, eliminating weak association interference items, verifying the space-time causality directionality of the candidate paths through a Granges causality test, injecting directional disturbance into a high-probability path to observe the response amplitude of the downstream indexes, and confirming the effectiveness of space-time coupling.

In this embodiment, candidate nodes whose transition probabilities exceed the dynamic threshold are screened, and propagation paths among the physical layer, the virtual layer, and the application layer of the cross-layer are regarded as potential association paths in the fault propagation map. First, candidate nodes with transition probabilities exceeding a dynamic threshold need to be screened out. Transition probability means the probability of a state change from one node to another. When the probability exceeds a set dynamic threshold, it is indicated that the path may involve a strong correlation. Thus, these nodes are considered candidate nodes. Dynamic threshold value the setting of the threshold value is dynamic and is adjusted based on historical data and system state factors. The threshold is adjusted according to fluctuations in the load of different systems or the history of faults to improve the sensitivity of the recognition. Mutual information entropy analysis is performed on the screened candidate nodes, and the dependency strength of the candidate nodes on the upstream and downstream indexes is quantified through mutual information entropy analysis. Mutual information entropy can help us evaluate the dependency between two variables and quantify the relevance between cross-layer indexes. Mutual information entropy is the mutual information between two variables, and the information sharing degree between the two variables is measured. The larger the mutual information entropy is, the stronger the dependency relationship between the two variables is. The formula: And wherein: And Is two variables between the cross-layer metrics (metrics of the physical layer and the application layer).Is thatAndIs described.AndRespectively areAndIs a marginal probability distribution of (c).Representation ofAndMutual information between them. Through the calculation of mutual information entropy, which nodes have strong relevance can be identified, interference items with weak dependency relationships are removed, and important relevance paths are reserved. Graininess causal test after mutual information entropy analysis, we used graininess causal test on candidate paths to verify the directionality of their spatio-temporal causal relationship. The graininess causal test may help us determine if one variable can predict future changes in another variable. Graininess causal test by analysis of historical data if a variableCan significantly improve the relative variablePredictive ability of the current value, we can sayGlanger causal impact. This may help us understand the causal relationships between different levels in the system. The formula is as follows: And wherein: is a target variable (e.g., performance index of the application layer). Is an explanatory variable (e.g., resource consumption of the physical layer).Is a constant term that is used to determine the degree of freedom,AndIs a hysteresis coefficient representing the effect of historical data on the current value.Is the hysteresis period number, representing the time frame of the variable history.Is an error term representing the part that the model cannot interpret. The graininess causal test can help us identify whether there is a time-sequence causal relationship between cross-layer indicators, i.e. determine whether an indicator at one level affects an indicator at another level. Directional perturbation and downstream index response Once we confirm causal directionality between candidate paths, the next step is to inject directional perturbation into the high probability paths and observe the response amplitude of the downstream index. Directional perturbation-directional perturbation refers to the testing of a system's response to changes in the state of certain nodes by artificially changing those changes (e.g., by simulating load changes or resource consumption). The purpose of the directional disturbance is to verify whether the causality has an actual impact in the system. The response amplitude of the downstream index is that the change of the downstream index after directional disturbance is observed, and the response amplitude is embodied. The larger the response amplitude, the stronger the space-time coupling of the causal path. If the downstream index changes significantly, it indicates that the space-time coupling is more effective. Finally, we confirm the validity of the space-time coupling after the directional disturbance and response test. If the disturbances have a significant impact on the downstream index, we can confirm that these paths are effective space-time coupled paths. Space-time coupling effectiveness the effectiveness of space-time coupling means that strong causal relations exist between cross-layer indexes, and the relations have practical influence on system performance. This relationship can be further used to optimize the system, detect potential failures, and make resource allocations.

The method comprises the specific processes of selecting a high probability association path in a fault propagation map, and injecting controllable disturbance simulating physical layer storage delay surge or limiting virtual layer container network bandwidth to a path source node; the method comprises the steps of monitoring the response of a downstream index, recording the response and the amplitude of a disturbance propagation path, comparing the consistency of the predicted path of an original map, extracting the time sequence relation between historical resource scheduling operation and a fault propagation chain, constructing an incidence matrix of the resource scheduling operation and the fault propagation chain, fitting the coupling index by a gradient descent method based on the influence weight of the resource scheduling operation in the matrix on the length and the repair time of the fault chain.

In this embodiment, first, in the constructed fault propagation map, we need to select a high probability of the associated path. The high probability path represents a path that most likely has causal relationships among the physical layer, virtual layer, and application layer. After these paths are selected, we will inject simulated perturbations to the source nodes of the paths to verify the authenticity of the causal relationship. Directional perturbation injection we inject a controlled perturbation on the source node, common perturbations include storage latency spikes of the physical layer or limiting the network bandwidth of the virtual layer container, etc. For example, the physical layer may inject a sudden increase in storage latency, and the virtual layer may limit the network bandwidth of the container to simulate a resource bottleneck or failure condition. And verifying whether the influence of the source node on the downstream index accords with a predicted path obtained from the fault propagation map through directional disturbance. If the perturbed downstream index response is expected, then it can be verified that causal relationships exist on the path. After the disturbance is injected into the source node, we need to observe the response of the downstream index and record the propagation path and response amplitude of the disturbance. In response to the observation, we observe the changes after disturbance injection by monitoring other level metrics of the system (e.g., application layer performance, virtual layer resource usage, etc.). If the response amplitude of the downstream index is large, it is stated that the causal relationship on the path is effective, and the influence of the disturbance is transmitted along the propagation path. And (3) comparing the predicted path, namely comparing the actually observed disturbance propagation path with the predicted path in the original map. If the two are consistent, the causal relationship path is truly effective. To further analyze the correlation of fault propagation to resource scheduling, we need to extract the timing relationship between historical resource scheduling operations and the fault propagation chain. The purpose of this step is to find potential associations between resource scheduling operations and fault propagation and understand how these operations affect the length of the fault propagation chain and repair time. Time sequence relation by analyzing historical data, we construct a time sequence relation between resource scheduling operation and fault propagation chain. For example, whether a resource scheduling operation advances or delays the occurrence of fault propagation, or whether certain scheduling operations can shorten repair time, etc. Based on the extracted historical data, we construct an incidence matrix between resource scheduling operations and the fault propagation chain. The matrix contains the impact weight of each resource scheduling operation on the fault propagation chain length and repair time. An incidence matrix in which each element represents the impact of a certain resource scheduling operation on the fault propagation chain. In particular, if an operation significantly shortens the length or repair time of the fault propagation chain, the value of that element in the matrix is larger. Based on the data in the correlation matrix, we fit the coupling index using a gradient descent method, quantifying the degree of coupling between the resource scheduling operation and the fault propagation chain. Index of degree of coupling definition of the coupling index [ ]) The relation between a certain resource scheduling operation and a fault propagation chain is reflected, and the influence of the resource scheduling operation on the aspects of the length of the fault propagation chain, the repair time, the propagation path and the like is considered. To fit the coupling index, a number of factors may be introduced and calculated in combination with weights. The coupling index formula: And wherein: representing the coupling index of the a-th resource scheduling operation and the fault propagation chain. The total number of resource scheduling operations.First of allThe weight of a resource scheduling operation represents the proportion of its impact in the overall fault propagation chain.First of allIndividual resource scheduling operations pair fault propagation chainsInfluence of length.First of allIndividual resource scheduling operations pair fault propagation chainsThe effect of repair time.First of allIndividual resource scheduling operations pair fault propagation chainsThe effect of a propagation path is typically the relationship of path length and nodes.The chain length, repair time and the influence weight coefficient of the propagation path are respectively expressed. Definition of the influencing factors, chain length influenceRepresenting resource scheduling operationsThe effect on the length of the fault propagation chain is typically determined by the expansion or compression of the node or chain caused by the scheduling operation. For example, optimized resource scheduling may reduce chain length, thereby reducing propagation chain delay.Repair time influenceThe effect of resource scheduling operation on fault propagation chain repair time is shown. The change in scheduling operations may affect the timeliness of the restoration process and thus the repair time.Propagation path influenceRepresenting the effect of a resource scheduling operation on the fault propagation path. The scheduling operation may cause a new path to be activated or affect the propagation effect of an existing path.In order to optimize the coupling degree index, the gradient descent method is used for parameter tuning, so that the model can be better adapted to different resource scheduling and fault propagation conditions. Gradient descent method for optimizing coupling index parameterAndThe formula of (2) is as follows: And wherein: is the parameter set to be optimized. Is the learning rate.Is a loss function and measures the error between the model predicted coupling index and the actual coupling. Loss function definition loss functionTo measure errors in the fitting process, which takes into account the difference between the actual observed coupling and the calculated coupling. The definition is as follows: And wherein: Is a coupling index predicted from the resource scheduling operation and the model. Is the actual observed coupling index. By minimizing the loss functionWe can find the optimal combination of parameters so that the coupling index between the resource scheduling operation and the fault propagation chain fits more accurately. Dynamic updating of the model, iterative updating by gradient descent, each optimization adjusting parameters,,AndSo that the calculated coupling index more closely matches the actual fault propagation chain characteristics. As the number of iterations increases, the coupling index will gradually optimize to the optimal configuration. Coupling index the coupling index indicates the extent to which a resource scheduling operation affects a fault propagation chain, and in particular, it describes how much a resource scheduling operation can affect the length and repair time of a fault propagation chain. A higher coupling index indicates a stronger association between the resource scheduling operation and the fault propagation chain.

The method comprises the specific processes of calculating the contribution weight of the operation to the length and repair time of a fault chain based on an incidence matrix of a resource scheduling operation and the fault propagation chain, generating an initial coupling degree analysis rule, setting a dynamic judgment threshold, marking as negative coupling operation when the contribution weight of the operation to the fault propagation chain exceeds the threshold, backtracking nodes associated with the negative coupling operation in a fault propagation map, screening high causal strength nodes which are not covered by historical scheduling operation as candidate root causes, performing reverse blocking test on the candidate root causes, verifying the interrupt effect on a downstream fault chain by limiting the resource access or flow distribution, generating a positioning instruction comprising the root cause node identification, the influence path and the repair suggestion, and pushing the positioning instruction to an operation and maintenance terminal.

In this embodiment, first, an association matrix is constructed from the relationship between the resource scheduling operation and the failure propagation chain. This matrix reflects the impact of each resource scheduling operation on the fault propagation chain length, repair time, and propagation path. By this matrix, the contribution of each resource scheduling operation to the fault chain can be quantified. And generating a coupling degree analysis rule, namely after the construction of the association matrix of the resource scheduling operation and the fault propagation chain is completed, generating the coupling degree analysis rule. The main objective of this rule is to quantify the impact weight of each resource scheduling operation on the fault propagation chain and to identify the negative coupling of the operation. And (4) contribution weight calculation, namely carrying out weight calculation on the relation between each resource scheduling operation and the fault propagation chain. The weight reflects the extent to which the operation affects the overall failing chain length, repair time, and propagation path. The higher the contribution weight of the resource scheduling operation, the more significant the impact on the failing chain. Dynamic decision threshold-in order to identify a negative coupling operation, a dynamic decision threshold needs to be set. A resource scheduling operation is considered a negative coupling operation when its contribution weight to the fault propagation chain exceeds the threshold. Negative coupling operations exacerbate the propagation of faults and therefore require special attention and handling. Initial rules are generated by initially forming a coupling degree analysis rule based on the above analysis, wherein it is defined what is determined to be a negative coupling operation, how to calculate contribution weights, and how to screen resource scheduling operations that require important attention. And the negative coupling operation is identified, namely after the coupling degree analysis rule is generated, each node in the fault propagation map is analyzed, and the nodes related to the negative coupling operation are identified. Backtracking the nodes associated with the negative coupling operation, namely backtracking and analyzing the nodes associated with the backtracking and analyzing the nodes by using the marked negative coupling operation. These nodes are potential sources in the fault propagation spectrum and are probably the most critical trigger points in the fault propagation chain. And screening high causal strength nodes, namely screening nodes with larger causal strength from retrospective nodes, wherein the nodes are the parts with the strongest association with negative coupling operation. The causal strength may be determined by measuring the impact of the node during fault propagation. Nodes not covered by the history scheduling operation need special attention to those nodes not covered by the history scheduling operation, as these nodes may be key causes of failure bursts. Root cause positioning and verification, namely further verifying the screened candidate root cause nodes to determine whether the candidate root cause nodes actually affect the fault propagation chain. Reverse blocking test, namely executing reverse blocking test on the candidate root cause node. By limiting its resource access or traffic distribution, it is observed whether the downstream fail-chain is broken. If interrupted, the node is proved to be the root cause in the fault propagation chain. The verification method is that when reverse blocking test is performed, limiting access to resources or traffic changes the state of a fault propagation chain. If in this way the fault propagation path is effectively interrupted, this means that the node is a valid root cause. And generating a root cause positioning instruction, namely generating a positioning instruction comprising root cause node identification, an influence path and a repair suggestion after verifying the candidate root cause. These instructions will be pushed to the operation and maintenance terminal for timely fault recovery and repair. Root cause positioning instruction content, namely root cause node identification, comprising key node identification in a fault propagation chain. And (3) determining a fault propagation path of the root cause node and analyzing the influence of the path on the whole system. And providing repair suggestions, such as resource optimization, flow limitation, path adjustment and the like, according to the characteristics of the root cause node. And finally, pushing the root cause positioning instruction to the operation and maintenance terminal through an automation system for operation and maintenance personnel to refer to and carry out subsequent operation. And the operation and maintenance personnel take corresponding repair measures according to the positioning instruction, so that the stable operation of the system is ensured.

The method comprises the specific processes of analyzing historical fault repair time and resource wave rate fitting propagation cost gradient according to a fault propagation cost model, wherein the specific processes comprise the steps of extracting historical fault repair time data and resource wave rate caused by resource scheduling operation to construct an initial propagation cost function, iteratively optimizing cost function parameters through a gradient descent method, dynamically adjusting repair time weight and resource waste penalty factors, calculating current propagation cost gradient according to real-time fault propagation path length and resource utilization rate change, continuously optimizing gradient parameters through strategy execution feedback data, and adapting to dynamic change of a hybrid cloud environment.

In the embodiment, a propagation cost model is constructed by firstly, constructing an initial propagation cost function by extracting historical data, wherein the initial propagation cost function mainly comprises two key factors of historical fault repair time and resource waste rate. These factors will serve as a basis for propagation costs, helping to quantify time costs and resource consumption in the fault remediation process. Repair time and resource waste rate, namely repair time, which means the time spent from the occurrence of a fault to the complete repair, is generally influenced by factors such as system response, fault location, repair means and the like. Resource waste rate-meaning the inefficient use or waste of system resources during the occurrence of a failure, may be due to redundant computation, idle resources, excessive resource allocation, etc. The expression for the initial propagation cost function may be as follows: And wherein: Initial propagation cost. Historical fault repair time.Resource waste rate.AndWeight coefficient, control repair time and resource waste to the contribution of the propagation cost separately. And optimizing the propagation cost function by using a gradient descent method, namely optimizing parameters of the propagation cost function, particularly the adjustment of repair time weight and resource waste penalty factor by using the gradient descent method after the initial propagation cost function is constructed. This step reduces the overall propagation cost by continually iterating through the computation of gradients and optimizing weights. The optimization process aims at enabling the propagation cost function to reflect the actual fault repair process, and dynamically adjusting the repair time and the weight of resource waste so as to achieve more efficient resource utilization and shorter repair time. The realization of the gradient descent method comprises the steps of updating parameters according to the gradient of the current propagation cost function and optimizing the parametersAnd. The update rule is as follows:; And wherein: And And the repair time weight and the resource waste penalty factor of the f-th iteration are respectively calculated. Eta, learning rate, and controlling the amplitude of each update.AndThe partial derivatives of the propagation cost function to the weight parameters, respectively. The current propagation cost gradient is calculated using real-time data once the optimization process is complete and the propagation cost function parameters have been adjusted to appropriate values. At this time, the repair time and the influence weight of the resource waste are dynamically adjusted, and the propagation cost function can accurately reflect the actual situation. Real-time propagation path and resource utilization: propagation path length: span representing fault propagation, can be regarded as the number of links from the source to the destination of the fault. The longer the fault propagation path, the longer the time required for repair is, and the greater the consumption of resources. The resource utilization rate is the actual utilization degree of the system resource in the fault repairing process. High resource utilization can reduce resource waste, thereby reducing propagation costs. The calculation formula of the propagation cost gradient can be expressed as: And wherein: gradient of current propagation cost. Real-time fault repair time variation.Real-time resource waste rate variation. And (3) strategy execution feedback and gradient optimization, wherein in order to adapt to dynamic changes in the mixed cloud environment, gradient parameters are required to be continuously optimized through strategy execution feedback data. The feedback data can reflect the improvement or deterioration of system performance after policy adjustment, thereby affecting the change in propagation cost. And (3) feeding back data, namely, the strategy execution effect is that the feedback of the dynamic change of the system is obtained by monitoring the length of a fault propagation path and the resource utilization rate of the system after the strategy is executed. Gradient adjustment, namely adjusting gradient parameters in the propagation cost function according to feedback data so as to more accurately optimize repair time and resource waste in a future fault repair process. And (3) strategy adjustment, namely adjusting parameters in the gradient descent process according to real-time feedback, so that the model is adapted to the change in the mixed cloud environment, and high-efficiency utilization of system resources and rapid fault repair are ensured. The optimization process is summarized in that the propagation cost model is optimized through a gradient descent method, and the system can dynamically adjust the repair time and the influence weight of resource waste, so that the aim of adapting to the change of the mixed cloud environment in real time is fulfilled. In the continuous optimization process, the model not only can accurately predict the cost of fault restoration, but also can automatically adjust along with the change of the environment, and the efficiency and the response capability of the system are improved.

Specifically, the coupling degree index, the real-time service grade protocol default rate and the fault propagation cost gradient are input into a weight function, the cooperative priority of short-term inhibition action and long-term eradication action is calculated through a dynamic weight function, and the specific process of generating execution sequences of the two types of actions through an asymmetric game strategy is as follows, namely, the coupling degree penalty factor is introduced into the dynamic weight function, and the dispatching operation priority which possibly aggravates fault propagation under a high coupling scene is inhibited; the method comprises the steps of calculating the urgency weight of short-term inhibition actions based on real-time service level agreement default rate, calculating the benefit weight of long-term eradication actions in combination with propagation cost gradient, defining the short-term inhibition actions and the long-term eradication actions as asymmetric game participants, constructing a benefit function for quantifying the contribution of the short-term inhibition actions and the long-term eradication actions to service availability promotion and fault propagation inhibition, solving an optimal cooperative strategy through dynamic Nash equilibrium, preferentially executing the short-term inhibition actions to rapidly stop losses, and asynchronously triggering the long-term eradication actions.

In this embodiment, the coupling penalty factor is introduced and the priority is adjusted by first considering the coupling index in calculating the collaborative priority of short-term suppression actions and long-term eradication actions, which helps identify those operations that may exacerbate fault propagation. Scheduling operations in a high coupling scenario may negatively impact the stability of the system and should therefore be given lower priority. And the coupling degree penalty factor is that the high coupling scene is punished by introducing the coupling degree penalty factor gamma coupling. The coupling index ζ reflects the close relationship between the levels of indicators in fault propagation, and a higher coupling generally indicates a higher risk of fault propagation. The formula is: And wherein: the priority weight of the scheduling operation takes the influence of the coupling penalty factor into account. The basic weight coefficient is used for adjusting the influence degree of the priority.The coupling degree penalty factor represents the penalty strength of the high coupling scene, and the value range is between 0 and 1.Coupling degree index, which reflects the association strength between each layer. By the formula, the scheduling operation which can aggravate fault propagation can be effectively restrained. The urgency weight of the short-term quench action, which is a measure of rapid response and mitigation of fault propagation, is calculated, typically for intervention at higher rates of service violations. Thus, the urgency weight of the short-term withholding action should be calculated based on the real-time Service Level Agreement (SLA) breach rate. SLA breach rateReflecting the probability that the current system fails to meet the service commitment on time, the higher the rate of breach means the stronger the urgency of the short-term withholding action. The formula is: And wherein: Urgency weights for short term withholding actions. The basic weight coefficient is used for adjusting the urgency weight.Real-time SLA breach rate.Impact index of odds ratio on urgency weight, typicallySo that the urgency weight is greatly increased when the offending rate is increased. By the formula, the execution urgency of the short-term inhibition action can be dynamically adjusted according to the SLA violation rate. And calculating the gain weight of the long-term eradication action, wherein the long-term eradication action aims to reduce the probability of future faults by eliminating the fault source and improve the overall stability of the system. Thus, the gain weight of a long-term eradication action should be associated with the current spreading cost gradient. Propagation cost gradient-propagation cost gradient reflects the cost change condition in the current fault propagation process, and the higher the propagation cost is, the greater the benefit of long-term eradication action is. The formula is: And wherein: the gain weight of the action is eradicated for a long time. The basic weight coefficient is used for adjusting the income weight of the long-term eradication action.The current cost of propagation gradient reflects the economic cost of fault propagation.Impact index of propagation cost on gain weight, typicallySo that the higher the cost, the greater the priority and benefit of eradication actions. The gaming model is constructed and the collaborative priorities are calculated in which short-term suppression actions and long-term eradication actions are treated as two participants in the gaming model, with the two types of actions having different goals and benefits, respectively. Short term dampening actions focus on quickly reducing the effects of the fault, while long term eradication actions focus on eliminating the root cause of the fault. Asymmetric gaming models-contribution and benefit of short-term suppression action and long-term eradication action are different, so that collaborative priorities need to be calculated by asymmetric gaming strategies. By the game model, contributions of both to fault propagation inhibition and service availability can be quantified, and finally the optimal execution sequence is determined. The formula is: And wherein: And The execution strategies for short-term suppression action and long-term eradication action are represented, respectively.AndThe gain functions representing short-term suppression actions and long-term eradication actions, respectively, take into account urgency weights, gain weights, and priorities. The optimal collaborative strategy can be obtained by solving Nash equilibrium of a game model, namely, how to select short-term inhibition actions and long-term eradication actions under a given condition so as to maximize service availability promotion and fault propagation inhibition effects of the system. The execution sequence is that based on Nash equilibrium solution of game, the system will execute short-term inhibition action preferentially to stop damage rapidly, then asynchronously trigger long-term eradication action to ensure the fundamental solution of fault. By the method, the operation strategy of the system can be dynamically adjusted according to the actual fault propagation condition, the resource utilization condition and the service violation rate, so that efficient fault suppression and service guarantee are achieved.

In this embodiment, the module is composed of an atomization execution engine, a transaction control mechanism, a cross-layer index monitoring unit and a feedback parameter tuning mechanism, and aims to realize reliable execution and dynamic optimization of a tuning strategy. The atomization execution engine is used for refining unstructured high-level policy instructions into atomic operations at the cloud resource operation level. Each policy action is broken down into a number of independently executable atomic tasks that have direct control over the cloud platform resource management interface (API). Atomic operation example, calling the flow limiting strategy API interface, triggering service instance degradation. And (3) initiating migration tasks of a storage volume (volume) among different nodes, and relieving I/O bottlenecks or realizing load balancing. The atomicity and consistency guarantee mechanism is that a distributed transaction lock mechanism is introduced to avoid the problem of inconsistent intermediate states in the cross-platform or cross-service execution process, and the atomicity (Atomicity) of the operation is ensured that any compound operation is either completely successful or completely rollback. The consistency of the operation (Consistency) is that the subsystem resource states are in a consistent controllable state after the operation is completed. The monitoring content comprises the changing trend of key cross-layer indexes (such as system load, service delay, link flow, instance failure rate and the like) before and after policy execution, the state response of the failure propagation chain nodes so as to quantify the cutting-off or relieving effect of scheduling actions on potential propagation paths, the resource utilization rate change, and the allocation efficiency of action execution on computational effort, storage or bandwidth resources. The method comprises the steps of collecting indexes of each layer through an integrated monitoring system, and marking index response delay and slope change between a controlled node and a potential propagation path through a propagation chain analyzer. And (3) model optimization target, namely dynamically adjusting a fault propagation cost model and a coupling degree analysis model used in the decision process according to feedback of the monitoring result so as to improve pertinence and foresight of strategy response. The tuning content comprises propagation cost weight adjustment, namely if a certain type of action has obvious effect on blocking a propagation chain in a plurality of execution rounds, the cost performance weight of the action in a cost model can be improved (for example, the propagation reduction benefit value corresponding to the unit execution cost is improved). And correcting the coupling degree analysis rule, namely analyzing the cooperative change relation of cross-layer indexes before and after the action, and identifying a negative coupling scene which cannot be covered by the initial rule (namely, reverse interference is caused by operation of one layer to the other layer), so as to expand a rule base and improve the sensitivity of coupling discrimination. Early warning capability enhancement, namely recharging the key index change mode identified in feedback to a training set for updating an early identification model (such as an attention mechanism feature extraction network for negative coupling identification).

S1, deploying a containerized probe cluster, collecting physical layer hardware resource fragmentation index, virtual layer container life cycle event and application layer micro-service call chain performance offset data in a hybrid cloud environment in real time, and outputting a standardized cross-layer index set through hierarchical label preprocessing and outlier filtering; S2, connecting a cross-layer index sensing module, eliminating time sequence deviation of physical layer low frequency data and application layer high frequency data through an incremental time sequence alignment algorithm, constructing a fault propagation map based on a dynamic time window, identifying hidden association nodes with space-time coupling characteristics among cross-layer indexes, S3, receiving the fault propagation map output by the space-time association analysis module, verifying authenticity of cross-layer causal relationship through directional disturbance injection, fitting a coupling degree index according to an association matrix of a resource scheduling operation and a fault propagation chain, constructing a coupling degree analysis rule to quantify influence weight of the operation on fault propagation, generating a root cause positioning instruction of a negative coupling scene, S4, analyzing historical fault repair time and resource wave rate fitting propagation cost gradient according to a fault propagation cost model, inputting a coupling degree index, a real-time service grade protocol default rate and a fault propagation cost gradient into a weight function, calculating cooperative priorities of short-term inhibition actions and long-term eradication actions through the dynamic weight function, generating execution sequences of the two types of actions through an asymmetric game strategy, S5, calling cloud platform interfaces to perform an atomization decision action, acquiring index change data after execution, and dynamically updating the fault propagation cost analysis rule.

In the embodiment, S1, the step realizes the low-invasive acquisition of multi-source heterogeneous data by deploying a containerized probe cluster, and constructs a cross-layer index structure system consisting of a physical layer, a virtual layer and an application layer. The physical layer acquisition content comprises server CPU utilization rate fragmentation, memory allocation dispersion, hard disk IO fluctuation and the like, the virtual layer acquisition content comprises container restarting, migration, life cycle events, the application layer acquisition content comprises micro service link calling delay, failure rate, throughput mutation and the like, different-level indexes are weighted and classified by adopting a labeling pretreatment mechanism, the quality of training samples is improved by fusing an outlier filter, real-time extraction and normalization processing of cross-layer fault symptoms are realized, a high-dimension and low-coupling data feature space is constructed, and structural input is provided for follow-up propagation analysis and causal modeling. S2, providing an incremental time sequence alignment algorithm and a dynamic time window composition strategy, realizing unified modeling of cross-layer data flow under different frequency sampling, and capturing a potential propagation path and space-time coupling relation. The method comprises the steps of adjusting the synchronism of data of a physical layer (such as 5min sampling) and an application layer (such as 5s sampling) through an incremental sliding window algorithm, constructing a propagation probability map by utilizing index change rate and delay response relation, introducing a hidden associated node discovery mechanism, excavating hidden nodes which cannot be explicitly abnormal in the initial stage of propagation, revealing a cross-layer causal diffusion chain which cannot be described by a traditional single-layer log, and obviously enhancing the interpretability of the system to a mixed fault path in a complex scene. S3, the scheduling operation behavior is incorporated into causal chain modeling, and a scheduling-propagation chain coupling index and a negative coupling behavior recognition mechanism are provided to construct a root cause positioning rule set. The coupling degree index calculation comprises the steps of establishing an incidence matrix of the change of a scheduling operation and a propagation chain, measuring gain factors of the scheduling operation on the chain extension length and the repair time, determining negative coupling operation, introducing a dynamic contribution threshold, marking as a negative coupling source when the propagation chain is obviously amplified or recovery delay is caused by operation behaviors, screening root causes, backtracking the negative operation incidence nodes, eliminating historical scheduling overlapping interference nodes, only retaining high causal strength nodes, reversely blocking and verifying, carrying out resource blocking test on candidate root nodes to verify the interrupt effect of the candidate root nodes on downstream propagation, effectively avoiding the excessive dependence of the traditional root positioning on the abnormality degree, and realizing active type from the aspects of causal structure and system behaviors, Multipath, multi-source root cause reasoning. S4, the step initiatively brings the coupling index, SLAs default rate and propagation cost gradient into a uniform dynamic weight function, and introduces an asymmetric game mechanism to realize the cooperative sequencing of the two types of actions. The dynamic weight function design comprises the steps of introducing a coupling penalty factor, preventing the scheduling operation from generating negative feedback in a high coupling area, urgency weight, measuring the priority of short-term inhibition action through SLA violation degree, long-term income weight, calculating income of eradicating action through a propagation cost gradient function, asymmetric game modeling, taking two types of actions as game participants, constructing service availability and propagation reduction capacity as double-target income functions, obtaining an optimal action combination strategy capable of evolving along with the environment through dynamic Nash equilibrium solution, breaking through the traditional rule-driven operation and maintenance solidification problem, realizing cooperative control of short-term rapid loss prevention and long-term risk elimination, and having extremely high operation and maintenance elasticity. S5, introducing a cloud platform API-level atomic execution mechanism and an execution-monitoring-optimizing feedback loop to construct a decision closed loop. The method comprises the steps of performing atomization operation, namely decomposing strategy actions into API-level tasks (such as bandwidth limiting, computing resource rescheduling, volume migration and the like), adopting a distributed transaction lock mechanism to ensure consistency and uninterruptability of execution under a multi-cloud heterogeneous system, performing feedback acquisition and model optimization, namely monitoring key index changes after strategy execution in real time, using feedback results for adjusting coupling degree weight factors and cost gradient function parameters, updating a negative coupling judgment rule and an early recognition strategy model, realizing complete closed loop of strategy execution, perception, feedback and re-optimization, and supporting dynamic adaptation and quick response in an unstable resource environment.

In summary, the present application has at least the following effects:

A cloud monitoring service operation and maintenance dynamic optimization system and method based on an AI intelligent agent are used for realizing real-time acquisition and labeling pretreatment of multidimensional data of a physical layer, a virtual layer and an application layer through a containerized probe cluster to form a standardized cross-layer index set, and a unified data base is provided for subsequent analysis. Through an incremental time sequence alignment and dynamic propagation map construction algorithm, a cross-layer space-time coupling relation is accurately captured, a hidden fault link is revealed, and the accuracy and the foresight of operation and maintenance decisions are improved. By fitting the coupling degree index and constructing an operation-propagation chain association matrix, the identification and root cause positioning of the negative scheduling behavior are realized, and the system instability risk caused by the traditional experience-driven scheduling is avoided. And carrying out multidimensional fusion on the service level protocol violation rate, the coupling index and the fault propagation cost gradient, outputting a synergistically optimized action sequence through an asymmetric game strategy, and improving the overall benefit ratio and response efficiency of strategy execution. Through the cloud platform API-level atomic execution engine and the feedback tuning mechanism, the full-flow closed-loop control from strategy generation to execution monitoring is realized, the system has good system evolution capability and dynamic adaptability, and the intellectualization and toughness of the cloud monitoring system are obviously enhanced.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of systems, apparatuses (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A cloud monitoring service operation and maintenance dynamic optimization system based on AI agents, characterized by comprising the following steps: a cross-layer indicator perception module, a spatiotemporal correlation analysis module, a causal verification and coupling analysis module, a dynamic priority decision module, and a strategy execution and feedback module;

The cross-layer indicator perception module is used to deploy a containerized probe cluster to collect real-time physical layer hardware resource fragmentation indicators, virtual layer container lifecycle events, and application layer microservice call chain performance deviation data in a hybrid cloud environment. Through layered labeling preprocessing and outlier filtering, it outputs a standardized cross-layer indicator set;

The spatiotemporal correlation analysis module is used to connect to the cross-layer indicator perception module, eliminate the timing deviation between the low-frequency data of the physical layer and the high-frequency data of the application layer through an incremental timing alignment algorithm, build a fault propagation map based on a dynamic time window, and identify implicit correlation nodes with spatiotemporal coupling characteristics between cross-layer indicators;

The causal verification and coupling analysis module is used to receive the fault propagation map output by the spatiotemporal correlation analysis module, verify the authenticity of cross-layer causal relationships through targeted disturbance injection, fit a coupling index based on the correlation matrix between resource scheduling operations and fault propagation chains, and construct coupling analysis rules to quantify the impact of operations on fault propagation, thereby generating root cause location instructions for negative coupling scenarios.

The dynamic priority decision module is used to analyze the historical fault repair time and resource waste rate according to the fault propagation cost model to fit the propagation cost gradient, input the coupling index, the real-time service level agreement breach rate and the fault propagation cost gradient into the weight function, calculate the collaborative priority of the short-term suppression action and the long-term eradication action through the dynamic weight function, and generate the execution sequence of the two types of actions through an asymmetric game strategy;

The strategy execution and feedback module is used to call the cloud platform interface to atomically execute decision actions, collect indicator change data after execution, and dynamically update the fault propagation cost model and coupling analysis rules;

The incremental timing alignment algorithm is used to dynamically adjust the alignment of data points during continuous data updates to achieve synchronous matching between high-frequency and low-frequency indicators; the negative coupling scenario refers to a scenario where resource scheduling operations exacerbate fault propagation; the propagation cost gradient refers to the marginal contribution rate of different nodes or paths to the overall system cost growth; the fault propagation graph is a graph structure based on a probability model, in which nodes represent indicator items, edges represent potential fault transmission paths, and weights represent propagation probabilities; the directed disturbance injection refers to injecting slight changes into specific components to verify causal relationships; the coupling index is a numerical indicator used to quantify the intensity of mutual influence between indicators or operations; the real-time service level agreement default rate refers to the proportion of service quality that is not provided in accordance with contract or platform requirements; the short-term suppression action is used to quickly curb the impact of the fault; the long-term eradication action is used to fundamentally eliminate the fault factor; the atomic execution decision action is used to decompose the strategy into the smallest operation unit that can be executed independently.

2. The cloud monitoring service operation and maintenance dynamic optimization system based on AI agent according to claim 1 is characterized in that the cross-layer indicator perception module specifically includes:

The containerized probe cluster is deployed on hybrid cloud nodes using a microservices architecture. It includes physical-layer probes to collect hardware resource fragmentation indicators, virtual-layer probes to monitor container lifecycle events and resource contention, and application-layer probes to track microservice call chain topology and performance deviations.

The acquisition frequency is adjusted according to the dynamic change rate of the indicator. The physical layer uses low-frequency trigger sampling, the application layer uses event-driven high-frequency tracking, and the sliding window statistics suppress instantaneous noise.

The hierarchical labeling preprocessing unit adds cloud platform type and service dependency layer environmental context labels to the original data, filters outliers based on the isolation forest algorithm, and outputs a standardized cross-layer indicator set.

3. The AI-agent-based cloud monitoring service operation and maintenance dynamic optimization system according to claim 2 is characterized by: using an incremental timing alignment algorithm to eliminate timing deviations between low-frequency data at the physical layer and high-frequency data at the application layer, and constructing a fault propagation map based on a dynamic time window. The specific process is as follows:

Dynamically interpolate low-frequency data at the physical layer and high-frequency data at the application layer, and generate a continuous time series based on data confidence weighting;

Detect potential phase differences between cross-layer indicators through sliding correlation analysis and dynamically adjust interpolation anchor points to eliminate timing offsets;

The conditional transition probability between cross-layer indicators is calculated based on the aligned time series data, the time window range is dynamically expanded and the window is contracted when a sudden change in the indicator is detected, and a fault propagation graph with weighted edges is generated.

4. The AI-agent-based cloud monitoring service operation and maintenance dynamic optimization system according to claim 3 is characterized in that the identification logic for identifying implicit association nodes with spatiotemporal coupling characteristics between cross-layer indicators is as follows:

Extract the propagation paths across the physical layer, virtual layer, and application layer from the fault propagation graph, and screen candidate nodes whose transfer probabilities exceed dynamic thresholds;

Perform mutual information entropy analysis on candidate nodes to quantify their dependence on upstream and downstream indicators and eliminate weakly correlated interference items;

The directionality of the spatiotemporal causal relationship of the candidate paths is verified by Granger causality test, and directional disturbances are injected into high-probability paths to observe the response amplitude of downstream indicators and confirm the effectiveness of spatiotemporal coupling.

5. The AI-agent-based cloud monitoring service operation and maintenance dynamic optimization system according to claim 4 is characterized by: verifying the authenticity of cross-layer causal relationships through targeted disturbance injection, and fitting the coupling index based on the correlation matrix of resource scheduling operations and fault propagation chains. The specific process is as follows:

A high-probability associated path is selected in the fault propagation graph, and a controllable disturbance is injected into the source node of the path to simulate a sudden increase in physical layer storage latency or limit the network bandwidth of the virtual layer container.

Observe the response of downstream indicators and record the disturbance propagation path and amplitude, and compare the predicted path consistency with the original map;

The temporal relationship between historical resource scheduling operations and fault propagation chains is extracted, and a correlation matrix between resource scheduling operations and fault propagation chains is constructed. Based on the influence weights of resource scheduling operations on the length and repair time of the fault chain in the matrix, the coupling index is fitted using the gradient descent method.

6. The AI-agent-based cloud monitoring service operation and maintenance dynamic optimization system according to claim 5 is characterized by: constructing coupling analysis rules to quantify the weight of the impact of operations on fault propagation, and generating root cause location instructions for negative coupling scenarios. The specific process is as follows:

Based on the correlation matrix between resource scheduling operations and fault propagation chains, the contribution weight of the operations to the length of the fault chain and the repair time is calculated to generate the initial coupling analysis rules;

Set a dynamic judgment threshold. When the contribution weight of an operation to the fault propagation chain exceeds the threshold, it is marked as a negative coupling operation.

In the fault propagation graph, trace back the nodes associated with negative coupling operations and select high causal strength nodes that are not covered by historical scheduling operations as candidate root causes;

Perform reverse blocking tests on candidate root causes to verify their interruption effect on the downstream fault chain by restricting their resource access or traffic distribution.

Generates a positioning instruction containing the root cause node identifier, impact path, and repair suggestions, and pushes it to the operation and maintenance terminal.

7. The AI-agent-based cloud monitoring service operation and maintenance dynamic optimization system according to claim 6 is characterized in that the specific process of fitting the propagation cost gradient by analyzing historical fault repair time and resource waste rate according to the fault propagation cost model is as follows:

Extract historical fault repair time data and resource waste rate caused by resource scheduling operations to construct the initial propagation cost function;

Iteratively optimize the cost function parameters through the gradient descent method, and dynamically adjust the repair time weight and resource waste penalty factor;

Calculate the current propagation cost gradient based on the real-time fault propagation path length and resource utilization changes;

Continuously optimize gradient parameters through policy execution feedback data to adapt to dynamic changes in the hybrid cloud environment.

8. The AI-agent-based cloud monitoring service operation and maintenance dynamic optimization system according to claim 7 is characterized by: inputting the coupling index, the real-time service level agreement default rate, and the fault propagation cost gradient into a weighting function, calculating the collaborative priority of short-term suppression actions and long-term eradication actions through a dynamic weighting function, and generating the execution sequence of the two types of actions through an asymmetric game strategy. The specific process is as follows:

By introducing a coupling penalty factor into the dynamic weight function, the scheduling operation priority that is positively correlated with the fault propagation chain is suppressed;

The urgency weight of short-term suppression actions is calculated based on the real-time service level agreement default rate, and the benefit weight of long-term eradication actions is calculated based on the propagation cost gradient;

Define short-term suppression actions and long-term eradication actions as participants in an asymmetric game, and construct a payoff function that quantifies their contribution to improving service availability and suppressing fault propagation.

The optimal collaborative strategy is solved through dynamic Nash equilibrium, prioritizing short-term suppression actions to quickly stop losses and asynchronously triggering long-term eradication actions.

9. The cloud monitoring service operation and maintenance dynamic optimization system based on AI agent according to claim 8, characterized in that the strategy execution and feedback module specifically includes:

The atomic execution engine breaks down decision-making actions into independently executable atomic operations, such as calling cloud platform APIs to trigger rate limiting policies or initiate storage volume migration tasks. It uses a transaction lock mechanism to ensure the atomicity and consistency of cross-platform operations.

Monitor changes in cross-layer indicators after execution, capture the inhibitory effect of actions on the fault propagation chain and the impact of resource utilization, adjust the weight parameters in the fault propagation cost model based on feedback data, optimize the coupling analysis rules and enhance the early identification capability of negative coupling scenarios.

10. A method for dynamically optimizing cloud monitoring service operations based on an AI agent, applying the system for dynamically optimizing cloud monitoring service operations based on an AI agent according to any one of claims 1 to 9, characterized in that it comprises the following steps:

S1. Deploy a containerized probe cluster to collect real-time data on physical layer hardware resource fragmentation, virtual layer container lifecycle events, and application layer microservice call chain performance deviations in the hybrid cloud environment. Through layered labeling preprocessing and outlier filtering, a standardized cross-layer metric set is output.

S2. Connect the cross-layer indicator perception module and use an incremental timing alignment algorithm to eliminate the timing deviation between low-frequency data at the physical layer and high-frequency data at the application layer. This algorithm constructs a fault propagation map based on a dynamic time window and identifies implicit correlation nodes with spatiotemporal coupling characteristics between cross-layer indicators.

S3. Receive the fault propagation map output by the spatiotemporal correlation analysis module, verify the authenticity of cross-layer causal relationships through targeted disturbance injection, fit a coupling index based on the correlation matrix between resource scheduling operations and fault propagation chains, and construct coupling analysis rules to quantify the impact of operations on fault propagation. This generates root cause location instructions for negative coupling scenarios.

S4. Analyze historical fault repair times and resource waste rates based on the fault propagation cost model to fit a propagation cost gradient. Input the coupling index, real-time service level agreement (SLA) breach rate, and fault propagation cost gradient into a weighting function. This dynamic weighting function calculates the collaborative priorities of short-term suppression actions and long-term eradication actions. An asymmetric game strategy is then used to generate execution sequences for the two types of actions.

S5. Call the cloud platform interface to atomically execute decision actions, collect indicator change data after execution, and dynamically update the fault propagation cost model and coupling analysis rules.